Ranking_(information_retrieval)

Ranking (information retrieval)

Add article description

Ranking of query is one of the fundamental problems in information retrieval (IR),^[1] the scientific/engineering discipline behind search engines.^[2] Given a query q and a collection D of documents that match the query, the problem is to rank, that is, sort, the documents in D according to some criterion so that the "best" results appear early in the result list displayed to the user. Ranking in terms of information retrieval is an important concept in computer science and is used in many different applications such as search engine queries and recommender systems.^[3] A majority of search engines use ranking algorithms to provide users with accurate and relevant results.^[4]

Ranking models

Ranking functions are evaluated by a variety of means; one of the simplest is determining the precision of the first k top-ranked results for some fixed k; for example, the proportion of the top 10 results that are relevant, on average over many queries.

IR models can be broadly divided into three types: Boolean models or BIR, Vector Space Models, and Probabilistic Models.^[9] Various comparisons between retrieval models can be found in the literature (e.g., ^[10]).

Probabilistic Model

In probabilistic model, probability theory has been used as a principal means for modeling the retrieval process in mathematical terms. The probability model of information retrieval was introduced by Maron and Kuhns in 1960 and further developed by Roberston and other researchers. According to Spack Jones and Willett (1997): The rationale for introducing probabilistic concepts is obvious: IR systems deal with natural language, and this is too far imprecise to enable a system to state with certainty which document will be relevant to a particular query.

The model applies the theory of probability to information retrieval (An event has a possibility from 0 percent to 100 percent of occurring). i.e, in probability model, relevance is expressed in terms of probability. Here, documents are ranked in order of decreasing probability of relevance. It takes into the consideration of uncertainty element in the IR process. i.e., uncertainty about whether documents retrieved by the system are relevant to a given query.

The probability model intends to estimate and calculate the probability that a document will be relevant to a given query based on some methods. The “event” in this context of information retrieval refers to the probability of relevance between a query and a document. Unlike other IR models, the probability model does not treat relevance as an exact miss-or-match measurement.

The model adopts various methods to determine the probability of relevance between queries and documents. Relevance in the probability model is judged according to the similarity between queries and documents. The similarity judgment is further dependent on term frequency.

Thus, for a query consisting of only one term (B), the probability that a particular document (Dm) will be judged relevant is the ratio of users who submit query term (B) and consider the document (Dm) to be relevant in relation to the number of users who submitted the term (B). As represented in Maron’s and Kuhn’s model, can be represented as the probability that users submitting a particular query term (B) will judge an individual document (Dm) to be relevant.

According to Gerard Salton and Michael J. McGill, the essence of this model is that if estimates for the probability of occurrence of various terms in relevant documents can be calculated, then the probabilities that a document will be retrieved, given that it is relevant, or that it is not, can be estimated.^[11]

Several experiments have shown that the probabilistic model can yield good results. However, such results have not been sufficiently better than those obtained using the Boolean or Vector Space model.^[12]^[13]

Evaluation Measures

The most common measures of evaluation are precision, recall, and f-score. They are computed using unordered sets of documents. These measures must be extended, or new measures must be defined, in order to evaluate the ranked retrieval results that are standard in modern search engines. In a ranked retrieval context, appropriate sets of retrieved documents are naturally given by the top k retrieved documents. For each such set, precision and recall values can be plotted to give a precision-recall curve.^[14]

Precision

Precision measures the exactness of the retrieval process. If the actual set of relevant documents is denoted by I and the retrieved set of documents is denoted by O, then the precision is given by:

{\text{Precision}}={\frac {|\{{\text{I}}\}\cap \{{\text{O}}\}|}{|\{{\text{O}}\}|}}

Recall

Recall is a measure of completeness of the IR process. If the actual set of relevant documents is denoted by I and the retrieved set of documents is denoted by O, then the recall is given by:

{\text{Recall}}={\frac {|\{{\text{I}}\}\cap \{{\text{O}}\}|}{|\{{\text{I}}\}|}}

F1 Score

F1 Score tries to combine the precision and recall measure. It is the harmonic mean of the two. If P is the precision and R is the recall then the F-Score is given by:

F_{1}=2\cdot {\frac {\mathrm {P} \cdot \mathrm {R} }{\mathrm {P} +\mathrm {R} }}

Page Rank Algorithm

The PageRank algorithm outputs a probability distribution used to represent the likelihood that a person randomly clicking on the links will arrive at any particular page. PageRank can be calculated for collections of documents of any size. It is assumed in several research papers that the distribution is evenly divided among all documents in the collection at the beginning of the computational process. The PageRank computations require several passes through the collection to adjust approximate PageRank values to more closely reflect the theoretical true value. The formulae is given below:

PR(u)=\sum _{v\in B_{u}}{\frac {PR(v)}{L(v)}}

i.e. the PageRank value for a page u is dependent on the PageRank values for each page v contained in the set B_u (the set containing all pages linking to page u), divided by the amount L(v) of links from page v.

Share this article:

This article uses material from the Wikipedia article Ranking_(information_retrieval), and is written by contributors. Text is available under a CC BY-SA 4.0 International License; additional terms may apply. Images, videos and audio are available under their respective licenses.

[1] [1]
Piccoli, Gabriele; Pigni, Federico (July 2018). Information systems for managers: with cases (Edition 4.0 ed.). Prospect Press. p. 28. ISBN 978-1-943153-50-3. Retrieved 25 November 2018.

[2] [2]
Mogotsi, I. C. "Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze: Introduction to information retrieval: Cambridge University Press, Cambridge, England, 2008, 482 pp, ISBN: 978-0-521-86571-5". Information Retrieval. 13 (2): 192–195. doi:10.1007/s10791-009-9115-y. ISSN 1386-4564. S2CID 31674042.

[3] [3]
"What is Information Retrieval?". GeeksforGeeks. 2020-07-02. Retrieved 2022-03-02.

[4] [4]
"Google's Search Algorithm and Ranking System - Google Search". www.google.com. Retrieved 2022-03-02.

[5] [5]
"Scientist Finds PageRank-Type Algorithm from the 1940s". MIT Technology Review. Retrieved 2022-03-02.

[6] [6]
Pinski, Gabriel; Narin, Francis (1976). "Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics". Information Processing & Management. 12 (5): 297–312. doi:10.1016/0306-4573(76)90048-0.

[7] [7]
"What are SERP Features?". www.accuranker.com. 2019-03-28. Retrieved 2022-03-02.

[8] [8]
Franceschet, Massimo (17 February 2010). "Scientist Finds PageRank-Type Algorithm from the 1940s". www.technologyreview.com.

[9] [9]
Datta, Joydip (16 April 2010). "Ranking in Information Retrieval" (PDF). Department of Computer Science and Engineering, Indian Institute of Technology. p. 7. Retrieved 25 April 2019.

[10] [10]
Turtle, Howard R.; Croft, W.Bruce (1992). "A comparison of text retrieval models". The Computer Journal. 35 (3). OUP: 279–290. doi:10.1093/comjnl/35.3.279.

[11] [11]
Harter, Stephen P. (1984-07-01). "Introduction to modem information retrieval (Gerard Salton and Michael J. McGill)". Education for Information. 2 (3): 237–238. doi:10.3233/EFI-1984-2307.

[12] [12]
Chu, H. Information Representation and Retrieval in the Digital Age. New Delhi: Ess Ess Publication.

[13] [13]
G.G.Choudhary. Introduction to Modern Information Retrieval. Facet Publishing.

[14] [14]
Manning, Christopher; Raghavan, Prabhakar; Schutze, Hinrich. Evaluation of ranked retrieval results. Cambridge University Press.

[15] [15]
Tanase, Racula; Radu, Remus (16 April 2010). "Lecture #4: HITS Algorithm - Hubs and Authorities on the Internet".

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

Ranking_(information_retrieval)

Ranking (information retrieval)

History

Ranking models

Boolean Models

Vector Space Model