Re-ranking Retrieved Documents Using Query Term Distances Kerem Ali Uluğ Turhan O. Daybelge
CS533 Information Retrieval Systems2 Outline Motivation – How query term distances can affect the relevance of a document Problem statement and a possible solution Scoring spans in a document according to query-term distances – Different approaches Detecting spans contained in a document Combining pieces – A numeric example Input data for our project Evaluating the performance
CS533 Information Retrieval Systems3 Motivation In conventional IR Systems documents are retrieved and ranked according to their relevance to a query using the tf-idf approach: But, the relevance of a document to a query should increase when distance between query-terms in the document decreases. We need to re-rank documents retrieved from the IR system according to query-term distances in order to incorporate proximity information into the ranking.
CS533 Information Retrieval Systems4 Motivation Example Query: “ renk körlüğü ” renk AND körlük Göz Duyusu; ışık, şekil, renk, hareket ve derinlik gibi çok çeşitli özelliklerin toplamıdır. Görme duyusunun gelişmesi, dogumdan sonra altı yaşına kadar devam eder. Doğumda, iki göz arasındaki denge herhangi bir nedenle bozulmuş ise, bir göz, beyin tarafindan tercih edilir, diğer göz atıl kapasite ile kullanılır. Düşük kapasite ile kullanılan gözün görme yeteneği azalır ve göz tembelliği oluşur. Göz hastalıkları kalıtım ile geçen, mikrobik, çeşitli kazalar ve mekanik birçok nedenlerle ortaya çıkabilir. Ülkemizde akraba evlilikleri, çocukluk çağı körlüklerinin başta gelen sebebidir. Geri kalmış ülkelerde trahom gibi mikrobik ve A vitamini eksikliği gibi beslenme bozukluğu, başlıca körlük nedenleridir. For our query, an IR system can assign a high rank to this irrelevant documeny when solely using a tf-idf based method. However the query-terms are so apart from each other that they are not semantically related.
CS533 Information Retrieval Systems5 Problem & Possible Solution This ranking method fails for this kind of situations because term-distance (proximity) information is not used. By intiution, we know that semantically related terms often occur near to each other in a document and a usual query searches for such related term groups. We should use a re-ranking method that will consider the following criteria: The relevance of a document to a query increases when: The distance between appropriately chosen groups of query-term occurences (spans) in the document decreases. The number of such spans in the document increases. Such a re-ranking method should assign higher ranks to more relevant documents.
CS533 Information Retrieval Systems6 Problem & Possible Solution We need a method that will assign a score to each retrieved document. If we could assign a proximity score to span S i in document D, then we could calculate a relevance score for D. Then we can re-rank each document according to this measure.
CS533 Information Retrieval Systems7 Things to be solved There are two problems that should be solved: How will we assign scores to spans using proximity information? How will we group related terms into spans?
CS533 Information Retrieval Systems8 Calculating Span Scores Factors that may affect the score of a span: 1. Lexical distance of query-terms in the span. 2. Whether or not the span crosses semantic boundaries. (Such as sentence and paragraph boundaries) 3. Number of unique query-term occurences in the span
CS533 Information Retrieval Systems9 Calculating Span Scores: Lexical Distance Suppose that one span covers w 1 through w 10 and there are four query-term occurences in this span, namely w 1, w 4, w 8 and w 10. Some span-length measures are given below:
CS533 Information Retrieval Systems10 Calculating Span Scores: Lexical Distance By using this approach we guarantee that the more query-term occurences are seperated from each other in a document, the less the score assigned to that document will be. Limiting the span length: We can set an upper limit L max on the span length. Terms and that are apart from each other will be considered unrelated if dist (, ) > L max
CS533 Information Retrieval Systems11 Calculating Span Scores: Crossing Semantic Boundaries Since a span represents a group of semantically related terms, when a span crosses a semantic boundary, the score of the span must drop. Possible semantic boundaries: Sentence boundaries Paragraph boundaries Section boundaries, etc... This problem can again be solved using the lexical distance concept. Semantic boundaries between term pairs can be considered as increasing the distance between those terms.
CS533 Information Retrieval Systems12 Calculating Span Scores: Unique Query Terms The more a span contains repeated query-term occurrences in it, the less its score should be. i.e. For a three term query, following two spans with equal lengths are identified in the text: q 1 x x x q 2 x q 3 (should have a higher score) q 1 x x x q 2 x q 1 The span that has more unique query tems covers the query better.
CS533 Information Retrieval Systems13 Calculating Span Scores: Previous Research Cormack et. al. University of Waterloo & University of Toronto Hawking et. al. Australian National University Shin et. al. AI Lab Seoul National University Song et. al. Microsoft Research Asia We will present the first two approaches in the next slides
CS533 Information Retrieval Systems14 Calculating Span Scores: Cormack et. al. Cormack et. al. used a method named “ranking by solution density” Suppose a document contains n spans We calculate the score of a span S by the formula: After scoring each span, we order them in descending order in terms of score. (S 1, S 2,..., S n ) Finally the total score of the document is calculated as
CS533 Information Retrieval Systems15 Calculating Span Scores: Hawking et. al. Hawking et. al. propose a similar distance-based relevance formula A relevance contribution score of span S to document D for a query Q is defined as: C is a constant, usually 1, but may be adjusted according to the number of repeating query terms in the span F is a function, usually identity, but may be adjusted to alter the rate at which relevance contribution score decays with length n = |Q| - number of unique query-term occurences in the query L max is the maximum allowable span length
CS533 Information Retrieval Systems16 Determining Spans There are many possible spans in a document An example query: Query: “Türkiye Avupa İlişkileri” Türkiye AND Avupa AND İlişki UNESCO Türkiye Milli Komitesi Başkan Vekili ve Büyükelçi Pulat Tacar, [Türkiye Cumhuriyeti'nin Avrupa] değerleri çerçevesi içinde olduğunu söyledi. Doğuş Üniversitesi tarafından düzenlenen “[Avrupa Birliği - Türkiye ilişkileri]” konulu panelde konuşan Tacar, [Türkiye'nin Avrupa] Birliği yolunda büyük ilerlemeler kaydettiğini, ancak hala bazı eksiklikleri olduğunu belirtti.
CS533 Information Retrieval Systems17 Determining Spans We are planning to use an iterative algorithm Algorithm iterates through query term occurences in the document A maximum allowable distance ( MAX_DIS) between query term occurences is defined Same query term to occur more than once in a span is not allowed All query term occurences are covered by a span at the end of a single pass of the algorithm
CS533 Information Retrieval Systems18 Determining Spans current-term = first query-term hit do while current-term ≠ NIL If the distance between the current-term and the next-term is bigger than a threshold MAX_DIS then the current-span ends and a new span begins with the next-term If the current-term and the next-term are identical then the current-span ends and a new span begins with the next-term If the next-term is identical to a hit within the current-span then the distance between the current-term and the next-term and the distance between the identical hit and its next is compared, the span is separated at the bigger gap. Otherwise add the current term to the current-span current-term = next-term repeat
CS533 Information Retrieval Systems19 Determining Spans Looking to the example again: UNESCO [Türkiye] Milli Komitesi Başkan Vekili ve Büyükelçi Pulat Tacar, [Türkiye Cumhuriyeti'nin Avrupa] değerleri çerçevesi içinde olduğunu söyledi. Doğuş Üniversitesi tarafından düzenlenen “[Avrupa Birliği - Türkiye ilişkileri]” konulu panelde konuşan Tacar, [Türkiye'nin Avrupa] Birliği yolunda büyük ilerlemeler kaydettiğini, ancak hala bazı eksiklikleri olduğunu belirtti.
CS533 Information Retrieval Systems20 Determining Spans We will use some different approaches for determining spans by modifying the algorithm Since we have relevancy data of documents, we will be able to compare different approaches An example approach is to allow more than one occurrence of a query term in a span
CS533 Information Retrieval Systems21 Ranking Example The following two documents have the same number of query terms UNESCO Türkiye Milli Komitesi Başkan Vekili ve Büyükelçi Pulat Tacar, Türkiye Cumhuriyeti'nin Avrupa değerleri çerçevesi içinde olduğunu söyledi. Doğuş Üniversitesi tarafından düzenlenen “Avrupa Birliği - Türkiye ilişkileri” konulu panelde konuşan Tacar, Türkiye'nin Avrupa Birliği yolunda büyük ilerlemeler kaydettiğini, ancak hala bazı eksiklikleri olduğunu belirtti. Rusya Devlet Başkanı Vladimir Putin'den önce, mesajları Türkiye'ye ulaştı. Artık Türkiye için Rusya, "komünizm tehlikesi", Rusya için Türkiye NATO üyesi hasım olmadığına göre, ilişkilere bu gözle bakmak iki ülkenin de yararına. Bunun Avrupa'ya alternatif bir blok anlayışı taşıması da gerekmez. Avrupa birliği bizim için son çare değildir. Türkiye'nin ulusal çıkarları doğrultusunda Avrupa dışında da temaslarımıza devam etmeliyiz.
CS533 Information Retrieval Systems22 Ranking Example By running the span detection algorithm with MAX_DIS = 8 we obtain the following spans UNESCO [Türkiye] Milli Komitesi Başkan Vekili ve Büyükelçi Pulat Tacar, [Türkiye Cumhuriyeti'nin Avrupa] değerleri çerçevesi içinde olduğunu söyledi. Doğuş Üniversitesi tarafından düzenlenen “[Avrupa Birliği - Türkiye ilişkileri]” konulu panelde konuşan Tacar, [Türkiye'nin Avrupa] Birliği yolunda büyük ilerlemeler kaydettiğini, ancak hala bazı eksiklikleri olduğunu belirtti. Rusya Devlet Başkanı Vladimir Putin'den önce, mesajları [Türkiye'ye] ulaştı. Artık [Türkiye] için Rusya, "komünizm tehlikesi", Rusya için [Türkiye NATO üyesi hasım olmadığına göre, ilişkilere bu gözle bakmak iki ülkenin de yararına. Bunun Avrupa'ya] alternatif bir blok anlayışı taşıması da gerekmez. [Avrupa] birliği bizim için son çare değildir. [Türkiye'nin ulusal çıkarları doğrultusunda Avrupa] dışında da temaslarımıza devam etmeliyiz.
CS533 Information Retrieval Systems23 Ranking Example By using the formula proposed by Hawking et. al., we calculate scores for each document: Score(D 1 ) = Score(D 2 ) = Results show us that D 1 is more relevant to the query, thus should be ranked higher than D 2. C = 1 F(x) = x(identity function) L max = MAX_DIS x ( |Q| - 1 ) = 16
CS533 Information Retrieval Systems24 Input Data We will use Bilkent Information Retrieval Group 2006 queries run on Milliyet documents as data Number of documents: 408,305 Average article size: 234 tokens Total database size: 800MB Number of evaluated queries: 52 Avg. no. of documents/query: 474 Avg. no. of relevant documents/query: 133 Current system retrieves documents according to its own ranking system We will use query data (terms), original rankings of the documents, the information regarding the relavency of the documents and document full texts
CS533 Information Retrieval Systems25 Evaluation Strategy Get data from the existing system Rerank documents according to our method Check to see if relevant documents are ranked high and non-relevant documents have lower ranks (compare with original rankings) If the re-ranking happens to be succesful, precision values for 11 standard recall levels should improve. (using the TREC interpolation rule)
CS533 Information Retrieval Systems26 Evaluation Strategy
CS533 Information Retrieval Systems27 Summary We have introduced the importance of proximity based document relevance measures This relevance information can be used to re-rank the output of a conventional IR system We have introduced two main approaches for determining span scores and cumulative scores for each document We have introduced an algorithm to choose semantically related query- terms and group them into spans We have shown how a distance-based relevance measure can effectively assign a higher score to a more relevant document on a simple example