***snippet->
|------------
8.7 Results snippets Having chosen or ranked the documents matching a query, we wish to pre- sent a results list that will be informative to the user. In many cases the user will not want to examine all the returned documents and so we want to make the results list informative enough that the user can do a final rank- ing of the documents for themselves based on relevance to their information need.3 The standard way of doing this is to provide a snippet, a short sum-SNIPPET mary of the document, which is designed so as to allow the user to decide its relevance. Typically, the snippet consists of the document title and a short 3. There are exceptions, in domains where recall is emphasized. For instance, in many legal disclosure cases, a legal associate will review every document that matches a keyword search.
***unary code->
|------------
simplest bit-level code is unary code. The unary code of n is a string of n 1sUNARY CODE followed by a 0 (see the first two columns of Table 5.5). Obviously, this is not a very efficient code, but it will come in handy in a moment.
|------------
A method that is within a factor of optimal is γ encoding. γ codes im-γ ENCODING plement variable-length encoding by splitting the representation of a gap G into a pair of length and offset. Offset is G in binary, but with the leading 1 removed.2 For example, for 13 (binary 1101) offset is 101. Length encodes the length of offset in unary code. For 13, the length of offset is 3 bits, which is 1110 in unary. The γ code of 13 is therefore 1110101, the concatenation of length 1110 and offset 101. The right hand column of Table 5.5 gives additional examples of γ codes.
|------------
A γ code is decoded by first reading the unary code up to the 0 that ter- minates it, for example, the four bits 1110 when decoding 1110101. Now we know how long the offset is: 3 bits. The offset 101 can then be read correctly and the 1 that was chopped off in encoding is prepended: 101 → 1101 = 13.
***bigram language model->
|------------
12.1.2 Types of language models How do we build probabilities over sequences of terms? We can always use the chain rule from Equation (11.1) to decompose the probability of a sequence of events into the probability of each successive event conditioned on earlier events: P(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t1t2)P(t4|t1t2t3)(12.4) The simplest form of language model simply throws away all conditioning context, and estimates each term independently. Such a model is called a unigram language model:UNIGRAM LANGUAGE MODEL Puni(t1t2t3t4) = P(t1)P(t2)P(t3)P(t4)(12.5) There are many more complex kinds of language models, such as bigramBIGRAM LANGUAGE MODEL language models, which condition on the previous term, Pbi(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t2)P(t4|t3)(12.6) and even more complex grammar-based language models such as proba- bilistic context-free grammars. Such models are vital for tasks like speech recognition, spelling correction, and machine translation, where you need the probability of a term conditioned on surrounding context. However, most language-modeling work in IR has used unigram language models.
***polytomous classification->
***truecasing->
|------------
For English, an alternative to making every token lowercase is to just make some tokens lowercase. The simplest heuristic is to convert to lowercase words at the beginning of a sentence and all words occurring in a title that is all uppercase or in which most or all words are capitalized. These words are usually ordinary words that have been capitalized. Mid-sentence capitalized words are left as capitalized (which is usually correct). This will mostly avoid case-folding in cases where distinctions should be kept apart. The same task can be done more accurately by a machine learning sequence model which uses more features to make the decision of when to case-fold. This is known as truecasing. However, trying to get capitalization right in this way probablyTRUECASING doesn’t help if your users usually use lowercase regardless of the correct case of words. Thus, lowercasing everything often remains the most practical solution.
|------------
Lita et al. (2003) present a method for truecasing. Natural language pro- cessing work on computational morphology is presented in (Sproat 1992, Beesley and Karttunen 2003).
***accuracy->
|------------
Precision (P) is the fraction of retrieved documents that are relevantPRECISION Precision = #(relevant items retrieved) #(retrieved items) = P(relevant|retrieved)(8.1) Recall (R) is the fraction of relevant documents that are retrievedRECALL Recall = #(relevant items retrieved) #(relevant items) = P(retrieved|relevant)(8.2) These notions can be made clear by examining the following contingency table: (8.3) Relevant Nonrelevant Retrieved true positives (tp) false positives (fp) Not retrieved false negatives (fn) true negatives (tn) Then: P = tp/(tp+ f p)(8.4) R = tp/(tp+ f n) An obvious alternative that may occur to the reader is to judge an infor- mation retrieval system by its accuracy, that is, the fraction of its classifica-ACCURACY tions that are correct. In terms of the contingency table above, accuracy = (tp+ tn)/(tp + f p + f n + tn). This seems plausible, since there are two ac- tual classes, relevant and nonrelevant, and an information retrieval system can be thought of as a two-class classifier which attempts to label them as such (it retrieves the subset of documents which it believes to be relevant).
|------------
There is a good reason why accuracy is not an appropriate measure for information retrieval problems. In almost all circumstances, the data is ex- tremely skewed: normally over 99.9% of the documents are in the nonrele- vant category. A system tuned to maximize accuracy can appear to perform well by simply deeming all documents nonrelevant to all queries. Even if the system is quite good, trying to label some documents as relevant will almost always lead to a high rate of false positives. However, labeling all documents as nonrelevant is completely unsatisfying to an information retrieval system user. Users are always going to want to see some documents, and can be assumed to have a certain tolerance for seeing some false positives provid- ing that they get some useful information. The measures of precision and recall concentrate the evaluation on the return of true positives, asking what percentage of the relevant documents have been found and how many false positives have also been returned.
***classification function->
|------------
Using a learning method or learning algorithm, we then wish to learn a clas-LEARNING METHOD sifier or classification function γ that maps documents to classes:CLASSIFIER γ : X → C(13.1) This type of learning is called supervised learning because a supervisor (theSUPERVISED LEARNING human who defines the classes and labels training documents) serves as a teacher directing the learning process. We denote the supervised learning method by Γ and write Γ(D) = γ. The learning method Γ takes the training set D as input and returns the learned classification function γ.
|------------
Figure 13.1 shows an example of text classification from the Reuters-RCV1 collection, introduced in Section 4.2, page 69. There are six classes (UK,China, . . . , sports), each with three training documents. We show a few mnemonic words for each document’s content. The training set provides some typical examples for each class, so that we can learn the classification function γ.
***relative frequency->
|------------
This is referred to as the relative frequency of the event. Estimating the prob-RELATIVE FREQUENCY ability as the relative frequency is the maximum likelihood estimate (or MLE),MAXIMUM LIKELIHOOD ESTIMATE MLE because this value makes the observed data maximally likely. However, if we simply use the MLE, then the probability given to events we happened to see is usually too high, whereas other events may be completely unseen and giving them as a probability estimate their relative frequency of 0 is both an underestimate, and normally breaks our models, since anything multiplied by 0 is 0. Simultaneously decreasing the estimated probability of seen events and increasing the probability of unseen events is referred to as smoothing.SMOOTHING One simple way of smoothing is to add a number α to each of the observed counts. These pseudocounts correspond to the use of a uniform distributionPSEUDOCOUNTS over the vocabulary as a Bayesian prior, following Equation (11.4). We ini-BAYESIAN PRIOR tially assume a uniform distribution over events, where the size of α denotes the strength of our belief in uniformity, and we then update the probability based on observed events. Since our belief in uniformity is weak, we use α = 12 . This is a form of maximum a posteriori (MAP) estimation, where weMAXIMUM A POSTERIORI MAP choose the most likely point value for probabilities based on the prior and the observed evidence, following Equation (11.4). We will further discuss methods of smoothing estimated counts to give probability models in Sec- tion 12.2.2 (page 243); the simple method of adding 12 to each observed count will do for now.
***two-class classifier->
|------------
It is much smaller than and predates the Reuters-RCV1 collection discussed in Chapter 4 (page 69). The articles are assigned classes from a set of 118 topic categories. A document may be assigned several classes or none, but the commonest case is single assignment (documents with at least one class received an average of 1.24 classes). The standard approach to this any-of problem (Chapter 14, page 306) is to learn 118 two-class classifiers, one for each class, where the two-class classifier for class c is the classifier for the twoTWO-CLASS CLASSIFIER classes c and its complement c.
***sentiment detection->
|------------
• Sentiment detection or the automatic classification of a movie or productSENTIMENT DETECTION review as positive or negative. An example application is a user search- ing for negative reviews before buying a camera to make sure it has no undesirable features or quality problems.
***in XML retrieval->
***interpolated precision->
|------------
8.4 Evaluation of ranked retrieval results Precision, recall, and the F measure are set-based measures. They are com- puted using unordered sets of documents. We need to extend these measures (or to define new measures) if we are to evaluate the ranked retrieval results that are now standard with search engines. In a ranked retrieval context, appropriate sets of retrieved documents are naturally given by the top k re- trieved documents. For each such set, precision and recall values can be plotted to give a precision-recall curve, such as the one shown in Figure 8.2.PRECISION-RECALL CURVE Precision-recall curves have a distinctive saw-tooth shape: if the (k + 1)th document retrieved is nonrelevant then recall is the same as for the top k documents, but precision has dropped. If it is relevant, then both precision and recall increase, and the curve jags up and to the right. It is often useful to remove these jiggles and the standard way to do this is with an interpolated precision: the interpolated precision pinterp at a certain recall level r is definedINTERPOLATED PRECISION Recall Interp.
***search engine marketing->
|------------
retrieval and microeconomics, and is beyond the scope of this book. For advertisers, understanding how search engines do this ranking and how to allocate marketing campaign budgets to different keywords and to different sponsored search engines has become a profession known as search engineSEARCH ENGINE MARKETING marketing (SEM).
***link spam->
|------------
A doorway page contains text and metadata carefully chosen to rank highly on selected search keywords. When a browser requests the doorway page, it is redirected to a page containing content of a more commercial nature. More complex spamming techniques involve manipulation of the metadata related to a page including (for reasons we will see in Chapter 21) the links into a web page. Given that spamming is inherently an economically motivated activity, there has sprung around it an industry of Search Engine Optimizers,SEARCH ENGINE OPTIMIZERS or SEOs to provide consultancy services for clients who seek to have their web pages rank highly on selected keywords. Web search engines frown on this business of attempting to decipher and adapt to their proprietary rank- ing techniques and indeed announce policies on forms of SEO behavior they do not tolerate (and have been known to shut down search requests from cer- tain SEOs for violation of these). Inevitably, the parrying between such SEOs (who gradually infer features of each web search engine’s ranking methods) and the web search engines (who adapt in response) is an unending struggle; indeed, the research sub-area of adversarial information retrieval has sprung upADVERSARIAL INFORMATION RETRIEVAL around this battle. To combat spammers who manipulate the text of their web pages is the exploitation of the link structure of the Web – a technique known as link analysis. The first web search engine known to apply link anal- ysis on a large scale (to be detailed in Chapter 21) was Google, although all web search engines currently make use of it (and correspondingly, spam- mers now invest considerable effort in subverting it – this is known as linkLINK SPAM spam).
|------------
Clearly, not every citation or hyperlink implies such authority conferral; for this reason, simply measuring the quality of a web page by the number of in-links (citations from other pages) is not robust enough. For instance, one may contrive to set up multiple web pages pointing to a target web page, with the intent of artificially boosting the latter’s tally of in-links. This phe- nomenon is referred to as link spam. Nevertheless, the phenomenon of ci- tation is prevalent and dependable enough that it is feasible for web search engines to derive useful signals for ranking from more sophisticated link analysis. Link analysis also proves to be a useful indicator of what page(s) to crawl next while crawling the web; this is done by using link analysis to guide the priority assignment in the front queues of Chapter 20.
***machine translation->
|------------
12.1.2 Types of language models How do we build probabilities over sequences of terms? We can always use the chain rule from Equation (11.1) to decompose the probability of a sequence of events into the probability of each successive event conditioned on earlier events: P(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t1t2)P(t4|t1t2t3)(12.4) The simplest form of language model simply throws away all conditioning context, and estimates each term independently. Such a model is called a unigram language model:UNIGRAM LANGUAGE MODEL Puni(t1t2t3t4) = P(t1)P(t2)P(t3)P(t4)(12.5) There are many more complex kinds of language models, such as bigramBIGRAM LANGUAGE MODEL language models, which condition on the previous term, Pbi(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t2)P(t4|t3)(12.6) and even more complex grammar-based language models such as proba- bilistic context-free grammars. Such models are vital for tasks like speech recognition, spelling correction, and machine translation, where you need the probability of a term conditioned on surrounding context. However, most language-modeling work in IR has used unigram language models.
|------------
Basic LMs do not address issues of alternate expression, that is, synonymy, or any deviation in use of language between queries and documents. Berger and Lafferty (1999) introduce translation models to bridge this query-document gap. A translation model lets you generate query words not in a document byTRANSLATION MODEL translation to alternate terms with similar meaning. This also provides a ba- sis for performing cross-language IR. We assume that the translation model can be represented by a conditional probability distribution T(·|·) between vocabulary terms. The form of the translation query generation model is then: P(q|Md) = ∏ t∈q ∑ v∈V P(v|Md)T(t|v)(12.15) The term P(v|Md) is the basic document language model, and the term T(t|v) performs translation. This model is clearly more computationally intensive and we need to build a translation model. The translation model is usually built using separate resources (such as a traditional thesaurus or bilingual dictionary or a statistical machine translation system’s translation diction- ary), but can be built using the document collection if there are pieces of text that naturally paraphrase or summarize other pieces of text. Candi- date examples are documents and their titles or abstracts, or documents and anchor-text pointing to them in a hypertext environment.
***assumption->
|------------
Again, if we knew the percentage of relevant documents in the collection, then we could use this number to estimate P(R = 1|~q) and P(R = 0|~q). Since a document is either relevant or nonrelevant to a query, we must have that: P(R = 1|~x,~q) + P(R = 0|~x,~q) = 1(11.9) 11.3.1 Deriving a ranking function for query terms Given a query q, we wish to order returned documents by descending P(R = 1|d, q). Under the BIM, this is modeled as ordering by P(R = 1|~x,~q). Rather than estimating this probability directly, because we are interested only in the ranking of documents, we work with some other quantities which are easier to compute and which give the same ordering of documents. In particular, we can rank documents by their odds of relevance (as the odds of relevance is monotonic with the probability of relevance). This makes things easier, because we can ignore the common denominator in (11.8), giving: O(R|~x,~q) = P(R = 1|~x,~q) P(R = 0|~x,~q) = P(R=1|~q)P(~x|R=1,~q) P(~x|~q) P(R=0|~q)P(~x|R=0,~q) P(~x|~q) = P(R = 1|~q) P(R = 0|~q) · P(~x|R = 1,~q) P(~x|R = 0,~q)(11.10) The left term in the rightmost expression of Equation (11.10) is a constant for a given query. Since we are only ranking documents, there is thus no need for us to estimate it. The right-hand term does, however, require estimation, and this initially appears to be difficult: How can we accurately estimate the probability of an entire term incidence vector occurring? It is at this point that we make the Naive Bayes conditional independence assumption that the presenceNAIVE BAYES ASSUMPTION or absence of a word in a document is independent of the presence or absence of any other word (given the query): P(~x|R = 1,~q) P(~x|R = 0,~q) = M ∏ t=1 P(xt|R = 1,~q) P(xt|R = 0,~q) (11.11) So: O(R|~x,~q) = O(R|~q) · M ∏ t=1 P(xt|R = 1,~q) P(xt|R = 0,~q) (11.12) Since each xt is either 0 or 1, we can separate the terms to give: O(R|~x,~q) = O(R|~q) · ∏ t:xt=1 P(xt = 1|R = 1,~q) P(xt = 1|R = 0,~q) · ∏ t:xt=0 P(xt = 0|R = 1,~q) P(xt = 0|R = 0,~q) (11.13) Henceforth, let pt = P(xt = 1|R = 1,~q) be the probability of a term appear- ing in a document relevant to the query, and ut = P(xt = 1|R = 0,~q) be the probability of a term appearing in a nonrelevant document. These quantities can be visualized in the following contingency table where the columns add to 1: (11.14) document relevant (R = 1) nonrelevant (R = 0) Term present xt = 1 pt ut Term absent xt = 0 1 − pt 1 − ut Let us make an additional simplifying assumption that terms not occur- ring in the query are equally likely to occur in relevant and nonrelevant doc- uments: that is, if qt = 0 then pt = ut. (This assumption can be changed, as when doing relevance feedback in Section 11.3.4.) Then we need only consider terms in the products that appear in the query, and so, O(R|~q,~x) = O(R|~q) · ∏ t:xt=qt=1 pt ut · ∏ t:xt=0,qt=1 1 − pt 1 − ut (11.15) The left product is over query terms found in the document and the right product is over query terms not found in the document.
|------------
To reduce the number of parameters, we make the Naive Bayes conditionalCONDITIONAL INDEPENDENCE ASSUMPTION independence assumption. We assume that attribute values are independent of each other given the class: Multinomial P(d|c) = P(〈t1, . . . , tnd〉|c) = ∏ 1≤k≤nd P(Xk = tk|c)(13.13) Bernoulli P(d|c) = P(〈e1, . . . , eM〉|c) = ∏ 1≤i≤M P(Ui = ei|c).(13.14) We have introduced two random variables here to make the two different generative models explicit. Xk is the random variable for position k in theRANDOM VARIABLE X document and takes as values terms from the vocabulary. P(Xk = t|c) is the probability that in a document of class c the term t will occur in position k. UiRANDOM VARIABLE U is the random variable for vocabulary term i and takes as values 0 (absence) and 1 (presence). P̂(Ui = 1|c) is the probability that in a document of class c the term ti will occur – in any position and possibly multiple times.
|------------
We illustrate the conditional independence assumption in Figures 13.4 and 13.5.
|------------
In reality, the conditional independence assumption does not hold for text data. Terms are conditionally dependent on each other. But as we will dis- cuss shortly, NB models perform well despite the conditional independence assumption.
***rank->
|------------
Exercise 17.12 For N points, there are ≤ NK different flat clusterings into K clusters (Section 16.2, page 356). What is the number of different hierarchical clusterings (or dendrograms) of N documents? Are there more flat clusterings or more hierarchical clusterings for given K and N? 18 Matrix decompositions and latentsemantic indexing On page 123 we introduced the notion of a term-document matrix: an M× N matrix C, each of whose rows represents a term and each of whose columns represents a document in the collection. Even for a collection of modest size, the term-document matrix C is likely to have several tens of thousands of rows and columns. In Section 18.1.1 we first develop a class of operations from linear algebra, known as matrix decomposition. In Section 18.2 we use a special form of matrix decomposition to construct a low-rank approximation to the term-document matrix. In Section 18.3 we examine the application of such low-rank approximations to indexing and retrieving documents, a technique referred to as latent semantic indexing. While latent semantic in- dexing has not been established as a significant force in scoring and ranking for information retrieval, it remains an intriguing approach to clustering in a number of domains including for collections of text documents (Section 16.6, page 372). Understanding its full potential remains an area of active research.
|------------
18.1 Linear algebra review We briefly review some necessary background in linear algebra. Let C be an M × N matrix with real-valued entries; for a term-document matrix, all entries are in fact non-negative. The rank of a matrix is the number of linearlyRANK independent rows (or columns) in it; thus, rank(C) ≤ min{M, N}. A square r × r matrix all of whose off-diagonal entries are zero is called a diagonal matrix; its rank is equal to the number of non-zero diagonal entries. If all r diagonal entries of such a diagonal matrix are 1, it is called the identity matrix of dimension r and represented by Ir.
|------------
For a square M×M matrix C and a vector ~x that is not all zeros, the values of λ satisfying C~x = λ~x(18.1) are called the eigenvalues of C . The N-vector ~x satisfying Equation (18.1)EIGENVALUE for an eigenvalue λ is the corresponding right eigenvector. The eigenvector corresponding to the eigenvalue of largest magnitude is called the principal eigenvector. In a similar fashion, the left eigenvectors of C are the M-vectors y such that ~yT C = λ~yT .(18.2) The number of non-zero eigenvalues of C is at most rank(C).
***proximity operator->
|------------
1.4 The extended Boolean model versus ranked retrieval The Boolean retrieval model contrasts with ranked retrieval models such as theRANKED RETRIEVAL MODEL vector space model (Section 6.3), in which users largely use free text queries, FREE TEXT QUERIES that is, just typing one or more words rather than using a precise language with operators for building up query expressions, and the system decides which documents best satisfy the query. Despite decades of academic re- search on the advantages of ranked retrieval, systems implementing the Boo- lean retrieval model were the main or only search option provided by large commercial information providers for three decades until the early 1990s (ap- proximately the date of arrival of the World Wide Web). However, these systems did not have just the basic Boolean operations (AND, OR, and NOT) which we have presented so far. A strict Boolean expression over terms with an unordered results set is too limited for many of the information needs that people have, and these systems implemented extended Boolean retrieval models by incorporating additional operators such as term proximity oper- ators. A proximity operator is a way of specifying that two terms in a queryPROXIMITY OPERATOR must occur close to each other in a document, where closeness may be mea- sured by limiting the allowed number of intervening words or by reference to a structural unit such as a sentence or paragraph.
***merge algorithm->
|------------
The intersection operation is the crucial one: we need to efficiently intersectPOSTINGS LIST INTERSECTION postings lists so as to be able to quickly find documents that contain both terms. (This operation is sometimes referred to as merging postings lists:POSTINGS MERGE this slightly counterintuitive name reflects using the term merge algorithm for a general family of algorithms that combine multiple sorted lists by inter- leaved advancing of pointers through each; here we are merging the lists with a logical AND operation.) There is a simple and effective method of intersecting postings lists using the merge algorithm (see Figure 1.6): we maintain pointers into both lists INTERSECT(p1, p2) 1 answer ← 〈 〉 2 while p1 6= NIL and p2 6= NIL 3 do if docID(p1) = docID(p2) 4 then ADD(answer, docID(p1)) 5 p1 ← next(p1) 6 p2 ← next(p2) 7 else if docID(p1) < docID(p2) 8 then p1 ← next(p1) 9 else p2 ← next(p2) 10 return answer ◮ Figure 1.6 Algorithm for the intersection of two postings lists p1 and p2.
***informational queries->
|------------
Informational queries seek general information on a broad topic, such asINFORMATIONAL QUERIES leukemia or Provence. There is typically not a single web page that con- tains all the information sought; indeed, users with informational queries typically try to assimilate information from multiple web pages.
***impact->
|------------
sion – instead of docIDs we can compress smaller gaps between IDs, thus reducing space requirements for the index. However, this structure for the index is not optimal when we build ranked (Chapters 6 and 7) – as opposed toRANKED Boolean – retrieval systems. In ranked retrieval, postings are often ordered ac-RETRIEVAL SYSTEMS cording to weight or impact, with the highest-weighted postings occurring first. With this organization, scanning of long postings lists during query processing can usually be terminated early when weights have become so small that any further documents can be predicted to be of low similarity to the query (see Chapter 6). In a docID-sorted index, new documents are always inserted at the end of postings lists. In an impact-sorted index (Sec- tion 7.1.5, page 140), the insertion can occur anywhere, thus complicating the update of the inverted index.
***in-links->
|------------
Figure 19.2 shows two nodes A and B from the web graph, each corre- sponding to a web page, with a hyperlink from A to B. We refer to the set of all such nodes and directed edges as the web graph. Figure 19.2 also shows that (as is the case with most links on web pages) there is some text surround- ing the origin of the hyperlink on page A. This text is generally encapsulated in the href attribute of the (for anchor) tag that encodes the hyperlink in the HTML code of page A, and is referred to as anchor text. As one mightANCHOR TEXT suspect, this directed graph is not strongly connected: there are pairs of pages such that one cannot proceed from one page of the pair to the other by follow- ing hyperlinks. We refer to the hyperlinks into a page as in-links and thoseIN-LINKS out of a page as out-links. The number of in-links to a page (also known asOUT-LINKS its in-degree) has averaged from roughly 8 to 15, in a range of studies. We similarly define the out-degree of a web page to be the number of links out ◮ Figure 19.3 A sample small web graph. In this example we have six pages labeled A-F. Page B has in-degree 3 and out-degree 1. This example graph is not strongly connected: there is no path from any of pages B-F to page A.
|------------
Clearly, not every citation or hyperlink implies such authority conferral; for this reason, simply measuring the quality of a web page by the number of in-links (citations from other pages) is not robust enough. For instance, one may contrive to set up multiple web pages pointing to a target web page, with the intent of artificially boosting the latter’s tally of in-links. This phe- nomenon is referred to as link spam. Nevertheless, the phenomenon of ci- tation is prevalent and dependable enough that it is feasible for web search engines to derive useful signals for ranking from more sophisticated link analysis. Link analysis also proves to be a useful indicator of what page(s) to crawl next while crawling the web; this is done by using link analysis to guide the priority assignment in the front queues of Chapter 20.
***semistructured retrieval->
|------------
We call XML retrieval structured retrieval in this chapter. Some researchers prefer the term semistructured retrieval to distinguish XML retrieval from databaseSEMISTRUCTURED RETRIEVAL querying. We have adopted the terminology that is widespread in the XML retrieval community. For instance, the standard way of referring to XML queries is structured queries, not semistructured queries. The term structured retrieval is rarely used for database querying and it always refers to XML retrieval in this book.
***linear classifier->
|------------
14.4 Linear versus nonlinear classifiers In this section, we show that the two learning methods Naive Bayes and Rocchio are instances of linear classifiers, the perhaps most important group of text classifiers, and contrast them with nonlinear classifiers. To simplify the discussion, we will only consider two-class classifiers in this section and define a linear classifier as a two-class classifier that decides class membershipLINEAR CLASSIFIER by comparing a linear combination of the features to a threshold.
|------------
In two dimensions, a linear classifier is a line. Five examples are shown in Figure 14.8. These lines have the functional form w1x1 + w2x2 = b. The classification rule of a linear classifier is to assign a document to c if w1x1 + w2x2 > b and to c if w1x1 + w2x2 ≤ b. Here, (x1, x2)T is the two-dimensional vector representation of the document and (w1,w2)T is the parameter vector APPLYLINEARCLASSIFIER(~w, b,~x) 1 score ← ∑Mi=1 wixi 2 if score > b 3 then return 1 4 else return 0 ◮ Figure 14.9 Linear classification algorithm.
|------------
Without loss of generality, a linear classifier will use a linear combination of features of the form Score(d, q) = Score(α,ω) = aα + bω + c,(15.17) with the coefficients a, b, c to be learned from the training data. While it is 0 2 3 4 5 0 . 0 5 0 . 0 2 5cosi ne score T e r m p r o x i m i t y RRR R R RR RRR R NN NN N N NN N N ◮ Figure 15.7 A collection of training examples. Each R denotes a training example labeled relevant, while each N is a training example labeled nonrelevant.
|------------
In this setting, the function Score(α,ω) from Equation (15.17) represents a plane “hanging above” Figure 15.7. Ideally this plane (in the direction perpendicular to the page containing Figure 15.7) assumes values close to 1 above the points marked R, and values close to 0 above the points marked N. Since a plane is unlikely to assume only values close to 0 or 1 above the training sample points, we make use of thresholding: given any query and document for which we wish to determine relevance, we pick a value θ and if Score(α,ω) > θ we declare the document to be relevant, else we declare the document to be nonrelevant. As we know from Figure 14.8 (page 301), all points that satisfy Score(α,ω) = θ form a line (shown as a dashed line in Figure 15.7) and we thus have a linear classifier that separates relevant from nonrelevant instances. Geometrically, we can find the separating line as follows. Consider the line passing through the plane Score(α,ω) whose height is θ above the page containing Figure 15.7. Project this line down onto Figure 15.7; this will be the dashed line in Figure 15.7. Then, any subse- quent query/document pair that falls below the dashed line in Figure 15.7 is deemed nonrelevant; above the dashed line, relevant.
***Reuters-RCV1->
|------------
We work with the Reuters-RCV1 collection as our model collection in thisREUTERS-RCV1 chapter, a collection with roughly 1 GB of text. It consists of about 800,000 documents that were sent over the Reuters newswire during a 1-year pe- riod between August 20, 1996, and August 19, 1997. A typical document is shown in Figure 4.1, but note that we ignore multimedia information like images in this book and are only concerned with text. Reuters-RCV1 covers a wide range of international topics, including politics, business, sports, and (as in this example) science. Some key statistics of the collection are shown in Table 4.2.
|------------
See: http://www.clef-campaign.org/ Reuters-21578 and Reuters-RCV1. For text classification, the most used testREUTERS collection has been the Reuters-21578 collection of 21578 newswire arti- cles; see Chapter 13, page 279. More recently, Reuters released the much larger Reuters Corpus Volume 1 (RCV1), consisting of 806,791 documents; see Chapter 4, page 69. Its scale and rich annotation makes it a better basis for future research.
***schema diversity->
|------------
In many cases, several different XML schemas occur in a collection since the XML documents in an IR application often come from more than one source. This phenomenon is called schema heterogeneity or schema diversitySCHEMA HETEROGENEITY and presents yet another challenge. As illustrated in Figure 10.6 comparable elements may have different names: creator in d2 vs. author in d3. In other cases, the structural organization of the schemas may be different: Author names are direct descendants of the node author in q3, but there are the in- tervening nodes firstname and lastname in d3. If we employ strict matching of trees, then q3 will retrieve neither d2 nor d3 although both documents are relevant. Some form of approximate matching of element names in combina- tion with semi-automatic matching of different document structures can help here. Human editing of correspondences of elements in different schemas will usually do better than automatic methods.
***unsupervised learning->
|------------
Clustering is the most common form of unsupervised learning. No super-UNSUPERVISED LEARNING vision means that there is no human expert who has assigned documents to classes. In clustering, it is the distribution and makeup of the data that will determine cluster membership. A simple example is Figure 16.1. It is visually clear that there are three distinct clusters of points. This chapter and Chapter 17 introduce algorithms that find such clusters in an unsupervised fashion.
|------------
The difference between clustering and classification may not seem great at first. After all, in both cases we have a partition of a set of documents into groups. But as we will see the two problems are fundamentally differ- ent. Classification is a form of supervised learning (Chapter 13, page 256): our goal is to replicate a categorical distinction that a human supervisor im- poses on the data. In unsupervised learning, of which clustering is the most important example, we have no such teacher to guide us.
***sponsored search->
|------------
Several aspects of Goto’s model are worth highlighting. First, a user typing the query q into Goto’s search interface was actively expressing an interest and intent related to the query q. For instance, a user typing golf clubs is more likely to be imminently purchasing a set than one who is simply browsing news on golf. Second, Goto only got compensated when a user actually ex- pressed interest in an advertisement – as evinced by the user clicking the ad- vertisement. Taken together, these created a powerful mechanism by which to connect advertisers to consumers, quickly raising the annual revenues of Goto/Overture into hundreds of millions of dollars. This style of search en- gine came to be known variously as sponsored search or search advertising.SPONSORED SEARCH SEARCH ADVERTISING Given these two kinds of search engines – the “pure” search engines such as Google and Altavista, versus the sponsored search engines – the logi- cal next step was to combine them into a single user experience. Current search engines follow precisely this model: they provide pure search results (generally known as algorithmic search results) as the primary response to aALGORITHMIC SEARCH user’s search, together with sponsored search results displayed separately and distinctively to the right of the algorithmic results. This is shown in Fig- ure 19.6. Retrieving sponsored search results and ranking them in response to a query has now become considerably more sophisticated than the sim- ple Goto scheme; the process entails a blending of ideas from information ◮ Figure 19.6 Search advertising triggered by query keywords. Here the query A320 returns algorithmic search results about the Airbus aircraft, together with advertise- ments for various non-aircraft goods numbered A320, that advertisers seek to market to those querying on this query. The lack of advertisements for the aircraft reflects the fact that few marketers attempt to sell A320 aircraft on the web.
***support vector->
|------------
b b b b b b bb b ut ut ut ut ut ut ut Support vectorsMaximum margin decision hyperplane Margin is maximized ◮ Figure 15.1 The support vectors are the 5 points right up against the margin of the classifier.
|------------
15.1 Support vector machines: The linearly separable case For two-class, separable training data sets, such as the one in Figure 14.8 (page 301), there are lots of possible linear separators. Intuitively, a decision boundary drawn in the middle of the void between data items of the two classes seems better than one which approaches very close to examples of one or both classes. While some learning methods such as the perceptron algorithm (see references in Section 14.7, page 314) find just any linear sepa- rator, others, like Naive Bayes, search for the best linear separator according to some criterion. The SVM in particular defines the criterion to be looking for a decision surface that is maximally far away from any data point. This distance from the decision surface to the closest data point determines the margin of the classifier. This method of construction necessarily means thatMARGIN the decision function for an SVM is fully specified by a (usually small) sub- set of the data which defines the position of the separator. These points are referred to as the support vectors (in a vector space, a point can be thought ofSUPPORT VECTOR as a vector between the origin and that point). Figure 15.1 shows the margin and support vectors for a sample problem. Other data points play no part in determining the decision surface that is chosen.
***document likelihood model->
|------------
There are other ways to think of using the language modeling idea in IR settings, and many of them have been tried in subsequent work. Rather than looking at the probability of a document language model Md generating the query, you can look at the probability of a query language model Mq gener- ating the document. The main reason that doing things in this direction and creating a document likelihood model is less appealing is that there is much lessDOCUMENT LIKELIHOOD MODEL text available to estimate a language model based on the query text, and so the model will be worse estimated, and will have to depend more on being smoothed with some other language model. On the other hand, it is easy to see how to incorporate relevance feedback into such a model: you can ex- pand the query with terms taken from relevant documents in the usual way and hence update the language model Mq (Zhai and Lafferty 2001a). Indeed, with appropriate modeling choices, this approach leads to the BIM model of Chapter 11. The relevance model of Lavrenko and Croft (2001) is an instance of a document likelihood model, which incorporates pseudo-relevance feed- back into a language modeling approach. It achieves very strong empirical results.
***standing query->
|------------
13 Text classification and NaiveBayes Thus far, this book has mainly discussed the process of ad hoc retrieval, where users have transient information needs that they try to address by posing one or more queries to a search engine. However, many users have ongoing information needs. For example, you might need to track developments in multicore computer chips. One way of doing this is to issue the query multi- core AND computer AND chip against an index of recent newswire articles each morning. In this and the following two chapters we examine the question: How can this repetitive task be automated? To this end, many systems sup- port standing queries. A standing query is like any other query except that itSTANDING QUERY is periodically executed on a collection to which new documents are incre- mentally added over time.
|------------
If your standing query is just multicore AND computer AND chip, you will tend to miss many relevant new articles which use other terms such as multicore processors. To achieve good recall, standing queries thus have to be refined over time and can gradually become quite complex. In this example, using a Boolean search engine with stemming, you might end up with a query like (multicore OR multi-core) AND (chip OR processor OR microprocessor).
|------------
To capture the generality and scope of the problem space to which stand- ing queries belong, we now introduce the general notion of a classificationCLASSIFICATION problem. Given a set of classes, we seek to determine which class(es) a given object belongs to. In the example, the standing query serves to divide new newswire articles into the two classes: documents about multicore computer chips and documents not about multicore computer chips. We refer to this as two-class classification. Classification using standing queries is also called routing orROUTING filteringand will be discussed further in Section 15.3.1 (page 335).FILTERING A class need not be as narrowly focused as the standing query multicore computer chips. Often, a class is a more general subject area like China or coffee.
***generative model->
|------------
12.1 Language models 12.1.1 Finite automata and language models What do we mean by a document model generating a query? A traditional generative model of a language, of the kind familiar from formal languageGENERATIVE MODEL theory, can be used either to recognize or to generate strings. For example, the finite automaton shown in Figure 12.1 can generate strings that include the examples shown. The full set of strings that can be generated is called the language of the automaton.1LANGUAGE I wish I wish I wish I wish I wish I wish I wish I wish I wish I wish I wish I wish I wish . . .
|------------
We first need to state our objective in text classification more precisely. In Section 13.1 (page 256), we said that we want to minimize classification er- ror on the test set. The implicit assumption was that training documents and test documents are generated according to the same underlying distri- bution. We will denote this distribution P(〈d, c〉) where d is the document and c its label or class. Figures 13.4 and 13.5 were examples of generative models that decompose P(〈d, c〉) into the product of P(c) and P(d|c). Fig- ures 14.10 and 14.11 depict generative models for 〈d, c〉 with d ∈ R2 and c ∈ {square, solid circle}.
|------------
Linear methods like Rocchio and Naive Bayes have a high bias for non- linear problems because they can only model one type of class boundary, a linear hyperplane. If the generative model P(〈d, c〉) has a complex nonlinear class boundary, the bias term in Equation (14.11) will be high because a large number of points will be consistently misclassified. For example, the circular enclave in Figure 14.11 does not fit a linear model and will be misclassified consistently by linear classifiers.
***BM25 weights->
|------------
11.4.3 Okapi BM25: a non-binary model The BIM was originally designed for short catalog records and abstracts of fairly consistent length, and it works reasonably in these contexts, but for modern full-text search collections, it seems clear that a model should pay attention to term frequency and document length, as in Chapter 6. The BM25BM25 WEIGHTS weighting scheme, often called Okapiweighting, after the system in which it wasOKAPI WEIGHTING first implemented, was developed as a way of building a probabilistic model sensitive to these quantities while not introducing too many additional pa- rameters into the model (Spärck Jones et al. 2000). We will not develop the full theory behind the model here, but just present a series of forms that build up to the standard form now used for document scoring. The simplest score for document d is just idf weighting of the query terms present, as in Equation (11.22): RSVd = ∑ t∈q log N dft (11.30) Sometimes, an alternative version of idf is used. If we start with the formula in Equation (11.21) but in the absence of relevance feedback information we estimate that S = s = 0, then we get an alternative idf formulation as follows: RSVd = ∑ t∈q log N − dft + 12 dft + 12 (11.31) This variant behaves slightly strangely: if a term occurs in over half the doc- uments in the collection then this model gives a negative term weight, which is presumably undesirable. But, assuming the use of a stop list, this normally doesn’t happen, and the value for each summand can be given a floor of 0.
***schema->
|------------
We also need the concept of schema in this chapter. A schema puts con-SCHEMA straints on the structure of allowable XML documents for a particular ap- plication. A schema for Shakespeare’s plays may stipulate that scenes can only occur as children of acts and that only acts and scenes have the num- ber attribute. Two standards for schemas for XML documents are XML DTDXML DTD (document type definition) andXML Schema. Users can only write structuredXML SCHEMA queries for an XML retrieval system if they have some minimal knowledge about the schema of the collection.
***BSBI->
|------------
With main memory insufficient, we need to use an external sorting algo-EXTERNAL SORTING ALGORITHM rithm, that is, one that uses disk. For acceptable speed, the central require- BSBINDEXCONSTRUCTION() 1 n ← 0 2 while (all documents have not been processed) 3 do n ← n + 1 4 block ← PARSENEXTBLOCK() 5 BSBI-INVERT(block) 6 WRITEBLOCKTODISK(block, fn) 7 MERGEBLOCKS( f1, . . . , fn; fmerged) ◮ Figure 4.2 Blocked sort-based indexing. The algorithm stores inverted blocks in files f1, . . . , fn and the merged index in fmerged.
|------------
ment of such an algorithm is that it minimize the number of random disk seeks during sorting – sequential disk reads are far faster than seeks as we explained in Section 4.1. One solution is the blocked sort-based indexing algo-BLOCKED SORT-BASED INDEXING ALGORITHM rithm or BSBI in Figure 4.2. BSBI (i) segments the collection into parts of equal size, (ii) sorts the termID–docID pairs of each part in memory, (iii) stores in- termediate sorted results on disk, and (iv) merges all intermediate results into the final index.
|------------
How expensive is BSBI? Its time complexity is Θ(T log T) because the step with the highest time complexity is sorting and T is an upper bound for the number of items we must sort (i.e., the number of termID–docID pairs). But brutus d1,d3 caesar d1,d2,d4 noble d5 with d1,d2,d3,d5 brutus d6,d7 caesar d8,d9 julius d10 killed d8 postings lists to be merged brutus d1,d3,d6,d7 caesar d1,d2,d4,d8,d9 julius d10 killed d8 noble d5 with d1,d2,d3,d5 merged postings lists disk ◮ Figure 4.3 Merging in blocked sort-based indexing. Two blocks (“postings lists to be merged”) are loaded from disk into memory, merged in memory (“merged post- ings lists”) and written back to disk. We show terms instead of termIDs for better readability.
***zone search->
|------------
There is a second type of information retrieval problem that is intermediate between unstructured retrieval and querying a relational database: paramet- ric and zone search, which we discussed in Section 6.1 (page 110). In the data model of parametric and zone search, there are parametric fields (re- lational attributes like date or file-size) and zones – text attributes that each take a chunk of unstructured text as value, e.g., author and title in Figure 6.1 (page 111). The data model is flat, that is, there is no nesting of attributes.
|------------
The number of attributes is small. In contrast, XML documents have the more complex tree structure that we see in Figure 10.2 in which attributes are nested. The number of attributes and nodes is greater than in parametric and zone search.
***CAS topics->
|------------
Exercise 10.3 How many structural terms does the document in Figure 10.1 yield? 10.4 Evaluation of XML retrieval The premier venue for research on XML retrieval is the INEX (INitiative forINEX the Evaluation of XML retrieval) program, a collaborative effort that has pro- duced reference collections, sets of queries, and relevance judgments. A yearly INEX meeting is held to present and discuss research results. The 12,107 number of documents 494 MB size 1995–2002 time of publication of articles 1,532 average number of XML nodes per document 6.9 average depth of a node 30 number of CAS topics 30 number of CO topics ◮ Table 10.2 INEX 2002 collection statistics.
|------------
Two types of information needs or topics in INEX are content-only or CO topics and content-and-structure (CAS) topics. CO topics are regular key-CO TOPICS word queries as in unstructured information retrieval. CAS topics have struc-CAS TOPICS tural constraints in addition to keywords. We already encountered an exam- ple of a CAS topic in Figure 10.3. The keywords in this case are summer and holidays and the structural constraints specify that the keywords occur in a section that in turn is part of an article and that this article has an embedded year attribute with value 2001 or 2002.
***learning error->
|------------
For learning methods, we adopt as our goal to find a Γ that, averaged over training sets, learns classifiers γ with minimal MSE. We can formalize this as minimizing learning error:LEARNING ERROR learning-error(Γ) = ED[MSE(Γ(D))](14.7) where ED is the expectation over labeled training sets. To keep things simple, we can assume that training sets have a fixed size – the distribution P(〈d, c〉) then defines a distribution P(D) over training sets.
|------------
We can use learning error as a criterion for selecting a learning method in statistical text classification. A learning method Γ is optimal for a distributionOPTIMAL LEARNING METHOD P(D) if it minimizes the learning error.
***master node->
|------------
The distributed index construction method we describe in this section is an application of MapReduce, a general architecture for distributed computing.MAPREDUCE MapReduce is designed for large computer clusters. The point of a cluster is to solve large computing problems on cheap commodity machines or nodes that are built from standard parts (processor, memory, disk) as opposed to on a supercomputer with specialized hardware. Although hundreds or thou- sands of machines are available in such clusters, individual machines can fail at any time. One requirement for robust distributed indexing is, there- fore, that we divide the work up into chunks that we can easily assign and – in case of failure – reassign. A master node directs the process of assigningMASTER NODE and reassigning tasks to individual worker nodes.
|------------
The map and reduce phases of MapReduce split up the computing job into chunks that standard machines can process in a short time. The various steps of MapReduce are shown in Figure 4.5 and an example on a collection consisting of two documents is shown in Figure 4.6. First, the input data, in our case a collection of web pages, are split into n splits where the size ofSPLITS the split is chosen to ensure that the work can be distributed evenly (chunks should not be too large) and efficiently (the total number of chunks we need to manage should not be too large); 16 or 64 MB are good sizes in distributed indexing. Splits are not preassigned to machines, but are instead assigned by the master node on an ongoing basis: As a machine finishes processing one split, it is assigned the next one. If a machine dies or becomes a laggard due to hardware problems, the split it is working on is simply reassigned to another machine.
***test data->
|------------
Once we have learned γ, we can apply it to the test set (or test data), for ex-TEST SET ample, the new document first private Chinese airline whose class is unknown.
***transductive SVMs->
|------------
Here, the theoretically interesting answer is to try to apply semi-supervisedSEMI-SUPERVISED LEARNING training methods. This includes methods such as bootstrapping or the EM algorithm, which we will introduce in Section 16.5 (page 368). In these meth- ods, the system gets some labeled documents, and a further large supply of unlabeled documents over which it can attempt to learn. One of the big advantages of Naive Bayes is that it can be straightforwardly extended to be a semi-supervised learning algorithm, but for SVMs, there is also semi- supervised learning work which goes under the title of transductive SVMs.TRANSDUCTIVE SVMS See the references for pointers.
***balanced F measure->
|------------
A single measure that trades off precision versus recall is the F measure,F MEASURE which is the weighted harmonic mean of precision and recall: F = 1 α 1P + (1 − α) 1R = (β2 + 1)PR β2P+ R where β2 = 1 − α α (8.5) where α ∈ [0, 1] and thus β2 ∈ [0, ∞]. The default balanced F measure equally weights precision and recall, which means making α = 1/2 or β = 1. It is commonly written as F1, which is short for Fβ=1, even though the formula- tion in terms of α more transparently exhibits the F measure as a weighted harmonic mean. When using β = 1, the formula on the right simplifies to: Fβ=1 = 2PR P + R (8.6) However, using an even weighting is not the only choice. Values of β < 1 emphasize precision, while values of β > 1 emphasize recall. For example, a value of β = 3 or β = 5 might be used if recall is to be emphasized. Recall, precision, and the F measure are inherently measures between 0 and 1, but they are also very commonly written as percentages, on a scale between 0 and 100.
***stop list->
|------------
2.2.2 Dropping common terms: stop words Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words. The generalSTOP WORDS strategy for determining a stop list is to sort the terms by collection frequencyCOLLECTION FREQUENCY (the total number of times each term appears in the document collection), and then to take the most frequent terms, often hand-filtered for their se- mantic content relative to the domain of the documents being indexed, as a stop list, the members of which are then discarded during indexing. AnSTOP LIST example of a stop list is shown in Figure 2.5. Using a stop list significantly reduces the number of postings that a system has to store; we will present some statistics on this in Chapter 5 (see Table 5.1, page 87). And a lot of the time not indexing stop words does little harm: keyword searches with terms like the and by don’t seem very useful. However, this is not true for phrase searches. The phrase query “President of the United States”, which con- tains two stop words, is more precise than President AND “United States”. The meaning of flights to London is likely to be lost if the word to is stopped out. A search for Vannevar Bush’s article As we may think will be difficult if the first three words are stopped out, and the system searches simply for documents containing the word think. Some special query types are disproportionately affected. Some song titles and well known pieces of verse consist entirely of words that are commonly on stop lists (To be or not to be, Let It Be, I don’t want to be, . . . ).
|------------
The general trend in IR systems over time has been from standard use of quite large stop lists (200–300 terms) to very small stop lists (7–12 terms) to no stop list whatsoever. Web search engines generally do not use stop lists. Some of the design of modern IR systems has focused precisely on how we can exploit the statistics of language so as to be able to cope with common words in better ways. We will show in Section 5.3 (page 95) how good compression techniques greatly reduce the cost of storing the postings for common words. Section 6.2.1 (page 117) then discusses how standard term weighting leads to very common words having little impact on doc- ument rankings. Finally, Section 7.1.5 (page 140) shows how an IR system with impact-sorted indexes can terminate scanning a postings list early when weights get small, and hence common words do not cause a large additional processing cost for the average query, even though postings lists for stop words are very long. So for most modern IR systems, the additional cost of including stop words is not that big – neither in terms of index size nor in terms of query processing time.
***vocabulary->
|------------
We keep a dictionary of terms (sometimes also referred to as a vocabulary orDICTIONARY VOCABULARY lexicon; in this book, we use dictionary for the data structure and vocabulary LEXICON for the set of terms). Then for each term, we have a list that records which documents the term occurs in. Each item in the list – which records that a term appeared in a document (and, later, often, the positions in the docu- ment) – is conventionally called a posting.4 The list is then called a postingsPOSTING POSTINGS LIST list (or inverted list), and all the postings lists taken together are referred to as the postings. The dictionary in Figure 1.3 has been sorted alphabetically andPOSTINGS each postings list is sorted by document ID. We will see why this is useful in Section 1.3, below, but later we will also consider alternatives to doing this (Section 7.1.5).
***document partitioning->
|------------
Two obvious alternative index implementations suggest themselves: parti-TERM PARTITIONING tioning by terms, also known as global index organization, and partitioning byDOCUMENT PARTITIONING documents, also know as local index organization. In the former, the diction- ary of index terms is partitioned into subsets, each subset residing at a node.
***indexing granularity->
|------------
More generally, for very long documents, the issue of indexing granularityINDEXING GRANULARITY arises. For a collection of books, it would usually be a bad idea to index an entire book as a document. A search for Chinese toys might bring up a book that mentions China in the first chapter and toys in the last chapter, but this does not make it relevant to the query. Instead, we may well wish to index each chapter or paragraph as a mini-document. Matches are then more likely to be relevant, and since the documents are smaller it will be much easier for the user to find the relevant passages in the document. But why stop there? We could treat individual sentences as mini-documents. It becomes clear that there is a precision/recall tradeoff here. If the units get too small, we are likely to miss important passages because terms were distributed over several mini-documents, while if units are too large we tend to get spurious matches and the relevant information is hard for the user to find.
***postings->
|------------
Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174 Calpurnia −→ 2 → 31 → 54 → 101 Intersection =⇒ 2 → 31 ◮ Figure 1.5 Intersecting the postings lists for Brutus and Calpurnia from Figure 1.3.
|------------
Exercise 1.3 [⋆] For the document collection shown in Exercise 1.2, what are the returned results for these queries: a. schizophrenia AND drug b. for AND NOT(drug OR approach) 1.3 Processing Boolean queries How do we process a query using an inverted index and the basic Boolean retrieval model? Consider processing the simple conjunctive query:SIMPLE CONJUNCTIVE QUERIES (1.1) Brutus AND Calpurnia over the inverted index partially shown in Figure 1.3 (page 7). We: 1. Locate Brutus in the Dictionary 2. Retrieve its postings 3. Locate Calpurnia in the Dictionary 4. Retrieve its postings 5. Intersect the two postings lists, as shown in Figure 1.5.
|------------
The intersection operation is the crucial one: we need to efficiently intersectPOSTINGS LIST INTERSECTION postings lists so as to be able to quickly find documents that contain both terms. (This operation is sometimes referred to as merging postings lists:POSTINGS MERGE this slightly counterintuitive name reflects using the term merge algorithm for a general family of algorithms that combine multiple sorted lists by inter- leaved advancing of pointers through each; here we are merging the lists with a logical AND operation.) There is a simple and effective method of intersecting postings lists using the merge algorithm (see Figure 1.6): we maintain pointers into both lists INTERSECT(p1, p2) 1 answer ← 〈 〉 2 while p1 6= NIL and p2 6= NIL 3 do if docID(p1) = docID(p2) 4 then ADD(answer, docID(p1)) 5 p1 ← next(p1) 6 p2 ← next(p2) 7 else if docID(p1) < docID(p2) 8 then p1 ← next(p1) 9 else p2 ← next(p2) 10 return answer ◮ Figure 1.6 Algorithm for the intersection of two postings lists p1 and p2.
***test set->
|------------
Once we have learned γ, we can apply it to the test set (or test data), for ex-TEST SET ample, the new document first private Chinese airline whose class is unknown.
|------------
Comparing parts (a) and (b) of the table, one is struck by the degree to which the cited papers’ results differ. This is partly due to the fact that the numbers in (b) are break-even scores (cf. page 161) averaged over 118 classes, whereas the numbers in (a) are true F1 scores (computed without any know- ledge of the test set) averaged over ninety classes. This is unfortunately typ- ical of what happens when comparing different results in text classification: There are often differences in the experimental setup or the evaluation that complicate the interpretation of the results.
|------------
When performing evaluations like the one in Table 13.9, it is important to maintain a strict separation between the training set and the test set. We can easily make correct classification decisions on the test set by using informa- tion we have gleaned from the test set, such as the fact that a particular term is a good predictor in the test set (even though this is not the case in the train- ing set). A more subtle example of using knowledge about the test set is to try a large number of values of a parameter (e.g., the number of selected fea- tures) and select the value that is best for the test set. As a rule, accuracy on new data – the type of data we will encounter when we use the classifier in an application – will be much lower than accuracy on a test set that the clas- sifier has been tuned for. We discussed the same problem in ad hoc retrieval in Section 8.1 (page 153).
|------------
In a clean statistical text classification experiment, you should never run any program on or even look at the test set while developing a text classifica- tion system. Instead, set aside a development set for testing while you developDEVELOPMENT SET your method. When such a set serves the primary purpose of finding a good value for a parameter, for example, the number of selected features, then it is also called held-out data. Train the classifier on the rest of the training setHELD-OUT DATA with different parameter values, and then select the value that gives best re- sults on the held-out part of the training set. Ideally, at the very end, when all parameters have been set and the method is fully specified, you run one final experiment on the test set and publish the results. Because no informa- ◮ Table 13.10 Data for parameter estimation exercise.
***chain rule->
|------------
We can ask the probability of the event 0 ≤ P(A) ≤ 1. For two events A and B, the joint event of both events occurring is described by the joint probabil- ity P(A, B). The conditional probability P(A|B) expresses the probability of event A given that event B occurred. The fundamental relationship between joint and conditional probabilities is given by the chain rule:CHAIN RULE P(A, B) = P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A)(11.1) Without making any assumptions, the probability of a joint event equals the probability of one of the events multiplied by the probability of the other event conditioned on knowing the first event happened.
***macroaveraging->
|------------
When we process a collection with several two-class classifiers (such as Reuters-21578 with its 118 classes), we often want to compute a single ag- gregate measure that combines the measures for individual classifiers. There are two methods for doing this. Macroaveraging computes a simple aver-MACROAVERAGING age over classes. Microaveraging pools per-document decisions across classes,MICROAVERAGING and then computes an effectiveness measure on the pooled contingency ta- ble. Table 13.8 gives an example.
|------------
The differences between the two methods can be large. Macroaveraging gives equal weight to each class, whereas microaveraging gives equal weight to each per-document classification decision. Because the F1 measure ignores true negatives and its magnitude is mostly determined by the number of true positives, large classes dominate small classes in microaveraging. In the example, microaveraged precision (0.83) is much closer to the precision of 2-MAR-1987 16:51:43.42 livestockhog AMERICAN PORK CONGRESS KICKS OFF TOMORROW CHICAGO, March 2 - The American Pork Congress kicks off tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC.
***term->
|------------
In this chapter we begin with a very simple example of an information retrieval problem, and introduce the idea of a term-document matrix (Sec- tion 1.1) and the central inverted index data structure (Section 1.2). We will then examine the Boolean retrieval model and how Boolean queries are pro- cessed (Sections 1.3 and 1.4).
|------------
1.1 An example information retrieval problem A fat book which many people own is Shakespeare’s Collected Works. Sup- pose you wanted to determine which plays of Shakespeare contain the words Brutus AND Caesar AND NOT Calpurnia. One way to do that is to start at the beginning and to read through all the text, noting for each play whether it contains Brutus and Caesar and excluding it from consideration if it con- tains Calpurnia. The simplest form of document retrieval is for a computer to do this sort of linear scan through documents. This process is commonly referred to as grepping through text, after the Unix command grep, whichGREP performs this process. Grepping through text can be a very effective process, especially given the speed of modern computers, and often allows useful possibilities for wildcard pattern matching through the use of regular expres- sions. With modern computers, for simple querying of modest collections (the size of Shakespeare’s Collected Works is a bit under one million words of text in total), you really need nothing more.
|------------
The way to avoid linearly scanning the texts for each query is to index theINDEX documents in advance. Let us stick with Shakespeare’s Collected Works, and use it to introduce the basics of the Boolean retrieval model. Suppose we record for each document – here a play of Shakespeare’s – whether it contains each word out of all the words Shakespeare used (Shakespeare used about 32,000 different words). The result is a binary term-document incidenceINCIDENCE MATRIX matrix, as in Figure 1.1. Terms are the indexed units (further discussed inTERM Section 2.2); they are usually words, and for the moment you can think of Antony Julius The Hamlet Othello Macbeth . . .
|------------
2 The term vocabulary and postingslists Recall the major steps in inverted index construction: 1. Collect the documents to be indexed.
|------------
4. Index the documents that each term occurs in.
|------------
In this chapter we first briefly mention how the basic unit of a document can be defined and how the character sequence that it comprises is determined (Section 2.1). We then examine in detail some of the substantive linguis- tic issues of tokenization and linguistic preprocessing, which determine the vocabulary of terms which a system uses (Section 2.2). Tokenization is the process of chopping character streams into tokens, while linguistic prepro- cessing then deals with building equivalence classes of tokens which are the set of terms that are indexed. Indexing itself is covered in Chapters 1 and 4.
|------------
2.1 Document delineation and character sequence decoding 2.1.1 Obtaining the character sequence in a document Digital documents that are the input to an indexing process are typically bytes in a file or on a web server. The first step of processing is to convert this byte sequence into a linear sequence of characters. For the case of plain En- glish text in ASCII encoding, this is trivial. But often things get much more complex. The sequence of characters may be encoded by one of various sin- gle byte or multibyte encoding schemes, such as Unicode UTF-8, or various national or vendor-specific standards. We need to determine the correct en- coding. This can be regarded as a machine learning classification problem, as discussed in Chapter 13,1 but is often handled by heuristic methods, user selection, or by using provided document metadata. Once the encoding is determined, we decode the byte sequence to a character sequence. We might save the choice of encoding because it gives some evidence about what lan- guage the document is written in.
|------------
More generally, for very long documents, the issue of indexing granularityINDEXING GRANULARITY arises. For a collection of books, it would usually be a bad idea to index an entire book as a document. A search for Chinese toys might bring up a book that mentions China in the first chapter and toys in the last chapter, but this does not make it relevant to the query. Instead, we may well wish to index each chapter or paragraph as a mini-document. Matches are then more likely to be relevant, and since the documents are smaller it will be much easier for the user to find the relevant passages in the document. But why stop there? We could treat individual sentences as mini-documents. It becomes clear that there is a precision/recall tradeoff here. If the units get too small, we are likely to miss important passages because terms were distributed over several mini-documents, while if units are too large we tend to get spurious matches and the relevant information is hard for the user to find.
|------------
2.2 Determining the vocabulary of terms 2.2.1 Tokenization Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation. Here is an example of tokenization: Input: Friends, Romans, Countrymen, lend me your ears; Output: Friends Romans Countrymen lend me your ears These tokens are often loosely referred to as terms or words, but it is some- times important to make a type/token distinction. A token is an instanceTOKEN of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing. A type is the class of allTYPE tokens containing the same character sequence. A term is a (perhaps nor-TERM malized) type that is included in the IR system’s dictionary. The set of index terms could be entirely distinct from the tokens, for instance, they could be semantic identifiers in a taxonomy, but in practice in modern IR systems they are strongly related to the tokens in the document. However, rather than be- ing exactly the tokens that appear in the document, they are usually derived from them by various normalization processes which are discussed in Sec- tion 2.2.3.2 For example, if the document to be indexed is to sleep perchance to dream, then there are 5 tokens, but only 4 types (since there are 2 instances of to). However, if to is omitted from the index (as a stop word, see Section 2.2.2 (page 27)), then there will be only 3 terms: sleep, perchance, and dream.
***inner product->
***nibble->
|------------
The idea of VB encoding can also be applied to larger or smaller units than bytes: 32-bit words, 16-bit words, and 4-bit words or nibbles. Larger wordsNIBBLE further decrease the amount of bit manipulation necessary at the cost of less effective (or no) compression. Word sizes smaller than bytes get even better compression ratios at the cost of more bit manipulation. In general, bytes offer a good compromise between compression ratio and speed of decom- pression.
***kNN classification->
|------------
14.3 k nearest neighbor Unlike Rocchio, k nearest neighbor or kNN classification determines the deci-k NEAREST NEIGHBOR CLASSIFICATION sion boundary locally. For 1NN we assign each document to the class of its closest neighbor. For kNN we assign each document to the majority class of its k closest neighbors where k is a parameter. The rationale of kNN classifi- cation is that, based on the contiguity hypothesis, we expect a test document d to have the same label as the training documents located in the local region surrounding d.
***XPath->
|------------
XPath is a standard for enumerating paths in an XML document collection.XPATH We will also refer to paths as XML contexts or simply contexts in this chapter.XML CONTEXT Only a small subset of XPath is needed for our purposes. The XPath expres- sion node selects all nodes of that name. Successive elements of a path are separated by slashes, so act/scene selects all scene elements whose par- ent is an act element. Double slashes indicate that an arbitrary number of elements can intervene on a path: play//scene selects all scene elements occurring in a play element. In Figure 10.2 this set consists of a single scene el- ement, which is accessible via the path play, act, scene from the top. An initial slash starts the path at the root element. /play/title selects the play’s ti- tle in Figure 10.1, /play//title selects a set with two members (the play’s title and the scene’s title), and /scene/title selects no elements. For no- tational convenience, we allow the final element of a path to be a vocabulary term and separate it from the element path by the symbol #, even though this does not conform to the XPath standard. For example, title#"Macbeth" selects all titles containing the term Macbeth.
***Mercer kernel->
|------------
✎ Example 15.2: The quadratic kernel in two dimensions. For 2-dimensionalvectors ~u = (u1 u2), ~v = (v1 v2), consider K(~u,~v) = (1 + ~uT~v)2. We wish to show that this is a kernel, i.e., that K(~u,~v) = φ(~u)Tφ(~v) for some φ. Consider φ(~u) = (1 u21 √ 2u1u2 u22 √ 2u1 √ 2u2). Then: K(~u,~v) = (1 +~uT~v)2(15.14) = 1 + u21v 2 1 + 2u1v1u2v2 + u 2 2v 2 2 + 2u1v1 + 2u2v2 = (1 u21 √ 2u1u2 u 2 2 √ 2u1 √ 2u2)T(1 v 2 1 √ 2v1v2 v 2 2 √ 2v1 √ 2v2) = φ(~u)Tφ(~v) In the language of functional analysis, what kinds of functions are valid kernel functions? Kernel functions are sometimes more precisely referred toKERNEL as Mercer kernels, because they must satisfy Mercer’s condition: for any g(~x)MERCER KERNEL such that ∫ g(~x)2d~x is finite, we must have that: ∫ K(~x,~z)g(~x)g(~z)d~xd~z ≥ 0 .(15.15) A kernel function K must be continuous, symmetric, and have a positive def- inite gram matrix. Such a K means that there exists a mapping to a reproduc- ing kernel Hilbert space (a Hilbert space is a vector space closed under dot products) such that the dot product there gives the same value as the function K. If a kernel does not satisfy Mercer’s condition, then the corresponding QP may have no solution. If you would like to better understand these issues, you should consult the books on SVMs mentioned in Section 15.5. Other- wise, you can content yourself with knowing that 90% of work with kernels uses one of two straightforward families of functions of two vectors, which we define below, and which define valid kernels.
***lossy compression->
|------------
The compression techniques we describe in the remainder of this chapter are lossless, that is, all information is preserved. Better compression ratiosLOSSLESS can be achieved with lossy compression, which discards some information.LOSSY COMPRESSION Case folding, stemming, and stop word elimination are forms of lossy com- pression. Similarly, the vector space model (Chapter 6) and dimensionality reduction techniques like latent semantic indexing (Chapter 18) create com- pact representations from which we cannot fully restore the original collec- tion. Lossy compression makes sense when the “lost” information is unlikely ever to be used by the search system. For example, web search is character- ized by a large number of documents, short queries, and users who only look at the first few pages of results. As a consequence, we can discard postings of documents that would only be used for hits far down the list. Thus, there are retrieval scenarios where lossy methods can be used for compression without any reduction in effectiveness.
***steady-state->
|------------
If a Markov chain is allowed to run for many time steps, each state is vis- ited at a (different) frequency that depends on the structure of the Markov chain. In our running analogy, the surfer visits certain web pages (say, pop- ular news home pages) more often than other pages. We now make this in- tuition precise, establishing conditions under which such the visit frequency converges to fixed, steady-state quantity. Following this, we set the Page- Rank of each node v to this steady-state visit frequency and show how it can be computed.
|------------
Theorem 21.1. For any ergodicMarkov chain, there is a unique steady-state prob-STEADY-STATE ability vector ~π that is the principal left eigenvector of P, such that if η(i, t) is the number of visits to state i in t steps, then lim t→∞ η(i, t) t = π(i), where π(i) > 0 is the steady-state probability for state i.
|------------
It follows from Theorem 21.1 that the random walk with teleporting re- sults in a unique distribution of steady-state probabilities over the states of the induced Markov chain. This steady-state probability for a state is the PageRank of the corresponding web page.
|------------
21.2.2 The PageRank computation How do we compute PageRank values? Recall the definition of a left eigen- vector from Equation 18.2; the left eigenvectors of the transition probability matrix P are N-vectors ~π such that ~π P = λ~π.(21.2) The N entries in the principal eigenvector ~π are the steady-state proba- bilities of the random walk with teleporting, and thus the PageRank values for the corresponding web pages. We may interpret Equation (21.2) as fol- lows: if ~π is the probability distribution of the surfer across the web pages, he remains in the steady-state distribution ~π. Given that ~π is the steady-state distribution, we have that πP = 1π, so 1 is an eigenvalue of P. Thus if we were to compute the principal left eigenvector of the matrix P — the one with eigenvalue 1 — we would have computed the PageRank values.
***decision hyperplane->
|------------
A large number of text classifiers can be viewed as linear classifiers – clas- sifiers that classify based on a simple linear combination of the features (Sec- tion 14.4). Such classifiers partition the space of features into regions sepa- rated by linear decision hyperplanes, in a manner to be detailed below. Because of the bias-variance tradeoff (Section 14.6) more complex nonlinear models d tru e dprojected x1 x2 x3 x4 x5 x′1 x ′ 2 x ′ 3 x ′ 4 x ′ 5 x′1 x ′ 2 x′3 x′4 x ′ 5 ◮ Figure 14.2 Projections of small areas of the unit sphere preserve distances. Left: A projection of the 2D semicircle to 1D. For the points x1, x2, x3, x4, x5 at x coordinates −0.9,−0.2, 0, 0.2, 0.9 the distance |x2x3| ≈ 0.201 only differs by 0.5% from |x′2x′3| = 0.2; but |x1x3|/|x′1x′3| = dtrue/dprojected ≈ 1.06/0.9 ≈ 1.18 is an example of a large distortion (18%) when projecting a large area. Right: The corresponding projection of the 3D hemisphere to 2D.
|------------
We call a hyperplane that we use as a linear classifier a decision hyperplane.DECISION HYPERPLANE The corresponding algorithm for linear classification in M dimensions is shown in Figure 14.9. Linear classification at first seems trivial given the simplicity of this algorithm. However, the difficulty is in training the lin- ear classifier, that is, in determining the parameters ~w and b based on the training set. In general, some learning methods compute much better param- eters than others where our criterion for evaluating the quality of a learning method is the effectiveness of the learned linear classifier on new data.
***kernel->
|------------
SVMs, and also a number of other linear classifiers, provide an easy and efficient way of doing this mapping to a higher dimensional space, which is referred to as “the kernel trick”. It’s not really a trick: it just exploits the mathKERNEL TRICK that we have seen. The SVM linear classifier relies on a dot product between data point vectors. Let K(~xi,~xj) = ~xiT~xj. Then the classifier we have seen so far is: f (~x) = sign(∑ i αiyiK(~xi,~x) + b)(15.13) Now suppose we decide to map every data point into a higher dimensional space via some transformation Φ:~x 7→ φ(~x). Then the dot product becomes φ(~xi) Tφ(~xj). If it turned out that this dot product (which is just a real num- ber) could be computed simply and efficiently in terms of the original data points, then we wouldn’t have to actually map from ~x 7→ φ(~x). Rather, we could simply compute the quantity K(~xi,~xj) = φ(~xi)Tφ(~xj), and then use the function’s value in Equation (15.13). A kernel function K is such a functionKERNEL FUNCTION that corresponds to a dot product in some expanded feature space.
|------------
✎ Example 15.2: The quadratic kernel in two dimensions. For 2-dimensionalvectors ~u = (u1 u2), ~v = (v1 v2), consider K(~u,~v) = (1 + ~uT~v)2. We wish to show that this is a kernel, i.e., that K(~u,~v) = φ(~u)Tφ(~v) for some φ. Consider φ(~u) = (1 u21 √ 2u1u2 u22 √ 2u1 √ 2u2). Then: K(~u,~v) = (1 +~uT~v)2(15.14) = 1 + u21v 2 1 + 2u1v1u2v2 + u 2 2v 2 2 + 2u1v1 + 2u2v2 = (1 u21 √ 2u1u2 u 2 2 √ 2u1 √ 2u2)T(1 v 2 1 √ 2v1v2 v 2 2 √ 2v1 √ 2v2) = φ(~u)Tφ(~v) In the language of functional analysis, what kinds of functions are valid kernel functions? Kernel functions are sometimes more precisely referred toKERNEL as Mercer kernels, because they must satisfy Mercer’s condition: for any g(~x)MERCER KERNEL such that ∫ g(~x)2d~x is finite, we must have that: ∫ K(~x,~z)g(~x)g(~z)d~xd~z ≥ 0 .(15.15) A kernel function K must be continuous, symmetric, and have a positive def- inite gram matrix. Such a K means that there exists a mapping to a reproduc- ing kernel Hilbert space (a Hilbert space is a vector space closed under dot products) such that the dot product there gives the same value as the function K. If a kernel does not satisfy Mercer’s condition, then the corresponding QP may have no solution. If you would like to better understand these issues, you should consult the books on SVMs mentioned in Section 15.5. Other- wise, you can content yourself with knowing that 90% of work with kernels uses one of two straightforward families of functions of two vectors, which we define below, and which define valid kernels.
|------------
The two commonly used families of kernels are polynomial kernels and radial basis functions. Polynomial kernels are of the form K(~x,~z) = (1 + ~xT~z)d. The case of d = 1 is a linear kernel, which is what we had before the start of this section (the constant 1 just changing the threshold). The case of d = 2 gives a quadratic kernel, and is very commonly used. We illustrated the quadratic kernel in Example 15.2.
***PageRank->
|------------
Exercise 21.4 Does your heuristic in the previous exercise take into account a single domain D repeating anchor text for x from multiple pages in D? 21.2 PageRank We now focus on scoring and ranking measures derived from the link struc- ture alone. Our first technique for link analysis assigns to every node in the web graph a numerical score between 0 and 1, known as its PageRank.PAGERANK The PageRank of a node will depend on the link structure of the web graph.
|------------
Given a query, a web search engine computes a composite score for each web page that combines hundreds of features such as cosine similarity (Sec- tion 6.3) and term proximity (Section 7.2.2), together with the PageRank score.
***next word index->
|------------
Williams et al. (2004) evaluate an even more sophisticated scheme which employs indexes of both these sorts and additionally a partial next word index as a halfway house between the first two strategies. For each term, a next word index records terms that follow it in a document. They concludeNEXT WORD INDEX that such a strategy allows a typical mixture of web phrase queries to be completed in one quarter of the time taken by use of a positional index alone, while taking up 26% more space than use of a positional index alone.
***topic spotting->
|------------
Such more general classes are usually referred to as topics, and the classifica- tion task is then called text classification, text categorization, topic classification,TEXT CLASSIFICATION or topic spotting. An example for China appears in Figure 13.1. Standing queries and topics differ in their degree of specificity, but the methods for solving routing, filtering, and text classification are essentially the same. We therefore include routing and filtering under the rubric of text classification in this and the following chapters.
***δ codes->
|------------
Exercise 5.9 γ codes are relatively inefficient for large numbers (e.g., 1025 in Table 5.5) as they encode the length of the offset in inefficient unary code. δ codes differ from γ codesδ CODES in that they encode the first part of the code (length) in γ code instead of unary code.
|------------
The encoding of offset is the same. For example, the δ code of 7 is 10,0,11 (again, we add commas for readability). 10,0 is the γ code for length (2 in this case) and the encoding of offset (11) is unchanged. (i) Compute the δ codes for the other numbers ◮ Table 5.7 Two gap sequences to be merged in blocked sort-based indexing γ encoded gap sequence of run 1 1110110111111001011111111110100011111001 γ encoded gap sequence of run 2 11111010000111111000100011111110010000011111010101 in Table 5.5. For what range of numbers is the δ code shorter than the γ code? (ii) γ code beats variable byte code in Table 5.6 because the index contains stop words and thus many small gaps. Show that variable byte code is more compact if larger gaps dominate. (iii) Compare the compression ratios of δ code and variable byte code for a distribution of gaps dominated by large gaps.
***kernel trick->
|------------
SVMs, and also a number of other linear classifiers, provide an easy and efficient way of doing this mapping to a higher dimensional space, which is referred to as “the kernel trick”. It’s not really a trick: it just exploits the mathKERNEL TRICK that we have seen. The SVM linear classifier relies on a dot product between data point vectors. Let K(~xi,~xj) = ~xiT~xj. Then the classifier we have seen so far is: f (~x) = sign(∑ i αiyiK(~xi,~x) + b)(15.13) Now suppose we decide to map every data point into a higher dimensional space via some transformation Φ:~x 7→ φ(~x). Then the dot product becomes φ(~xi) Tφ(~xj). If it turned out that this dot product (which is just a real num- ber) could be computed simply and efficiently in terms of the original data points, then we wouldn’t have to actually map from ~x 7→ φ(~x). Rather, we could simply compute the quantity K(~xi,~xj) = φ(~xi)Tφ(~xj), and then use the function’s value in Equation (15.13). A kernel function K is such a functionKERNEL FUNCTION that corresponds to a dot product in some expanded feature space.
***continuation bit->
|------------
5.3.1 Variable byte codes Variable byte (VB) encoding uses an integral number of bytes to encode a gap.VARIABLE BYTE ENCODING The last 7 bits of a byte are “payload” and encode part of the gap. The first bit of the byte is a continuation bit.It is set to 1 for the last byte of the encodedCONTINUATION BIT gap and to 0 otherwise. To decode a variable byte code, we read a sequence of bytes with continuation bit 0 terminated by a byte with continuation bit 1.
***email sorting->
|------------
• Personal email sorting. A user may have folders like talk announcements,EMAIL SORTING electronic bills, email from family and friends, and so on, and may want a classifier to classify each incoming email and automatically move it to the appropriate folder. It is easier to find messages in sorted folders than in a very large inbox. The most common case of this application is a spam folder that holds all suspected spam messages.
***concept drift->
|------------
Even if it is not the method with the highest accuracy for text, NB has many virtues that make it a strong contender for text classification. It excels if there are many equally important features that jointly contribute to the classifi- cation decision. It is also somewhat robust to noise features (as defined in the next section) and concept drift – the gradual change over time of the con-CONCEPT DRIFT cept underlying a class like US president from Bill Clinton to George W. Bush (see Section 13.7). Classifiers like kNN (Section 14.3, page 297) can be care- fully tuned to idiosyncratic properties of a particular time period. This will then hurt them when documents in the following time period have slightly ◮ Table 13.5 A set of documents for which the NB independence assumptions are problematic.
|------------
These and other results have shown that the average effectiveness of NB is uncompetitive with classifiers like SVMs when trained and tested on inde- pendent and identically distributed (i.i.d.) data, that is, uniform data with all the good properties of statistical sampling. However, these differences may of- ten be invisible or even reverse themselves when working in the real world where, usually, the training sample is drawn from a subset of the data to which the classifier will be applied, the nature of the data drifts over time rather than being stationary (the problem of concept drift we mentioned on page 269), and there may well be errors in the data (among other problems).
|------------
Maron and Kuhns (1960) described one of the first NB text classifiers. Lewis (1998) focuses on the history of NB classification. Bernoulli and multinomial models and their accuracy for different collections are discussed by McCal- lum and Nigam (1998). Eyheramendy et al. (2003) present additional NB models. Domingos and Pazzani (1997), Friedman (1997), and Hand and Yu (2001) analyze why NB performs well although its probability estimates are poor. The first paper also discusses NB’s optimality when the independence assumptions are true of the data. Pavlov et al. (2004) propose a modified document representation that partially addresses the inappropriateness of the independence assumptions. Bennett (2000) attributes the tendency of NB probability estimates to be close to either 0 or 1 to the effect of document length. Ng and Jordan (2001) show that NB is sometimes (although rarely) superior to discriminative methods because it more quickly reaches its opti- mal error rate. The basic NB model presented in this chapter can be tuned for better effectiveness (Rennie et al. 2003;Kołcz and Yih 2007). The problem of concept drift and other reasons why state-of-the-art classifiers do not always excel in practice are discussed by Forman (2006) and Hand (2006).
***Ergodic Markov Chain->
|------------
Definition: A Markov chain is said to be ergodic if there exists a positiveERGODIC MARKOV CHAIN integer T0 such that for all pairs of states i, j in the Markov chain, if it is started at time 0 in state i then for all t > T0, the probability of being in state j at time t is greater than 0.
***hierarchic clustering->
|------------
17 Hierarchical clustering Flat clustering is efficient and conceptually simple, but as we saw in Chap- ter 16 it has a number of drawbacks. The algorithms introduced in Chap- ter 16 return a flat unstructured set of clusters, require a prespecified num- ber of clusters as input and are nondeterministic. Hierarchical clustering (orHIERARCHICAL CLUSTERING hierarchic clustering) outputs a hierarchy, a structure that is more informative than the unstructured set of clusters returned by flat clustering.1 Hierarchical clustering does not require us to prespecify the number of clusters and most hierarchical algorithms that have been used in IR are deterministic. These ad- vantages of hierarchical clustering come at the cost of lower efficiency. The most common hierarchical clustering algorithms have a complexity that is at least quadratic in the number of documents compared to the linear complex- ity of K-means and EM (cf. Section 16.4, page 364).
***microaveraging->
|------------
When we process a collection with several two-class classifiers (such as Reuters-21578 with its 118 classes), we often want to compute a single ag- gregate measure that combines the measures for individual classifiers. There are two methods for doing this. Macroaveraging computes a simple aver-MACROAVERAGING age over classes. Microaveraging pools per-document decisions across classes,MICROAVERAGING and then computes an effectiveness measure on the pooled contingency ta- ble. Table 13.8 gives an example.
|------------
The differences between the two methods can be large. Macroaveraging gives equal weight to each class, whereas microaveraging gives equal weight to each per-document classification decision. Because the F1 measure ignores true negatives and its magnitude is mostly determined by the number of true positives, large classes dominate small classes in microaveraging. In the example, microaveraged precision (0.83) is much closer to the precision of 2-MAR-1987 16:51:43.42 livestockhog AMERICAN PORK CONGRESS KICKS OFF TOMORROW CHICAGO, March 2 - The American Pork Congress kicks off tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC.
***spider->
|------------
20 Web crawling and indexes 20.1 Overview Web crawling is the process by which we gather pages from the Web, in order to index them and support a search engine. The objective of crawling is to quickly and efficiently gather as many useful web pages as possible, together with the link structure that interconnects them. In Chapter 19 we studied the complexities of the Web stemming from its creation by millions of uncoordinated individuals. In this chapter we study the resulting difficulties for crawling the Web. The focus of this chapter is the component shown in Figure 19.7 as web crawler; it is sometimes referred to as a spider.WEB CRAWLER SPIDER The goal of this chapter is not to describe how to build the crawler for a full-scale commercial web search engine. We focus instead on a range of issues that are generic to crawling from the student project scale to substan- tial research projects. We begin (Section 20.1.1) by listing desiderata for web crawlers, and then discuss in Section 20.2 how each of these issues is ad- dressed. The remainder of this chapter describes the architecture and some implementation details for a distributed web crawler that satisfies these fea- tures. Section 20.3 discusses distributing indexes across many machines for a web-scale implementation.
|------------
Robustness: The Web contains servers that create spider traps, which are gen- erators of web pages that mislead crawlers into getting stuck fetching an infinite number of pages in a particular domain. Crawlers must be de- signed to be resilient to such traps. Not all such traps are malicious; some are the inadvertent side-effect of faulty website development.
***Search Engine Optimizers->
|------------
A doorway page contains text and metadata carefully chosen to rank highly on selected search keywords. When a browser requests the doorway page, it is redirected to a page containing content of a more commercial nature. More complex spamming techniques involve manipulation of the metadata related to a page including (for reasons we will see in Chapter 21) the links into a web page. Given that spamming is inherently an economically motivated activity, there has sprung around it an industry of Search Engine Optimizers,SEARCH ENGINE OPTIMIZERS or SEOs to provide consultancy services for clients who seek to have their web pages rank highly on selected keywords. Web search engines frown on this business of attempting to decipher and adapt to their proprietary rank- ing techniques and indeed announce policies on forms of SEO behavior they do not tolerate (and have been known to shut down search requests from cer- tain SEOs for violation of these). Inevitably, the parrying between such SEOs (who gradually infer features of each web search engine’s ranking methods) and the web search engines (who adapt in response) is an unending struggle; indeed, the research sub-area of adversarial information retrieval has sprung upADVERSARIAL INFORMATION RETRIEVAL around this battle. To combat spammers who manipulate the text of their web pages is the exploitation of the link structure of the Web – a technique known as link analysis. The first web search engine known to apply link anal- ysis on a large scale (to be detailed in Chapter 21) was Google, although all web search engines currently make use of it (and correspondingly, spam- mers now invest considerable effort in subverting it – this is known as linkLINK SPAM spam).
***held-out->
|------------
The parameter k in kNN is often chosen based on experience or knowledge about the classification problem at hand. It is desirable for k to be odd to make ties less likely. k = 3 and k = 5 are common choices, but much larger values between 50 and 100 are also used. An alternative way of setting the parameter is to select the k that gives best results on a held-out portion of the training set.
***web crawler->
|------------
20 Web crawling and indexes 20.1 Overview Web crawling is the process by which we gather pages from the Web, in order to index them and support a search engine. The objective of crawling is to quickly and efficiently gather as many useful web pages as possible, together with the link structure that interconnects them. In Chapter 19 we studied the complexities of the Web stemming from its creation by millions of uncoordinated individuals. In this chapter we study the resulting difficulties for crawling the Web. The focus of this chapter is the component shown in Figure 19.7 as web crawler; it is sometimes referred to as a spider.WEB CRAWLER SPIDER The goal of this chapter is not to describe how to build the crawler for a full-scale commercial web search engine. We focus instead on a range of issues that are generic to crawling from the student project scale to substan- tial research projects. We begin (Section 20.1.1) by listing desiderata for web crawlers, and then discuss in Section 20.2 how each of these issues is ad- dressed. The remainder of this chapter describes the architecture and some implementation details for a distributed web crawler that satisfies these fea- tures. Section 20.3 discusses distributing indexes across many machines for a web-scale implementation.
|------------
20.1.1 Features a crawler must provide We list the desiderata for web crawlers in two categories: features that web crawlers must provide, followed by features they should provide.
***corpus->
|------------
Let us now consider a more realistic scenario, simultaneously using the opportunity to introduce some terminology and notation. Suppose we have N = 1 million documents. By documents we mean whatever units we haveDOCUMENT decided to build a retrieval system over. They might be individual memos or chapters of a book (see Section 2.1.2 (page 20) for further discussion). We will refer to the group of documents over which we perform retrieval as the (document) collection. It is sometimes also referred to as a corpus (a body ofCOLLECTION CORPUS texts). Suppose each document is about 1000 words long (2–3 book pages). If 2. Formally, we take the transpose of the matrix to be able to get the terms as column vectors.
***lemma->
|------------
2.2.4 Stemming and lemmatization For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are fami- lies of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.
|------------
The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance: am, are, is ⇒ be car, cars, car’s, cars’ ⇒ car The result of this mapping of text will be something like: the boy’s cars are different colors ⇒ the boy car be differ color However, the two words differ in their flavor. Stemming usually refers toSTEMMING a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the re- moval of derivational affixes. Lemmatization usually refers to doing thingsLEMMATIZATION properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. If confrontedLEMMA with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatiza- tion commonly only collapses the different inflectional forms of a lemma.
***language->
|------------
12 Language models for informationretrieval A common suggestion to users for coming up with good queries is to think of words that would likely appear in a relevant document, and to use those words as the query. The language modeling approach to IR directly models that idea: a document is a good match to a query if the document model is likely to generate the query, which will in turn happen if the document contains the query words often. This approach thus provides a different real- ization of some of the basic ideas for document ranking which we saw in Sec- tion 6.2 (page 117). Instead of overtly modeling the probability P(R = 1|q, d) of relevance of a document d to a query q, as in the traditional probabilis- tic approach to IR (Chapter 11), the basic language modeling approach in- stead builds a probabilistic language model Md from each document d, and ranks documents based on the probability of the model generating the query: P(q|Md).
|------------
In this chapter, we first introduce the concept of language models (Sec- tion 12.1) and then describe the basic and most commonly used language modeling approach to IR, the Query Likelihood Model (Section 12.2). Af- ter some comparisons between the language modeling approach and other approaches to IR (Section 12.3), we finish by briefly describing various ex- tensions to the language modeling approach (Section 12.4).
|------------
12.1 Language models 12.1.1 Finite automata and language models What do we mean by a document model generating a query? A traditional generative model of a language, of the kind familiar from formal languageGENERATIVE MODEL theory, can be used either to recognize or to generate strings. For example, the finite automaton shown in Figure 12.1 can generate strings that include the examples shown. The full set of strings that can be generated is called the language of the automaton.1LANGUAGE I wish I wish I wish I wish I wish I wish I wish I wish I wish I wish I wish I wish I wish . . .
***matrix decomposition->
|------------
18.1.1 Matrix decompositions In this section we examine ways in which a square matrix can be factored into the product of matrices derived from its eigenvectors; we refer to this process as matrix decomposition. Matrix decompositions similar to the onesMATRIX DECOMPOSITION in this section will form the basis of our principal text-analysis technique in Section 18.3, where we will look at decompositions of non-square term- document matrices. The square decompositions in this section are simpler and can be treated with sufficient mathematical rigor to help the reader un- derstand how such decompositions work. The detailed mathematical deriva- tion of the more complex decompositions in Section 18.2 are beyond the scope of this book.
***1/0 loss->
|------------
Writing P(A) for the complement of an event, we similarly have: P(A, B) = P(B|A)P(A)(11.2) Probability theory also has a partition rule, which says that if an event B canPARTITION RULE be divided into an exhaustive set of disjoint subcases, then the probability of B is the sum of the probabilities of the subcases. A special case of this rule gives that: P(B) = P(A, B) + P(A, B)(11.3) From these we can derive Bayes’ Rule for inverting conditional probabili-BAYES’ RULE ties: P(A|B) = P(B|A)P(A) P(B) = [ P(B|A) ∑X∈{A,A} P(B|X)P(X) ] P(A)(11.4) This equation can also be thought of as a way of updating probabilities. We start off with an initial estimate of how likely the event A is when we do not have any other information; this is the prior probability P(A). Bayes’ rulePRIOR PROBABILITY lets us derive a posterior probability P(A|B) after having seen the evidence B,POSTERIOR PROBABILITY based on the likelihood of B occurring in the two cases that A does or does not hold.1 Finally, it is often useful to talk about the odds of an event, which provideODDS a kind of multiplier for how probabilities change: Odds: O(A) = P(A) P(A) = P(A) 1 − P(A)(11.5) 11.2 The Probability Ranking Principle 11.2.1 The 1/0 loss case We assume a ranked retrieval setup as in Section 6.3, where there is a collec- tion of documents, the user issues a query, and an ordered list of documents is returned. We also assume a binary notion of relevance as in Chapter 8. For a query q and a document d in the collection, let Rd,q be an indicator random variable that says whether d is relevant with respect to a given query q. That is, it takes on a value of 1 when the document is relevant and 0 otherwise. In context we will often write just R for Rd,q.
|------------
Using a probabilistic model, the obvious order in which to present doc- uments to the user is to rank documents by their estimated probability of relevance with respect to the information need: P(R = 1|d, q). This is the ba- sis of the Probability Ranking Principle (PRP) (van Rijsbergen 1979, 113–114):PROBABILITY RANKING PRINCIPLE “If a reference retrieval system’s response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request, where the prob- abilities are estimated as accurately as possible on the basis of what- ever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data.” In the simplest case of the PRP, there are no retrieval costs or other utility concerns that would differentially weight actions or errors. You lose a point for either returning a nonrelevant document or failing to return a relevant document (such a binary situation where you are evaluated on your accuracy is called 1/0 loss). The goal is to return the best possible results as the top k1/0 LOSS documents, for any value of k the user chooses to examine. The PRP then says to simply rank all documents in decreasing order of P(R = 1|d, q). If a set of retrieval results is to be returned, rather than an ordering, the BayesBAYES OPTIMAL DECISION RULE 1. The term likelihood is just a synonym of probability. It is the probability of an event or data according to a model. The term is usually used when people are thinking of holding the data fixed, while varying the model.
***Ranked Boolean retrieval->
|------------
Given a Boolean query q and a document d, weighted zone scoring assigns to the pair (q, d) a score in the interval [0, 1], by computing a linear combina- tion of zone scores, where each zone of the document contributes a Boolean value. More specifically, consider a set of documents each of which has ℓ zones. Let g1, . . . , gℓ ∈ [0, 1] such that ∑ℓi=1 gi = 1. For 1 ≤ i ≤ ℓ, let si be the Boolean score denoting a match (or absence thereof) between q and the ith zone. For instance, the Boolean score from a zone could be 1 if all the query term(s) occur in that zone, and zero otherwise; indeed, it could be any Boo- lean function that maps the presence of query terms in a zone to 0, 1. Then, the weighted zone score is defined to be ℓ ∑ i=1 gisi.(6.1) Weighted zone scoring is sometimes referred to also as ranked Boolean re-RANKED BOOLEAN RETRIEVAL trieval.
***document vector->
|------------
At this point, we may view each document as a vector with one componentDOCUMENT VECTOR corresponding to each term in the dictionary, together with a weight for each component that is given by (6.8). For dictionary terms that do not occur in a document, this weight is zero. This vector form will prove to be crucial to scoring and ranking; we will develop these ideas in Section 6.3. As a first step, we introduce the overlap score measure: the score of a document d is the sum, over all query terms, of the number of times each of the query terms occurs in d. We can refine this idea so that we add up not the number of occurrences of each query term t in d, but instead the tf-idf weight of each term in d.
|------------
6.3 The vector space model for scoring In Section 6.2 (page 117) we developed the notion of a document vector that captures the relative importance of the terms in a document. The representa- tion of a set of documents as vectors in a common vector space is known as the vector space model and is fundamental to a host of information retrieval op-VECTOR SPACE MODEL erations ranging from scoring documents on a query, document classification and document clustering. We first develop the basic ideas underlying vector space scoring; a pivotal step in this development is the view (Section 6.3.2) of queries as vectors in the same vector space as the document collection.
***phrase search->
|------------
Query: host! /p (responsib! liab!) /p (intoxicat! drunk!) /p guest Note the long, precise queries and the use of proximity operators, both uncommon in web search. Submitted queries average about ten words in length. Unlike web search conventions, a space between words represents disjunction (the tightest bind- ing operator), & is AND and /s, /p, and /k ask for matches in the same sentence, same paragraph or within k words respectively. Double quotes give a phrase search (consecutive words); see Section 2.4 (page 39). The exclamation mark (!) gives a trail- ing wildcard query (see Section 3.2, page 51); thus liab! matches all words starting with liab. Additionally work-site matches any of worksite, work-site or work site; see Section 2.2.1 (page 22). Typical expert queries are usually carefully defined and incre- mentally developed until they obtain what look to be good results to the user.
***structural term->
|------------
We call such an XML-context/term pair a structural term and denote it bySTRUCTURAL TERM 〈c, t〉: a pair of XML-context c and vocabulary term t. The document in Figure 10.8 has nine structural terms. Seven are shown (e.g., "Bill" and Author#"Bill") and two are not shown: /Book/Author#"Bill" and /Book/Author#"Gates". The tree with the leaves Bill and Gates is a lexical- ized subtree that is not a structural term. We use the previously introduced pseudo-XPath notation for structural terms.
***case-folding->
|------------
Capitalization/case-folding. A common strategy is to do case-folding by re-CASE-FOLDING ducing all letters to lower case. Often this is a good idea: it will allow in- stances of Automobile at the beginning of a sentence to match with a query of automobile. It will also help on a web search engine when most of your users type in ferrari when they are interested in a Ferrari car. On the other hand, such case folding can equate words that might better be kept apart. Many proper nouns are derived from common nouns and so are distinguished only by case, including companies (General Motors, The Associated Press), govern- ment organizations (the Fed vs. fed) and person names (Bush, Black). We al- ready mentioned an example of unintended query expansion with acronyms, which involved not only acronym normalization (C.A.T. → CAT) but also case-folding (CAT → cat).
|------------
For English, an alternative to making every token lowercase is to just make some tokens lowercase. The simplest heuristic is to convert to lowercase words at the beginning of a sentence and all words occurring in a title that is all uppercase or in which most or all words are capitalized. These words are usually ordinary words that have been capitalized. Mid-sentence capitalized words are left as capitalized (which is usually correct). This will mostly avoid case-folding in cases where distinctions should be kept apart. The same task can be done more accurately by a machine learning sequence model which uses more features to make the decision of when to case-fold. This is known as truecasing. However, trying to get capitalization right in this way probablyTRUECASING doesn’t help if your users usually use lowercase regardless of the correct case of words. Thus, lowercasing everything often remains the most practical solution.
***lemmatizer->
|------------
The most common algorithm for stemming English, and one that has re- peatedly been shown to be empirically very effective, is Porter’s algorithmPORTER STEMMER (Porter 1980). The entire algorithm is too long and intricate to present here, but we will indicate its general nature. Porter’s algorithm consists of 5 phases of word reductions, applied sequentially. Within each phase there are var- ious conventions to select rules, such as selecting the rule from each rule group that applies to the longest suffix. In the first phase, this convention is used with the following rule group: (2.1) Rule Example SSES → SS caresses → caress IES → I ponies → poni SS → SS caress → caress S → cats → cat Many of the later rules use a concept of the measure of a word, which loosely checks the number of syllables to see whether a word is long enough that it is reasonable to regard the matching portion of a rule as a suffix rather than as part of the stem of a word. For example, the rule: (m > 1) EMENT → would map replacement to replac, but not cement to c. The official site for the Porter Stemmer is: http://www.tartarus.org/˜martin/PorterStemmer/ Other stemmers exist, including the older, one-pass Lovins stemmer (Lovins 1968), and newer entrants like the Paice/Husk stemmer (Paice 1990); see: http://www.cs.waikato.ac.nz/˜eibe/stemmers/ http://www.comp.lancs.ac.uk/computing/research/stemming/ Figure 2.8 presents an informal comparison of the different behaviors of these stemmers. Stemmers use language-specific rules, but they require less know- ledge than a lemmatizer, which needs a complete vocabulary and morpho- logical analysis to correctly lemmatize words. Particular domains may also require special stemming rules. However, the exact stemmed form does not matter, only the equivalence classes it forms.
|------------
Rather than using a stemmer, you can use a lemmatizer, a tool from Nat-LEMMATIZER ural Language Processing which does full morphological analysis to accu- rately identify the lemma for each word. Doing full morphological analysis produces at most very modest benefits for retrieval. It is hard to say more, Sample text: Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation Lovins stemmer: such an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres Porter stemmer: such an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret Paice stemmer: such an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret ◮ Figure 2.8 A comparison of three stemming algorithms on a sample text.
***spectral clustering->
|------------
An example of an efficient divisive algorithm is bisecting K-means (Stein- bach et al. 2000). Spectral clustering algorithms (Kannan et al. 2000, DhillonSPECTRAL CLUSTERING 2001, Zha et al. 2001, Ng et al. 2001a), including principal direction divisive partitioning (PDDP) (whose bisecting decisions are based on SVD, see Chap- ter 18) (Boley 1998, Savaresi and Boley 2004), are computationally more ex- pensive than bisecting K-means, but have the advantage of being determin- istic.
***SPIMI->
|------------
Exercise 4.2 [⋆] How would you create the dictionary in blocked sort-based indexing on the fly to avoid an extra pass through the data? SPIMI-INVERT(token_stream) 1 output_ f ile = NEWFILE() 2 dictionary = NEWHASH() 3 while (free memory available) 4 do token ← next(token_stream) 5 if term(token) /∈ dictionary 6 then postings_list = ADDTODICTIONARY(dictionary, term(token)) 7 else postings_list = GETPOSTINGSLIST(dictionary, term(token)) 8 if f ull(postings_list) 9 then postings_list = DOUBLEPOSTINGSLIST(dictionary, term(token)) 10 ADDTOPOSTINGSLIST(postings_list, docID(token)) 11 sorted_terms ← SORTTERMS(dictionary) 12 WRITEBLOCKTODISK(sorted_terms, dictionary, output_ f ile) 13 return output_ f ile ◮ Figure 4.4 Inversion of a block in single-pass in-memory indexing 4.3 Single-pass in-memory indexing Blocked sort-based indexing has excellent scaling properties, but it needs a data structure for mapping terms to termIDs. For very large collections, this data structure does not fit into memory. A more scalable alternative is single-pass in-memory indexing or SPIMI. SPIMI uses terms instead of termIDs,SINGLE-PASS IN-MEMORY INDEXING writes each block’s dictionary to disk, and then starts a new dictionary for the next block. SPIMI can index collections of any size as long as there is enough disk space available.
|------------
The SPIMI algorithm is shown in Figure 4.4. The part of the algorithm that parses documents and turns them into a stream of term–docID pairs, which we call tokens here, has been omitted. SPIMI-INVERT is called repeatedly on the token stream until the entire collection has been processed.
|------------
Tokens are processed one by one (line 4) during each successive call of SPIMI-INVERT. When a term occurs for the first time, it is added to the dictionary (best implemented as a hash), and a new postings list is created (line 6). The call in line 7 returns this postings list for subsequent occurrences of the term.
|------------
A difference between BSBI and SPIMI is that SPIMI adds a posting di- rectly to its postings list (line 10). Instead of first collecting all termID–docID pairs and then sorting them (as we did in BSBI), each postings list is dynamic (i.e., its size is adjusted as it grows) and it is immediately available to collect postings. This has two advantages: It is faster because there is no sorting required, and it saves memory because we keep track of the term a postings list belongs to, so the termIDs of postings need not be stored. As a result, the blocks that individual calls of SPIMI-INVERT can process are much larger and the index construction process as a whole is more efficient.
***eigen decomposition->
|------------
Theorem 18.1. (Matrix diagonalization theorem) Let S be a square real-valued M × M matrix with M linearly independent eigenvectors. Then there exists an eigen decompositionEIGEN DECOMPOSITION S = UΛU−1,(18.5) where the columns of U are the eigenvectors of S and Λ is a diagonal matrix whose diagonal entries are the eigenvalues of S in decreasing order λ1 λ2 · · · λM , λi ≥ λi+1.(18.6) If the eigenvalues are distinct, then this decomposition is unique.
***minimum variance clustering->
***static->
|------------
The two basic kinds of summaries are static, which are always the sameSTATIC SUMMARY regardless of the query, and dynamic (or query-dependent), which are cus-DYNAMIC SUMMARY tomized according to the user’s information need as deduced from a query.
|------------
A static summary is generally comprised of either or both a subset of the document and metadata associated with the document. The simplest form of summary takes the first two sentences or 50 words of a document, or ex- tracts particular zones of a document, such as the title and author. Instead of zones of a document, the summary can instead use metadata associated with the document. This may be an alternative way to provide an author or date, or may include elements which are designed to give a summary, such as the description metadata which can appear in the meta element of a web HTML page. This summary is typically extracted and cached at indexing time, in such a way that it can be retrieved and presented quickly when dis- playing search results, whereas having to access the actual document content might be a relatively expensive operation.
|------------
Dynamic summaries display one or more “windows” on the document, aiming to present the pieces that have the most utility to the user in evalu- ating the document with respect to their information need. Usually these windows contain one or several of the query terms, and so are often re- ferred to as keyword-in-context (KWIC) snippets, though sometimes they mayKEYWORD-IN-CONTEXT still be pieces of the text such as the title that are selected for their query- independent information value just as in the case of static summarization.
***flat clustering->
|------------
Flat clustering creates a flat set of clusters without any explicit structure thatFLAT CLUSTERING would relate clusters to each other. Hierarchical clustering creates a hierarchy of clusters and will be covered in Chapter 17. Chapter 17 also addresses the difficult problem of labeling clusters automatically.
|------------
This chapter motivates the use of clustering in information retrieval by introducing a number of applications (Section 16.1), defines the problem we are trying to solve in clustering (Section 16.2) and discusses measures for evaluating cluster quality (Section 16.3). It then describes two flat clus- tering algorithms, K-means (Section 16.4), a hard clustering algorithm, and the Expectation-Maximization (or EM) algorithm (Section 16.5), a soft clus- tering algorithm. K-means is perhaps the most widely used flat clustering algorithm due to its simplicity and efficiency. The EM algorithm is a gen- eralization of K-means and can be applied to a large variety of document representations and distributions.
***inter-similarity->
|------------
b b b b (a) single-link: maximum similarity b b b b (b) complete-link: minimum similarity b b b b (c) centroid: average inter-similarity b b b b (d) group-average: average of all similarities ◮ Figure 17.3 The different notions of cluster similarity used by the four HAC al- gorithms. An inter-similarity is a similarity between two documents from different clusters.
***front coding->
|------------
One source of redundancy in the dictionary we have not exploited yet is the fact that consecutive entries in an alphabetically sorted list share common prefixes. This observation leads to front coding (Figure 5.7). A common prefixFRONT CODING (a) aid box den ex job ox pit win (b) aid box den ex job ox pit win ◮ Figure 5.6 Search of the uncompressed dictionary (a) and a dictionary com- pressed by blocking with k = 4 (b).
***digital libraries->
|------------
However, many structured data sources containing text are best modeled as structured documents rather than relational data. We call the search over such structured documents structured retrieval. Queries in structured retrievalSTRUCTURED RETRIEVAL can be either structured or unstructured, but we will assume in this chap- ter that the collection consists only of structured documents. Applications of structured retrieval include digital libraries, patent databases, blogs, text in which entities like persons and locations have been tagged (in a process called named entity tagging) and output from office suites like OpenOffice that save documents as marked up text. In all of these applications, we want to be able to run queries that combine textual criteria with structural criteria.
|------------
Examples of such queries are give me a full-length article on fast fourier transforms (digital libraries), give me patents whose claims mention RSA public key encryption 1. In most modern database systems, one can enable full-text search for text columns. This usually means that an inverted index is created and Boolean or vector space search enabled, effectively combining core database with information retrieval technologies.
***low-rank approximation->
|------------
18.3 Low-rank approximations We next state a matrix approximation problem that at first seems to have little to do with information retrieval. We describe a solution to this matrix problem using singular-value decompositions, then develop its application to information retrieval.
|------------
Given an M × N matrix C and a positive integer k, we wish to find an M× N matrix Ck of rank at most k, so as to minimize the Frobenius norm ofFROBENIUS NORM the matrix difference X = C− Ck, defined to be ‖X‖F = √√√√ M ∑ i=1 N ∑ j=1 X2ij.(18.15) Thus, the Frobenius norm of X measures the discrepancy between Ck and C; our goal is to find a matrix Ck that minimizes this discrepancy, while con- straining Ck to have rank at most k. If r is the rank of C, clearly Cr = C and the Frobenius norm of the discrepancy is zero in this case. When k is far smaller than r, we refer to Ck as a low-rank approximation.LOW-RANK APPROXIMATION The singular value decomposition can be used to solve the low-rank ma- trix approximation problem. We then derive from it an application to ap- proximating term-document matrices. We invoke the following three-step procedure to this end: Ck = U Σk V T rr r rr r rr r rr r rr r rr r rr r rr r r r r rr rr r rr rr r rr rr r rr rr r rr rr r ◮ Figure 18.2 Illustration of low rank approximation using the singular-value de- composition. The dashed boxes indicate the matrix entries affected by “zeroing out” the smallest singular values.
***spider traps->
|------------
19.5 Index size and estimation To a first approximation, comprehensiveness grows with index size, although it does matter which specific pages a search engine indexes – some pages are more informative than others. It is also difficult to reason about the fraction of the Web indexed by a search engine, because there is an infinite number of dynamic web pages; for instance, http://www.yahoo.com/any_string returns a valid HTML page rather than an error, politely informing the user that there is no such page at Yahoo! Such a "soft 404 error" is only one exam- ple of many ways in which web servers can generate an infinite number of valid web pages. Indeed, some of these are malicious spider traps devised to cause a search engine’s crawler (the component that systematically gath- ers web pages for the search engine’s index, described in Chapter 20) to stay within a spammer’s website and index many pages from that site.
***distributed indexing->
|------------
4.4 Distributed indexing Collections are often so large that we cannot perform index construction effi- ciently on a single machine. This is particularly true of the World Wide Web for which we need large computer clusters1 to construct any reasonably sized web index. Web search engines, therefore, use distributed indexing algorithms for index construction. The result of the construction process is a distributed index that is partitioned across several machines – either according to term or according to document. In this section, we describe distributed indexing for a term-partitioned index. Most large search engines prefer a document- 1. A cluster in this chapter is a group of tightly coupled computers that work together closely.
***marginal statistic->
|------------
kappa = P(A)− P(E) 1 − P(E)(8.10) where P(A) is the proportion of the times the judges agreed, and P(E) is the proportion of the times they would be expected to agree by chance. There are choices in how the latter is estimated: if we simply say we are making a two-class decision and assume nothing more, then the expected chance agreement rate is 0.5. However, normally the class distribution assigned is skewed, and it is usual to use marginal statistics to calculate expected agree-MARGINAL ment.2 There are still two ways to do it depending on whether one pools 2. For a contingency table, as in Table 8.2, a marginal statistic is formed by summing a row or column. The marginal ai.k = ∑j aijk.
***single-pass in-memory indexing->
|------------
Exercise 4.2 [⋆] How would you create the dictionary in blocked sort-based indexing on the fly to avoid an extra pass through the data? SPIMI-INVERT(token_stream) 1 output_ f ile = NEWFILE() 2 dictionary = NEWHASH() 3 while (free memory available) 4 do token ← next(token_stream) 5 if term(token) /∈ dictionary 6 then postings_list = ADDTODICTIONARY(dictionary, term(token)) 7 else postings_list = GETPOSTINGSLIST(dictionary, term(token)) 8 if f ull(postings_list) 9 then postings_list = DOUBLEPOSTINGSLIST(dictionary, term(token)) 10 ADDTOPOSTINGSLIST(postings_list, docID(token)) 11 sorted_terms ← SORTTERMS(dictionary) 12 WRITEBLOCKTODISK(sorted_terms, dictionary, output_ f ile) 13 return output_ f ile ◮ Figure 4.4 Inversion of a block in single-pass in-memory indexing 4.3 Single-pass in-memory indexing Blocked sort-based indexing has excellent scaling properties, but it needs a data structure for mapping terms to termIDs. For very large collections, this data structure does not fit into memory. A more scalable alternative is single-pass in-memory indexing or SPIMI. SPIMI uses terms instead of termIDs,SINGLE-PASS IN-MEMORY INDEXING writes each block’s dictionary to disk, and then starts a new dictionary for the next block. SPIMI can index collections of any size as long as there is enough disk space available.
***slack variables->
|------------
b b b b b b bb b b b~xi ξi ut ut ut ut ut ut ut ut ut ut ~xj ξ j ◮ Figure 15.5 Large margin classification with slack variables.
|------------
If the training set D is not linearly separable, the standard approach is to allow the fat decision margin to make a few mistakes (some points – outliers or noisy examples – are inside or on the wrong side of the margin). We then pay a cost for each misclassified example, which depends on how far it is from meeting the margin requirement given in Equation (15.5). To imple- ment this, we introduce slack variables ξi. A non-zero value for ξi allows ~xi toSLACK VARIABLES not meet the margin requirement at a cost proportional to the value of ξi. See Figure 15.5.
|------------
The formulation of the SVM optimization problem with slack variables is: (15.10) Find ~w, b, and ξi ≥ 0 such that: • 12~w T~w + C ∑i ξi is minimized • and for all {(~xi, yi)}, yi(~wT~xi + b) ≥ 1 − ξi The optimization problem is then trading off how fat it can make the margin versus how many points have to be moved around to allow this margin.
***parser->
|------------
The map phase of MapReduce consists of mapping splits of the input dataMAP PHASE to key-value pairs. This is the same parsing task we also encountered in BSBI and SPIMI, and we therefore call the machines that execute the map phase parsers. Each parser writes its output to local intermediate files, the segmentPARSER SEGMENT FILE files (shown as a-f g-p q-z in Figure 4.5).
|------------
For the reduce phase, we want all values for a given key to be stored closeREDUCE PHASE together, so that they can be read and processed quickly. This is achieved by masterassign map phase reduce phase assign parser splits parser parser inverter postings inverter inverter a-f g-p q-z a-f g-p q-z a-f g-p q-z a-f segment files g-p q-z ◮ Figure 4.5 An example of distributed indexing with MapReduce. Adapted from Dean and Ghemawat (2004).
***query likelihood model->
|------------
Exercise 12.2 [⋆] If the stop probability is omitted from calculations, what will the sum of the scores assigned to strings in the language of length 1 be? Exercise 12.3 [⋆] What is the likelihood ratio of the document according to M1 and M2 in Exam- ple 12.2? Exercise 12.4 [⋆] No explicit STOP probability appeared in Example 12.2. Assuming that the STOP probability of each model is 0.1, does this change the likelihood ratio of a document according to the two models? Exercise 12.5 [⋆⋆] How might a language model be used in a spelling correction system? In particular, consider the case of context-sensitive spelling correction, and correcting incorrect us- ages of words, such as their in Are you their? (See Section 3.5 (page 65) for pointers to some literature on this topic.) 12.2 The query likelihood model 12.2.1 Using query likelihood language models in IR Language modeling is a quite general formal approach to IR, with many vari- ant realizations. The original and basic method for using language models in IR is the query likelihood model. In it, we construct from each document dQUERY LIKELIHOOD MODEL in the collection a language model Md. Our goal is to rank documents by P(d|q), where the probability of a document is interpreted as the likelihood that it is relevant to the query. Using Bayes rule (as introduced in Section 11.1, page 220), we have: P(d|q) = P(q|d)P(d)/P(q) P(q) is the same for all documents, and so can be ignored. The prior prob- ability of a document P(d) is often treated as uniform across all d and so it can also be ignored, but we could implement a genuine prior which could in- clude criteria like authority, length, genre, newness, and number of previous people who have read the document. But, given these simplifications, we return results ranked by simply P(q|d), the probability of the query q under the language model derived from d. The Language Modeling approach thus attempts to model the query generation process: Documents are ranked by the probability that a query would be observed as a random sample from the respective document model.
***reduced SVD->
|------------
as the reduced SVD or truncated SVD and we will encounter it again in Ex-REDUCED SVD TRUNCATED SVD ercise 18.9. Henceforth, our numerical examples and exercises will use this reduced form.
|------------
Exercise 18.10 Exercise 18.9 can be generalized to rank k approximations: we let U′k and V ′ k denote the “reduced” matrices formed by retaining only the first k columns of U and V, respectively. Thus U′k is an M× k matrix while V ′ T k is a k× N matrix. Then, we have Ck = U ′ kΣ ′ kV ′T k ,(18.20) where Σ′k is the square k × k submatrix of Σk with the singular values σ1, . . . , σk on the diagonal. The primary advantage of using (18.20) is to eliminate a lot of redun- dant columns of zeros in U and V, thereby explicitly eliminating multiplication by columns that do not affect the low-rank approximation; this version of the SVD is sometimes known as the reduced SVD or truncated SVD and is a computationally simpler representation from which to compute the low rank approximation.
***multilabel classification->
***hierarchical clustering->
|------------
Flat clustering creates a flat set of clusters without any explicit structure thatFLAT CLUSTERING would relate clusters to each other. Hierarchical clustering creates a hierarchy of clusters and will be covered in Chapter 17. Chapter 17 also addresses the difficult problem of labeling clusters automatically.
|------------
17 Hierarchical clustering Flat clustering is efficient and conceptually simple, but as we saw in Chap- ter 16 it has a number of drawbacks. The algorithms introduced in Chap- ter 16 return a flat unstructured set of clusters, require a prespecified num- ber of clusters as input and are nondeterministic. Hierarchical clustering (orHIERARCHICAL CLUSTERING hierarchic clustering) outputs a hierarchy, a structure that is more informative than the unstructured set of clusters returned by flat clustering.1 Hierarchical clustering does not require us to prespecify the number of clusters and most hierarchical algorithms that have been used in IR are deterministic. These ad- vantages of hierarchical clustering come at the cost of lower efficiency. The most common hierarchical clustering algorithms have a complexity that is at least quadratic in the number of documents compared to the linear complex- ity of K-means and EM (cf. Section 16.4, page 364).
|------------
This chapter first introduces agglomerativehierarchical clustering (Section 17.1) and presents four different agglomerative algorithms, in Sections 17.2–17.4, which differ in the similarity measures they employ: single-link, complete- link, group-average, and centroid similarity. We then discuss the optimality conditions of hierarchical clustering in Section 17.5. Section 17.6 introduces top-down (or divisive) hierarchical clustering. Section 17.7 looks at labeling clusters automatically, a problem that must be solved whenever humans in- teract with the output of clustering. We discuss implementation issues in Section 17.8. Section 17.9 provides pointers to further reading, including ref- erences to soft hierarchical clustering, which we do not cover in this book.
|------------
There are few differences between the applications of flat and hierarchi- cal clustering in information retrieval. In particular, hierarchical clustering is appropriate for any of the applications shown in Table 16.1 (page 351; see also Section 16.6, page 372). In fact, the example we gave for collection clus- tering is hierarchical. In general, we select flat clustering when efficiency is important and hierarchical clustering when one of the potential problems 1. In this chapter, we only consider hierarchies that are binary trees like the one shown in Fig- ure 17.1 – but hierarchical clustering can be easily extended to other types of trees.
***http->
|------------
The invention of hypertext, envisioned by Vannevar Bush in the 1940’s and first realized in working systems in the 1970’s, significantly precedes the for- mation of the World Wide Web (which we will simply refer to as the Web), in the 1990’s. Web usage has shown tremendous growth to the point where it now claims a good fraction of humanity as participants, by relying on a sim- ple, open client-server design: (1) the server communicates with the client via a protocol (the http or hypertext transfer protocol) that is lightweight andHTTP simple, asynchronously carrying a variety of payloads (text, images and – over time – richer media such as audio and video files) encoded in a sim- ple markup language called HTML (for hypertext markup language); (2) theHTML client – generally a browser, an application within a graphical user environ- ment – can ignore what it does not understand. Each of these seemingly innocuous features has contributed enormously to the growth of the Web, so it is worthwhile to examine them further.
***compounds->
|------------
Other languages make the problem harder in new ways. German writes compound nouns without spaces (e.g., Computerlinguistik ‘computational lin-COMPOUNDS guistics’; Lebensversicherungsgesellschaftsangestellter ‘life insurance company employee’). Retrieval systems for German greatly benefit from the use of a compound-splitter module, which is usually implemented by seeing if a wordCOMPOUND-SPLITTER can be subdivided into multiple words that appear in a vocabulary. This phe- nomenon reaches its limit case with major East Asian Languages (e.g., Chi- nese, Japanese, Korean, and Thai), where text is written without any spaces between words. An example is shown in Figure 2.3. One approach here is to perform word segmentation as prior linguistic processing. Methods of wordWORD SEGMENTATION segmentation vary from having a large vocabulary and taking the longest vocabulary match with some heuristics for unknown words to the use of machine learning sequence models, such as hidden Markov models or condi- tional random fields, trained over hand-segmented words (see the references ! " # $ % & ' ' ( ) * + , # - . / ◮ Figure 2.3 The standard unsegmented form of Chinese text using the simplified characters of mainland China. There is no whitespace between words, not even be- tween sentences – the apparent space after the Chinese period (◦) is just a typograph- ical illusion caused by placing the character on the left side of its square box. The first sentence is just words in Chinese characters with no spaces between them. The second and third sentences include Arabic numerals and punctuation breaking up the Chinese characters.
***connectivity queries->
|------------
20.4 Connectivity servers For reasons to become clearer in Chapter 21, web search engines require a connectivity server that supports fast connectivity queries on the web graph.CONNECTIVITY SERVER CONNECTIVITY QUERIES Typical connectivity queries are which URLs link to a given URL? and which URLs does a given URL link to? To this end, we wish to store mappings in memory from URL to out-links, and from URL to in-links. Applications in- clude crawl control, web graph analysis, sophisticated crawl optimization and link analysis (to be covered in Chapter 21).
***INEX->
|------------
Exercise 10.3 How many structural terms does the document in Figure 10.1 yield? 10.4 Evaluation of XML retrieval The premier venue for research on XML retrieval is the INEX (INitiative forINEX the Evaluation of XML retrieval) program, a collaborative effort that has pro- duced reference collections, sets of queries, and relevance judgments. A yearly INEX meeting is held to present and discuss research results. The 12,107 number of documents 494 MB size 1995–2002 time of publication of articles 1,532 average number of XML nodes per document 6.9 average depth of a node 30 number of CAS topics 30 number of CO topics ◮ Table 10.2 INEX 2002 collection statistics.
***precision at k->
|------------
The above measures factor in precision at all recall levels. For many promi-PRECISION AT k nent applications, particularly web search, this may not be germane to users.
|------------
What matters is rather how many good results there are on the first page or the first three pages. This leads to measuring precision at fixed low levels of retrieved results, such as 10 or 30 documents. This is referred to as “Precision at k”, for example “Precision at 10”. It has the advantage of not requiring any estimate of the size of the set of relevant documents but the disadvantages that it is the least stable of the commonly used evaluation measures and that it does not average well, since the total number of relevant documents for a query has a strong influence on precision at k.
|------------
An alternative, which alleviates this problem, is R-precision. It requiresR-PRECISION having a set of known relevant documents Rel, from which we calculate the precision of the top Rel documents returned. (The set Rel may be incomplete, such as when Rel is formed by creating relevance judgments for the pooled top k results of particular systems in a set of experiments.) R-precision ad- justs for the size of the set of relevant documents: A perfect system could score 1 on this metric for each query, whereas, even a perfect system could only achieve a precision at 20 of 0.4 if there were only 8 documents in the collection relevant to an information need. Averaging this measure across queries thus makes more sense. This measure is harder to explain to naive users than Precision at k but easier to explain than MAP. If there are |Rel| relevant documents for a query, we examine the top |Rel| results of a sys- tem, and find that r are relevant, then by definition, not only is the precision (and hence R-precision) r/|Rel|, but the recall of this result set is also r/|Rel|.
|------------
Thus, R-precision turns out to be identical to the break-even point, anotherBREAK-EVEN POINT measure which is sometimes used, defined in terms of this equality relation- ship holding. Like Precision at k, R-precision describes only one point on the precision-recall curve, rather than attempting to summarize effectiveness across the curve, and it is somewhat unclear why you should be interested in the break-even point rather than either the best point on the curve (the point with maximal F-measure) or a retrieval level of interest to a particular application (Precision at k). Nevertheless, R-precision turns out to be highly correlated with MAP empirically, despite measuring only a single point on 0.0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1 1 − specificity se ns itiv ity ( = re ca ll) ◮ Figure 8.4 The ROC curve corresponding to the precision-recall curve in Fig- ure 8.2.
***EM algorithm->
|------------
A commonly used algorithm for model-based clustering is the Expectation-EXPECTATION- MAXIMIZATION ALGORITHM Maximization algorithm or EM algorithm. EM clustering is an iterative algo- rithm that maximizes L(D|Θ). EM can be applied to many different types of probabilistic modeling. We will work with a mixture of multivariate Bernoulli distributions here, the distribution we know from Section 11.3 (page 222) and Section 13.3 (page 263): P(d|ωk; Θ) = ( ∏ tm∈d qmk )( ∏ tm/∈d (1 − qmk) ) (16.14) where Θ = {Θ1, . . . , ΘK}, Θk = (αk, q1k, . . . , qMk), and qmk = P(Um = 1|ωk) are the parameters of the model.3 P(Um = 1|ωk) is the probability that a document from cluster ωk contains term tm. The probability αk is the prior of cluster ωk: the probability that a document d is in ωk if we have no informa- tion about d.
***NDCG->
|------------
A final approach that has seen increasing adoption, especially when em- ployed with machine learning approaches to ranking (see Section 15.4, page 341) is measures of cumulative gain, and in particular normalized discounted cumu-CUMULATIVE GAIN NORMALIZED DISCOUNTED CUMULATIVE GAIN lative gain (NDCG). NDCG is designed for situations of non-binary notionsNDCG of relevance (cf. Section 8.5.1). Like precision at k, it is evaluated over some number k of top search results. For a set of queries Q, let R(j, d) be the rele- vance score assessors gave to document d for query j. Then, NDCG(Q, k) = 1 |Q| |Q| ∑ j=1 Zkj k ∑ m=1 2R(j,m) − 1 log2(1 + m) ,(8.9) where Zkj is a normalization factor calculated to make it so that a perfect ranking’s NDCG at k for query j is 1. For queries for which k′ < k documents are retrieved, the last summation is done up to k′.
***training set->
|------------
13.1 The text classification problem In text classification, we are given a description d ∈ X of a document, where X is the document space; and a fixed set of classes C = {c1, c2, . . . , cJ}. ClassesDOCUMENT SPACE CLASS are also called categories or labels. Typically, the document space X is some type of high-dimensional space, and the classes are human defined for the needs of an application, as in the examples China and documents that talk about multicore computer chips above. We are given a training set D of labeledTRAINING SET documents 〈d, c〉,where 〈d, c〉 ∈ X × C. For example: 〈d, c〉 = 〈Beijing joins the World Trade Organization,China〉 for the one-sentence document Beijing joins the World Trade Organization and the class (or label) China.
|------------
Using a learning method or learning algorithm, we then wish to learn a clas-LEARNING METHOD sifier or classification function γ that maps documents to classes:CLASSIFIER γ : X → C(13.1) This type of learning is called supervised learning because a supervisor (theSUPERVISED LEARNING human who defines the classes and labels training documents) serves as a teacher directing the learning process. We denote the supervised learning method by Γ and write Γ(D) = γ. The learning method Γ takes the training set D as input and returns the learned classification function γ.
|------------
Figure 13.1 shows an example of text classification from the Reuters-RCV1 collection, introduced in Section 4.2, page 69. There are six classes (UK,China, . . . , sports), each with three training documents. We show a few mnemonic words for each document’s content. The training set provides some typical examples for each class, so that we can learn the classification function γ.
|------------
When performing evaluations like the one in Table 13.9, it is important to maintain a strict separation between the training set and the test set. We can easily make correct classification decisions on the test set by using informa- tion we have gleaned from the test set, such as the fact that a particular term is a good predictor in the test set (even though this is not the case in the train- ing set). A more subtle example of using knowledge about the test set is to try a large number of values of a parameter (e.g., the number of selected fea- tures) and select the value that is best for the test set. As a rule, accuracy on new data – the type of data we will encounter when we use the classifier in an application – will be much lower than accuracy on a test set that the clas- sifier has been tuned for. We discussed the same problem in ad hoc retrieval in Section 8.1 (page 153).
|------------
In a clean statistical text classification experiment, you should never run any program on or even look at the test set while developing a text classifica- tion system. Instead, set aside a development set for testing while you developDEVELOPMENT SET your method. When such a set serves the primary purpose of finding a good value for a parameter, for example, the number of selected features, then it is also called held-out data. Train the classifier on the rest of the training setHELD-OUT DATA with different parameter values, and then select the value that gives best re- sults on the held-out part of the training set. Ideally, at the very end, when all parameters have been set and the method is fully specified, you run one final experiment on the test set and publish the results. Because no informa- ◮ Table 13.10 Data for parameter estimation exercise.
***statistical significance->
|------------
We compute the other Eetec in the same way: epoultry = 1 epoultry = 0 eexport = 1 N11 = 49 E11 ≈ 6.6 N10 = 27,652 E10 ≈ 27,694.4 eexport = 0 N01 = 141 E01 ≈ 183.4 N00 = 774,106 E00 ≈ 774,063.6 Plugging these values into Equation (13.18), we get a X2 value of 284: X2(D, t, c) = ∑ et∈{0,1} ∑ ec∈{0,1} (Netec − Eetec)2 Eetec ≈ 284 X2 is a measure of how much expected counts E and observed counts N deviate from each other. A high value of X2 indicates that the hypothesis of independence, which implies that expected and observed counts are similar, is incorrect. In our example, X2 ≈ 284 > 10.83. Based on Table 13.6, we can reject the hypothesis that poultry and export are independent with only a 0.001 chance of being wrong.8 Equivalently, we say that the outcome X2 ≈ 284 > 10.83 is statistically significant at the 0.001 level. If the two events areSTATISTICAL SIGNIFICANCE dependent, then the occurrence of the term makes the occurrence of the class more likely (or less likely), so it should be helpful as a feature. This is the rationale of χ2 feature selection.
***XML DTD->
|------------
We also need the concept of schema in this chapter. A schema puts con-SCHEMA straints on the structure of allowable XML documents for a particular ap- plication. A schema for Shakespeare’s plays may stipulate that scenes can only occur as children of acts and that only acts and scenes have the num- ber attribute. Two standards for schemas for XML documents are XML DTDXML DTD (document type definition) andXML Schema. Users can only write structuredXML SCHEMA queries for an XML retrieval system if they have some minimal knowledge about the schema of the collection.
***Heaps’ law->
|------------
Before introducing techniques for compressing the dictionary, we want to estimate the number of distinct terms M in a collection. It is sometimes said that languages have a vocabulary of a certain size. The second edition of the Oxford English Dictionary (OED) defines more than 600,000 words. But the vocabulary of most large collections is much larger than the OED. The OED does not include most names of people, locations, products, or scientific 0 2 4 6 8 0 1 2 3 4 5 6 log10 T lo g 1 0 M ◮ Figure 5.1 Heaps’ law. Vocabulary size M as a function of collection size T (number of tokens) for Reuters-RCV1. For these data, the dashed line log10 M = 0.49 ∗ log10 T + 1.64 is the best least-squares fit. Thus, k = 101.64 ≈ 44 and b = 0.49.
|------------
5.1.1 Heaps’ law: Estimating the number of terms A better way of getting a handle on M is Heaps’ law, which estimates vocab-HEAPS’ LAW ulary size as a function of collection size: M = kTb(5.1) where T is the number of tokens in the collection. Typical values for the parameters k and b are: 30 ≤ k ≤ 100 and b ≈ 0.5. The motivation for Heaps’ law is that the simplest possible relationship between collection size and vocabulary size is linear in log–log space and the assumption of linearity is usually born out in practice as shown in Figure 5.1 for Reuters-RCV1. In this case, the fit is excellent for T > 105 = 100,000, for the parameter values b = 0.49 and k = 44. For example, for the first 1,000,020 tokens Heaps’ law predicts 38,323 terms: 44 × 1,000,0200.49 ≈ 38,323.
***perceptron algorithm->
|------------
The bias-variance tradeoff was introduced by Geman et al. (1992). The derivation in Section 14.6 is for MSE(γ), but the tradeoff applies to many loss functions (cf. Friedman (1997), Domingos (2000)). Schütze et al. (1995) and Lewis et al. (1996) discuss linear classifiers for text and Hastie et al. (2001) linear classifiers in general. Readers interested in the algorithms mentioned, but not described in this chapter may wish to consult Bishop (2006) for neu- ral networks, Hastie et al. (2001) for linear and logistic regression, and Min- sky and Papert (1988) for the perceptron algorithm. Anagnostopoulos et al.
***query-by-example->
***text categorization->
|------------
Such more general classes are usually referred to as topics, and the classifica- tion task is then called text classification, text categorization, topic classification,TEXT CLASSIFICATION or topic spotting. An example for China appears in Figure 13.1. Standing queries and topics differ in their degree of specificity, but the methods for solving routing, filtering, and text classification are essentially the same. We therefore include routing and filtering under the rubric of text classification in this and the following chapters.
***Wikipedia->
|------------
We give collection statistics in Table 10.2 and show part of the schema of the collection in Figure 10.11. The IEEE journal collection was expanded in 2005. Since 2006 INEX uses the much larger English Wikipedia as a test col- lection. The relevance of documents is judged by human assessors using the methodology introduced in Section 8.1 (page 152), appropriately modified for structured documents as we will discuss shortly.
***enterprise resource planning->
|------------
An effective indexer for enterprise search needs to be able to communicate efficiently with a number of applications that hold text data in corporations, including Microsoft Outlook, IBM’s Lotus software, databases like Oracle and MySQL, content management systems like Open Text, and enterprise resource planning software like SAP.
***top docs->
|------------
(1996). Cluster pruning is investigated by Singitham et al. (2004) and by Chierichetti et al. (2007); see also Section 16.6 (page 372). Champion lists are described in Persin (1994) and (under the name top docs) in Brown (1995),TOP DOCS and further developed in Brin and Page (1998), Long and Suel (2003). While these heuristics are well-suited to free text queries that can be viewed as vec- tors, they complicate phrase queries; see Anh and Moffat (2006c) for an index structure that supports both weighted and Boolean/phrase searches. Carmel et al. (2001) Clarke et al. (2000) and Song et al. (2005) treat the use of query term proximity in assessing relevance. Pioneering work on learning of rank- ing functions was done by Fuhr (1989), Fuhr and Pfeifer (1994), Cooper et al.
***prior probability->
|------------
Writing P(A) for the complement of an event, we similarly have: P(A, B) = P(B|A)P(A)(11.2) Probability theory also has a partition rule, which says that if an event B canPARTITION RULE be divided into an exhaustive set of disjoint subcases, then the probability of B is the sum of the probabilities of the subcases. A special case of this rule gives that: P(B) = P(A, B) + P(A, B)(11.3) From these we can derive Bayes’ Rule for inverting conditional probabili-BAYES’ RULE ties: P(A|B) = P(B|A)P(A) P(B) = [ P(B|A) ∑X∈{A,A} P(B|X)P(X) ] P(A)(11.4) This equation can also be thought of as a way of updating probabilities. We start off with an initial estimate of how likely the event A is when we do not have any other information; this is the prior probability P(A). Bayes’ rulePRIOR PROBABILITY lets us derive a posterior probability P(A|B) after having seen the evidence B,POSTERIOR PROBABILITY based on the likelihood of B occurring in the two cases that A does or does not hold.1 Finally, it is often useful to talk about the odds of an event, which provideODDS a kind of multiplier for how probabilities change: Odds: O(A) = P(A) P(A) = P(A) 1 − P(A)(11.5) 11.2 The Probability Ranking Principle 11.2.1 The 1/0 loss case We assume a ranked retrieval setup as in Section 6.3, where there is a collec- tion of documents, the user issues a query, and an ordered list of documents is returned. We also assume a binary notion of relevance as in Chapter 8. For a query q and a document d in the collection, let Rd,q be an indicator random variable that says whether d is relevant with respect to a given query q. That is, it takes on a value of 1 when the document is relevant and 0 otherwise. In context we will often write just R for Rd,q.
***Latent Dirichlet Allocation->
|------------
probabilistic latent variable model for dimensionality reduction is the Latent Dirichlet Allocation (LDA) model (Blei et al. 2003), which is generative and assigns probabilities to documents outside of the training set. This model is extended to a hierarchical clustering by Rosen-Zvi et al. (2004). Wei and Croft (2006) present the first large scale evaluation of LDA, finding it to signifi- cantly outperform the query likelihood model of Section 12.2 (page 242), but to not perform quite as well as the relevance model mentioned in Section 12.4 (page 250) – but the latter does additional per-query processing unlike LDA.
***text-centric XML->
|------------
10.5 Text-centric vs. data-centric XML retrieval In the type of structured retrieval we cover in this chapter, XML structure serves as a framework within which we match the text of the query with the text of the XML documents. This exemplifies a system that is optimized for text-centric XML. While both text and structure are important, we give higherTEXT-CENTRIC XML priority to text. We do this by adapting unstructured retrieval methods to handling additional structural constraints. The premise of our approach is that XML document retrieval is characterized by (i) long text fields (e.g., sec- tions of a document), (ii) inexact matching, and (iii) relevance-ranked results.
***free text->
|------------
1.4 The extended Boolean model versus ranked retrieval The Boolean retrieval model contrasts with ranked retrieval models such as theRANKED RETRIEVAL MODEL vector space model (Section 6.3), in which users largely use free text queries, FREE TEXT QUERIES that is, just typing one or more words rather than using a precise language with operators for building up query expressions, and the system decides which documents best satisfy the query. Despite decades of academic re- search on the advantages of ranked retrieval, systems implementing the Boo- lean retrieval model were the main or only search option provided by large commercial information providers for three decades until the early 1990s (ap- proximately the date of arrival of the World Wide Web). However, these systems did not have just the basic Boolean operations (AND, OR, and NOT) which we have presented so far. A strict Boolean expression over terms with an unordered results set is too limited for many of the information needs that people have, and these systems implemented extended Boolean retrieval models by incorporating additional operators such as term proximity oper- ators. A proximity operator is a way of specifying that two terms in a queryPROXIMITY OPERATOR must occur close to each other in a document, where closeness may be mea- sured by limiting the allowed number of intervening words or by reference to a structural unit such as a sentence or paragraph.
|------------
With these additional ideas, we will have seen most of the basic technol- ogy that supports ad hoc searching over unstructured information. Ad hoc searching over documents has recently conquered the world, powering not only web search engines but the kind of unstructured search that lies behind the large eCommerce websites. Although the main web search engines differ by emphasizing free text querying, most of the basic issues and technologies of indexing and querying remain the same, as we will see in later chapters.
|------------
Exercise 6.6 For the value of g estimated in Exercise 6.5, compute the weighted zone score for each (query, document) example. How do these scores relate to the relevance judgments in Figure 6.5 (quantized to 0/1)? Exercise 6.7 Why does the expression for g in (6.6) not involve training examples in which sT(dt, qt) and sB(dt, qt) have the same value? 6.2 Term frequency and weighting Thus far, scoring has hinged on whether or not a query term is present in a zone within a document. We take the next logical step: a document or zone that mentions a query term more often has more to do with that query and therefore should receive a higher score. To motivate this, we recall the notion of a free text query introduced in Section 1.4: a query in which the terms of the query are typed freeform into the search interface, without any connecting search operators (such as Boolean operators). This query style, which is extremely popular on the web, views the query as simply a set of words. A plausible scoring mechanism then is to compute a score that is the sum, over the query terms, of the match scores between each query term and the document.
***click spam->
|------------
This can take many forms, one of which is known as click spam. There isCLICK SPAM currently no universally accepted definition of click spam. It refers (as the name suggests) to clicks on sponsored search results that are not from bona fide search users. For instance, a devious advertiser may attempt to exhaust the advertising budget of a competitor by clicking repeatedly (through the use of a robotic click generator) on that competitor’s sponsored search ad- vertisements. Search engines face the challenge of discerning which of the clicks they observe are part of a pattern of click spam, to avoid charging their advertiser clients for such clicks.
***shingling->
|------------
The simplest approach to detecting duplicates is to compute, for each web page, a fingerprint that is a succinct (say 64-bit) digest of the characters on that page. Then, whenever the fingerprints of two web pages are equal, we test whether the pages themselves are equal and if so declare one of them to be a duplicate copy of the other. This simplistic approach fails to capture a crucial and widespread phenomenon on the Web: near duplication. In many cases, the contents of one web page are identical to those of another except for a few characters – say, a notation showing the date and time at which the page was last modified. Even in such cases, we want to be able to declare the two pages to be close enough that we only index one copy. Short of exhaustively comparing all pairs of web pages, an infeasible task at the scale of billions of pages, how can we detect and filter out such near duplicates? We now describe a solution to the problem of detecting near-duplicate web pages. The answer lies in a technique known as shingling. Given a positiveSHINGLING integer k and a sequence of terms in a document d, define the k-shingles of d to be the set of all consecutive sequences of k terms in d. As an example, consider the following text: a rose is a rose is a rose. The 4-shingles for this text (k = 4 is a typical value used in the detection of near-duplicate web pages) are a rose is a, rose is a rose and is a rose is. The first two of these shingles each occur twice in the text. Intuitively, two documents are near duplicates if the sets of shingles generated from them are nearly the same. We now make this intuition precise, then develop a method for efficiently computing and comparing the sets of shingles for all web pages.
***highlighting->
|------------
Thus, we may want to remove some elements in a postprocessing step to re- duce redundancy. Alternatively, we can collapse several nested elements in the results list and use highlighting of query terms to draw the user’s atten- tion to the relevant passages. If query terms are highlighted, then scanning a medium-sized element (e.g., a section) takes little more time than scanning a small subelement (e.g., a paragraph). Thus, if the section and the paragraph both occur in the results list, it is sufficient to show the section. An additional advantage of this approach is that the paragraph is presented together with its context (i.e., the embedding section). This context may be helpful in in- terpreting the paragraph (e.g., the source of the information reported) even if the paragraph on its own satisfies the query.
***semistructured query->
***effectiveness->
|------------
Our goal is to develop a system to address the ad hoc retrieval task. This isAD HOC RETRIEVAL the most standard IR task. In it, a system aims to provide documents from within the collection that are relevant to an arbitrary user information need, communicated to the system by means of a one-off, user-initiated query. An information need is the topic about which the user desires to know more, andINFORMATION NEED is differentiated from a query, which is what the user conveys to the com-QUERY puter in an attempt to communicate the information need. A document is relevant if it is one that the user perceives as containing information of valueRELEVANCE with respect to their personal information need. Our example above was rather artificial in that the information need was defined in terms of par- ticular words, whereas usually a user is interested in a topic like “pipeline leaks” and would like to find relevant documents regardless of whether they precisely use those words or express the concept with other words such as pipeline rupture. To assess the effectiveness of an IR system (i.e., the quality ofEFFECTIVENESS its search results), a user will usually want to know two key statistics about the system’s returned results for a query: Precision: What fraction of the returned results are relevant to the informa-PRECISION tion need? Recall: What fraction of the relevant documents in the collection were re-RECALL turned by the system? Detailed discussion of relevance and evaluation measures including preci- sion and recall is found in Chapter 8.
|------------
We will use effectiveness as a generic term for measures that evaluate theEFFECTIVENESS quality of classification decisions, including precision, recall, F1, and accu- racy. Performance refers to the computational efficiency of classification andPERFORMANCE EFFICIENCY IR systems in this book. However, many researchers mean effectiveness, not efficiency of text classification when they use the term performance.
|------------
When we process a collection with several two-class classifiers (such as Reuters-21578 with its 118 classes), we often want to compute a single ag- gregate measure that combines the measures for individual classifiers. There are two methods for doing this. Macroaveraging computes a simple aver-MACROAVERAGING age over classes. Microaveraging pools per-document decisions across classes,MICROAVERAGING and then computes an effectiveness measure on the pooled contingency ta- ble. Table 13.8 gives an example.
***R-precision->
|------------
An alternative, which alleviates this problem, is R-precision. It requiresR-PRECISION having a set of known relevant documents Rel, from which we calculate the precision of the top Rel documents returned. (The set Rel may be incomplete, such as when Rel is formed by creating relevance judgments for the pooled top k results of particular systems in a set of experiments.) R-precision ad- justs for the size of the set of relevant documents: A perfect system could score 1 on this metric for each query, whereas, even a perfect system could only achieve a precision at 20 of 0.4 if there were only 8 documents in the collection relevant to an information need. Averaging this measure across queries thus makes more sense. This measure is harder to explain to naive users than Precision at k but easier to explain than MAP. If there are |Rel| relevant documents for a query, we examine the top |Rel| results of a sys- tem, and find that r are relevant, then by definition, not only is the precision (and hence R-precision) r/|Rel|, but the recall of this result set is also r/|Rel|.
|------------
Thus, R-precision turns out to be identical to the break-even point, anotherBREAK-EVEN POINT measure which is sometimes used, defined in terms of this equality relation- ship holding. Like Precision at k, R-precision describes only one point on the precision-recall curve, rather than attempting to summarize effectiveness across the curve, and it is somewhat unclear why you should be interested in the break-even point rather than either the best point on the curve (the point with maximal F-measure) or a retrieval level of interest to a particular application (Precision at k). Nevertheless, R-precision turns out to be highly correlated with MAP empirically, despite measuring only a single point on 0.0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1 1 − specificity se ns itiv ity ( = re ca ll) ◮ Figure 8.4 The ROC curve corresponding to the precision-recall curve in Fig- ure 8.2.
|------------
Buckley and Voorhees (2000) compare several evaluation measures, in- cluding precision at k, MAP, and R-precision, and evaluate the error rate of each measure. R-precision was adopted as the official evaluation metric inR-PRECISION the TREC HARD track (Allan 2005). Aslam and Yilmaz (2005) examine its surprisingly close correlation to MAP, which had been noted in earlier stud- ies (Tague-Sutcliffe and Blustein 1995, Buckley and Voorhees 2000). A stan- dard program for evaluating IR systems which computes many measures of ranked retrieval effectiveness is Chris Buckley’s trec_eval program used in the TREC evaluations. It can be downloaded from: http://trec.nist.gov/trec_eval/.
***overfitting->
|------------
Such an incorrect generalization from an accidental property of the training set is called overfitting.OVERFITTING We can view feature selection as a method for replacing a complex clas- sifier (using all features) with a simpler one (using a subset of the features).
|------------
High-variance learning methods are prone to overfitting the training data.OVERFITTING The goal in classification is to fit the training data to the extent that we cap- ture true properties of the underlying distribution P(〈d, c〉). In overfitting, the learning method also learns from noise. Overfitting increases MSE and frequently is a problem for high-variance learning methods.
***distributed index->
|------------
4.4 Distributed indexing Collections are often so large that we cannot perform index construction effi- ciently on a single machine. This is particularly true of the World Wide Web for which we need large computer clusters1 to construct any reasonably sized web index. Web search engines, therefore, use distributed indexing algorithms for index construction. The result of the construction process is a distributed index that is partitioned across several machines – either according to term or according to document. In this section, we describe distributed indexing for a term-partitioned index. Most large search engines prefer a document- 1. A cluster in this chapter is a group of tightly coupled computers that work together closely.
|------------
Tomasic and Garcia-Molina (1993) and Jeong and Omiecinski (1995) are key early papers evaluating term partitioning versus document partitioning for distributed indexes. Document partitioning is found to be superior, at least when the distribution of terms is skewed, as it typically is in practice.
***Bayes’ Rule->
|------------
Writing P(A) for the complement of an event, we similarly have: P(A, B) = P(B|A)P(A)(11.2) Probability theory also has a partition rule, which says that if an event B canPARTITION RULE be divided into an exhaustive set of disjoint subcases, then the probability of B is the sum of the probabilities of the subcases. A special case of this rule gives that: P(B) = P(A, B) + P(A, B)(11.3) From these we can derive Bayes’ Rule for inverting conditional probabili-BAYES’ RULE ties: P(A|B) = P(B|A)P(A) P(B) = [ P(B|A) ∑X∈{A,A} P(B|X)P(X) ] P(A)(11.4) This equation can also be thought of as a way of updating probabilities. We start off with an initial estimate of how likely the event A is when we do not have any other information; this is the prior probability P(A). Bayes’ rulePRIOR PROBABILITY lets us derive a posterior probability P(A|B) after having seen the evidence B,POSTERIOR PROBABILITY based on the likelihood of B occurring in the two cases that A does or does not hold.1 Finally, it is often useful to talk about the odds of an event, which provideODDS a kind of multiplier for how probabilities change: Odds: O(A) = P(A) P(A) = P(A) 1 − P(A)(11.5) 11.2 The Probability Ranking Principle 11.2.1 The 1/0 loss case We assume a ranked retrieval setup as in Section 6.3, where there is a collec- tion of documents, the user issues a query, and an ordered list of documents is returned. We also assume a binary notion of relevance as in Chapter 8. For a query q and a document d in the collection, let Rd,q be an indicator random variable that says whether d is relevant with respect to a given query q. That is, it takes on a value of 1 when the document is relevant and 0 otherwise. In context we will often write just R for Rd,q.
***postings list->
|------------
We keep a dictionary of terms (sometimes also referred to as a vocabulary orDICTIONARY VOCABULARY lexicon; in this book, we use dictionary for the data structure and vocabulary LEXICON for the set of terms). Then for each term, we have a list that records which documents the term occurs in. Each item in the list – which records that a term appeared in a document (and, later, often, the positions in the docu- ment) – is conventionally called a posting.4 The list is then called a postingsPOSTING POSTINGS LIST list (or inverted list), and all the postings lists taken together are referred to as the postings. The dictionary in Figure 1.3 has been sorted alphabetically andPOSTINGS each postings list is sorted by document ID. We will see why this is useful in Section 1.3, below, but later we will also consider alternatives to doing this (Section 7.1.5).
|------------
4. In a (non-positional) inverted index, a posting is just a document ID, but it is inherently associated with a term, via the postings list it is placed on; sometimes we will also talk of a (term, docID) pair as a posting.
***label->
|------------
13.1 The text classification problem In text classification, we are given a description d ∈ X of a document, where X is the document space; and a fixed set of classes C = {c1, c2, . . . , cJ}. ClassesDOCUMENT SPACE CLASS are also called categories or labels. Typically, the document space X is some type of high-dimensional space, and the classes are human defined for the needs of an application, as in the examples China and documents that talk about multicore computer chips above. We are given a training set D of labeledTRAINING SET documents 〈d, c〉,where 〈d, c〉 ∈ X × C. For example: 〈d, c〉 = 〈Beijing joins the World Trade Organization,China〉 for the one-sentence document Beijing joins the World Trade Organization and the class (or label) China.
|------------
Using a learning method or learning algorithm, we then wish to learn a clas-LEARNING METHOD sifier or classification function γ that maps documents to classes:CLASSIFIER γ : X → C(13.1) This type of learning is called supervised learning because a supervisor (theSUPERVISED LEARNING human who defines the classes and labels training documents) serves as a teacher directing the learning process. We denote the supervised learning method by Γ and write Γ(D) = γ. The learning method Γ takes the training set D as input and returns the learned classification function γ.
***structured retrieval->
|------------
However, many structured data sources containing text are best modeled as structured documents rather than relational data. We call the search over such structured documents structured retrieval. Queries in structured retrievalSTRUCTURED RETRIEVAL can be either structured or unstructured, but we will assume in this chap- ter that the collection consists only of structured documents. Applications of structured retrieval include digital libraries, patent databases, blogs, text in which entities like persons and locations have been tagged (in a process called named entity tagging) and output from office suites like OpenOffice that save documents as marked up text. In all of these applications, we want to be able to run queries that combine textual criteria with structural criteria.
|------------
We call XML retrieval structured retrieval in this chapter. Some researchers prefer the term semistructured retrieval to distinguish XML retrieval from databaseSEMISTRUCTURED RETRIEVAL querying. We have adopted the terminology that is widespread in the XML retrieval community. For instance, the standard way of referring to XML queries is structured queries, not semistructured queries. The term structured retrieval is rarely used for database querying and it always refers to XML retrieval in this book.
|------------
There is a second type of information retrieval problem that is intermediate between unstructured retrieval and querying a relational database: paramet- ric and zone search, which we discussed in Section 6.1 (page 110). In the data model of parametric and zone search, there are parametric fields (re- lational attributes like date or file-size) and zones – text attributes that each take a chunk of unstructured text as value, e.g., author and title in Figure 6.1 (page 111). The data model is flat, that is, there is no nesting of attributes.
***incidence matrix->
|------------
The way to avoid linearly scanning the texts for each query is to index theINDEX documents in advance. Let us stick with Shakespeare’s Collected Works, and use it to introduce the basics of the Boolean retrieval model. Suppose we record for each document – here a play of Shakespeare’s – whether it contains each word out of all the words Shakespeare used (Shakespeare used about 32,000 different words). The result is a binary term-document incidenceINCIDENCE MATRIX matrix, as in Figure 1.1. Terms are the indexed units (further discussed inTERM Section 2.2); they are usually words, and for the moment you can think of Antony Julius The Hamlet Othello Macbeth . . .
|------------
By multiplying Equation (18.9) by its transposed version, we have CCT = UΣVT VΣUT = UΣ2UT.(18.10) Note now that in Equation (18.10), the left-hand side is a square symmetric matrix real-valued matrix, and the right-hand side represents its symmetric diagonal decomposition as in Theorem 18.2. What does the left-hand side CCT represent? It is a square matrix with a row and a column correspond- ing to each of the M terms. The entry (i, j) in the matrix is a measure of the overlap between the ith and jth terms, based on their co-occurrence in docu- ments. The precise mathematical meaning depends on the manner in which C is constructed based on term weighting. Consider the case where C is the term-document incidence matrix of page 3, illustrated in Figure 1.1. Then the entry (i, j) in CCT is the number of documents in which both term i and term j occur.
***centroid->
|------------
However, in addition to documents, centroids or averages of vectors also play an important role in vector space classification. Centroids are not length- normalized. For unnormalized vectors, dot product, cosine similarity and Euclidean distance all have different behavior in general (Exercise 14.6). We will be mostly concerned with small local regions when computing the sim- ilarity between a document and a centroid, and the smaller the region the more similar the behavior of the three measures is.
|------------
Perhaps the best-known way of computing good class boundaries is Roc-ROCCHIO CLASSIFICATION chio classification, which uses centroids to define the boundaries. The centroid CENTROID of a class c is computed as the vector average or center of mass of its mem- bers: ~µ(c) = 1 |Dc| ∑d∈Dc ~v(d)(14.1) where Dc is the set of documents in D whose class is c: Dc = {d : 〈d, c〉 ∈ D}.
|------------
Three example centroids are shown as solid circles in Figure 14.3.
|------------
The boundary between two classes in Rocchio classification is the set of points with equal distance from the two centroids. For example, |a1| = |a2|, xx x x ⋄ ⋄ ⋄⋄ ⋄ ⋄ China Kenya UK ⋆ a1 a2 b1 b2 c1 c2 ◮ Figure 14.3 Rocchio classification.
|------------
(i) Is it less difficult, equally difficult or more difficult to cluster this set of 34 points as opposed to the 17 points in Figure 16.4? (ii) Compute purity, NMI, RI, and F5 for the clustering with 34 points. Which measures increase and which stay the same after doubling the number of points? (iii) Given your assessment in (i) and the results in (ii), which measures are best suited to compare the quality of the two clusterings? 16.4 K-means K-means is the most important flat clustering algorithm. Its objective is to minimize the average squared Euclidean distance (Chapter 6, page 131) of documents from their cluster centers where a cluster center is defined as the mean or centroid ~µ of the documents in a cluster ω:CENTROID ~µ(ω) = 1 |ω| ∑ ~x∈ω ~x The definition assumes that documents are represented as length-normalized vectors in a real-valued space in the familiar way. We used centroids for Roc- chio classification in Chapter 14 (page 292). They play a similar role here.
|------------
The ideal cluster in K-means is a sphere with the centroid as its center of gravity. Ideally, the clusters should not overlap. Our desiderata for classes in Rocchio classification were the same. The difference is that we have no la- beled training set in clustering for which we know which documents should be in the same cluster.
|------------
A measure of how well the centroids represent the members of their clus- ters is the residual sum of squares or RSS, the squared distance of each vectorRESIDUAL SUM OF SQUARES from its centroid summed over all vectors: RSSk = ∑ ~x∈ωk |~x−~µ(ωk)|2 RSS = K ∑ k=1 RSSk(16.7) RSS is the objective function in K-means and our goal is to minimize it. Since N is fixed, minimizing RSS is equivalent to minimizing the average squared distance, a measure of how well centroids represent their documents.
***regression->
|------------
However, approaching IR result ranking like this is not necessarily the right way to think about the problem. Statisticians normally first divide problems into classification problems (where a categorical variable is pre- dicted) versus regression problems (where a real number is predicted). InREGRESSION between is the specialized field of ordinal regression where a ranking is pre-ORDINAL REGRESSION dicted. Machine learning for ad hoc retrieval is most properly thought of as an ordinal regression problem, where the goal is to rank a set of documents for a query, given training data of the same sort. This formulation gives some additional power, since documents can be evaluated relative to other candidate documents for the same query, rather than having to be mapped to a global scale of goodness, while also weakening the problem space, since just a ranking is required rather than an absolute measure of relevance. Is- sues of ranking are especially germane in web search, where the ranking at the very top of the results list is exceedingly important, whereas decisions of relevance of a document to a query may be much less important. Such work can and has been pursued using the structural SVM framework which we mentioned in Section 15.2.2, where the class being predicted is a ranking of results for a query, but here we will present the slightly simpler ranking SVM.
***decision trees->
|------------
(a) NB Rocchio kNN SVM micro-avg-L (90 classes) 80 85 86 89 macro-avg (90 classes) 47 59 60 60 (b) NB Rocchio kNN trees SVM earn 96 93 97 98 98 acq 88 65 92 90 94 money-fx 57 47 78 66 75 grain 79 68 82 85 95 crude 80 70 86 85 89 trade 64 65 77 73 76 interest 65 63 74 67 78 ship 85 49 79 74 86 wheat 70 69 77 93 92 corn 65 48 78 92 90 micro-avg (top 10) 82 65 82 88 92 micro-avg-D (118 classes) 75 62 n/a n/a 87 Rocchio and kNN. In addition, we give numbers for decision trees, an impor-DECISION TREES tant classification method we do not cover. The bottom part of the table shows that there is considerable variation from class to class. For instance, NB beats kNN on ship, but is much worse on money-fx.
|------------
Exercise 13.16 χ2 and mutual information do not distinguish between positively and negatively cor- related features. Because most good text classification features are positively corre- lated (i.e., they occur more often in c than in c), one may want to explicitly rule out the selection of negative indicators. How would you do this? 13.7 References and further reading General introductions to statistical classification and machine learning can be found in (Hastie et al. 2001), (Mitchell 1997), and (Duda et al. 2000), including many important methods (e.g., decision trees and boosting) that we do not cover. A comprehensive review of text classification methods and results is (Sebastiani 2002). Manning and Schütze (1999, Chapter 16) give an accessible introduction to text classification with coverage of decision trees, perceptrons and maximum entropy models. More information on the superlinear time complexity of learning methods that are more accurate than Naive Bayes can be found in (Perkins et al. 2003) and (Joachims 2006a).
***noise document->
|------------
As is typical in text classification, there are some noise documents in Fig-NOISE DOCUMENT ure 14.10 (marked with arrows) that do not fit well into the overall distri- bution of the classes. In Section 13.5 (page 271), we defined a noise feature as a misleading feature that, when included in the document representation, on average increases the classification error. Analogously, a noise document is a document that, when included in the training set, misleads the learn- ing method and increases classification error. Intuitively, the underlying distribution partitions the representation space into areas with mostly ho- ◮ Figure 14.10 A linear problem with noise. In this hypothetical web page classifi- cation scenario, Chinese-only web pages are solid circles and mixed Chinese-English web pages are squares. The two classes are separated by a linear class boundary (dashed line, short dashes), except for three noise documents (marked with arrows).
***schema heterogeneity->
|------------
Gates book Gates author book Gates creator book Gates lastname Bill firstname author book q3 q4 d2 d3 ◮ Figure 10.6 Schema heterogeneity: intervening nodes and mismatched names.
|------------
In many cases, several different XML schemas occur in a collection since the XML documents in an IR application often come from more than one source. This phenomenon is called schema heterogeneity or schema diversitySCHEMA HETEROGENEITY and presents yet another challenge. As illustrated in Figure 10.6 comparable elements may have different names: creator in d2 vs. author in d3. In other cases, the structural organization of the schemas may be different: Author names are direct descendants of the node author in q3, but there are the in- tervening nodes firstname and lastname in d3. If we employ strict matching of trees, then q3 will retrieve neither d2 nor d3 although both documents are relevant. Some form of approximate matching of element names in combina- tion with semi-automatic matching of different document structures can help here. Human editing of correspondences of elements in different schemas will usually do better than automatic methods.
***ranked retrieval->
|------------
sion – instead of docIDs we can compress smaller gaps between IDs, thus reducing space requirements for the index. However, this structure for the index is not optimal when we build ranked (Chapters 6 and 7) – as opposed toRANKED Boolean – retrieval systems. In ranked retrieval, postings are often ordered ac-RETRIEVAL SYSTEMS cording to weight or impact, with the highest-weighted postings occurring first. With this organization, scanning of long postings lists during query processing can usually be terminated early when weights have become so small that any further documents can be predicted to be of low similarity to the query (see Chapter 6). In a docID-sorted index, new documents are always inserted at the end of postings lists. In an impact-sorted index (Sec- tion 7.1.5, page 140), the insertion can occur anywhere, thus complicating the update of the inverted index.
|------------
This chapter only looks at index compression for Boolean retrieval. For ranked retrieval (Chapter 6), it is advantageous to order postings according to term frequency instead of docID. During query processing, the scanning of many postings lists can then be terminated early because smaller weights do not change the ranking of the highest ranked k documents found so far. It is not a good idea to precompute and store weights in the index (as opposed to frequencies) because they cannot be compressed as well as integers (see Section 7.1.5, page 140).
***dot product->
|------------
6.3.1 Dot products We denote by ~V(d) the vector derived from document d, with one com- ponent in the vector for each dictionary term. Unless otherwise specified, the reader may assume that the components are computed using the tf-idf weighting scheme, although the particular weighting scheme is immaterial to the discussion that follows. The set of documents in a collection then may be viewed as a set of vectors in a vector space, in which there is one axis for 0 1 0 1 jealous gossip ~v(q) ~v(d1) ~v(d2) ~v(d3) θ ◮ Figure 6.10 Cosine similarity illustrated. sim(d1, d2) = cos θ.
|------------
To compensate for the effect of document length, the standard way of quantifying the similarity between two documents d1 and d2 is to compute the cosine similarity of their vector representations ~V(d1) and ~V(d2)COSINE SIMILARITY sim(d1, d2) = ~V(d1) · ~V(d2) |~V(d1)||~V(d2)| ,(6.10) where the numerator represents the dot product (also known as the inner prod-DOT PRODUCT uct) of the vectors ~V(d1) and ~V(d2), while the denominator is the product of their Euclidean lengths. The dot product ~x · ~y of two vectors is defined asEUCLIDEAN LENGTH ∑ M i=1 xiyi. Let ~V(d) denote the document vector for d, with M components ~V1(d) . . . ~VM(d). The Euclidean length of d is defined to be √ ∑ M i=1 ~V2i (d).
***semi-supervised learning->
|------------
Here, the theoretically interesting answer is to try to apply semi-supervisedSEMI-SUPERVISED LEARNING training methods. This includes methods such as bootstrapping or the EM algorithm, which we will introduce in Section 16.5 (page 368). In these meth- ods, the system gets some labeled documents, and a further large supply of unlabeled documents over which it can attempt to learn. One of the big advantages of Naive Bayes is that it can be straightforwardly extended to be a semi-supervised learning algorithm, but for SVMs, there is also semi- supervised learning work which goes under the title of transductive SVMs.TRANSDUCTIVE SVMS See the references for pointers.
***parameterized compression->
***ModApte split->
|------------
For each of these classifiers, we can measure recall, precision, and accu- racy. In recent work, people almost invariably use the ModApte split, whichMODAPTE SPLIT includes only documents that were viewed and assessed by a human indexer, ◮ Table 13.7 The ten largest classes in the Reuters-21578 collection with number of documents in training and test sets.
|------------
David D. Lewis defines the ModApte split at www.daviddlewis.com/resources/testcollections/reuters21578/readme based on Apté et al. (1994). Lewis (1995) describes utility measures for theUTILITY MEASURE evaluation of text classification systems. Yang and Liu (1999) employ signif- icance tests in the evaluation of text classification methods.
***algorithmic search->
|------------
Several aspects of Goto’s model are worth highlighting. First, a user typing the query q into Goto’s search interface was actively expressing an interest and intent related to the query q. For instance, a user typing golf clubs is more likely to be imminently purchasing a set than one who is simply browsing news on golf. Second, Goto only got compensated when a user actually ex- pressed interest in an advertisement – as evinced by the user clicking the ad- vertisement. Taken together, these created a powerful mechanism by which to connect advertisers to consumers, quickly raising the annual revenues of Goto/Overture into hundreds of millions of dollars. This style of search en- gine came to be known variously as sponsored search or search advertising.SPONSORED SEARCH SEARCH ADVERTISING Given these two kinds of search engines – the “pure” search engines such as Google and Altavista, versus the sponsored search engines – the logi- cal next step was to combine them into a single user experience. Current search engines follow precisely this model: they provide pure search results (generally known as algorithmic search results) as the primary response to aALGORITHMIC SEARCH user’s search, together with sponsored search results displayed separately and distinctively to the right of the algorithmic results. This is shown in Fig- ure 19.6. Retrieving sponsored search results and ranking them in response to a query has now become considerably more sophisticated than the sim- ple Goto scheme; the process entails a blending of ideas from information ◮ Figure 19.6 Search advertising triggered by query keywords. Here the query A320 returns algorithmic search results about the Airbus aircraft, together with advertise- ments for various non-aircraft goods numbered A320, that advertisers seek to market to those querying on this query. The lack of advertisements for the aircraft reflects the fact that few marketers attempt to sell A320 aircraft on the web.
***soft clustering->
|------------
A second important distinction can be made between hard and soft cluster- ing algorithms. Hard clustering computes a hard assignment – each documentHARD CLUSTERING is a member of exactly one cluster. The assignment of soft clustering algo-SOFT CLUSTERING rithms is soft – a document’s assignment is a distribution over all clusters.
|------------
In a soft assignment, a document has fractional membership in several clus- ters. Latent semantic indexing, a form of dimensionality reduction, is a soft clustering algorithm (Chapter 18, page 417).
|------------
A note on terminology. An alternative definition of hard clustering is that a document can be a full member of more than one cluster. Partitional clus-PARTITIONAL CLUSTERING tering always refers to a clustering where each document belongs to exactly one cluster. (But in a partitional hierarchical clustering (Chapter 17) all mem- bers of a cluster are of course also members of its parent.) On the definition of hard clustering that permits multiple membership, the difference between soft clustering and hard clustering is that membership values in hard clus- tering are either 0 or 1, whereas they can take on any non-negative value in soft clustering.
***connected component->
|------------
Both single-link and complete-link clustering have graph-theoretic inter- pretations. Define sk to be the combination similarity of the two clusters merged in step k, and G(sk) the graph that links all data points with a similar- ity of at least sk. Then the clusters after step k in single-link clustering are the connected components of G(sk) and the clusters after step k in complete-link clustering are maximal cliques of G(sk). A connected component is a maximalCONNECTED COMPONENT set of connected points such that there is a path connecting each pair. A clique CLIQUE is a set of points that are completely linked with each other.
***odds ratio->
|------------
We can manipulate this expression by including the query terms found in the document into the right product, but simultaneously dividing through by them in the left product, so the value is unchanged. Then we have: O(R|~q,~x) = O(R|~q) · ∏ t:xt=qt=1 pt(1 − ut) ut(1 − pt) · ∏ t:qt=1 1 − pt 1 − ut (11.16) The left product is still over query terms found in the document, but the right product is now over all query terms. That means that this right product is a constant for a particular query, just like the oddsO(R|~q). So the only quantity that needs to be estimated to rank documents for relevance to a query is the left product. We can equally rank documents by the logarithm of this term, since log is a monotonic function. The resulting quantity used for ranking is called the Retrieval Status Value (RSV) in this model:RETRIEVAL STATUS VALUE RSVd = log ∏ t:xt=qt=1 pt(1 − ut) ut(1 − pt) = ∑ t:xt=qt=1 log pt(1 − ut) ut(1 − pt) (11.17) So everything comes down to computing the RSV. Define ct: ct = log pt(1 − ut) ut(1 − pt) = log pt (1 − pt) + log 1 − ut ut (11.18) The ct terms are log odds ratios for the terms in the query. We have the odds of the term appearing if the document is relevant (pt/(1 − pt)) and the odds of the term appearing if the document is nonrelevant (ut/(1− ut)). The odds ratio is the ratio of two such odds, and then we finally take the log of thatODDS RATIO quantity. The value will be 0 if a term has equal odds of appearing in relevant and nonrelevant documents, and positive if it is more likely to appear in relevant documents. The ct quantities function as term weights in the model, and the document score for a query is RSVd = ∑xt=qt=1 ct. Operationally, we sum them in accumulators for query terms appearing in documents, just as for the vector space model calculations discussed in Section 7.1 (page 135).
***biclustering->
|------------
The applications in Table 16.1 all cluster documents. Other information re- trieval applications cluster words (e.g., Crouch 1988), contexts of words (e.g., Schütze and Pedersen 1995) or words and documents simultaneously (e.g., Tishby and Slonim 2000, Dhillon 2001, Zha et al. 2001). Simultaneous clus- tering of words and documents is an example of co-clustering or biclustering.CO-CLUSTERING 16.7 Exercises ? Exercise 16.7Let Ω be a clustering that exactly reproduces a class structure C and Ω′ a clustering that further subdivides some clusters in Ω. Show that I(Ω; C) = I(Ω′; C).
***speech recognition->
|------------
12.1.2 Types of language models How do we build probabilities over sequences of terms? We can always use the chain rule from Equation (11.1) to decompose the probability of a sequence of events into the probability of each successive event conditioned on earlier events: P(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t1t2)P(t4|t1t2t3)(12.4) The simplest form of language model simply throws away all conditioning context, and estimates each term independently. Such a model is called a unigram language model:UNIGRAM LANGUAGE MODEL Puni(t1t2t3t4) = P(t1)P(t2)P(t3)P(t4)(12.5) There are many more complex kinds of language models, such as bigramBIGRAM LANGUAGE MODEL language models, which condition on the previous term, Pbi(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t2)P(t4|t3)(12.6) and even more complex grammar-based language models such as proba- bilistic context-free grammars. Such models are vital for tasks like speech recognition, spelling correction, and machine translation, where you need the probability of a term conditioned on surrounding context. However, most language-modeling work in IR has used unigram language models.
|------------
IR is not the place where you most immediately need complex language models, since IR does not directly depend on the structure of sentences to the extent that other tasks like speech recognition do. Unigram models are often sufficient to judge the topic of a text. Moreover, as we shall see, IR lan- guage models are frequently estimated from a single document and so it is questionable whether there is enough training data to do more. Losses from data sparseness (see the discussion on page 260) tend to outweigh any gains from richer models. This is an example of the bias-variance tradeoff (cf. Sec- tion 14.6, page 308): With limited training data, a more constrained model tends to perform better. In addition, unigram models are more efficient to estimate and apply than higher-order models. Nevertheless, the importance of phrase and proximity queries in IR in general suggests that future work should make use of more sophisticated language models, and some has be- gun to (see Section 12.5, page 252). Indeed, making this move parallels the model of van Rijsbergen in Chapter 11 (page 231).
***multiclass classification->
***paid inclusion->
|------------
You might argue that this is no different from a company that uses large fonts to list its phone numbers in the yellow pages; but this generally costs the company more and is thus a fairer mechanism. A more apt analogy, perhaps, is the use of company names beginning with a long string of A’s to be listed early in a yellow pages category. In fact, the yellow pages’ model of com- panies paying for larger/darker fonts has been replicated in web search: in many search engines, it is possible to pay to have one’s web page included in the search engine’s index – a model known as paid inclusion. DifferentPAID INCLUSION search engines have different policies on whether to allow paid inclusion, and whether such a payment has any effect on ranking in search results.
***data-centric XML->
|------------
This type of application of XML is called data-centric because numerical andDATA-CENTRIC XML non-text attribute-value data dominate and text is usually a small fraction of the overall data. Most data-centric XML is stored in databases – in contrast to the inverted index-based methods for text-centric XML that we present in this chapter.
|------------
10.5 Text-centric vs. data-centric XML retrieval In the type of structured retrieval we cover in this chapter, XML structure serves as a framework within which we match the text of the query with the text of the XML documents. This exemplifies a system that is optimized for text-centric XML. While both text and structure are important, we give higherTEXT-CENTRIC XML priority to text. We do this by adapting unstructured retrieval methods to handling additional structural constraints. The premise of our approach is that XML document retrieval is characterized by (i) long text fields (e.g., sec- tions of a document), (ii) inexact matching, and (iii) relevance-ranked results.
|------------
In contrast, data-centric XML mainly encodes numerical and non-text attribute-DATA-CENTRIC XML value data. When querying data-centric XML, we want to impose exact match conditions in most cases. This puts the emphasis on the structural aspects of XML documents and queries. An example is: Find employees whose salary is the same this month as it was 12 months ago.
***normalization->
|------------
✄ 6.4.4 Pivoted normalized document length In Section 6.3.1 we normalized each document vector by the Euclidean length of the vector, so that all document vectors turned into unit vectors. In doing so, we eliminated all information on the length of the original document; this masks some subtleties about longer documents. First, longer documents will – as a result of containing more terms – have higher tf values. Second, longer documents contain more distinct terms. These factors can conspire to raise the scores of longer documents, which (at least for some information needs) is unnatural. Longer documents can broadly be lumped into two cat- egories: (1) verbose documents that essentially repeat the same content – in these, the length of the document does not alter the relative weights of dif- ferent terms; (2) documents covering multiple different topics, in which the search terms probably match small segments of the document but not all of it – in this case, the relative weights of terms are quite different from a single short document that matches the query terms. Compensating for this phe- nomenon is a form of document length normalization that is independent of term and document frequencies. To this end, we introduce a form of normal- izing the vector representations of documents in the collection, so that the resulting “normalized” documents are not necessarily of unit length. Then, when we compute the dot product score between a (unit) query vector and such a normalized document, the score is skewed to account for the effect of document length on relevance. This form of compensation for document length is known as pivoted document length normalization.PIVOTED DOCUMENT LENGTH NORMALIZATION Consider a document collection together with an ensemble of queries for that collection. Suppose that we were given, for each query q and for each document d, a Boolean judgment of whether or not d is relevant to the query q; in Chapter 8 we will see how to procure such a set of relevance judgments for a query ensemble and a document collection. Given this set of relevance judgments, we may compute a probability of relevance as a function of docu- ment length, averaged over all queries in the ensemble. The resulting plot may look like the curve drawn in thick lines in Figure 6.16. To compute this curve, we bucket documents by length and compute the fraction of relevant documents in each bucket, then plot this fraction against the median docu- ment length of each bucket. (Thus even though the “curve” in Figure 6.16 appears to be continuous, it is in fact a histogram of discrete buckets of doc- ument length.) On the other hand, the curve in thin lines shows what might happen with the same documents and query ensemble if we were to use relevance as pre- scribed by cosine normalization Equation (6.12) – thus, cosine normalization has a tendency to distort the computed relevance vis-à-vis the true relevance, at the expense of longer documents. The thin and thick curves crossover at a point p corresponding to document length ℓp, which we refer to as the pivot Document length Relevance ℓp p - 6 ◮ Figure 6.16 Pivoted document length normalization.
***distributed crawling->
***odds->
|------------
Writing P(A) for the complement of an event, we similarly have: P(A, B) = P(B|A)P(A)(11.2) Probability theory also has a partition rule, which says that if an event B canPARTITION RULE be divided into an exhaustive set of disjoint subcases, then the probability of B is the sum of the probabilities of the subcases. A special case of this rule gives that: P(B) = P(A, B) + P(A, B)(11.3) From these we can derive Bayes’ Rule for inverting conditional probabili-BAYES’ RULE ties: P(A|B) = P(B|A)P(A) P(B) = [ P(B|A) ∑X∈{A,A} P(B|X)P(X) ] P(A)(11.4) This equation can also be thought of as a way of updating probabilities. We start off with an initial estimate of how likely the event A is when we do not have any other information; this is the prior probability P(A). Bayes’ rulePRIOR PROBABILITY lets us derive a posterior probability P(A|B) after having seen the evidence B,POSTERIOR PROBABILITY based on the likelihood of B occurring in the two cases that A does or does not hold.1 Finally, it is often useful to talk about the odds of an event, which provideODDS a kind of multiplier for how probabilities change: Odds: O(A) = P(A) P(A) = P(A) 1 − P(A)(11.5) 11.2 The Probability Ranking Principle 11.2.1 The 1/0 loss case We assume a ranked retrieval setup as in Section 6.3, where there is a collec- tion of documents, the user issues a query, and an ordered list of documents is returned. We also assume a binary notion of relevance as in Chapter 8. For a query q and a document d in the collection, let Rd,q be an indicator random variable that says whether d is relevant with respect to a given query q. That is, it takes on a value of 1 when the document is relevant and 0 otherwise. In context we will often write just R for Rd,q.
***classifier->
|------------
✄ 9.1.2 Probabilistic relevance feedbackRather than reweighting the query in a vector space, if a user has told us some relevant and nonrelevant documents, then we can proceed to build a classifier. One way of doing this is with a Naive Bayes probabilistic model.
***B-tree->
|------------
Search trees overcome many of these issues – for instance, they permit us to enumerate all vocabulary terms beginning with automat. The best-known search tree is the binary tree, in which each internal node has two children.BINARY TREE The search for a term begins at the root of the tree. Each internal node (in- cluding the root) represents a binary test, based on whose outcome the search proceeds to one of the two sub-trees below that node. Figure 3.1 gives an ex- ample of a binary search tree used for a dictionary. Efficient search (with a number of comparisons that is O(log M)) hinges on the tree being balanced: the numbers of terms under the two sub-trees of any node are either equal or differ by one. The principal issue here is that of rebalancing: as terms are inserted into or deleted from the binary search tree, it needs to be rebalanced so that the balance property is maintained.
|------------
To mitigate rebalancing, one approach is to allow the number of sub-trees under an internal node to vary in a fixed interval. A search tree commonly used for a dictionary is the B-tree – a search tree in which every internal nodeB-TREE has a number of children in the interval [a, b], where a and b are appropriate positive integers; Figure 3.2 shows an example with a = 2 and b = 4. Each branch under an internal node again represents a test for a range of char- 1. So-called perfect hash functions are designed to preclude collisions, but are rather more com- plicated both to implement and to compute.
***γ encoding->
|------------
A method that is within a factor of optimal is γ encoding. γ codes im-γ ENCODING plement variable-length encoding by splitting the representation of a gap G into a pair of length and offset. Offset is G in binary, but with the leading 1 removed.2 For example, for 13 (binary 1101) offset is 101. Length encodes the length of offset in unary code. For 13, the length of offset is 3 bits, which is 1110 in unary. The γ code of 13 is therefore 1110101, the concatenation of length 1110 and offset 101. The right hand column of Table 5.5 gives additional examples of γ codes.
***partition rule->
|------------
Writing P(A) for the complement of an event, we similarly have: P(A, B) = P(B|A)P(A)(11.2) Probability theory also has a partition rule, which says that if an event B canPARTITION RULE be divided into an exhaustive set of disjoint subcases, then the probability of B is the sum of the probabilities of the subcases. A special case of this rule gives that: P(B) = P(A, B) + P(A, B)(11.3) From these we can derive Bayes’ Rule for inverting conditional probabili-BAYES’ RULE ties: P(A|B) = P(B|A)P(A) P(B) = [ P(B|A) ∑X∈{A,A} P(B|X)P(X) ] P(A)(11.4) This equation can also be thought of as a way of updating probabilities. We start off with an initial estimate of how likely the event A is when we do not have any other information; this is the prior probability P(A). Bayes’ rulePRIOR PROBABILITY lets us derive a posterior probability P(A|B) after having seen the evidence B,POSTERIOR PROBABILITY based on the likelihood of B occurring in the two cases that A does or does not hold.1 Finally, it is often useful to talk about the odds of an event, which provideODDS a kind of multiplier for how probabilities change: Odds: O(A) = P(A) P(A) = P(A) 1 − P(A)(11.5) 11.2 The Probability Ranking Principle 11.2.1 The 1/0 loss case We assume a ranked retrieval setup as in Section 6.3, where there is a collec- tion of documents, the user issues a query, and an ordered list of documents is returned. We also assume a binary notion of relevance as in Chapter 8. For a query q and a document d in the collection, let Rd,q be an indicator random variable that says whether d is relevant with respect to a given query q. That is, it takes on a value of 1 when the document is relevant and 0 otherwise. In context we will often write just R for Rd,q.
***model->
|------------
1.4 The extended Boolean model versus ranked retrieval The Boolean retrieval model contrasts with ranked retrieval models such as theRANKED RETRIEVAL MODEL vector space model (Section 6.3), in which users largely use free text queries, FREE TEXT QUERIES that is, just typing one or more words rather than using a precise language with operators for building up query expressions, and the system decides which documents best satisfy the query. Despite decades of academic re- search on the advantages of ranked retrieval, systems implementing the Boo- lean retrieval model were the main or only search option provided by large commercial information providers for three decades until the early 1990s (ap- proximately the date of arrival of the World Wide Web). However, these systems did not have just the basic Boolean operations (AND, OR, and NOT) which we have presented so far. A strict Boolean expression over terms with an unordered results set is too limited for many of the information needs that people have, and these systems implemented extended Boolean retrieval models by incorporating additional operators such as term proximity oper- ators. A proximity operator is a way of specifying that two terms in a queryPROXIMITY OPERATOR must occur close to each other in a document, where closeness may be mea- sured by limiting the allowed number of intervening words or by reference to a structural unit such as a sentence or paragraph.
***ROC curve->
|------------
Thus, R-precision turns out to be identical to the break-even point, anotherBREAK-EVEN POINT measure which is sometimes used, defined in terms of this equality relation- ship holding. Like Precision at k, R-precision describes only one point on the precision-recall curve, rather than attempting to summarize effectiveness across the curve, and it is somewhat unclear why you should be interested in the break-even point rather than either the best point on the curve (the point with maximal F-measure) or a retrieval level of interest to a particular application (Precision at k). Nevertheless, R-precision turns out to be highly correlated with MAP empirically, despite measuring only a single point on 0.0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1 1 − specificity se ns itiv ity ( = re ca ll) ◮ Figure 8.4 The ROC curve corresponding to the precision-recall curve in Fig- ure 8.2.
|------------
Another concept sometimes used in evaluation is an ROC curve. (“ROC”ROC CURVE stands for “Receiver Operating Characteristics”, but knowing that doesn’t help most people.) An ROC curve plots the true positive rate or sensitiv- ity against the false positive rate or (1 − specificity). Here, sensitivity is justSENSITIVITY another term for recall. The false positive rate is given by f p/( f p+ tn). Fig- ure 8.4 shows the ROC curve corresponding to the precision-recall curve in Figure 8.2. An ROC curve always goes from the bottom left to the top right of the graph. For a good system, the graph climbs steeply on the left side. For unranked result sets, specificity, given by tn/( f p+ tn), was not seen as a verySPECIFICITY useful notion. Because the set of true negatives is always so large, its value would be almost 1 for all information needs (and, correspondingly, the value of the false positive rate would be almost 0). That is, the “interesting” part of Figure 8.2 is 0 < recall < 0.4, a part which is compressed to a small corner of Figure 8.4. But an ROC curve could make sense when looking over the full retrieval spectrum, and it provides another way of looking at the data.
|------------
In many fields, a common aggregate measure is to report the area under the ROC curve, which is the ROC analog of MAP. Precision-recall curves are sometimes loosely referred to as ROC curves. This is understandable, but not accurate.
***SMART->
|------------
nism introduced in and popularized by Salton’s SMART system around 1970.
***eigenvalue->
|------------
For a square M×M matrix C and a vector ~x that is not all zeros, the values of λ satisfying C~x = λ~x(18.1) are called the eigenvalues of C . The N-vector ~x satisfying Equation (18.1)EIGENVALUE for an eigenvalue λ is the corresponding right eigenvector. The eigenvector corresponding to the eigenvalue of largest magnitude is called the principal eigenvector. In a similar fashion, the left eigenvectors of C are the M-vectors y such that ~yT C = λ~yT .(18.2) The number of non-zero eigenvalues of C is at most rank(C).
|------------
The eigenvalues of a matrix are found by solving the characteristic equation, which is obtained by rewriting Equation (18.1) in the form (C− λIM)~x = 0.
|------------
The eigenvalues of C are then the solutions of |(C − λIM)| = 0, where |S| denotes the determinant of a square matrix S. The equation |(C− λIM)| = 0 is an Mth order polynomial equation in λ and can have at most M roots, which are the eigenvalues of C. These eigenvalues can in general be complex, even if all entries of C are real.
|------------
We now examine some further properties of eigenvalues and eigenvectors, to set up the central idea of singular value decompositions in Section 18.2 be- low. First, we look at the relationship between matrix-vector multiplication and eigenvalues.
|------------
Clearly the matrix has rank 3, and has 3 non-zero eigenvalues λ1 = 30, λ2 = 20 and λ3 = 1, with the three corresponding eigenvectors ~x1 = 1 0 0 , ~x2 = 0 1 0 and ~x3 = 0 0 1 .
***algorithm->
|------------
With main memory insufficient, we need to use an external sorting algo-EXTERNAL SORTING ALGORITHM rithm, that is, one that uses disk. For acceptable speed, the central require- BSBINDEXCONSTRUCTION() 1 n ← 0 2 while (all documents have not been processed) 3 do n ← n + 1 4 block ← PARSENEXTBLOCK() 5 BSBI-INVERT(block) 6 WRITEBLOCKTODISK(block, fn) 7 MERGEBLOCKS( f1, . . . , fn; fmerged) ◮ Figure 4.2 Blocked sort-based indexing. The algorithm stores inverted blocks in files f1, . . . , fn and the merged index in fmerged.
|------------
ment of such an algorithm is that it minimize the number of random disk seeks during sorting – sequential disk reads are far faster than seeks as we explained in Section 4.1. One solution is the blocked sort-based indexing algo-BLOCKED SORT-BASED INDEXING ALGORITHM rithm or BSBI in Figure 4.2. BSBI (i) segments the collection into parts of equal size, (ii) sorts the termID–docID pairs of each part in memory, (iii) stores in- termediate sorted results on disk, and (iv) merges all intermediate results into the final index.
|------------
The algorithm parses documents into termID–docID pairs and accumu- lates the pairs in memory until a block of a fixed size is full (PARSENEXTBLOCK in Figure 4.2). We choose the block size to fit comfortably into memory to permit a fast in-memory sort. The block is then inverted and written to disk.
|------------
In the final step, the algorithm simultaneously merges the ten blocks into one large merged index. An example with two blocks is shown in Figure 4.3, where we use di to denote the ith document of the collection. To do the merg- ing, we open all block files simultaneously, and maintain small read buffers for the ten blocks we are reading and a write buffer for the final merged in- dex we are writing. In each iteration, we select the lowest termID that has not been processed yet using a priority queue or a similar data structure. All postings lists for this termID are read and merged, and the merged list is written back to disk. Each read buffer is refilled from its file when necessary.
***bias-variance tradeoff->
|------------
IR is not the place where you most immediately need complex language models, since IR does not directly depend on the structure of sentences to the extent that other tasks like speech recognition do. Unigram models are often sufficient to judge the topic of a text. Moreover, as we shall see, IR lan- guage models are frequently estimated from a single document and so it is questionable whether there is enough training data to do more. Losses from data sparseness (see the discussion on page 260) tend to outweigh any gains from richer models. This is an example of the bias-variance tradeoff (cf. Sec- tion 14.6, page 308): With limited training data, a more constrained model tends to perform better. In addition, unigram models are more efficient to estimate and apply than higher-order models. Nevertheless, the importance of phrase and proximity queries in IR in general suggests that future work should make use of more sophisticated language models, and some has be- gun to (see Section 12.5, page 252). Indeed, making this move parallels the model of van Rijsbergen in Chapter 11 (page 231).
|------------
The decision for one learning method vs. another is then not simply a mat- ter of selecting the one that reliably produces good classifiers across training sets (small variance) or the one that can learn classification problems with very difficult decision boundaries (small bias). Instead, we have to weigh the respective merits of bias and variance in our application and choose ac- cordingly. This tradeoff is called the bias-variance tradeoff .BIAS-VARIANCE TRADEOFF Figure 14.10 provides an illustration, which is somewhat contrived, but will be useful as an example for the tradeoff. Some Chinese text contains English words written in the Roman alphabet like CPU, ONLINE, and GPS.
|------------
Maximizing the margin seems good because points near the decision sur- face represent very uncertain classification decisions: there is almost a 50% chance of the classifier deciding either way. A classifier with a large margin makes no low certainty classification decisions. This gives you a classifica- tion safety margin: a slight error in measurement or a slight document vari- ation will not cause a misclassification. Another intuition motivating SVMs is shown in Figure 15.2. By construction, an SVM classifier insists on a large margin around the decision boundary. Compared to a decision hyperplane, if you have to place a fat separator between classes, you have fewer choices of where it can be put. As a result of this, the memory capacity of the model has been decreased, and hence we expect that its ability to correctly general- ize to test data is increased (cf. the discussion of the bias-variance tradeoff in Chapter 14, page 312).
***outlier->
|------------
This is a particular problem if a document set contains many outliers, doc-OUTLIER uments that are far from any other documents and therefore do not fit well into any cluster. Frequently, if an outlier is chosen as an initial seed, then no other vector is assigned to it during subsequent iterations. Thus, we end up with a singleton cluster (a cluster with only one document) even though thereSINGLETON CLUSTER is probably a clustering with lower RSS. Figure 16.7 shows an example of a suboptimal clustering resulting from a bad choice of initial seeds.
***trec_eval->
|------------
Buckley and Voorhees (2000) compare several evaluation measures, in- cluding precision at k, MAP, and R-precision, and evaluate the error rate of each measure. R-precision was adopted as the official evaluation metric inR-PRECISION the TREC HARD track (Allan 2005). Aslam and Yilmaz (2005) examine its surprisingly close correlation to MAP, which had been noted in earlier stud- ies (Tague-Sutcliffe and Blustein 1995, Buckley and Voorhees 2000). A stan- dard program for evaluating IR systems which computes many measures of ranked retrieval effectiveness is Chris Buckley’s trec_eval program used in the TREC evaluations. It can be downloaded from: http://trec.nist.gov/trec_eval/.
***distributed->
|------------
These and other results have shown that the average effectiveness of NB is uncompetitive with classifiers like SVMs when trained and tested on inde- pendent and identically distributed (i.i.d.) data, that is, uniform data with all the good properties of statistical sampling. However, these differences may of- ten be invisible or even reverse themselves when working in the real world where, usually, the training sample is drawn from a subset of the data to which the classifier will be applied, the nature of the data drifts over time rather than being stationary (the problem of concept drift we mentioned on page 269), and there may well be errors in the data (among other problems).
***classification->
|------------
13 Text classification and NaiveBayes Thus far, this book has mainly discussed the process of ad hoc retrieval, where users have transient information needs that they try to address by posing one or more queries to a search engine. However, many users have ongoing information needs. For example, you might need to track developments in multicore computer chips. One way of doing this is to issue the query multi- core AND computer AND chip against an index of recent newswire articles each morning. In this and the following two chapters we examine the question: How can this repetitive task be automated? To this end, many systems sup- port standing queries. A standing query is like any other query except that itSTANDING QUERY is periodically executed on a collection to which new documents are incre- mentally added over time.
|------------
To capture the generality and scope of the problem space to which stand- ing queries belong, we now introduce the general notion of a classificationCLASSIFICATION problem. Given a set of classes, we seek to determine which class(es) a given object belongs to. In the example, the standing query serves to divide new newswire articles into the two classes: documents about multicore computer chips and documents not about multicore computer chips. We refer to this as two-class classification. Classification using standing queries is also called routing orROUTING filteringand will be discussed further in Section 15.3.1 (page 335).FILTERING A class need not be as narrowly focused as the standing query multicore computer chips. Often, a class is a more general subject area like China or coffee.
|------------
Such more general classes are usually referred to as topics, and the classifica- tion task is then called text classification, text categorization, topic classification,TEXT CLASSIFICATION or topic spotting. An example for China appears in Figure 13.1. Standing queries and topics differ in their degree of specificity, but the methods for solving routing, filtering, and text classification are essentially the same. We therefore include routing and filtering under the rubric of text classification in this and the following chapters.
|------------
Thus, the problem of making a binary relevant/nonrelevant judgment given training examples as above turns into one of learning the dashed line in Fig- ure 15.7 separating relevant training examples from the nonrelevant ones. Be- ing in the α-ω plane, this line can be written as a linear equation involving α and ω, with two parameters (slope and intercept). The methods of lin- ear classification that we have already looked at in Chapters 13–15 provide methods for choosing this line. Provided we can build a sufficiently rich col- lection of training samples, we can thus altogether avoid hand-tuning score functions as in Section 7.2.3 (page 145). The bottleneck of course is the ability to maintain a suitably representative set of training examples, whose rele- vance assessments must be made by experts.
|------------
However, approaching IR result ranking like this is not necessarily the right way to think about the problem. Statisticians normally first divide problems into classification problems (where a categorical variable is pre- dicted) versus regression problems (where a real number is predicted). InREGRESSION between is the specialized field of ordinal regression where a ranking is pre-ORDINAL REGRESSION dicted. Machine learning for ad hoc retrieval is most properly thought of as an ordinal regression problem, where the goal is to rank a set of documents for a query, given training data of the same sort. This formulation gives some additional power, since documents can be evaluated relative to other candidate documents for the same query, rather than having to be mapped to a global scale of goodness, while also weakening the problem space, since just a ranking is required rather than an absolute measure of relevance. Is- sues of ranking are especially germane in web search, where the ranking at the very top of the results list is exceedingly important, whereas decisions of relevance of a document to a query may be much less important. Such work can and has been pursued using the structural SVM framework which we mentioned in Section 15.2.2, where the class being predicted is a ranking of results for a query, but here we will present the slightly simpler ranking SVM.
***champion lists->
|------------
7.2.1 Tiered indexes We mentioned in Section 7.1.2 that when using heuristics such as index elim- ination for inexact top-K retrieval, we may occasionally find ourselves with a set A of contenders that has fewer than K documents. A common solution to this issue is the user of tiered indexes, which may be viewed as a gener-TIERED INDEXES alization of champion lists. We illustrate this idea in Figure 7.4, where we represent the documents and terms of Figure 6.9. In this example we set a tf threshold of 20 for tier 1 and 10 for tier 2, meaning that the tier 1 index only has postings entries with tf values exceeding 20, while the tier 2 index only ◮ Figure 7.4 Tiered indexes. If we fail to get K results from tier 1, query processing “falls back” to tier 2, and so on. Within each tier, postings are ordered by document ID.
***positional index->
|------------
to, 993427: 〈 1, 6: 〈7, 18, 33, 72, 86, 231〉; 2, 5: 〈1, 17, 74, 222, 255〉; 4, 5: 〈8, 16, 190, 429, 433〉; 5, 2: 〈363, 367〉; 7, 3: 〈13, 23, 191〉; . . . 〉 be, 178239: 〈 1, 2: 〈17, 25〉; 4, 5: 〈17, 191, 291, 430, 434〉; 5, 3: 〈14, 19, 101〉; . . . 〉 ◮ Figure 2.11 Positional index example. The word to has a document frequency 993,477, and occurs 6 times in document 1 at positions 7, 18, 33, etc.
|------------
2.4.2 Positional indexes For the reasons given, a biword index is not the standard solution. Rather, a positional index is most commonly employed. Here, for each term in thePOSITIONAL INDEX vocabulary, we store postings of the form docID: 〈position1, position2, . . . 〉, as shown in Figure 2.11, where each position is a token index in the docu- ment. Each posting will also usually record the term frequency, for reasons discussed in Chapter 6.
***relevance->
|------------
Our goal is to develop a system to address the ad hoc retrieval task. This isAD HOC RETRIEVAL the most standard IR task. In it, a system aims to provide documents from within the collection that are relevant to an arbitrary user information need, communicated to the system by means of a one-off, user-initiated query. An information need is the topic about which the user desires to know more, andINFORMATION NEED is differentiated from a query, which is what the user conveys to the com-QUERY puter in an attempt to communicate the information need. A document is relevant if it is one that the user perceives as containing information of valueRELEVANCE with respect to their personal information need. Our example above was rather artificial in that the information need was defined in terms of par- ticular words, whereas usually a user is interested in a topic like “pipeline leaks” and would like to find relevant documents regardless of whether they precisely use those words or express the concept with other words such as pipeline rupture. To assess the effectiveness of an IR system (i.e., the quality ofEFFECTIVENESS its search results), a user will usually want to know two key statistics about the system’s returned results for a query: Precision: What fraction of the returned results are relevant to the informa-PRECISION tion need? Recall: What fraction of the relevant documents in the collection were re-RECALL turned by the system? Detailed discussion of relevance and evaluation measures including preci- sion and recall is found in Chapter 8.
|------------
8.1 Information retrieval system evaluation To measure ad hoc information retrieval effectiveness in the standard way, we need a test collection consisting of three things: 1. A document collection 2. A test suite of information needs, expressible as queries 3. A set of relevance judgments, standardly a binary assessment of either relevant or nonrelevant for each query-document pair.
|------------
The standard approach to information retrieval system evaluation revolves around the notion of relevant and nonrelevant documents. With respect to aRELEVANCE user information need, a document in the test collection is given a binary classification as either relevant or nonrelevant. This decision is referred to as the gold standard or ground truth judgment of relevance. The test documentGOLD STANDARD GROUND TRUTH collection and suite of information needs have to be of a reasonable size: you need to average performance over fairly large test sets, as results are highly variable over different documents and information needs. As a rule of thumb, 50 information needs has usually been found to be a sufficient minimum.
|------------
Relevance is assessed relative to an information need, not a query. ForINFORMATION NEED example, an information need might be: Information on whether drinking red wine is more effective at reduc- ing your risk of heart attacks than white wine.
|------------
But, nevertheless, an information need is present. If a user types python into a web search engine, they might be wanting to know where they can purchase a pet python. Or they might be wanting information on the programming language Python. From a one word query, it is very difficult for a system to know what the information need is. But, nevertheless, the user has one, and can judge the returned results on the basis of their relevance to it. To evalu- ate a system, we require an overt expression of an information need, which can be used for judging returned documents as relevant or nonrelevant. At this point, we make a simplification: relevance can reasonably be thought of as a scale, with some documents highly relevant and others marginally so. But for the moment, we will use just a binary decision of relevance. We discuss the reasons for using binary relevance judgments and alternatives in Section 8.5.1.
***hard clustering->
|------------
A second important distinction can be made between hard and soft cluster- ing algorithms. Hard clustering computes a hard assignment – each documentHARD CLUSTERING is a member of exactly one cluster. The assignment of soft clustering algo-SOFT CLUSTERING rithms is soft – a document’s assignment is a distribution over all clusters.
|------------
This chapter motivates the use of clustering in information retrieval by introducing a number of applications (Section 16.1), defines the problem we are trying to solve in clustering (Section 16.2) and discusses measures for evaluating cluster quality (Section 16.3). It then describes two flat clus- tering algorithms, K-means (Section 16.4), a hard clustering algorithm, and the Expectation-Maximization (or EM) algorithm (Section 16.5), a soft clus- tering algorithm. K-means is perhaps the most widely used flat clustering algorithm due to its simplicity and efficiency. The EM algorithm is a gen- eralization of K-means and can be applied to a large variety of document representations and distributions.
|------------
A note on terminology. An alternative definition of hard clustering is that a document can be a full member of more than one cluster. Partitional clus-PARTITIONAL CLUSTERING tering always refers to a clustering where each document belongs to exactly one cluster. (But in a partitional hierarchical clustering (Chapter 17) all mem- bers of a cluster are of course also members of its parent.) On the definition of hard clustering that permits multiple membership, the difference between soft clustering and hard clustering is that membership values in hard clus- tering are either 0 or 1, whereas they can take on any non-negative value in soft clustering.
***Reuters-21578->
|------------
See: http://www.clef-campaign.org/ Reuters-21578 and Reuters-RCV1. For text classification, the most used testREUTERS collection has been the Reuters-21578 collection of 21578 newswire arti- cles; see Chapter 13, page 279. More recently, Reuters released the much larger Reuters Corpus Volume 1 (RCV1), consisting of 806,791 documents; see Chapter 4, page 69. Its scale and rich annotation makes it a better basis for future research.
***termID->
|------------
To make index construction more efficient, we represent terms as termIDs (instead of strings as we did in Figure 1.4), where each termID is a uniqueTERMID serial number. We can build the mapping from terms to termIDs on the fly while we are processing the collection; or, in a two-pass approach, we com- pile the vocabulary in the first pass and construct the inverted index in the second pass. The index construction algorithms described in this chapter all do a single pass through the data. Section 4.7 gives references to multipass algorithms that are preferable in certain applications, for example, when disk space is scarce.
***query optimization->
|------------
We can extend the intersection operation to process more complicated queries like: (1.2) (Brutus OR Caesar) AND NOT Calpurnia Query optimization is the process of selecting how to organize the work of an-QUERY OPTIMIZATION swering a query so that the least total amount of work needs to be done by the system. A major element of this for Boolean queries is the order in which postings lists are accessed. What is the best order for query processing? Con- sider a query that is an AND of t terms, for instance: (1.3) Brutus AND Caesar AND Calpurnia For each of the t terms, we need to get its postings, then AND them together.
***field->
|------------
6.1 Parametric and zone indexes We have thus far viewed a document as a sequence of terms. In fact, most documents have additional structure. Digital documents generally encode, in machine-recognizable form, certain metadata associated with each docu-METADATA ment. By metadata, we mean specific forms of data about a document, such as its author(s), title and date of publication. This metadata would generally include fields such as the date of creation and the format of the document, asFIELD well the author and possibly the title of the document. The possible values of a field should be thought of as finite – for instance, the set of all dates of authorship.
|------------
Consider queries of the form “find documents authored by William Shake- speare in 1601, containing the phrase alas poor Yorick”. Query processing then consists as usual of postings intersections, except that we may merge post- ings from standard inverted as well as parametric indexes. There is one para-PARAMETRIC INDEX metric index for each field (say, date of creation); it allows us to select only the documents matching a date specified in the query. Figure 6.1 illustrates the user’s view of such a parametric search. Some of the fields may assume ordered values, such as dates; in the example query above, the year 1601 is one such field value. The search engine may support querying ranges on such ordered values; to this end, a structure like a B-tree may be used for the field’s dictionary.
|------------
Zones are similar to fields, except the contents of a zone can be arbitraryZONE free text. Whereas a field may take on a relatively small set of values, a zone can be thought of as an arbitrary, unbounded amount of text. For instance, document titles and abstracts are generally treated as zones. We may build a separate inverted index for each zone of a document, to support queries such as “find documents with merchant in the title and william in the author list and the phrase gentle rain in the body”. This has the effect of building an index that looks like Figure 6.2. Whereas the dictionary for a parametric index comes from a fixed vocabulary (the set of languages, or the set of dates), the dictionary for a zone index must structure whatever vocabulary stems from the text of that zone.
***likelihood->
|------------
Writing P(A) for the complement of an event, we similarly have: P(A, B) = P(B|A)P(A)(11.2) Probability theory also has a partition rule, which says that if an event B canPARTITION RULE be divided into an exhaustive set of disjoint subcases, then the probability of B is the sum of the probabilities of the subcases. A special case of this rule gives that: P(B) = P(A, B) + P(A, B)(11.3) From these we can derive Bayes’ Rule for inverting conditional probabili-BAYES’ RULE ties: P(A|B) = P(B|A)P(A) P(B) = [ P(B|A) ∑X∈{A,A} P(B|X)P(X) ] P(A)(11.4) This equation can also be thought of as a way of updating probabilities. We start off with an initial estimate of how likely the event A is when we do not have any other information; this is the prior probability P(A). Bayes’ rulePRIOR PROBABILITY lets us derive a posterior probability P(A|B) after having seen the evidence B,POSTERIOR PROBABILITY based on the likelihood of B occurring in the two cases that A does or does not hold.1 Finally, it is often useful to talk about the odds of an event, which provideODDS a kind of multiplier for how probabilities change: Odds: O(A) = P(A) P(A) = P(A) 1 − P(A)(11.5) 11.2 The Probability Ranking Principle 11.2.1 The 1/0 loss case We assume a ranked retrieval setup as in Section 6.3, where there is a collec- tion of documents, the user issues a query, and an ordered list of documents is returned. We also assume a binary notion of relevance as in Chapter 8. For a query q and a document d in the collection, let Rd,q be an indicator random variable that says whether d is relevant with respect to a given query q. That is, it takes on a value of 1 when the document is relevant and 0 otherwise. In context we will often write just R for Rd,q.
|------------
Using a probabilistic model, the obvious order in which to present doc- uments to the user is to rank documents by their estimated probability of relevance with respect to the information need: P(R = 1|d, q). This is the ba- sis of the Probability Ranking Principle (PRP) (van Rijsbergen 1979, 113–114):PROBABILITY RANKING PRINCIPLE “If a reference retrieval system’s response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request, where the prob- abilities are estimated as accurately as possible on the basis of what- ever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data.” In the simplest case of the PRP, there are no retrieval costs or other utility concerns that would differentially weight actions or errors. You lose a point for either returning a nonrelevant document or failing to return a relevant document (such a binary situation where you are evaluated on your accuracy is called 1/0 loss). The goal is to return the best possible results as the top k1/0 LOSS documents, for any value of k the user chooses to examine. The PRP then says to simply rank all documents in decreasing order of P(R = 1|d, q). If a set of retrieval results is to be returned, rather than an ordering, the BayesBAYES OPTIMAL DECISION RULE 1. The term likelihood is just a synonym of probability. It is the probability of an event or data according to a model. The term is usually used when people are thinking of holding the data fixed, while varying the model.
***add 12->
|------------
11.3.2 Probability estimates in theory For each term t, what would these ct numbers look like for the whole collec- tion? (11.19) gives a contingency table of counts of documents in the collec- tion, where dft is the number of documents that contain term t: (11.19) documents relevant nonrelevant Total Term present xt = 1 s dft − s dft Term absent xt = 0 S− s (N − dft) − (S− s) N − dft Total S N − S N Using this, pt = s/S and ut = (dft − s)/(N − S) and ct = K(N, dft, S, s) = log s/(S− s) (dft − s)/((N − dft) − (S− s)) (11.20) To avoid the possibility of zeroes (such as if every or no relevant document has a particular term) it is fairly standard to add 12 to each of the quantities in the center 4 terms of (11.19), and then to adjust the marginal counts (the totals) accordingly (so, the bottom right cell totals N + 2). Then we have: ĉt = K(N, dft, S, s) = log (s + 12 )/(S− s + 12 ) (dft − s + 12 )/(N − dft − S + s + 12 ) (11.21) Adding 12 in this way is a simple form of smoothing. For trials with cat- egorical outcomes (such as noting the presence or absence of a term), one way to estimate the probability of an event from data is simply to count the number of times an event occurred divided by the total number of trials.
|------------
3. Improve our guesses for pt and ut. We choose from the methods of Equa- tions (11.23) and (11.25) for re-estimating pt, except now based on the set V instead of VR. If we let Vt be the subset of documents in V containing xt and use add 12 smoothing, we get: pt = |Vt|+ 12 |V| + 1(11.26) and if we assume that documents that are not retrieved are nonrelevant then we can update our ut estimates as: ut = dft − |Vt|+ 12 N − |V| + 1(11.27) 4. Go to step 2 until the ranking of the returned results converges.
***link farms->
|------------
Ng et al. (2001b) suggests that the PageRank score assignment is more ro- bust than HITS in the sense that scores are less sensitive to small changes in graph topology. However, it has also been noted that the teleport operation contributes significantly to PageRank’s robustness in this sense. Both Page- Rank and HITS can be “spammed” by the orchestrated insertion of links into the web graph; indeed, the Web is known to have such link farms that col-LINK FARMS lude to increase the score assigned to certain pages by various link analysis algorithms.
***marginal relevance->
|------------
One clear problem with the relevance-based assessment that we have pre- sented is the distinction between relevance and marginal relevance: whetherMARGINAL RELEVANCE a document still has distinctive usefulness after the user has looked at cer- tain other documents (Carbonell and Goldstein 1998). Even if a document is highly relevant, its information can be completely redundant with other documents which have already been examined. The most extreme case of this is documents that are duplicates – a phenomenon that is actually very common on the World Wide Web – but it can also easily occur when sev- eral documents provide a similar precis of an event. In such circumstances, marginal relevance is clearly a better measure of utility to the user. Maximiz- ing marginal relevance requires returning documents that exhibit diversity and novelty. One way to approach measuring this is by using distinct facts or entities as evaluation units. This perhaps more directly measures true utility to the user but doing this makes it harder to create a test collection.
***XML tag->
***lemmatization->
|------------
2.2.4 Stemming and lemmatization For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are fami- lies of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.
|------------
The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance: am, are, is ⇒ be car, cars, car’s, cars’ ⇒ car The result of this mapping of text will be something like: the boy’s cars are different colors ⇒ the boy car be differ color However, the two words differ in their flavor. Stemming usually refers toSTEMMING a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the re- moval of derivational affixes. Lemmatization usually refers to doing thingsLEMMATIZATION properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. If confrontedLEMMA with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatiza- tion commonly only collapses the different inflectional forms of a lemma.
***407->
***token normalization->
|------------
Token normalization is the process of canonicalizing tokens so that matchesTOKEN NORMALIZATION occur despite superficial differences in the character sequences of the to- kens.4 The most standard way to normalize is to implicitly create equivalenceEQUIVALENCE CLASSES classes, which are normally named after one member of the set. For instance, if the tokens anti-discriminatory and antidiscriminatory are both mapped onto the term antidiscriminatory, in both the document text and queries, then searches for one term will retrieve documents that contain either.
***MAP->
|------------
In recent years, other measures have become more common. Most stan- dard among the TREC community is Mean Average Precision (MAP), whichMEAN AVERAGE PRECISION provides a single-figure measure of quality across recall levels. Among eval- uation measures, MAP has been shown to have especially good discrimina- tion and stability. For a single information need, Average Precision is the 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall Pr ec is io n ◮ Figure 8.3 Averaged 11-point precision/recall graph across 50 queries for a rep- resentative TREC system. The Mean Average Precision for this system is 0.2553.
|------------
This is referred to as the relative frequency of the event. Estimating the prob-RELATIVE FREQUENCY ability as the relative frequency is the maximum likelihood estimate (or MLE),MAXIMUM LIKELIHOOD ESTIMATE MLE because this value makes the observed data maximally likely. However, if we simply use the MLE, then the probability given to events we happened to see is usually too high, whereas other events may be completely unseen and giving them as a probability estimate their relative frequency of 0 is both an underestimate, and normally breaks our models, since anything multiplied by 0 is 0. Simultaneously decreasing the estimated probability of seen events and increasing the probability of unseen events is referred to as smoothing.SMOOTHING One simple way of smoothing is to add a number α to each of the observed counts. These pseudocounts correspond to the use of a uniform distributionPSEUDOCOUNTS over the vocabulary as a Bayesian prior, following Equation (11.4). We ini-BAYESIAN PRIOR tially assume a uniform distribution over events, where the size of α denotes the strength of our belief in uniformity, and we then update the probability based on observed events. Since our belief in uniformity is weak, we use α = 12 . This is a form of maximum a posteriori (MAP) estimation, where weMAXIMUM A POSTERIORI MAP choose the most likely point value for probabilities based on the prior and the observed evidence, following Equation (11.4). We will further discuss methods of smoothing estimated counts to give probability models in Sec- tion 12.2.2 (page 243); the simple method of adding 12 to each observed count will do for now.
|------------
In text classification, our goal is to find the best class for the document. The best class in NB classification is the most likely or maximum a posteriori (MAP)MAXIMUM A POSTERIORI CLASS class cmap: cmap = arg max c∈C P̂(c|d) = arg max c∈C P̂(c) ∏ 1≤k≤nd P̂(tk|c).(13.3) We write P̂ for P because we do not know the true values of the parameters P(c) and P(tk|c), but estimate them from the training set as we will see in a moment.
***symmetric diagonal decomposition,->
***exhaustive clustering->
|------------
Some researchers distinguish between exhaustive clusterings that assignEXHAUSTIVE each document to a cluster and non-exhaustive clusterings, in which some documents will be assigned to no cluster. Non-exhaustive clusterings in which each document is a member of either no cluster or one cluster are called exclusive. We define clustering to be exhaustive in this book.EXCLUSIVE 16.2.1 Cardinality – the number of clusters A difficult issue in clustering is determining the number of clusters or cardi-CARDINALITY nality of a clustering, which we denote by K. Often K is nothing more than a good guess based on experience or domain knowledge. But for K-means, we will also introduce a heuristic method for choosing K and an attempt to incorporate the selection of K into the objective function. Sometimes the ap- plication puts constraints on the range of K. For example, the Scatter-Gather interface in Figure 16.3 could not display more than about K = 10 clusters per layer because of the size and resolution of computer monitors in the early 1990s.
***ordinal regression->
|------------
However, approaching IR result ranking like this is not necessarily the right way to think about the problem. Statisticians normally first divide problems into classification problems (where a categorical variable is pre- dicted) versus regression problems (where a real number is predicted). InREGRESSION between is the specialized field of ordinal regression where a ranking is pre-ORDINAL REGRESSION dicted. Machine learning for ad hoc retrieval is most properly thought of as an ordinal regression problem, where the goal is to rank a set of documents for a query, given training data of the same sort. This formulation gives some additional power, since documents can be evaluated relative to other candidate documents for the same query, rather than having to be mapped to a global scale of goodness, while also weakening the problem space, since just a ranking is required rather than an absolute measure of relevance. Is- sues of ranking are especially germane in web search, where the ranking at the very top of the results list is exceedingly important, whereas decisions of relevance of a document to a query may be much less important. Such work can and has been pursued using the structural SVM framework which we mentioned in Section 15.2.2, where the class being predicted is a ranking of results for a query, but here we will present the slightly simpler ranking SVM.
***push model->
|------------
14.7 References and further reading As discussed in Chapter 9, Rocchio relevance feedback is due to Rocchio (1971). Joachims (1997) presents a probabilistic analysis of the method. Roc- chio classification was widely used as a classification method in TREC in the 1990s (Buckley et al. 1994a;b, Voorhees and Harman 2005). Initially, it was used as a form of routing. Routing merely ranks documents according to rel-ROUTING evance to a class without assigning them. Early work on filtering, a true clas-FILTERING sification approach that makes an assignment decision on each document, was published by Ittner et al. (1995) and Schapire et al. (1998). The definition of routing we use here should not be confused with another sense. Routing can also refer to the electronic distribution of documents to subscribers, the so-called push model of document distribution. In a pull model, each transferPUSH MODEL PULL MODEL of a document to the user is initiated by the user – for example, by means of search or by selecting it from a list of documents on a news aggregation website.
***supervised learning->
|------------
Using a learning method or learning algorithm, we then wish to learn a clas-LEARNING METHOD sifier or classification function γ that maps documents to classes:CLASSIFIER γ : X → C(13.1) This type of learning is called supervised learning because a supervisor (theSUPERVISED LEARNING human who defines the classes and labels training documents) serves as a teacher directing the learning process. We denote the supervised learning method by Γ and write Γ(D) = γ. The learning method Γ takes the training set D as input and returns the learned classification function γ.
***search advertising->
|------------
Several aspects of Goto’s model are worth highlighting. First, a user typing the query q into Goto’s search interface was actively expressing an interest and intent related to the query q. For instance, a user typing golf clubs is more likely to be imminently purchasing a set than one who is simply browsing news on golf. Second, Goto only got compensated when a user actually ex- pressed interest in an advertisement – as evinced by the user clicking the ad- vertisement. Taken together, these created a powerful mechanism by which to connect advertisers to consumers, quickly raising the annual revenues of Goto/Overture into hundreds of millions of dollars. This style of search en- gine came to be known variously as sponsored search or search advertising.SPONSORED SEARCH SEARCH ADVERTISING Given these two kinds of search engines – the “pure” search engines such as Google and Altavista, versus the sponsored search engines – the logi- cal next step was to combine them into a single user experience. Current search engines follow precisely this model: they provide pure search results (generally known as algorithmic search results) as the primary response to aALGORITHMIC SEARCH user’s search, together with sponsored search results displayed separately and distinctively to the right of the algorithmic results. This is shown in Fig- ure 19.6. Retrieving sponsored search results and ranking them in response to a query has now become considerably more sophisticated than the sim- ple Goto scheme; the process entails a blending of ideas from information ◮ Figure 19.6 Search advertising triggered by query keywords. Here the query A320 returns algorithmic search results about the Airbus aircraft, together with advertise- ments for various non-aircraft goods numbered A320, that advertisers seek to market to those querying on this query. The lack of advertisements for the aircraft reflects the fact that few marketers attempt to sell A320 aircraft on the web.
***proximity weighting->
|------------
How can we design such a proximity-weighted scoring function to dependPROXIMITY WEIGHTING on ω? The simplest answer relies on a “hand coding” technique we introduce below in Section 7.2.3. A more scalable approach goes back to Section 6.1.2 – we treat the integer ω as yet another feature in the scoring function, whose importance is assigned by machine learning, as will be developed further in Section 15.4.1.
***term-at-a-time->
|------------
The outermost loop beginning Step 3 repeats the updating of Scores, iter- ating over each query term t in turn. In Step 5 we calculate the weight in the query vector for term t. Steps 6-8 update the score of each document by adding in the contribution from term t. This process of adding in contribu- tions one query term at a time is sometimes known as term-at-a-time scoringTERM-AT-A-TIME or accumulation, and the N elements of the array Scores are therefore known as accumulators. For this purpose, it would appear necessary to store, withACCUMULATOR each postings entry, the weight wft,d of term t in document d (we have thus far used either tf or tf-idf for this weight, but leave open the possibility of other functions to be developed in Section 6.4). In fact this is wasteful, since storing this weight may require a floating point number. Two ideas help alle- viate this space problem. First, if we are using inverse document frequency, we need not precompute idft; it suffices to store N/dft at the head of the postings for t. Second, we store the term frequency tft,d for each postings en- try. Finally, Step 12 extracts the top K scores – this requires a priority queue data structure, often implemented using a heap. Such a heap takes no more than 2N comparisons to construct, following which each of the K top scores can be extracted from the heap at a cost of O(log N) comparisons.
|------------
Computing scores in this manner is sometimes referred to as document-at-a- time scoring. We will now introduce a technique for inexact top-K retrieval in which the postings are not all ordered by a common ordering, thereby precluding such a concurrent traversal. We will therefore require scores to be “accumulated” one term at a time as in the scheme of Figure 6.14, so that we have term-at-a-time scoring.
***monotonicity->
|------------
A fundamental assumption in HAC is that the merge operation is mono-MONOTONICITY tonic. Monotonic means that if s1, s2, . . . , sK−1 are the combination similarities of the successive merges of an HAC, then s1 ≥ s2 ≥ . . . ≥ sK−1 holds. A non- monotonic hierarchical clustering contains at least one inversion si < si+1INVERSION and contradicts the fundamental assumption that we chose the best merge available at each step. We will see an example of an inversion in Figure 17.12.
***combination similarity->
|------------
An HAC clustering is typically visualized as a dendrogram as shown inDENDROGRAM Figure 17.1. Each merge is represented by a horizontal line. The y-coordinate of the horizontal line is the similarity of the two clusters that were merged, where documents are viewed as singleton clusters. We call this similarity the combination similarity of the merged cluster. For example, the combinationCOMBINATION SIMILARITY similarity of the cluster consisting of Lloyd’s CEO questioned and Lloyd’s chief / U.S. grilling in Figure 17.1 is ≈ 0.56. We define the combination similarity of a singleton cluster as its document’s self-similarity (which is 1.0 for cosine similarity).
|------------
Both single-link and complete-link clustering have graph-theoretic inter- pretations. Define sk to be the combination similarity of the two clusters merged in step k, and G(sk) the graph that links all data points with a similar- ity of at least sk. Then the clusters after step k in single-link clustering are the connected components of G(sk) and the clusters after step k in complete-link clustering are maximal cliques of G(sk). A connected component is a maximalCONNECTED COMPONENT set of connected points such that there is a path connecting each pair. A clique CLIQUE is a set of points that are completely linked with each other.
|------------
? Exercise 17.3For a fixed set of N documents there are up to N2 distinct similarities between clusters in single-link and complete-link clustering. How many distinct cluster similarities are there in GAAC and centroid clustering? ✄ 17.5 Optimality of HAC To state the optimality conditions of hierarchical clustering precisely, we first define the combination similarity COMB-SIM of a clustering Ω = {ω1, . . . ,ωK} as the smallest combination similarity of any of its K clusters: COMB-SIM({ω1, . . . ,ωK}) = min k COMB-SIM(ωk) Recall that the combination similarity of a cluster ω that was created as the merge of ω1 and ω2 is the similarity of ω1 and ω2 (page 378).
|------------
We then define Ω = {ω1, . . . ,ωK} to be optimal if all clusterings Ω′ with kOPTIMAL CLUSTERING clusters, k ≤ K, have lower combination similarities: |Ω′| ≤ |Ω| ⇒ COMB-SIM(Ω′) ≤ COMB-SIM(Ω) Figure 17.12 shows that centroid clustering is not optimal. The cluster- ing {{d1, d2}, {d3}} (for K = 2) has combination similarity −(4 − ǫ) and {{d1, d2, d3}} (for K = 1) has combination similarity -3.46. So the cluster- ing {{d1, d2}, {d3}} produced in the first merge is not optimal since there is a clustering with fewer clusters ({{d1, d2, d3}}) that has higher combination similarity. Centroid clustering is not optimal because inversions can occur.
|------------
The above definition of optimality would be of limited use if it was only applicable to a clustering together with its merge history. However, we can show (Exercise 17.4) that combination similarity for the three non-inversionCOMBINATION SIMILARITY algorithms can be read off from the cluster without knowing its history. These direct definitions of combination similarity are as follows.
|------------
single-link The combination similarity of a cluster ω is the smallest similar- ity of any bipartition of the cluster, where the similarity of a bipartition is the largest similarity between any two documents from the two parts: COMB-SIM(ω) = min {ω′:ω′⊂ω} max di∈ω′ max d j∈ω−ω′ SIM(di, dj) where each 〈ω′,ω− ω′〉 is a bipartition of ω.
|------------
complete-link The combination similarity of a cluster ω is the smallest sim- ilarity of any two points in ω: mindi∈ω mind j∈ω SIM(di, dj).
|------------
GAAC The combination similarity of a cluster ω is the average of all pair- wise similarities in ω (where self-similarities are not included in the aver- age): Equation (17.3).
|------------
If we use these definitions of combination similarity, then optimality is a property of a set of clusters and not of a process that produces a set of clus- ters.
***decision boundary->
|------------
? Exercise 14.1For small areas, distances on the surface of the hypersphere are approximated well by distances on its projection (Figure 14.2) because α ≈ sin α for small angles. For what size angle is the distortion α/ sin(α) (i) 1.01, (ii) 1.05 and (iii) 1.1? 14.2 Rocchio classification Figure 14.1 shows three classes, China, UK and Kenya, in a two-dimensional (2D) space. Documents are shown as circles, diamonds and X’s. The bound- aries in the figure, which we call decision boundaries, are chosen to separateDECISION BOUNDARY the three classes, but are otherwise arbitrary. To classify a new document, depicted as a star in the figure, we determine the region it occurs in and as- sign it the class of that region – China in this case. Our task in vector space classification is to devise algorithms that compute good boundaries where “good” means high classification accuracy on data unseen during training.
***F measure->
|------------
A single measure that trades off precision versus recall is the F measure,F MEASURE which is the weighted harmonic mean of precision and recall: F = 1 α 1P + (1 − α) 1R = (β2 + 1)PR β2P+ R where β2 = 1 − α α (8.5) where α ∈ [0, 1] and thus β2 ∈ [0, ∞]. The default balanced F measure equally weights precision and recall, which means making α = 1/2 or β = 1. It is commonly written as F1, which is short for Fβ=1, even though the formula- tion in terms of α more transparently exhibits the F measure as a weighted harmonic mean. When using β = 1, the formula on the right simplifies to: Fβ=1 = 2PR P + R (8.6) However, using an even weighting is not the only choice. Values of β < 1 emphasize precision, while values of β > 1 emphasize recall. For example, a value of β = 3 or β = 5 might be used if recall is to be emphasized. Recall, precision, and the F measure are inherently measures between 0 and 1, but they are also very commonly written as percentages, on a scale between 0 and 100.
|------------
The notions of recall and precision were first used by Kent et al. (1955), although the term precision did not appear until later. The F measure (or,F MEASURE rather its complement E = 1 − F) was introduced by van Rijsbergen (1979).
***URL->
|------------
The basic operation is as follows: a client (such as a browser) sends an http request to a web server. The browser specifies a URL (for Universal Resource Lo-URL cator) such ashttp://www.stanford.edu/home/atoz/contact.html.
|------------
In this example URL, the string http refers to the protocol to be used for transmitting the data. The string www.stanford.edu is known as the do- main and specifies the root of a hierarchy of web pages (typically mirroring a filesystem hierarchy underlying the web server). In this example, /home/atoz/contact.html is a path in this hierarchy with a file contact.html that contains the infor- mation to be returned by the web server at www.stanford.edu in response to this request. The HTML-encoded file contact.html holds the hyper- links and the content (in this instance, contact information for Stanford Uni- versity), as well as formatting rules for rendering this content in a browser.
|------------
The designers of the first browsers made it easy to view the HTML markup tags on the content of a URL. This simple convenience allowed new users to create their own HTML content without extensive training or experience; rather, they learned from example content that they liked. As they did so, a second feature of browsers supported the rapid proliferation of web content creation and usage: browsers ignored what they did not understand. This did not, as one might fear, lead to the creation of numerous incompatible dialects of HTML. What it did promote was amateur content creators who could freely experiment with and learn from their newly created web pages without fear that a simple syntax error would “bring the system down.” Pub- lishing on the Web became a mass activity that was not limited to a few trained programmers, but rather open to tens and eventually hundreds of millions of individuals. For most users and for most information needs, the Web quickly became the best way to supply and consume information on everything from rare ailments to subway schedules.
***hub score->
|------------
21.3 Hubs and Authorities We now develop a scheme in which, given a query, every web page is as- signed two scores. One is called its hub score and the other its authority score.HUB SCORE AUTHORITY SCORE For any query, we compute two ranked lists of results rather than one. The ranking of one list is induced by the hub scores and that of the other by the authority scores.
***static quality scores->
|------------
7.1.4 Static quality scores and ordering We now further develop the idea of champion lists, in the somewhat more general setting of static quality scores. In many search engines, we have avail-STATIC QUALITY SCORES able a measure of quality g(d) for each document d that is query-independent and thus static. This quality measure may be viewed as a number between zero and one. For instance, in the context of news stories on the web, g(d) may be derived from the number of favorable reviews of the story by web ◮ Figure 7.2 A static quality-ordered index. In this example we assume that Doc1, Doc2 and Doc3 respectively have static quality scores g(1) = 0.25, g(2) = 0.5, g(3) = 1.
***entropy->
|------------
The characteristic of a discrete probability distribution3 P that determines its coding properties (including whether a code is optimal) is its entropy H(P),ENTROPY which is defined as follows: H(P) = − ∑ x∈X P(x) log2 P(x) where X is the set of all possible numbers we need to be able to encode (and therefore ∑x∈X P(x) = 1.0). Entropy is a measure of uncertainty as shown in Figure 5.9 for a probability distribution P over two possible out- comes, namely, X = {x1, x2}. Entropy is maximized (H(P) = 1) for P(x1) = P(x2) = 0.5 when uncertainty about which xi will appear next is largest; and 2. We assume here that G has no leading 0s. If there are any, they are removed before deleting the leading 1.
|------------
δ codes (Exercise 5.9) and γ codes were introduced by Elias (1975), who proved that both codes are universal. In addition, δ codes are asymptotically optimal for H(P) → ∞. δ codes perform better than γ codes if large num- bers (greater than 15) dominate. A good introduction to information theory, including the concept of entropy, is (Cover and Thomas 1991). While Elias codes are only asymptotically optimal, arithmetic codes (Witten et al. 1999, Section 2.4) can be constructed to be arbitrarily close to the optimum H(P) for any P.
|------------
H is entropy as defined in Chapter 5 (page 99): H(Ω) = −∑ k P(ωk) log P(ωk)(16.5) = −∑ k |ωk| N log |ωk| N (16.6) where, again, the second equation is based on maximum likelihood estimates of the probabilities.
|------------
The normalization by the denominator [H(Ω)+H(C)]/2 in Equation (16.2) fixes this problem since entropy tends to increase with the number of clus- ters. For example, H(Ω) reaches its maximum log N for K = N, which en- sures that NMI is low for K = N. Because NMI is normalized, we can use it to compare clusterings with different numbers of clusters. The particular form of the denominator is chosen because [H(Ω) + H(C)]/2 is a tight upper bound on I(Ω; C) (Exercise 16.8). Thus, NMI is always a number between 0 and 1.
***NTCIR->
|------------
NII Test Collections for IR Systems (NTCIR). The NTCIR project has builtNTCIR various test collections of similar sizes to the TREC collections, focus- ing on East Asian language and cross-language information retrieval, whereCROSS-LANGUAGE INFORMATION RETRIEVAL queries are made in one language over a document collection containing documents in one or more other languages. See: http://research.nii.ac.jp/ntcir/data/data- en.html Cross Language Evaluation Forum (CLEF). This evaluation series has con-CLEF centrated on European languages and cross-language information retrieval.
|------------
Kekäläinen and Järvelin (2002) argue for the superiority of graded rele- vance judgments when dealing with very large document collections, and Järvelin and Kekäläinen (2002) introduce cumulated gain-based methods for IR system evaluation in this context. Sakai (2007) does a study of the stabil- ity and sensitivity of evaluation measures based on graded relevance judg- ments from NTCIR tasks, and concludes that NDCG is best for evaluating document ranking.
***divisive clustering->
|------------
17.6 Divisive clustering So far we have only looked at agglomerative clustering, but a cluster hierar- chy can also be generated top-down. This variant of hierarchical clustering is called top-down clustering or divisive clustering. We start at the top with allTOP-DOWN CLUSTERING documents in one cluster. The cluster is split using a flat clustering algo- rithm. This procedure is applied recursively until each document is in its own singleton cluster.
***sort-based->
|------------
We will define and discuss the earlier stages of processing, that is, steps 1–3, in Section 2.2 (page 22). Until then you can think of tokens and normalized tokens as also loosely equivalent to words. Here, we assume that the first 3 steps have already been done, and we examine building a basic inverted index by sort-based indexing.
***class boundary->
|------------
Figure 14.10 is a graphical example of a linear problem, which we define to mean that the underlying distributions P(d|c) and P(d|c) of the two classes are separated by a line. We call this separating line the class boundary. It isCLASS BOUNDARY the “true” boundary of the two classes and we distinguish it from the deci- sion boundary that the learning method computes to approximate the class boundary.
|------------
As is typical in text classification, there are some noise documents in Fig-NOISE DOCUMENT ure 14.10 (marked with arrows) that do not fit well into the overall distri- bution of the classes. In Section 13.5 (page 271), we defined a noise feature as a misleading feature that, when included in the document representation, on average increases the classification error. Analogously, a noise document is a document that, when included in the training set, misleads the learn- ing method and increases classification error. Intuitively, the underlying distribution partitions the representation space into areas with mostly ho- ◮ Figure 14.10 A linear problem with noise. In this hypothetical web page classifi- cation scenario, Chinese-only web pages are solid circles and mixed Chinese-English web pages are squares. The two classes are separated by a linear class boundary (dashed line, short dashes), except for three noise documents (marked with arrows).
***utility measure->
|------------
David D. Lewis defines the ModApte split at www.daviddlewis.com/resources/testcollections/reuters21578/readme based on Apté et al. (1994). Lewis (1995) describes utility measures for theUTILITY MEASURE evaluation of text classification systems. Yang and Liu (1999) employ signif- icance tests in the evaluation of text classification methods.
***Akaike Information Criterion->
|------------
A theoretical justification for Equation (16.11) is the Akaike Information Cri-AKAIKE INFORMATION CRITERION terion or AIC, an information-theoretic measure that trades off distortion against model complexity. The general form of AIC is: AIC: K = arg min K [−2L(K) + 2q(K)](16.12) where −L(K), the negative maximum log-likelihood of the data for K clus- ters, is a measure of distortion and q(K), the number of parameters of a model with K clusters, is a measure of model complexity. We will not at- tempt to derive the AIC here, but it is easy to understand intuitively. The first property of a good model of the data is that each data point is modeled well by the model. This is the goal of low distortion. But models should also be small (i.e., have low model complexity) since a model that merely describes the data (and therefore has zero distortion) is worthless. AIC pro- vides a theoretical justification for one particular way of weighting these two factors, distortion and model complexity, when selecting a model.
***objective function->
|------------
16.2 Problem statement We can define the goal in hard flat clustering as follows. Given (i) a set of documents D = {d1, . . . , dN}, (ii) a desired number of clusters K, and (iii) an objective function that evaluates the quality of a clustering, we want toOBJECTIVE FUNCTION compute an assignment γ : D → {1, . . . ,K} that minimizes (or, in other cases, maximizes) the objective function. In most cases, we also demand that γ is surjective, i.e., that none of the K clusters is empty.
|------------
The objective function is often defined in terms of similarity or distance between documents. Below, we will see that the objective in K-means clus- tering is to minimize the average distance between documents and their cen- troids or, equivalently, to maximize the similarity between documents and their centroids. The discussion of similarity measures and distance metrics in Chapter 14 (page 291) also applies to this chapter. As in Chapter 14, we use both similarity and distance to talk about relatedness between documents.
|------------
A measure of how well the centroids represent the members of their clus- ters is the residual sum of squares or RSS, the squared distance of each vectorRESIDUAL SUM OF SQUARES from its centroid summed over all vectors: RSSk = ∑ ~x∈ωk |~x−~µ(ωk)|2 RSS = K ∑ k=1 RSSk(16.7) RSS is the objective function in K-means and our goal is to minimize it. Since N is fixed, minimizing RSS is equivalent to minimizing the average squared distance, a measure of how well centroids represent their documents.
***vector space model->
|------------
6.3 The vector space model for scoring In Section 6.2 (page 117) we developed the notion of a document vector that captures the relative importance of the terms in a document. The representa- tion of a set of documents as vectors in a common vector space is known as the vector space model and is fundamental to a host of information retrieval op-VECTOR SPACE MODEL erations ranging from scoring documents on a query, document classification and document clustering. We first develop the basic ideas underlying vector space scoring; a pivotal step in this development is the view (Section 6.3.2) of queries as vectors in the same vector space as the document collection.
***vertical search engine->
|------------
• Topic-specific or vertical search. Vertical search engines restrict searches toVERTICAL SEARCH ENGINE a particular topic. For example, the query computer science on a vertical search engine for the topic China will return a list of Chinese computer science departments with higher precision and recall than the query com- puter science China on a general purpose search engine. This is because the vertical search engine does not include web pages in its index that contain the term china in a different sense (e.g., referring to a hard white ceramic), but does include relevant pages even if they do not explicitly mention the term China.
***k nearest neighbor classification->
|------------
14.3 k nearest neighbor Unlike Rocchio, k nearest neighbor or kNN classification determines the deci-k NEAREST NEIGHBOR CLASSIFICATION sion boundary locally. For 1NN we assign each document to the class of its closest neighbor. For kNN we assign each document to the majority class of its k closest neighbors where k is a parameter. The rationale of kNN classifi- cation is that, based on the contiguity hypothesis, we expect a test document d to have the same label as the training documents located in the local region surrounding d.
***external sorting algorithm->
|------------
With main memory insufficient, we need to use an external sorting algo-EXTERNAL SORTING ALGORITHM rithm, that is, one that uses disk. For acceptable speed, the central require- BSBINDEXCONSTRUCTION() 1 n ← 0 2 while (all documents have not been processed) 3 do n ← n + 1 4 block ← PARSENEXTBLOCK() 5 BSBI-INVERT(block) 6 WRITEBLOCKTODISK(block, fn) 7 MERGEBLOCKS( f1, . . . , fn; fmerged) ◮ Figure 4.2 Blocked sort-based indexing. The algorithm stores inverted blocks in files f1, . . . , fn and the merged index in fmerged.
***prototype->
|------------
We introduce two vector space classification methods in this chapter, Roc- chio and kNN. Rocchio classification (Section 14.2) divides the vector space into regions centered on centroids or prototypes, one for each class, computedPROTOTYPE as the center of mass of all documents in the class. Rocchio classification is simple and efficient, but inaccurate if classes are not approximately spheres with similar radii.
***reduce phase->
|------------
The map and reduce phases of MapReduce split up the computing job into chunks that standard machines can process in a short time. The various steps of MapReduce are shown in Figure 4.5 and an example on a collection consisting of two documents is shown in Figure 4.6. First, the input data, in our case a collection of web pages, are split into n splits where the size ofSPLITS the split is chosen to ensure that the work can be distributed evenly (chunks should not be too large) and efficiently (the total number of chunks we need to manage should not be too large); 16 or 64 MB are good sizes in distributed indexing. Splits are not preassigned to machines, but are instead assigned by the master node on an ongoing basis: As a machine finishes processing one split, it is assigned the next one. If a machine dies or becomes a laggard due to hardware problems, the split it is working on is simply reassigned to another machine.
|------------
For the reduce phase, we want all values for a given key to be stored closeREDUCE PHASE together, so that they can be read and processed quickly. This is achieved by masterassign map phase reduce phase assign parser splits parser parser inverter postings inverter inverter a-f g-p q-z a-f g-p q-z a-f g-p q-z a-f segment files g-p q-z ◮ Figure 4.5 An example of distributed indexing with MapReduce. Adapted from Dean and Ghemawat (2004).
***ranking SVM->
|------------
However, approaching IR result ranking like this is not necessarily the right way to think about the problem. Statisticians normally first divide problems into classification problems (where a categorical variable is pre- dicted) versus regression problems (where a real number is predicted). InREGRESSION between is the specialized field of ordinal regression where a ranking is pre-ORDINAL REGRESSION dicted. Machine learning for ad hoc retrieval is most properly thought of as an ordinal regression problem, where the goal is to rank a set of documents for a query, given training data of the same sort. This formulation gives some additional power, since documents can be evaluated relative to other candidate documents for the same query, rather than having to be mapped to a global scale of goodness, while also weakening the problem space, since just a ranking is required rather than an absolute measure of relevance. Is- sues of ranking are especially germane in web search, where the ranking at the very top of the results list is exceedingly important, whereas decisions of relevance of a document to a query may be much less important. Such work can and has been pursued using the structural SVM framework which we mentioned in Section 15.2.2, where the class being predicted is a ranking of results for a query, but here we will present the slightly simpler ranking SVM.
|------------
The construction of a ranking SVM proceeds as follows. We begin with aRANKING SVM set of judged queries. For each training query q, we have a set of documents returned in response to the query, which have been totally ordered by a per- son for relevance to the query. We construct a vector of features ψj = ψ(dj, q) for each document/query pair, using features such as those discussed in Sec- tion 15.4.1, and many more. For two documents di and dj, we then form the vector of feature differences: Φ(di, dj, q) = ψ(di, q) − ψ(dj, q)(15.18) By hypothesis, one of di and dj has been judged more relevant. If di is judged more relevant than dj, denoted di ≺ dj (di should precede dj in the results ordering), then we will assign the vector Φ(di, dj, q) the class yijq = +1; otherwise −1. The goal then is to build a classifier which will return ~wTΦ(di, dj, q) > 0 iff di ≺ dj(15.19) This SVM learning task is formalized in a manner much like the other exam- ples that we saw before: (15.20) Find ~w, and ξi,j ≥ 0 such that: • 12~w T~w + C ∑i,j ξi,j is minimized • and for all {Φ(di, dj, q) : di ≺ dj}, ~wTΦ(di, dj, q) ≥ 1 − ξi,j We can leave out yijq in the statement of the constraint, since we only need to consider the constraint for document pairs ordered in one direction, since ≺ is antisymmetric. These constraints are then solved, as before, to give a linear classifier which can rank pairs of documents. This approach has been used to build ranking functions which outperform standard hand-built ranking functions in IR evaluations on standard data sets; see the references for papers that present such results.
***multiclass->
|------------
being satisfied with approximate solutions. Standardly, empirical complex- ity is about O(|D|1.7) (Joachims 2006a). Nevertheless, the super-linear train- ing time of traditional SVM algorithms makes them difficult or impossible to use on very large training data sets. Alternative traditional SVM solu- tion algorithms which are linear in the number of training examples scale badly with a large number of features, which is another standard attribute of text problems. However, a new training algorithm based on cutting plane techniques gives a promising answer to this issue by having running time linear in the number of training examples and the number of non-zero fea- tures in examples (Joachims 2006a). Nevertheless, the actual speed of doing quadratic optimization remains much slower than simply counting terms as is done in a Naive Bayes model. Extending SVM algorithms to nonlinear SVMs, as in the next section, standardly increases training complexity by a factor of |D| (since dot products between examples need to be calculated), making them impractical. In practice it can often be cheaper to materialize the higher-order features and to train a linear SVM.4 15.2.2 Multiclass SVMs SVMs are inherently two-class classifiers. The traditional way to do mul- ticlass classification with SVMs is to use one of the methods discussed in Section 14.5 (page 306). In particular, the most common technique in prac- tice has been to build |C| one-versus-rest classifiers (commonly referred to as “one-versus-all” or OVA classification), and to choose the class which classi- fies the test datum with greatest margin. Another strategy is to build a set of one-versus-one classifiers, and to choose the class that is selected by the most classifiers. While this involves building |C|(|C| − 1)/2 classifiers, the time for training classifiers may actually decrease, since the training data set for each classifier is much smaller.
|------------
However, these are not very elegant approaches to solving multiclass prob- lems. A better alternative is provided by the construction of multiclass SVMs, where we build a two-class classifier over a feature vector Φ(~x, y) derived from the pair consisting of the input features and the class of the datum. At test time, the classifier chooses the class y = arg maxy′ ~w TΦ(~x, y′). The mar- gin during training is the gap between this value for the correct class and for the nearest other class, and so the quadratic program formulation will require that ∀i ∀y 6= yi ~wTΦ(~xi, yi) − ~wTΦ(~xi, y) ≥ 1 − ξi. This general method can be extended to give a multiclass formulation of various kinds of linear classifiers. It is also a simple instance of a generalization of classifica- tion where the classes are not just a set of independent, categorical labels, but may be arbitrary structured objects with relationships defined between them.
***component coverage->
|------------
Since CAS queries have both structural and content criteria, relevance as- sessments are more complicated than in unstructured retrieval. INEX 2002 defined component coverage and topical relevance as orthogonal dimen- sions of relevance. The component coverage dimension evaluates whether theCOMPONENT COVERAGE element retrieved is “structurally” correct, i.e., neither too low nor too high in the tree. We distinguish four cases: • Exact coverage (E). The information sought is the main topic of the com- ponent and the component is a meaningful unit of information.
***simple conjunctive->
|------------
Exercise 1.3 [⋆] For the document collection shown in Exercise 1.2, what are the returned results for these queries: a. schizophrenia AND drug b. for AND NOT(drug OR approach) 1.3 Processing Boolean queries How do we process a query using an inverted index and the basic Boolean retrieval model? Consider processing the simple conjunctive query:SIMPLE CONJUNCTIVE QUERIES (1.1) Brutus AND Calpurnia over the inverted index partially shown in Figure 1.3 (page 7). We: 1. Locate Brutus in the Dictionary 2. Retrieve its postings 3. Locate Calpurnia in the Dictionary 4. Retrieve its postings 5. Intersect the two postings lists, as shown in Figure 1.5.
***evidence accumulation->
|------------
Each of these steps (if invoked) may yield a list of scored documents, for each of which we compute a score. This score must combine contributions from vector space scoring, static quality, proximity weighting and potentially other factors – particularly since a document may appear in the lists from multiple steps. This demands an aggregate scoring function that accumulatesEVIDENCE ACCUMULATION evidence of a document’s relevance from multiple sources. How do we devise a query parser and how do we devise the aggregate scoring function? The answer depends on the setting. In many enterprise settings we have application builders who make use of a toolkit of available scoring opera- tors, along with a query parsing layer, with which to manually configure the scoring function as well as the query parser. Such application builders make use of the available zones, metadata and knowledge of typical doc- uments and queries to tune the parsing and scoring. In collections whose characteristics change infrequently (in an enterprise application, significant changes in collection and query characteristics typically happen with infre- quent events such as the introduction of new document formats or document management systems, or a merger with another company). Web search on the other hand is faced with a constantly changing document collection with new characteristics being introduced all the time. It is also a setting in which the number of scoring factors can run into the hundreds, making hand-tuned scoring a difficult exercise. To address this, it is becoming increasingly com- mon to use machine-learned scoring, extending the ideas we introduced in Section 6.1.2, as will be discussed further in Section 15.4.1.
***centroid-based classification->
|------------
Some authors restrict the name Roccchio classification to two-class problems and use the terms cluster-based (Iwayama and Tokunaga 1995) and centroid-CENTROID-BASED CLASSIFICATION based classification (Han and Karypis 2000, Tan and Cheng 2007) for Rocchio classification with J > 2.
***CPC->
|------------
? Exercise 19.1If the number of pages with in-degree i is proportional to 1/i2.1, what is the probabil- ity that a randomly chosen web page has in-degree 1? Exercise 19.2 If the number of pages with in-degree i is proportional to 1/i2.1, what is the average in-degree of a web page? Exercise 19.3 If the number of pages with in-degree i is proportional to 1/i2.1, then as the largest in-degree goes to infinity, does the fraction of pages with in-degree i grow, stay the same, or diminish? How would your answer change for values of the exponent other than 2.1? Exercise 19.4 The average in-degree of all nodes in a snapshot of the web graph is 9. What can we say about the average out-degree of all nodes in this snapshot? 19.3 Advertising as the economic model Early in the history of the Web, companies used graphical banner advertise- ments on web pages at popular websites (news and entertainment sites such as MSN, America Online, Yahoo! and CNN). The primary purpose of these advertisements was branding: to convey to the viewer a positive feeling about the brand of the company placing the advertisement. Typically these adver- tisements are priced on a cost per mil (CPM) basis: the cost to the company ofCPM having its banner advertisement displayed 1000 times. Some websites struck contracts with their advertisers in which an advertisement was priced not by the number of times it is displayed (also known as impressions), but rather by the number of times it was clicked on by the user. This pricing model is known as the cost per click (CPC) model. In such cases, clicking on the adver-CPC tisement leads the user to a web page set up by the advertiser, where the user is induced to make a purchase. Here the goal of the advertisement is not so much brand promotion as to induce a transaction. This distinction between brand and transaction-oriented advertising was already widely recognized in the context of conventional media such as broadcast and print. The inter- activity of the web allowed the CPC billing model – clicks could be metered and monitored by the website and billed to the advertiser.
***in clustering->
***feature engineering->
|------------
Features for text The default in both ad hoc retrieval and text classification is to use terms as features. However, for text classification, a great deal of mileage can be achieved by designing additional features which are suited to a specific prob- lem. Unlike the case of IR query languages, since these features are internal to the classifier, there is no problem of communicating these features to an end user. This process is generally referred to as feature engineering. At pre-FEATURE ENGINEERING sent, feature engineering remains a human craft, rather than something done by machine learning. Good feature engineering can often markedly improve the performance of a text classifier. It is especially beneficial in some of the most important applications of text classification, like spam and porn filter- ing.
***soundex->
|------------
Algorithms for such phonetic hashing are commonly collectively known as soundex algorithms. However, there is an original soundex algorithm, withSOUNDEX various variants, built on the following scheme: 1. Turn every term to be indexed into a 4-character reduced form. Build an inverted index from these reduced forms to the original terms; call this the soundex index.
|------------
3. When the query calls for a soundex match, search this soundex index.
|------------
The variations in different soundex algorithms have to do with the conver- sion of terms to 4-character forms. A commonly used conversion results in a 4-character code, with the first character being a letter of the alphabet and the other three being digits between 0 and 9.
***linear problem->
|------------
Figure 14.10 is a graphical example of a linear problem, which we define to mean that the underlying distributions P(d|c) and P(d|c) of the two classes are separated by a line. We call this separating line the class boundary. It isCLASS BOUNDARY the “true” boundary of the two classes and we distinguish it from the deci- sion boundary that the learning method computes to approximate the class boundary.
|------------
As is typical in text classification, there are some noise documents in Fig-NOISE DOCUMENT ure 14.10 (marked with arrows) that do not fit well into the overall distri- bution of the classes. In Section 13.5 (page 271), we defined a noise feature as a misleading feature that, when included in the document representation, on average increases the classification error. Analogously, a noise document is a document that, when included in the training set, misleads the learn- ing method and increases classification error. Intuitively, the underlying distribution partitions the representation space into areas with mostly ho- ◮ Figure 14.10 A linear problem with noise. In this hypothetical web page classifi- cation scenario, Chinese-only web pages are solid circles and mixed Chinese-English web pages are squares. The two classes are separated by a linear class boundary (dashed line, short dashes), except for three noise documents (marked with arrows).
***term-partitioned index->
|------------
4.4 Distributed indexing Collections are often so large that we cannot perform index construction effi- ciently on a single machine. This is particularly true of the World Wide Web for which we need large computer clusters1 to construct any reasonably sized web index. Web search engines, therefore, use distributed indexing algorithms for index construction. The result of the construction process is a distributed index that is partitioned across several machines – either according to term or according to document. In this section, we describe distributed indexing for a term-partitioned index. Most large search engines prefer a document- 1. A cluster in this chapter is a group of tightly coupled computers that work together closely.
***equivalence classes->
|------------
Token normalization is the process of canonicalizing tokens so that matchesTOKEN NORMALIZATION occur despite superficial differences in the character sequences of the to- kens.4 The most standard way to normalize is to implicitly create equivalenceEQUIVALENCE CLASSES classes, which are normally named after one member of the set. For instance, if the tokens anti-discriminatory and antidiscriminatory are both mapped onto the term antidiscriminatory, in both the document text and queries, then searches for one term will retrieve documents that contain either.
|------------
The advantage of just using mapping rules that remove characters like hy- phens is that the equivalence classing to be done is implicit, rather than being fully calculated in advance: the terms that happen to become identical as the result of these rules are the equivalence classes. It is only easy to write rules of this sort that remove characters. Since the equivalence classes are implicit, it is not obvious when you might want to add characters. For instance, it would be hard to know to turn antidiscriminatory into anti-discriminatory.
|------------
An alternative to creating equivalence classes is to maintain relations be- tween unnormalized tokens. This method can be extended to hand-constructed lists of synonyms such as car and automobile, a topic we discuss further in Chapter 9. These term relationships can be achieved in two ways. The usual way is to index unnormalized tokens and to maintain a query expansion list of multiple vocabulary entries to consider for a certain query term. A query term is then effectively a disjunction of several postings lists. The alterna- tive is to perform the expansion during index construction. When the doc- ument contains automobile, we index it under car as well (and, usually, also vice-versa). Use of either of these methods is considerably less efficient than equivalence classing, as there are more postings to store and merge. The first 4. It is also often referred to as term normalization, but we prefer to reserve the name term for the output of the normalization process.
***information gain->
|------------
Exercise 13.11 What are the values of I(Ut;Cc) and X2(D, t, c) if term and class are completely inde- pendent? What are the values if they are completely dependent? Exercise 13.12 The feature selection method in Equation (13.16) is most appropriate for the Bernoulli model. Why? How could one modify it for the multinomial model? Exercise 13.13 Features can also be selected according toinformation gain (IG), which is defined as:INFORMATION GAIN IG(D, t, c) = H(pD) − ∑ x∈{Dt+ ,Dt−} |x| |D|H(px) where H is entropy, D is the training set, and Dt+ , and Dt− are the subset of D with term t, and the subset of D without term t, respectively. pA is the class distribution in (sub)collection A, e.g., pA(c) = 0.25, pA(c) = 0.75 if a quarter of the documents in A are in class c.
|------------
Show that mutual information and information gain are equivalent.
***CPM->
|------------
? Exercise 19.1If the number of pages with in-degree i is proportional to 1/i2.1, what is the probabil- ity that a randomly chosen web page has in-degree 1? Exercise 19.2 If the number of pages with in-degree i is proportional to 1/i2.1, what is the average in-degree of a web page? Exercise 19.3 If the number of pages with in-degree i is proportional to 1/i2.1, then as the largest in-degree goes to infinity, does the fraction of pages with in-degree i grow, stay the same, or diminish? How would your answer change for values of the exponent other than 2.1? Exercise 19.4 The average in-degree of all nodes in a snapshot of the web graph is 9. What can we say about the average out-degree of all nodes in this snapshot? 19.3 Advertising as the economic model Early in the history of the Web, companies used graphical banner advertise- ments on web pages at popular websites (news and entertainment sites such as MSN, America Online, Yahoo! and CNN). The primary purpose of these advertisements was branding: to convey to the viewer a positive feeling about the brand of the company placing the advertisement. Typically these adver- tisements are priced on a cost per mil (CPM) basis: the cost to the company ofCPM having its banner advertisement displayed 1000 times. Some websites struck contracts with their advertisers in which an advertisement was priced not by the number of times it is displayed (also known as impressions), but rather by the number of times it was clicked on by the user. This pricing model is known as the cost per click (CPC) model. In such cases, clicking on the adver-CPC tisement leads the user to a web page set up by the advertiser, where the user is induced to make a purchase. Here the goal of the advertisement is not so much brand promotion as to induce a transaction. This distinction between brand and transaction-oriented advertising was already widely recognized in the context of conventional media such as broadcast and print. The inter- activity of the web allowed the CPC billing model – clicks could be metered and monitored by the website and billed to the advertiser.
***LM->
|------------
For retrieval based on a language model (henceforth LM), we treat the generation of queries as a random process. The approach is to 1. Infer a LM for each document.
|------------
12.2.2 Estimating the query generation probability In this section we describe how to estimate P(q|Md). The probability of pro- ducing the query given the LM Md of document d using maximum likelihood 3. Of course, in other cases, they do not. The answer to this within the language modeling approach is translation language models, as briefly discussed in Section 12.4.
***XML fragment->
|------------
The vector-space based XML retrieval method in Section 10.3 is essentially IBM Haifa’s JuruXML system as presented by Mass et al. (2003) and Carmel et al. (2003). Schlieder and Meuss (2002) and Grabs and Schek (2002) describe similar approaches. Carmel et al. (2003) represent queries as XML fragments.XML FRAGMENT The trees that represent XML queries in this chapter are all XML fragments, but XML fragments also permit the operators +, − and phrase on content nodes.
***Zipf’s law->
|------------
5.1.2 Zipf’s law: Modeling the distribution of terms We also want to understand how terms are distributed across documents.
|------------
A commonly used model of the distribution of terms in a collection is Zipf’sZIPF’S LAW law. It states that, if t1 is the most common term in the collection, t2 is the next most common, and so on, then the collection frequency cfi of the ith most common term is proportional to 1/i: cfi ∝ 1 i .(5.2) So if the most frequent term occurs cf1 times, then the second most frequent term has half as many occurrences, the third most frequent term a third as many occurrences, and so on. The intuition is that frequency decreases very rapidly with rank. Equation (5.2) is one of the simplest ways of formalizing such a rapid decrease and it has been found to be a reasonably good model.
|------------
Equivalently, we can write Zipf’s law as cfi = cik or as log cfi = log c + k log i where k = −1 and c is a constant to be defined in Section 5.3.2. It is therefore a power law with exponent k = −1. See Chapter 19, page 426,POWER LAW for another power law, a law characterizing the distribution of links on web pages.
***zone index->
|------------
6.1 Parametric and zone indexes We have thus far viewed a document as a sequence of terms. In fact, most documents have additional structure. Digital documents generally encode, in machine-recognizable form, certain metadata associated with each docu-METADATA ment. By metadata, we mean specific forms of data about a document, such as its author(s), title and date of publication. This metadata would generally include fields such as the date of creation and the format of the document, asFIELD well the author and possibly the title of the document. The possible values of a field should be thought of as finite – for instance, the set of all dates of authorship.
|------------
Zones are similar to fields, except the contents of a zone can be arbitraryZONE free text. Whereas a field may take on a relatively small set of values, a zone can be thought of as an arbitrary, unbounded amount of text. For instance, document titles and abstracts are generally treated as zones. We may build a separate inverted index for each zone of a document, to support queries such as “find documents with merchant in the title and william in the author list and the phrase gentle rain in the body”. This has the effect of building an index that looks like Figure 6.2. Whereas the dictionary for a parametric index comes from a fixed vocabulary (the set of languages, or the set of dates), the dictionary for a zone index must structure whatever vocabulary stems from the text of that zone.
***precision->
|------------
Our goal is to develop a system to address the ad hoc retrieval task. This isAD HOC RETRIEVAL the most standard IR task. In it, a system aims to provide documents from within the collection that are relevant to an arbitrary user information need, communicated to the system by means of a one-off, user-initiated query. An information need is the topic about which the user desires to know more, andINFORMATION NEED is differentiated from a query, which is what the user conveys to the com-QUERY puter in an attempt to communicate the information need. A document is relevant if it is one that the user perceives as containing information of valueRELEVANCE with respect to their personal information need. Our example above was rather artificial in that the information need was defined in terms of par- ticular words, whereas usually a user is interested in a topic like “pipeline leaks” and would like to find relevant documents regardless of whether they precisely use those words or express the concept with other words such as pipeline rupture. To assess the effectiveness of an IR system (i.e., the quality ofEFFECTIVENESS its search results), a user will usually want to know two key statistics about the system’s returned results for a query: Precision: What fraction of the returned results are relevant to the informa-PRECISION tion need? Recall: What fraction of the relevant documents in the collection were re-RECALL turned by the system? Detailed discussion of relevance and evaluation measures including preci- sion and recall is found in Chapter 8.
|------------
8.3 Evaluation of unranked retrieval sets Given these ingredients, how is system effectiveness measured? The two most frequent and basic measures for information retrieval effectiveness are precision and recall. These are first defined for the simple case where an IR system returns a set of documents for a query. We will see later how to extend these notions to ranked retrieval situations.
|------------
Precision (P) is the fraction of retrieved documents that are relevantPRECISION Precision = #(relevant items retrieved) #(retrieved items) = P(relevant|retrieved)(8.1) Recall (R) is the fraction of relevant documents that are retrievedRECALL Recall = #(relevant items retrieved) #(relevant items) = P(retrieved|relevant)(8.2) These notions can be made clear by examining the following contingency table: (8.3) Relevant Nonrelevant Retrieved true positives (tp) false positives (fp) Not retrieved false negatives (fn) true negatives (tn) Then: P = tp/(tp+ f p)(8.4) R = tp/(tp+ f n) An obvious alternative that may occur to the reader is to judge an infor- mation retrieval system by its accuracy, that is, the fraction of its classifica-ACCURACY tions that are correct. In terms of the contingency table above, accuracy = (tp+ tn)/(tp + f p + f n + tn). This seems plausible, since there are two ac- tual classes, relevant and nonrelevant, and an information retrieval system can be thought of as a two-class classifier which attempts to label them as such (it retrieves the subset of documents which it believes to be relevant).
|------------
There is a good reason why accuracy is not an appropriate measure for information retrieval problems. In almost all circumstances, the data is ex- tremely skewed: normally over 99.9% of the documents are in the nonrele- vant category. A system tuned to maximize accuracy can appear to perform well by simply deeming all documents nonrelevant to all queries. Even if the system is quite good, trying to label some documents as relevant will almost always lead to a high rate of false positives. However, labeling all documents as nonrelevant is completely unsatisfying to an information retrieval system user. Users are always going to want to see some documents, and can be assumed to have a certain tolerance for seeing some false positives provid- ing that they get some useful information. The measures of precision and recall concentrate the evaluation on the return of true positives, asking what percentage of the relevant documents have been found and how many false positives have also been returned.
***text classification->
|------------
13 Text classification and NaiveBayes Thus far, this book has mainly discussed the process of ad hoc retrieval, where users have transient information needs that they try to address by posing one or more queries to a search engine. However, many users have ongoing information needs. For example, you might need to track developments in multicore computer chips. One way of doing this is to issue the query multi- core AND computer AND chip against an index of recent newswire articles each morning. In this and the following two chapters we examine the question: How can this repetitive task be automated? To this end, many systems sup- port standing queries. A standing query is like any other query except that itSTANDING QUERY is periodically executed on a collection to which new documents are incre- mentally added over time.
|------------
Such more general classes are usually referred to as topics, and the classifica- tion task is then called text classification, text categorization, topic classification,TEXT CLASSIFICATION or topic spotting. An example for China appears in Figure 13.1. Standing queries and topics differ in their degree of specificity, but the methods for solving routing, filtering, and text classification are essentially the same. We therefore include routing and filtering under the rubric of text classification in this and the following chapters.
***passage retrieval->
|------------
2006). Focused retrieval requires evaluation measures that penalize redun- dant results lists (Kazai and Lalmas 2006, Lalmas et al. 2007). Trotman and Geva (2006) argue that XML retrieval is a form of passage retrieval. In passagePASSAGE RETRIEVAL retrieval (Salton et al. 1993, Hearst and Plaunt 1993, Zobel et al. 1995, Hearst 1997, Kaszkiel and Zobel 1997), the retrieval system returns short passages instead of documents in response to a user query. While element bound- aries in XML documents are cues for identifying good segment boundaries between passages, the most relevant passage often does not coincide with an XML element.
***topic-specific PageRank->
|------------
21.2.3 Topic-specific PageRank Thus far we have discussed the PageRank computation with a teleport op- eration in which the surfer jumps to a random web page chosen uniformly at random. We now consider teleporting to a random web page chosen non- uniformly. In doing so, we are able to derive PageRank values tailored to particular interests. For instance, a sports aficionado might wish that pages on sports be ranked higher than non-sports pages. Suppose that web pages on sports are “near” one another in the web graph. Then, a random surfer who frequently finds himself on random sports pages is likely (in the course of the random walk) to spend most of his time at sports pages, so that the steady-state distribution of sports pages is boosted.
|------------
Provided the set S of sports-related pages is non-empty, it follows that there is a non-empty set of web pages Y ⊇ S over which the random walk has a steady-state distribution; let us denote this sports PageRank distribution by ~πs. For web pages not in Y, we set the PageRank values to zero. We call ~πs the topic-specific PageRank for sports.TOPIC-SPECIFIC PAGERANK We do not demand that teleporting takes the random surfer to a uniformly chosen sports page; the distribution over teleporting targets S could in fact be arbitrary.
|------------
In like manner we can envision topic-specific PageRank distributions for each of several topics such as science, religion, politics and so on. Each of these distributions assigns to each web page a PageRank value in the interval [0, 1). For a user interested in only a single topic from among these topics, we may invoke the corresponding PageRank distribution when scoring and ranking search results. This gives us the potential of considering settings in which the search engine knows what topic a user is interested in. This may happen because users either explicitly register their interests, or because the system learns by observing each user’s behavior over time.
|------------
But what if a user is known to have a mixture of interests from multiple topics? For instance, a user may have an interest mixture (or profile) that is 60% sports and 40% politics; can we compute a personalized PageRank for thisPERSONALIZED PAGERANK user? At first glance, this appears daunting: how could we possibly compute a different PageRank distribution for each user profile (with, potentially, in- finitely many possible profiles)? We can in fact address this provided we assume that an individual’s interests can be well-approximated as a linear ◮ Figure 21.5 Topic-specific PageRank. In this example we consider a user whose interests are 60% sports and 40% politics. If the teleportation probability is 10%, this user is modeled as teleporting 6% to sports pages and 4% to politics pages.
***Cranfield->
|------------
The Cranfield collection. This was the pioneering test collection in allowingCRANFIELD precise quantitative measures of information retrieval effectiveness, but is nowadays too small for anything but the most elementary pilot experi- ments. Collected in the United Kingdom starting in the late 1950s, it con- tains 1398 abstracts of aerodynamics journal articles, a set of 225 queries, and exhaustive relevance judgments of all (query, document) pairs.
***normalized mutual information->
***likelihood ratio->
|------------
✎ Example 12.1: To find the probability of a word sequence, we just multiply theprobabilities which the model gives to each word in the sequence, together with the probability of continuing or stopping after producing each word. For example, P(frog said that toad likes frog) = (0.01 × 0.03 × 0.04 × 0.01 × 0.02 × 0.01)(12.2) ×(0.8 × 0.8 × 0.8 × 0.8 × 0.8 × 0.8 × 0.2) ≈ 0.000000000001573 As you can see, the probability of a particular string/document, is usually a very small number! Here we stopped after generating frog the second time. The first line of numbers are the term emission probabilities, and the second line gives the probabil- ity of continuing or stopping after generating each word. An explicit stop probability is needed for a finite automaton to be a well-formed language model according to Equation (12.1). Nevertheless, most of the time, we will omit to include STOP and (1 − STOP) probabilities (as do most other authors). To compare two models for a data set, we can calculate their likelihood ratio, which results from simply dividing theLIKELIHOOD RATIO probability of the data according to one model by the probability of the data accord- ing to the other model. Providing that the stop probability is fixed, its inclusion will not alter the likelihood ratio that results from comparing the likelihood of two lan- guage models generating a string. Hence, it will not alter the ranking of documents.2 Nevertheless, formally, the numbers will no longer truly be probabilities, but only proportional to probabilities. See Exercise 12.4.
***rule of 30->
|------------
In general, the statistics in Table 5.1 show that preprocessing affects the size of the dictionary and the number of nonpositional postings greatly. Stem- ming and case folding reduce the number of (distinct) terms by 17% each and the number of nonpositional postings by 4% and 3%, respectively. The treatment of the most frequent words is also important. The rule of 30 statesRULE OF 30 that the 30 most common words account for 30% of the tokens in written text (31% in the table). Eliminating the 150 most common words from indexing (as stop words; cf. Section 2.2.2, page 27) cuts 25% to 30% of the nonpositional postings. But, although a stop list of 150 words reduces the number of post- ings by a quarter or more, this size reduction does not carry over to the size of the compressed index. As we will see later in this chapter, the postings lists of frequent words require only a few bits per posting after compression.
***317->
***20 Newsgroups->
|------------
20 Newsgroups. This is another widely used text classification collection,20 NEWSGROUPS collected by Ken Lang. It consists of 1000 articles from each of 20 Usenet newsgroups (the newsgroup name being regarded as the category). After the removal of duplicate articles, as it is usually used, it contains 18941 articles.
***buffer->
|------------
Block sizes of 8, 16, 32, and 64 kilobytes (KB) are common. We call the part of main memory where a block being read or written is stored a buffer.BUFFER • Data transfers from disk to memory are handled by the system bus, not by the processor. This means that the processor is available to process data during disk I/O. We can exploit this fact to speed up data transfers by storing compressed data on disk. Assuming an efficient decompression algorithm, the total time of reading and then decompressing compressed data is usually less than reading uncompressed data.
***structural SVMs->
|------------
In the SVM world, such work comes under the label of structural SVMs. WeSTRUCTURAL SVMS mention them again in Section 15.4.2.
***χ2 feature selection->
|------------
13.5.2 χ2 Feature selection Another popular feature selection method is χ2. In statistics, the χ2 test isχ2 FEATURE SELECTION applied to test the independence of two events, where two events A and B are defined to be independent if P(AB) = P(A)P(B) or, equivalently, P(A|B) =INDEPENDENCE P(A) and P(B|A) = P(B). In feature selection, the two events are occurrence of the term and occurrence of the class. We then rank terms with respect to the following quantity: X2(D, t, c) = ∑ et∈{0,1} ∑ ec∈{0,1} (Netec − Eetec)2 Eetec (13.18) where et and ec are defined as in Equation (13.16). N is the observed frequency in D and E the expected frequency. For example, E11 is the expected frequency of t and c occurring together in a document assuming that term and class are independent.
***false positive->
|------------
An alternative to this information-theoretic interpretation of clustering is to view it as a series of decisions, one for each of the N(N − 1)/2 pairs of documents in the collection. We want to assign two documents to the same cluster if and only if they are similar. A true positive (TP) decision assigns two similar documents to the same cluster, a true negative (TN) decision as- signs two dissimilar documents to different clusters. There are two types of errors we can commit. A false positive (FP) decision assigns two dissim- ilar documents to the same cluster. A false negative (FN) decision assigns two similar documents to different clusters. The Rand index (RI) measuresRAND INDEX RI the percentage of decisions that are correct. That is, it is simply accuracy (Section 8.3, page 155).
|------------
The Rand index gives equal weight to false positives and false negatives.
|------------
Separating similar documents is sometimes worse than putting pairs of dis- similar documents in the same cluster. We can use the F measure (Section 8.3,F MEASURE page 154) to penalize false negatives more strongly than false positives by selecting a value β > 1, thus giving more weight to recall.
***adversarial information retrieval->
|------------
A doorway page contains text and metadata carefully chosen to rank highly on selected search keywords. When a browser requests the doorway page, it is redirected to a page containing content of a more commercial nature. More complex spamming techniques involve manipulation of the metadata related to a page including (for reasons we will see in Chapter 21) the links into a web page. Given that spamming is inherently an economically motivated activity, there has sprung around it an industry of Search Engine Optimizers,SEARCH ENGINE OPTIMIZERS or SEOs to provide consultancy services for clients who seek to have their web pages rank highly on selected keywords. Web search engines frown on this business of attempting to decipher and adapt to their proprietary rank- ing techniques and indeed announce policies on forms of SEO behavior they do not tolerate (and have been known to shut down search requests from cer- tain SEOs for violation of these). Inevitably, the parrying between such SEOs (who gradually infer features of each web search engine’s ranking methods) and the web search engines (who adapt in response) is an unending struggle; indeed, the research sub-area of adversarial information retrieval has sprung upADVERSARIAL INFORMATION RETRIEVAL around this battle. To combat spammers who manipulate the text of their web pages is the exploitation of the link structure of the Web – a technique known as link analysis. The first web search engine known to apply link anal- ysis on a large scale (to be detailed in Chapter 21) was Google, although all web search engines currently make use of it (and correspondingly, spam- mers now invest considerable effort in subverting it – this is known as linkLINK SPAM spam).
***memory-based learning->
|------------
In kNN classification, we do not perform any estimation of parameters as we do in Rocchio classification (centroids) or in Naive Bayes (priors and con- ditional probabilities). kNN simply memorizes all examples in the training set and then compares the test document to them. For this reason, kNN is also called memory-based learning or instance-based learning. It is usually desir-MEMORY-BASED LEARNING able to have as much training data as possible in machine learning. But in kNN large training sets come with a severe efficiency penalty in classifica- tion.
***GOV2->
|------------
In more recent years, NIST has done evaluations on larger document col- lections, including the 25 million page GOV2 web page collection. FromGOV2 the beginning, the NIST test document collections were orders of magni- tude larger than anything available to researchers previously and GOV2 is now the largest Web collection easily available for research purposes.
|------------
Nevertheless, the size of GOV2 is still more than 2 orders of magnitude smaller than the current size of the document collections indexed by the large web search companies.
***universal code->
|------------
What is remarkable about this result is that it holds for any probability distri- bution P. So without knowing anything about the properties of the distribu- tion of gaps, we can apply γ codes and be certain that they are within a factor of ≈ 2 of the optimal code for distributions of large entropy. A code like γ code with the property of being within a factor of optimal for an arbitrary distribution P is called universal.UNIVERSAL CODE In addition to universality, γ codes have two other properties that are use- ful for index compression. First, they are prefix free, namely, no γ code is thePREFIX FREE prefix of another. This means that there is always a unique decoding of a sequence of γ codes – and we do not need delimiters between them, which would decrease the efficiency of the code. The second property is that γ codes are parameter free. For many other efficient codes, we have to fit thePARAMETER FREE parameters of a model (e.g., the binomial distribution) to the distribution of gaps in the index. This complicates the implementation of compression and decompression. For instance, the parameters need to be stored and re- trieved. And in dynamic indexing, the distribution of gaps can change, so that the original parameters are no longer appropriate. These problems are avoided with a parameter-free code.
***adjusted->
|------------
The discussion of external evaluation measures is partially based on Strehl (2002). Dom (2002) proposes a measure Q0 that is better motivated theoret- ically than NMI. Q0 is the number of bits needed to transmit class member- ships assuming cluster memberships are known. The Rand index is due to Rand (1971). Hubert and Arabie (1985) propose an adjusted Rand index thatADJUSTED RAND INDEX ranges between −1 and 1 and is 0 if there is only chance agreement between clusters and classes (similar to κ in Chapter 8, page 165). Basu et al. (2004) ar- gue that the three evaluation measures NMI, Rand index and F measure give very similar results. Stein et al. (2003) propose expected edge density as an in- ternal measure and give evidence that it is a good predictor of the quality of a clustering. Kleinberg (2002) and Meilă (2005) present axiomatic frameworks for comparing clusterings.
***indexing unit->
|------------
Parallel to the issue of which parts of a document to return to the user is the issue of which parts of a document to index. In Section 2.1.2 (page 20), we discussed the need for a document unit or indexing unit in indexing and re-INDEXING UNIT trieval. In unstructured retrieval, it is usually clear what the right document 3. To represent the semantics of NEXI queries fully we would also need to designate one node in the tree as a “target node”, for example, the section in the tree in Figure 10.3. Without the designation of a target node, the tree in Figure 10.3 is not a search for sections embedded in articles (as specified by NEXI), but a search for articles that contain sections.
***Ide dec-hi->
|------------
Relevance feedback can improve both recall and precision. But, in prac- tice, it has been shown to be most useful for increasing recall in situations where recall is important. This is partly because the technique expands the query, but it is also partly an effect of the use case: when they want high recall, users can be expected to take time to review results and to iterate on the search. Positive feedback also turns out to be much more valuable than negative feedback, and so most IR systems set γ < β. Reasonable values might be α = 1, β = 0.75, and γ = 0.15. In fact, many systems, such as the image search system in Figure 9.1, allow only positive feedback, which is equivalent to setting γ = 0. Another alternative is to use only the marked nonrelevant document which received the highest ranking from the IR sys- tem as negative feedback (here, |Dnr| = 1 in Equation (9.3)). While many of the experimental results comparing various relevance feedback variants are rather inconclusive, some studies have suggested that this variant, called IdeIDE DEC-HI dec-hi is the most effective or at least the most consistent performer.
***length-normalization->
***stop words->
|------------
For O’Neill, which of the following is the desired tokenization? neill oneill o’neill o’ neill o neill ? And for aren’t, is it: aren’t arent are n’t aren t ? A simple strategy is to just split on all non-alphanumeric characters, but while o neill looks okay, aren t looks intuitively bad. For all of them, the choices determine which Boolean queries will match. A query of neill AND capital will match in three cases but not the other two. In how many cases would a query of o’neill AND capital match? If no preprocessing of a query is done, then it would match in only one of the five cases. For either 2. That is, as defined here, tokens that are not indexed (stop words) are not terms, and if mul- tiple tokens are collapsed together via normalization, they are indexed as one term, under the normalized form. However, we later relax this definition when discussing classification and clustering in Chapters 13–18, where there is no index. In these chapters, we drop the require- ment of inclusion in the dictionary. A term means a normalized word.
|------------
2.2.2 Dropping common terms: stop words Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words. The generalSTOP WORDS strategy for determining a stop list is to sort the terms by collection frequencyCOLLECTION FREQUENCY (the total number of times each term appears in the document collection), and then to take the most frequent terms, often hand-filtered for their se- mantic content relative to the domain of the documents being indexed, as a stop list, the members of which are then discarded during indexing. AnSTOP LIST example of a stop list is shown in Figure 2.5. Using a stop list significantly reduces the number of postings that a system has to store; we will present some statistics on this in Chapter 5 (see Table 5.1, page 87). And a lot of the time not indexing stop words does little harm: keyword searches with terms like the and by don’t seem very useful. However, this is not true for phrase searches. The phrase query “President of the United States”, which con- tains two stop words, is more precise than President AND “United States”. The meaning of flights to London is likely to be lost if the word to is stopped out. A search for Vannevar Bush’s article As we may think will be difficult if the first three words are stopped out, and the system searches simply for documents containing the word think. Some special query types are disproportionately affected. Some song titles and well known pieces of verse consist entirely of words that are commonly on stop lists (To be or not to be, Let It Be, I don’t want to be, . . . ).
|------------
The general trend in IR systems over time has been from standard use of quite large stop lists (200–300 terms) to very small stop lists (7–12 terms) to no stop list whatsoever. Web search engines generally do not use stop lists. Some of the design of modern IR systems has focused precisely on how we can exploit the statistics of language so as to be able to cope with common words in better ways. We will show in Section 5.3 (page 95) how good compression techniques greatly reduce the cost of storing the postings for common words. Section 6.2.1 (page 117) then discusses how standard term weighting leads to very common words having little impact on doc- ument rankings. Finally, Section 7.1.5 (page 140) shows how an IR system with impact-sorted indexes can terminate scanning a postings list early when weights get small, and hence common words do not cause a large additional processing cost for the average query, even though postings lists for stop words are very long. So for most modern IR systems, the additional cost of including stop words is not that big – neither in terms of index size nor in terms of query processing time.
|------------
Exercise 2.14 [⋆⋆] How could an IR system combine use of a positional index and use of stop words? What is the potential problem, and how could it be handled? 2.5 References and further reading Exhaustive discussion of the character-level processing of East Asian lan-EAST ASIAN LANGUAGES guages can be found in Lunde (1998). Character bigram indexes are perhaps the most standard approach to indexing Chinese, although some systems use word segmentation. Due to differences in the language and writing system, word segmentation is most usual for Japanese (Luk and Kwok 2002, Kishida et al. 2005). The structure of a character k-gram index over unsegmented text differs from that in Section 3.2.2 (page 54): there the k-gram dictionary points to postings lists of entries in the regular dictionary, whereas here it points directly to document postings lists. For further discussion of Chinese word segmentation, see Sproat et al. (1996), Sproat and Emerson (2003), Tseng et al.
***unigram language model->
|------------
12.1.2 Types of language models How do we build probabilities over sequences of terms? We can always use the chain rule from Equation (11.1) to decompose the probability of a sequence of events into the probability of each successive event conditioned on earlier events: P(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t1t2)P(t4|t1t2t3)(12.4) The simplest form of language model simply throws away all conditioning context, and estimates each term independently. Such a model is called a unigram language model:UNIGRAM LANGUAGE MODEL Puni(t1t2t3t4) = P(t1)P(t2)P(t3)P(t4)(12.5) There are many more complex kinds of language models, such as bigramBIGRAM LANGUAGE MODEL language models, which condition on the previous term, Pbi(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t2)P(t4|t3)(12.6) and even more complex grammar-based language models such as proba- bilistic context-free grammars. Such models are vital for tasks like speech recognition, spelling correction, and machine translation, where you need the probability of a term conditioned on surrounding context. However, most language-modeling work in IR has used unigram language models.
***relevance feedback->
|------------
9.1 Relevance feedback and pseudo relevance feedback The idea of relevance feedback (RF) is to involve the user in the retrieval processRELEVANCE FEEDBACK so as to improve the final result set. In particular, the user gives feedback on the relevance of documents in an initial set of results. The basic procedure is: • The user issues a (short, simple) query.
|------------
Relevance feedback can go through one or more iterations of this sort. The process exploits the idea that it may be difficult to formulate a good query when you don’t know the collection well, but it is easy to judge particular documents, and so it makes sense to engage in iterative query refinement of this sort. In such a scenario, relevance feedback can also be effective in tracking a user’s evolving information need: seeing some documents may lead users to refine their understanding of the information they are seeking.
|------------
Image search provides a good example of relevance feedback. Not only is it easy to see the results at work, but this is a domain where a user can easily have difficulty formulating what they want in words, but can easily indicate relevant or nonrelevant images. After the user enters an initial query for bike on the demonstration system at: http://nayana.ece.ucsb.edu/imsearch/imsearch.html the initial results (in this case, images) are returned. In Figure 9.1 (a), the user has selected some of them as relevant. These will be used to refine the query, while other displayed results have no effect on the reformulation. Fig- ure 9.1 (b) then shows the new top-ranked results calculated after this round of relevance feedback.
|------------
9.1.1 The Rocchio algorithm for relevance feedback The Rocchio Algorithm is the classic algorithm for implementing relevance feedback. It models a way of incorporating relevance feedback information into the vector space model of Section 6.3.
***harmonic number->
|------------
Now we have derived term statistics that characterize the distribution of terms in the collection and, by extension, the distribution of gaps in the post- ings lists. From these statistics, we can calculate the space requirements for an inverted index compressed with γ encoding. We first stratify the vocab- ulary into blocks of size Lc = 15. On average, term i occurs 15/i times per 4. Note that, unfortunately, the conventional symbol for both entropy and harmonic number is H. Context should make clear which is meant in this chapter.
***multimodal class->
|------------
not represent the “a” class well with a single prototype because it has two clusters. Rocchio often misclassifies this type of multimodal class. A text clas-MULTIMODAL CLASS sification example for multimodality is a country like Burma, which changed its name to Myanmar in 1989. The two clusters before and after the name change need not be close to each other in space. We also encountered the problem of multimodality in relevance feedback (Section 9.1.2, page 184).
***Naive Bayes assumption->
|------------
Again, if we knew the percentage of relevant documents in the collection, then we could use this number to estimate P(R = 1|~q) and P(R = 0|~q). Since a document is either relevant or nonrelevant to a query, we must have that: P(R = 1|~x,~q) + P(R = 0|~x,~q) = 1(11.9) 11.3.1 Deriving a ranking function for query terms Given a query q, we wish to order returned documents by descending P(R = 1|d, q). Under the BIM, this is modeled as ordering by P(R = 1|~x,~q). Rather than estimating this probability directly, because we are interested only in the ranking of documents, we work with some other quantities which are easier to compute and which give the same ordering of documents. In particular, we can rank documents by their odds of relevance (as the odds of relevance is monotonic with the probability of relevance). This makes things easier, because we can ignore the common denominator in (11.8), giving: O(R|~x,~q) = P(R = 1|~x,~q) P(R = 0|~x,~q) = P(R=1|~q)P(~x|R=1,~q) P(~x|~q) P(R=0|~q)P(~x|R=0,~q) P(~x|~q) = P(R = 1|~q) P(R = 0|~q) · P(~x|R = 1,~q) P(~x|R = 0,~q)(11.10) The left term in the rightmost expression of Equation (11.10) is a constant for a given query. Since we are only ranking documents, there is thus no need for us to estimate it. The right-hand term does, however, require estimation, and this initially appears to be difficult: How can we accurately estimate the probability of an entire term incidence vector occurring? It is at this point that we make the Naive Bayes conditional independence assumption that the presenceNAIVE BAYES ASSUMPTION or absence of a word in a document is independent of the presence or absence of any other word (given the query): P(~x|R = 1,~q) P(~x|R = 0,~q) = M ∏ t=1 P(xt|R = 1,~q) P(xt|R = 0,~q) (11.11) So: O(R|~x,~q) = O(R|~q) · M ∏ t=1 P(xt|R = 1,~q) P(xt|R = 0,~q) (11.12) Since each xt is either 0 or 1, we can separate the terms to give: O(R|~x,~q) = O(R|~q) · ∏ t:xt=1 P(xt = 1|R = 1,~q) P(xt = 1|R = 0,~q) · ∏ t:xt=0 P(xt = 0|R = 1,~q) P(xt = 0|R = 0,~q) (11.13) Henceforth, let pt = P(xt = 1|R = 1,~q) be the probability of a term appear- ing in a document relevant to the query, and ut = P(xt = 1|R = 0,~q) be the probability of a term appearing in a nonrelevant document. These quantities can be visualized in the following contingency table where the columns add to 1: (11.14) document relevant (R = 1) nonrelevant (R = 0) Term present xt = 1 pt ut Term absent xt = 0 1 − pt 1 − ut Let us make an additional simplifying assumption that terms not occur- ring in the query are equally likely to occur in relevant and nonrelevant doc- uments: that is, if qt = 0 then pt = ut. (This assumption can be changed, as when doing relevance feedback in Section 11.3.4.) Then we need only consider terms in the products that appear in the query, and so, O(R|~q,~x) = O(R|~q) · ∏ t:xt=qt=1 pt ut · ∏ t:xt=0,qt=1 1 − pt 1 − ut (11.15) The left product is over query terms found in the document and the right product is over query terms not found in the document.
***contiguity hypothesis->
|------------
The basic hypothesis in using the vector space model for classification is the contiguity hypothesis.CONTIGUITY HYPOTHESIS Contiguity hypothesis. Documents in the same class form a contigu- ous region and regions of different classes do not overlap.
|------------
Whether or not a set of documents is mapped into a contiguous region de- pends on the particular choices we make for the document representation: type of weighting, stop list etc. To see that the document representation is crucial, consider the two classes written by a group vs. written by a single per- son. Frequent occurrence of the first person pronoun I is evidence for the single-person class. But that information is likely deleted from the document representation if we use a stop list. If the document representation chosen is unfavorable, the contiguity hypothesis will not hold and successful vector space classification is not possible.
***confusion matrix->
|------------
An important tool for analyzing the performance of a classifier for J > 2 classes is the confusion matrix. The confusion matrix shows for each pair ofCONFUSION MATRIX classes 〈c1, c2〉, how many documents from c1 were incorrectly assigned to c2.
|------------
In Table 14.5, the classifier manages to distinguish the three financial classes money-fx, trade, and interest from the three agricultural classes wheat, corn, and grain, but makes many errors within these two groups. The confusion matrix can help pinpoint opportunities for improving the accuracy of the assigned class money-fx trade interest wheat corn grain true class money-fx 95 0 10 0 0 0 trade 1 1 90 0 1 0 interest 13 0 0 0 0 0 wheat 0 0 1 34 3 7 corn 1 0 2 13 26 5 grain 0 0 2 14 5 10 ◮ Table 14.5 A confusion matrix for Reuters-21578. For example, 14 documents from grain were incorrectly assigned to wheat. Adapted from Picca et al. (2006).
***ground truth->
|------------
The standard approach to information retrieval system evaluation revolves around the notion of relevant and nonrelevant documents. With respect to aRELEVANCE user information need, a document in the test collection is given a binary classification as either relevant or nonrelevant. This decision is referred to as the gold standard or ground truth judgment of relevance. The test documentGOLD STANDARD GROUND TRUTH collection and suite of information needs have to be of a reasonable size: you need to average performance over fairly large test sets, as results are highly variable over different documents and information needs. As a rule of thumb, 50 information needs has usually been found to be a sufficient minimum.
***multinomial Naive Bayes->
|------------
13.2 Naive Bayes text classification The first supervised learning method we introduce is the multinomial NaiveMULTINOMIAL NAIVE BAYES Bayes or multinomial NB model, a probabilistic learning method. The proba- bility of a document d being in class c is computed as P(c|d) ∝ P(c) ∏ 1≤k≤nd P(tk|c)(13.2) where P(tk|c) is the conditional probability of term tk occurring in a docu- ment of class c.1 We interpret P(tk|c) as a measure of how much evidence tk contributes that c is the correct class. P(c) is the prior probability of a document occurring in class c. If a document’s terms do not provide clear evidence for one class versus another, we choose the one that has a higher prior probability. 〈t1, t2, . . . , tnd〉 are the tokens in d that are part of the vocab- ulary we use for classification and nd is the number of such tokens in d. For example, 〈t1, t2, . . . , tnd〉 for the one-sentence document Beijing and Taipei join the WTO might be 〈Beijing, Taipei, join, WTO〉, with nd = 4, if we treat the terms and and the as stop words.
***lexicalized subtree->
|------------
Next we define the dimensions of the vector space to be lexicalized subtrees of documents – subtrees that contain at least one vocabulary term. A sub- set of these possible lexicalized subtrees is shown in the figure, but there are others – e.g., the subtree corresponding to the whole document with the leaf node Gates removed. We can now represent queries and documents as vec- tors in this space of lexicalized subtrees and compute matches between them.
|------------
This means that we can use the vector space formalism from Chapter 6 for XML retrieval. The main difference is that the dimensions of vector space ◮ Figure 10.8 A mapping of an XML document (left) to a set of lexicalized subtrees (right).
***internal criterion of quality->
|------------
16.3 Evaluation of clustering Typical objective functions in clustering formalize the goal of attaining high intra-cluster similarity (documents within a cluster are similar) and low inter- cluster similarity (documents from different clusters are dissimilar). This is an internal criterion for the quality of a clustering. But good scores on anINTERNAL CRITERION OF QUALITY internal criterion do not necessarily translate into good effectiveness in an application. An alternative to internal criteria is direct evaluation in the ap- plication of interest. For search result clustering, we may want to measure the time it takes users to find an answer with different clustering algorithms.
***cumulative gain->
|------------
A final approach that has seen increasing adoption, especially when em- ployed with machine learning approaches to ranking (see Section 15.4, page 341) is measures of cumulative gain, and in particular normalized discounted cumu-CUMULATIVE GAIN NORMALIZED DISCOUNTED CUMULATIVE GAIN lative gain (NDCG). NDCG is designed for situations of non-binary notionsNDCG of relevance (cf. Section 8.5.1). Like precision at k, it is evaluated over some number k of top search results. For a set of queries Q, let R(j, d) be the rele- vance score assessors gave to document d for query j. Then, NDCG(Q, k) = 1 |Q| |Q| ∑ j=1 Zkj k ∑ m=1 2R(j,m) − 1 log2(1 + m) ,(8.9) where Zkj is a normalization factor calculated to make it so that a perfect ranking’s NDCG at k for query j is 1. For queries for which k′ < k documents are retrieved, the last summation is done up to k′.
***navigational queries->
|------------
Navigational queries seek the website or home page of a single entity that theNAVIGATIONAL QUERIES user has in mind, say Lufthansa airlines. In such cases, the user’s expectation is that the very first search result should be the home page of Lufthansa.
***teleport->
|------------
What if the current location of the surfer, the node A, has no out-links? To address this we introduce an additional operation for our random surfer: the teleport operation. In the teleport operation the surfer jumps from a nodeTELEPORT to any other node in the web graph. This could happen because he types an address into the URL bar of his browser. The destination of a teleport operation is modeled as being chosen uniformly at random from all web pages. In other words, if N is the total number of nodes in the web graph1, the teleport operation takes the surfer to each node with probability 1/N.
***LDA->
|------------
probabilistic latent variable model for dimensionality reduction is the Latent Dirichlet Allocation (LDA) model (Blei et al. 2003), which is generative and assigns probabilities to documents outside of the training set. This model is extended to a hierarchical clustering by Rosen-Zvi et al. (2004). Wei and Croft (2006) present the first large scale evaluation of LDA, finding it to signifi- cantly outperform the query likelihood model of Section 12.2 (page 242), but to not perform quite as well as the relevance model mentioned in Section 12.4 (page 250) – but the latter does additional per-query processing unlike LDA.
***phrase queries->
|------------
a. How often is a skip pointer followed (i.e., p1 is advanced to skip(p1))? b. How many postings comparisons will be made by this algorithm while intersect- ing the two lists? c. How many postings comparisons would be made if the postings lists are inter- sected without the use of skip pointers? 2.4 Positional postings and phrase queries Many complex or technical concepts and many organization and product names are multiword compounds or phrases. We would like to be able to pose a query such as Stanford University by treating it as a phrase so that a sentence in a document like The inventor Stanford Ovshinsky never went to uni- versity. is not a match. Most recent search engines support a double quotes syntax (“stanford university”) for phrase queries, which has proven to be veryPHRASE QUERIES easily understood and successfully used by users. As many as 10% of web queries are phrase queries, and many more are implicit phrase queries (such as person names), entered without use of double quotes. To be able to sup- port such queries, it is no longer sufficient for postings lists to be simply lists of documents that contain individual terms. In this section we consider two approaches to supporting phrase queries and their combination. A search engine should not only support phrase queries, but implement them effi- ciently. A related but distinct concept is term proximity weighting, where a document is preferred to the extent that the query terms appear close to each other in the text. This technique is covered in Section 7.2.2 (page 144) in the context of ranked retrieval.
|------------
2.4.1 Biword indexes One approach to handling phrases is to consider every pair of consecutive terms in a document as a phrase. For example, the text Friends, Romans, Countrymen would generate the biwords:BIWORD INDEX friends romans romans countrymen In this model, we treat each of these biwords as a vocabulary term. Being able to process two-word phrase queries is immediate. Longer phrases can be processed by breaking them down. The query stanford university palo alto can be broken into the Boolean query on biwords: “stanford university” AND “university palo” AND “palo alto” This query could be expected to work fairly well in practice, but there can and will be occasional false positives. Without examining the documents, we cannot verify that the documents matching the above Boolean query do actually contain the original 4 word phrase.
|------------
Johnson et al. (2006) report that 11.7% of all queries in two 2002 web query logs contained phrase queries, though Kammenhuber et al. (2006) report only 3% phrase queries for a different data set. Silverstein et al. (1999) note that many queries without explicit phrase operators are actually implicit phrase searches.
***differential cluster labeling->
|------------
Differential cluster labeling selects cluster labels by comparing the distribu-DIFFERENTIAL CLUSTER LABELING tion of terms in one cluster with that of other clusters. The feature selection methods we introduced in Section 13.5 (page 271) can all be used for differen- tial cluster labeling.5 In particular, mutual information (MI) (Section 13.5.1, page 272) or, equivalently, information gain and the χ2-test (Section 13.5.2, page 275) will identify cluster labels that characterize one cluster in contrast to other clusters. A combination of a differential test with a penalty for rare terms often gives the best labeling results because rare terms are not neces- sarily representative of the cluster as a whole.
***nonlinear problem->
|------------
0 ◮ Figure 14.11 A nonlinear problem.
|------------
Figure 14.11 is another example of a nonlinear problem: there is no good linear separator between the distributions P(d|c) and P(d|c) because of the circular “enclave” in the upper left part of the graph. Linear classifiers mis- classify the enclave, whereas a nonlinear classifier like kNN will be highly accurate for this type of problem if the training set is large enough.
***precision-recall curve->
|------------
8.4 Evaluation of ranked retrieval results Precision, recall, and the F measure are set-based measures. They are com- puted using unordered sets of documents. We need to extend these measures (or to define new measures) if we are to evaluate the ranked retrieval results that are now standard with search engines. In a ranked retrieval context, appropriate sets of retrieved documents are naturally given by the top k re- trieved documents. For each such set, precision and recall values can be plotted to give a precision-recall curve, such as the one shown in Figure 8.2.PRECISION-RECALL CURVE Precision-recall curves have a distinctive saw-tooth shape: if the (k + 1)th document retrieved is nonrelevant then recall is the same as for the top k documents, but precision has dropped. If it is relevant, then both precision and recall increase, and the curve jags up and to the right. It is often useful to remove these jiggles and the standard way to do this is with an interpolated precision: the interpolated precision pinterp at a certain recall level r is definedINTERPOLATED PRECISION Recall Interp.
***336->
***pull model->
|------------
14.7 References and further reading As discussed in Chapter 9, Rocchio relevance feedback is due to Rocchio (1971). Joachims (1997) presents a probabilistic analysis of the method. Roc- chio classification was widely used as a classification method in TREC in the 1990s (Buckley et al. 1994a;b, Voorhees and Harman 2005). Initially, it was used as a form of routing. Routing merely ranks documents according to rel-ROUTING evance to a class without assigning them. Early work on filtering, a true clas-FILTERING sification approach that makes an assignment decision on each document, was published by Ittner et al. (1995) and Schapire et al. (1998). The definition of routing we use here should not be confused with another sense. Routing can also refer to the electronic distribution of documents to subscribers, the so-called push model of document distribution. In a pull model, each transferPUSH MODEL PULL MODEL of a document to the user is initiated by the user – for example, by means of search or by selecting it from a list of documents on a news aggregation website.
***mutual information->
|------------
The basic feature selection algorithm is shown in Figure 13.6. For a given class c, we compute a utility measure A(t, c) for each term of the vocabulary and select the k terms that have the highest values of A(t, c). All other terms are discarded and not used in classification. We will introduce three different utility measures in this section: mutual information, A(t, c) = I(Ut;Cc); the χ2 test, A(t, c) = X2(t, c); and frequency, A(t, c) = N(t, c).
|------------
13.5.1 Mutual information A common feature selection method is to compute A(t, c) as the expected mutual information (MI) of term t and class c.5 MI measures how much in-MUTUAL INFORMATION formation the presence/absence of a term contributes to making the correct classification decision on c. Formally: I(U;C) = ∑ et∈{1,0} ∑ ec∈{1,0} P(U = et,C = ec) log2 P(U = et,C = ec) P(U = et)P(C = ec) ,(13.16) whereU is a random variable that takes values et = 1 (the document contains term t) and et = 0 (the document does not contain t), as defined on page 266, and C is a random variable that takes values ec = 1 (the document is in class c) and ec = 0 (the document is not in class c). We write Ut and Cc if it is not clear from context which term t and class c we are referring to.
|------------
ForMLEs of the probabilities, Equation (13.16) is equivalent to Equation (13.17): I(U;C) = N11 N log2 NN11 N1.N.1 + N01 N log2 NN01 N0.N.1 (13.17) + N10 N log2 NN10 N1.N.0 + N00 N log2 NN00 N0.N.0 where the Ns are counts of documents that have the values of et and ec that are indicated by the two subscripts. For example, N10 is the number of doc- 5. Take care not to confuse expected mutual information with pointwise mutual information, which is defined as log N11/E11 where N11 and E11 are defined as in Equation (13.18). The two measures have different properties. See Section 13.7.
|------------
mation or NMI: NMI(Ω, C) = I(Ω; C) [H(Ω) + H(C)]/2 (16.2) I is mutual information (cf. Chapter 13, page 272): I(Ω; C) = ∑ k ∑ j P(ωk ∩ cj) log P(ωk ∩ cj) P(ωk)P(cj) (16.3) = ∑ k ∑ j |ωk ∩ cj| N log N|ωk ∩ cj| |ωk||cj| (16.4) where P(ωk), P(cj), and P(ωk ∩ cj) are the probabilities of a document being in cluster ωk, class cj, and in the intersection of ωk and cj, respectively. Equa- tion (16.4) is equivalent to Equation (16.3) for maximum likelihood estimates of the probabilities (i.e., the estimate of each probability is the corresponding relative frequency).
|------------
Maximum mutual information is reached for a clustering Ωexact that perfectly recreates the classes – but also if clusters in Ωexact are further subdivided into smaller clusters (Exercise 16.7). In particular, a clustering with K = N one- document clusters has maximum MI. So MI has the same problem as purity: it does not penalize large cardinalities and thus does not formalize our bias that, other things being equal, fewer clusters are better.
***Euclidean length->
|------------
To compensate for the effect of document length, the standard way of quantifying the similarity between two documents d1 and d2 is to compute the cosine similarity of their vector representations ~V(d1) and ~V(d2)COSINE SIMILARITY sim(d1, d2) = ~V(d1) · ~V(d2) |~V(d1)||~V(d2)| ,(6.10) where the numerator represents the dot product (also known as the inner prod-DOT PRODUCT uct) of the vectors ~V(d1) and ~V(d2), while the denominator is the product of their Euclidean lengths. The dot product ~x · ~y of two vectors is defined asEUCLIDEAN LENGTH ∑ M i=1 xiyi. Let ~V(d) denote the document vector for d, with M components ~V1(d) . . . ~VM(d). The Euclidean length of d is defined to be √ ∑ M i=1 ~V2i (d).
***XML element->
|------------
10.1 Basic XML concepts An XML document is an ordered, labeled tree. Each node of the tree is an XML element and is written with an opening and closing tag. An element canXML ELEMENT have one or more XML attributes. In the XML document in Figure 10.1, theXML ATTRIBUTE scene element is enclosed by the two tags and . It has an attribute number with value vii and two child elements, title and verse.
***Okapi weighting->
|------------
11.4.3 Okapi BM25: a non-binary model The BIM was originally designed for short catalog records and abstracts of fairly consistent length, and it works reasonably in these contexts, but for modern full-text search collections, it seems clear that a model should pay attention to term frequency and document length, as in Chapter 6. The BM25BM25 WEIGHTS weighting scheme, often called Okapiweighting, after the system in which it wasOKAPI WEIGHTING first implemented, was developed as a way of building a probabilistic model sensitive to these quantities while not introducing too many additional pa- rameters into the model (Spärck Jones et al. 2000). We will not develop the full theory behind the model here, but just present a series of forms that build up to the standard form now used for document scoring. The simplest score for document d is just idf weighting of the query terms present, as in Equation (11.22): RSVd = ∑ t∈q log N dft (11.30) Sometimes, an alternative version of idf is used. If we start with the formula in Equation (11.21) but in the absence of relevance feedback information we estimate that S = s = 0, then we get an alternative idf formulation as follows: RSVd = ∑ t∈q log N − dft + 12 dft + 12 (11.31) This variant behaves slightly strangely: if a term occurs in over half the doc- uments in the collection then this model gives a negative term weight, which is presumably undesirable. But, assuming the use of a stop list, this normally doesn’t happen, and the value for each summand can be given a floor of 0.
***singleton cluster->
|------------
This is a particular problem if a document set contains many outliers, doc-OUTLIER uments that are far from any other documents and therefore do not fit well into any cluster. Frequently, if an outlier is chosen as an initial seed, then no other vector is assigned to it during subsequent iterations. Thus, we end up with a singleton cluster (a cluster with only one document) even though thereSINGLETON CLUSTER is probably a clustering with lower RSS. Figure 16.7 shows an example of a suboptimal clustering resulting from a bad choice of initial seeds.
***docID->
|------------
Within a document collection, we assume that each document has a unique serial number, known as the document identifier (docID). During index con-DOCID struction, we can simply assign successive integers to each new document when it is first encountered. The input to indexing is a list of normalized tokens for each document, which we can equally think of as a list of pairs of term and docID, as in Figure 1.4. The core indexing step is sorting this listSORTING so that the terms are alphabetical, giving us the representation in the middle column of Figure 1.4. Multiple occurrences of the same term from the same document are then merged.5 Instances of the same term are then grouped, and the result is split into a dictionary and postings, as shown in the right column of Figure 1.4. Since a term generally occurs in a number of docu- ments, this data organization already reduces the storage requirements of the index. The dictionary also records some statistics, such as the number of documents which contain each term (the document frequency, which is hereDOCUMENT FREQUENCY also the length of each postings list). This information is not vital for a ba- sic Boolean search engine, but it allows us to improve the efficiency of the 5. Unix users can note that these steps are similar to use of the sort and then uniq commands.
***connectivity server->
|------------
20.4 Connectivity servers For reasons to become clearer in Chapter 21, web search engines require a connectivity server that supports fast connectivity queries on the web graph.CONNECTIVITY SERVER CONNECTIVITY QUERIES Typical connectivity queries are which URLs link to a given URL? and which URLs does a given URL link to? To this end, we wish to store mappings in memory from URL to out-links, and from URL to in-links. Applications in- clude crawl control, web graph analysis, sophisticated crawl optimization and link analysis (to be covered in Chapter 21).
***break-even point->
|------------
Thus, R-precision turns out to be identical to the break-even point, anotherBREAK-EVEN POINT measure which is sometimes used, defined in terms of this equality relation- ship holding. Like Precision at k, R-precision describes only one point on the precision-recall curve, rather than attempting to summarize effectiveness across the curve, and it is somewhat unclear why you should be interested in the break-even point rather than either the best point on the curve (the point with maximal F-measure) or a retrieval level of interest to a particular application (Precision at k). Nevertheless, R-precision turns out to be highly correlated with MAP empirically, despite measuring only a single point on 0.0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1 1 − specificity se ns itiv ity ( = re ca ll) ◮ Figure 8.4 The ROC curve corresponding to the precision-recall curve in Fig- ure 8.2.
***functional margin->
|------------
The math works out much more cleanly if you do things this way, as we will see almost immediately in the definition of functional margin. The linear classifier is then: f (~x) = sign(~wT~x + b)(15.1) A value of −1 indicates one class, and a value of +1 the other class.
|------------
We are confident in the classification of a point if it is far away from the decision boundary. For a given data set and decision hyperplane, we define the functional margin of the ith example ~xi with respect to a hyperplane 〈~w, b〉FUNCTIONAL MARGIN as the quantity yi(~wT~xi + b). The functional margin of a data set with re- spect to a decision surface is then twice the functional margin of any of the points in the data set with minimal functional margin (the factor of 2 comes from measuring across the whole width of the margin, as in Figure 15.3).
|------------
However, there is a problem with using this definition as is: the value is un- derconstrained, because we can always make the functional margin as big as we wish by simply scaling up ~w and b. For example, if we replace ~w by 5~w and b by 5b then the functional margin yi(5~wT~xi + 5b) is five times as large. This suggests that we need to place some constraint on the size of the ~w vector. To get a sense of how to do that, let us look at the actual geometry.
***transactional query->
|------------
A transactional query is one that is a prelude to the user performing a trans-TRANSACTIONAL QUERY action on the Web – such as purchasing a product, downloading a file or making a reservation. In such cases, the search engine should return results listing services that provide form interfaces for such transactions.
***search result clustering->
|------------
The hypothesis states that if there is a document from a cluster that is rele- vant to a search request, then it is likely that other documents from the same cluster are also relevant. This is because clustering puts together documents that share many terms. The cluster hypothesis essentially is the contiguity Application What is Benefit Example clustered? Search result clustering search results more effective information presentation to user Figure 16.2 Scatter-Gather (subsets of) collection alternative user interface: “search without typing” Figure 16.3 Collection clustering collection effective information pre- sentation for exploratory browsing McKeown et al. (2002), http://news.google.com Language modeling collection increased precision and/or recall Liu and Croft (2004) Cluster-based retrieval collection higher efficiency: faster search Salton (1971a) ◮ Table 16.1 Some applications of clustering in information retrieval.
|------------
The first application mentioned in Table 16.1 is search result clustering whereSEARCH RESULT CLUSTERING by search results we mean the documents that were returned in response to a query. The default presentation of search results in information retrieval is a simple list. Users scan the list from top to bottom until they have found the information they are looking for. Instead, search result clustering clus- ters the search results, so that similar documents appear together. It is often easier to scan a few coherent groups than many individual documents. This is particularly useful if a search term has different word senses. The example in Figure 16.2 is jaguar. Three frequent senses on the web refer to the car, the animal and an Apple operating system. The Clustered Results panel returned by the Vivísimo search engine (http://vivisimo.com) can be a more effective user interface for understanding what is in the search results than a simple list of documents.
***partitioning->
|------------
An example of an efficient divisive algorithm is bisecting K-means (Stein- bach et al. 2000). Spectral clustering algorithms (Kannan et al. 2000, DhillonSPECTRAL CLUSTERING 2001, Zha et al. 2001, Ng et al. 2001a), including principal direction divisive partitioning (PDDP) (whose bisecting decisions are based on SVD, see Chap- ter 18) (Boley 1998, Savaresi and Boley 2004), are computationally more ex- pensive than bisecting K-means, but have the advantage of being determin- istic.
***clustering->
|------------
0 1 2 3 4 5 6 7 8 9 10 0 1 × d1 × d2 × d3 × d4 ◮ Figure 17.10 Complete-link clustering is not best-merge persistent. At first, d2 is the best-merge cluster for d3. But after merging d1 and d2, d4 becomes d3’s best-merge candidate. In a best-merge persistent algorithm like single-link, d3’s best-merge clus- ter would be {d1, d2}.
|------------
Can we also speed up the other three HAC algorithms with an NBM ar- ray? We cannot because only single-link clustering is best-merge persistent.BEST-MERGE PERSISTENCE Suppose that the best merge cluster for ωk is ωj in single-link clustering.
|------------
Then after merging ωj with a third cluster ωi 6= ωk, the merge of ωi and ωj will be ωk’s best merge cluster (Exercise 17.6). In other words, the best-merge candidate for the merged cluster is one of the two best-merge candidates of its components in single-link clustering. This means that C can be updated in Θ(N) in each iteration – by taking a simple max of two values on line 14 in Figure 17.9 for each of the remaining ≤ N clusters.
|------------
Figure 17.10 demonstrates that best-merge persistence does not hold for complete-link clustering, which means that we cannot use an NBM array to speed up clustering. After merging d3’s best merge candidate d2 with cluster d1, an unrelated cluster d4 becomes the best merge candidate for d3. This is because the complete-link merge criterion is non-local and can be affected by points at a great distance from the area where two merge candidates meet.
|------------
? Exercise 17.1Show that complete-link clustering creates the two-cluster clustering depicted in Fig- ure 17.7.
|------------
17.3 Group-average agglomerative clustering Group-average agglomerative clustering or GAAC (see Figure 17.3, (d)) evaluatesGROUP-AVERAGE AGGLOMERATIVE CLUSTERING cluster quality based on all similarities between documents, thus avoiding the pitfalls of the single-link and complete-link criteria, which equate cluster similarity with the similarity of a single pair of documents. GAAC is also called group-average clustering and average-link clustering. GAAC computes the average similarity SIM-GA of all pairs of documents, including pairs from the same cluster. But self-similarities are not included in the average: SIM-GA(ωi,ωj) = 1 (Ni + Nj)(Ni + Nj − 1) ∑dm∈ωi∪ωj ∑ dn∈ωi∪ωj,dn 6=dm ~dm · ~dn(17.1) where ~d is the length-normalized vector of document d, · denotes the dot product, and Ni and Nj are the number of documents in ωi and ωj, respec- tively.
***East Asian languages->
|------------
Exercise 2.14 [⋆⋆] How could an IR system combine use of a positional index and use of stop words? What is the potential problem, and how could it be handled? 2.5 References and further reading Exhaustive discussion of the character-level processing of East Asian lan-EAST ASIAN LANGUAGES guages can be found in Lunde (1998). Character bigram indexes are perhaps the most standard approach to indexing Chinese, although some systems use word segmentation. Due to differences in the language and writing system, word segmentation is most usual for Japanese (Luk and Kwok 2002, Kishida et al. 2005). The structure of a character k-gram index over unsegmented text differs from that in Section 3.2.2 (page 54): there the k-gram dictionary points to postings lists of entries in the regular dictionary, whereas here it points directly to document postings lists. For further discussion of Chinese word segmentation, see Sproat et al. (1996), Sproat and Emerson (2003), Tseng et al.
***cluster->
|------------
4.4 Distributed indexing Collections are often so large that we cannot perform index construction effi- ciently on a single machine. This is particularly true of the World Wide Web for which we need large computer clusters1 to construct any reasonably sized web index. Web search engines, therefore, use distributed indexing algorithms for index construction. The result of the construction process is a distributed index that is partitioned across several machines – either according to term or according to document. In this section, we describe distributed indexing for a term-partitioned index. Most large search engines prefer a document- 1. A cluster in this chapter is a group of tightly coupled computers that work together closely.
|------------
This sense of the word is different from the use of cluster as a group of documents that are semantically similar in Chapters 16–18.
|------------
16 Flat clustering Clustering algorithms group a set of documents into subsets or clusters. TheCLUSTER algorithms’ goal is to create clusters that are coherent internally, but clearly different from each other. In other words, documents within a cluster should be as similar as possible; and documents in one cluster should be as dissimi- lar as possible from documents in other clusters.
|------------
5 ◮ Figure 16.1 An example of a data set with a clear cluster structure.
|------------
Clustering is the most common form of unsupervised learning. No super-UNSUPERVISED LEARNING vision means that there is no human expert who has assigned documents to classes. In clustering, it is the distribution and makeup of the data that will determine cluster membership. A simple example is Figure 16.1. It is visually clear that there are three distinct clusters of points. This chapter and Chapter 17 introduce algorithms that find such clusters in an unsupervised fashion.
|------------
The difference between clustering and classification may not seem great at first. After all, in both cases we have a partition of a set of documents into groups. But as we will see the two problems are fundamentally differ- ent. Classification is a form of supervised learning (Chapter 13, page 256): our goal is to replicate a categorical distinction that a human supervisor im- poses on the data. In unsupervised learning, of which clustering is the most important example, we have no such teacher to guide us.
***linear separability->
|------------
If there exists a hyperplane that perfectly separates the two classes, then we call the two classes linearly separable. In fact, if linear separability holds,LINEAR SEPARABILITY then there is an infinite number of linear separators (Exercise 14.4) as illus- trated by Figure 14.8, where the number of possible separating hyperplanes is infinite.
***binary tree->
|------------
Search trees overcome many of these issues – for instance, they permit us to enumerate all vocabulary terms beginning with automat. The best-known search tree is the binary tree, in which each internal node has two children.BINARY TREE The search for a term begins at the root of the tree. Each internal node (in- cluding the root) represents a binary test, based on whose outcome the search proceeds to one of the two sub-trees below that node. Figure 3.1 gives an ex- ample of a binary search tree used for a dictionary. Efficient search (with a number of comparisons that is O(log M)) hinges on the tree being balanced: the numbers of terms under the two sub-trees of any node are either equal or differ by one. The principal issue here is that of rebalancing: as terms are inserted into or deleted from the binary search tree, it needs to be rebalanced so that the balance property is maintained.
|------------
There are few differences between the applications of flat and hierarchi- cal clustering in information retrieval. In particular, hierarchical clustering is appropriate for any of the applications shown in Table 16.1 (page 351; see also Section 16.6, page 372). In fact, the example we gave for collection clus- tering is hierarchical. In general, we select flat clustering when efficiency is important and hierarchical clustering when one of the potential problems 1. In this chapter, we only consider hierarchies that are binary trees like the one shown in Fig- ure 17.1 – but hierarchical clustering can be easily extended to other types of trees.
***metadata->
|------------
For most languages and particular domains within them there are unusual specific tokens that we wish to recognize as terms, such as the programming languages C++ and C#, aircraft names like B-52, or a T.V. show name such as M*A*S*H – which is sufficiently integrated into popular culture that you find usages such as M*A*S*H-style hospitals. Computer technology has in- troduced new types of character sequences that a tokenizer should probably tokenize as a single token, including email addresses (jblack@mail.yahoo.com), web URLs (http://stuff.big.com/new/specials.html),numeric IP addresses (142.32.48.231), package tracking numbers (1Z9999W99845399981), and more. One possible solution is to omit from indexing tokens such as monetary amounts, num- bers, and URLs, since their presence greatly expands the size of the vocab- ulary. However, this comes at a large cost in restricting what people can search for. For instance, people might want to search in a bug database for the line number where an error occurs. Items such as the date of an email, which have a clear semantic type, are often indexed separately as document metadata (see Section 6.1, page 110).
|------------
6.1 Parametric and zone indexes We have thus far viewed a document as a sequence of terms. In fact, most documents have additional structure. Digital documents generally encode, in machine-recognizable form, certain metadata associated with each docu-METADATA ment. By metadata, we mean specific forms of data about a document, such as its author(s), title and date of publication. This metadata would generally include fields such as the date of creation and the format of the document, asFIELD well the author and possibly the title of the document. The possible values of a field should be thought of as finite – for instance, the set of all dates of authorship.
|------------
A static summary is generally comprised of either or both a subset of the document and metadata associated with the document. The simplest form of summary takes the first two sentences or 50 words of a document, or ex- tracts particular zones of a document, such as the title and author. Instead of zones of a document, the summary can instead use metadata associated with the document. This may be an alternative way to provide an author or date, or may include elements which are designed to give a summary, such as the description metadata which can appear in the meta element of a web HTML page. This summary is typically extracted and cached at indexing time, in such a way that it can be retrieved and presented quickly when dis- playing search results, whereas having to access the actual document content might be a relatively expensive operation.
|------------
Figure 10.2 shows Figure 10.1 as a tree. The leaf nodes of the tree consist of text, e.g., Shakespeare, Macbeth, and Macbeth’s castle. The tree’s internal nodes encode either the structure of the document (title, act, and scene) or metadata functions (author).
|------------
The Columbia NewsBlaster system (McKeown et al. 2002), a forerunner to the now much more famous and refined Google News (http://news.google.com), used hierarchical clustering (Chapter 17) to give two levels of news topic granularity. See Hatzivassiloglou et al. (2000) for details, and Chen and Lin (2000) and Radev et al. (2001) for related systems. Other applications of clustering in information retrieval are duplicate detection (Yang and Callan (2006), Section 19.6, page 438), novelty detection (see references in Section 17.9, page 399) and metadata discovery on the semantic web (Alonso et al. 2006).
|------------
A doorway page contains text and metadata carefully chosen to rank highly on selected search keywords. When a browser requests the doorway page, it is redirected to a page containing content of a more commercial nature. More complex spamming techniques involve manipulation of the metadata related to a page including (for reasons we will see in Chapter 21) the links into a web page. Given that spamming is inherently an economically motivated activity, there has sprung around it an industry of Search Engine Optimizers,SEARCH ENGINE OPTIMIZERS or SEOs to provide consultancy services for clients who seek to have their web pages rank highly on selected keywords. Web search engines frown on this business of attempting to decipher and adapt to their proprietary rank- ing techniques and indeed announce policies on forms of SEO behavior they do not tolerate (and have been known to shut down search requests from cer- tain SEOs for violation of these). Inevitably, the parrying between such SEOs (who gradually infer features of each web search engine’s ranking methods) and the web search engines (who adapt in response) is an unending struggle; indeed, the research sub-area of adversarial information retrieval has sprung upADVERSARIAL INFORMATION RETRIEVAL around this battle. To combat spammers who manipulate the text of their web pages is the exploitation of the link structure of the Web – a technique known as link analysis. The first web search engine known to apply link anal- ysis on a large scale (to be detailed in Chapter 21) was Google, although all web search engines currently make use of it (and correspondingly, spam- mers now invest considerable effort in subverting it – this is known as linkLINK SPAM spam).
***k-gram index->
|------------
3.2.2 k-gram indexes for wildcard queries Whereas the permuterm index is simple, it can lead to a considerable blowup from the number of rotations per term; for a dictionary of English terms, this can represent an almost ten-fold space increase. We now present a second technique, known as the k-gram index, for processing wildcard queries. We will also use k-gram indexes in Section 3.3.4. A k-gram is a sequence of k characters. Thus cas, ast and stl are all 3-grams occurring in the term castle.
|------------
In a k-gram index, the dictionary contains all k-grams that occur in any termk-GRAM INDEX in the vocabulary. Each postings list points from a k-gram to all vocabulary etr beetroot metric petrify retrieval- - - - ◮ Figure 3.4 Example of a postings list in a 3-gram index. Here the 3-gram etr is illustrated. Matching vocabulary terms are lexicographically ordered in the postings.
|------------
3.3.4 k-gram indexes for spelling correction To further limit the set of vocabulary terms for which we compute edit dis- tances to the query term, we now show how to invoke the k-gram index of Section 3.2.2 (page 54) to assist with retrieving vocabulary terms with low edit distance to the query q. Once we retrieve such terms, we can then find the ones of least edit distance from q.
|------------
In fact, we will use the k-gram index to retrieve vocabulary terms that have many k-grams in common with the query. We will argue that for rea- sonable definitions of “many k-grams in common,” the retrieval process is essentially that of a single scan through the postings for the k-grams in the query string q.
***splits->
|------------
The map and reduce phases of MapReduce split up the computing job into chunks that standard machines can process in a short time. The various steps of MapReduce are shown in Figure 4.5 and an example on a collection consisting of two documents is shown in Figure 4.6. First, the input data, in our case a collection of web pages, are split into n splits where the size ofSPLITS the split is chosen to ensure that the work can be distributed evenly (chunks should not be too large) and efficiently (the total number of chunks we need to manage should not be too large); 16 or 64 MB are good sizes in distributed indexing. Splits are not preassigned to machines, but are instead assigned by the master node on an ongoing basis: As a machine finishes processing one split, it is assigned the next one. If a machine dies or becomes a laggard due to hardware problems, the split it is working on is simply reassigned to another machine.
|------------
The map phase of MapReduce consists of mapping splits of the input dataMAP PHASE to key-value pairs. This is the same parsing task we also encountered in BSBI and SPIMI, and we therefore call the machines that execute the map phase parsers. Each parser writes its output to local intermediate files, the segmentPARSER SEGMENT FILE files (shown as a-f g-p q-z in Figure 4.5).
|------------
For the reduce phase, we want all values for a given key to be stored closeREDUCE PHASE together, so that they can be read and processed quickly. This is achieved by masterassign map phase reduce phase assign parser splits parser parser inverter postings inverter inverter a-f g-p q-z a-f g-p q-z a-f g-p q-z a-f segment files g-p q-z ◮ Figure 4.5 An example of distributed indexing with MapReduce. Adapted from Dean and Ghemawat (2004).
***exclusive clustering->
***URL normalization->
|------------
Next, a URL should be normalized in the following sense: often the HTMLURL NORMALIZATION encoding of a link from a web page p indicates the target of that link relative to the page p. Thus, there is a relative link encoded thus in the HTML of the page en.wikipedia.org/wiki/Main_Page: Disclaimers points to the URL http://en.wikipedia.org/wiki/Wikipedia:General_disclaimer.
***named entity tagging->
|------------
However, many structured data sources containing text are best modeled as structured documents rather than relational data. We call the search over such structured documents structured retrieval. Queries in structured retrievalSTRUCTURED RETRIEVAL can be either structured or unstructured, but we will assume in this chap- ter that the collection consists only of structured documents. Applications of structured retrieval include digital libraries, patent databases, blogs, text in which entities like persons and locations have been tagged (in a process called named entity tagging) and output from office suites like OpenOffice that save documents as marked up text. In all of these applications, we want to be able to run queries that combine textual criteria with structural criteria.
***linear interpolation->
|------------
Thus, we need to smooth probabilities in our document language mod- els: to discount non-zero probabilities and to give some probability mass to unseen words. There’s a wide space of approaches to smoothing probabil- ity distributions to deal with this problem. In Section 11.3.2 (page 226), we already discussed adding a number (1, 1/2, or a small α) to the observed counts and renormalizing to give a probability distribution.4 In this sec- tion we will mention a couple of other smoothing methods, which involve combining observed counts with a more general reference probability distri- bution. The general approach is that a non-occurring term should be possi- ble in a query, but its probability should be somewhat close to but no more likely than would be expected by chance from the whole collection. That is, if tft,d = 0 then P̂(t|Md) ≤ cft/T where cft is the raw count of the term in the collection, and T is the raw size (number of tokens) of the entire collection. A simple idea that works well in practice is to use a mixture between a document-specific multinomial distri- bution and a multinomial distribution estimated from the entire collection: P̂(t|d) = λP̂mle(t|Md) + (1 − λ)P̂mle(t|Mc)(12.10) where 0 < λ < 1 and Mc is a language model built from the entire doc- ument collection. This mixes the probability from the document with the general collection frequency of the word. Such a model is referred to as a linear interpolation language model.5 Correctly setting λ is important to theLINEAR INTERPOLATION good performance of this model.
|------------
An alternative is to use a language model built from the whole collection as a prior distribution in a Bayesian updating process (rather than a uniformBAYESIAN SMOOTHING distribution, as we saw in Section 11.3.2). We then get the following equation: P̂(t|d) = tft,d + αP̂(t|Mc) Ld + α (12.11) Both of these smoothing methods have been shown to perform well in IR experiments; we will stick with the linear interpolation smoothing method for the rest of this section. While different in detail, they are both conceptu- ally similar: in both cases the probability estimate for a word present in the document combines a discounted MLE and a fraction of the estimate of its prevalence in the whole collection, while for words not present in a docu- ment, the estimate is just a fraction of the estimate of the prevalence of the word in the whole collection.
***single-link clustering->
|------------
17.2 Single-link and complete-link clustering In single-link clustering or single-linkage clustering, the similarity of two clus-SINGLE-LINK CLUSTERING ters is the similarity of their most similar members (see Figure 17.3, (a))3. This single-link merge criterion is local. We pay attention solely to the area where the two clusters come closest to each other. Other, more distant parts of the cluster and the clusters’ overall structure are not taken into account.
|------------
Figure 17.4 depicts a single-link and a complete-link clustering of eight documents. The first four steps, each producing a cluster consisting of a pair of two documents, are identical. Then single-link clustering joins the up- per two pairs (and after that the lower two pairs) because on the maximum- similarity definition of cluster similarity, those two clusters are closest. Complete- 3. Throughout this chapter, we equate similarity with proximity in 2D depictions of clustering.
***cluster hypothesis->
|------------
16.1 Clustering in information retrieval The cluster hypothesis states the fundamental assumption we make when us-CLUSTER HYPOTHESIS ing clustering in information retrieval.
|------------
Cluster hypothesis. Documents in the same cluster behave similarly with respect to relevance to information needs.
|------------
The hypothesis states that if there is a document from a cluster that is rele- vant to a search request, then it is likely that other documents from the same cluster are also relevant. This is because clustering puts together documents that share many terms. The cluster hypothesis essentially is the contiguity Application What is Benefit Example clustered? Search result clustering search results more effective information presentation to user Figure 16.2 Scatter-Gather (subsets of) collection alternative user interface: “search without typing” Figure 16.3 Collection clustering collection effective information pre- sentation for exploratory browsing McKeown et al. (2002), http://news.google.com Language modeling collection increased precision and/or recall Liu and Croft (2004) Cluster-based retrieval collection higher efficiency: faster search Salton (1971a) ◮ Table 16.1 Some applications of clustering in information retrieval.
***active learning->
|------------
Often, the practical answer is to work out how to get more labeled data as quickly as you can. The best way to do this is to insert yourself into a process where humans will be willing to label data for you as part of their natural tasks. For example, in many cases humans will sort or route email for their own purposes, and these actions give information about classes. The alter- native of getting human labelers expressly for the task of training classifiers is often difficult to organize, and the labeling is often of lower quality, be- cause the labels are not embedded in a realistic task context. Rather than getting people to label all or a random sample of documents, there has also been considerable research on active learning, where a system is built whichACTIVE LEARNING decides which documents a human should label. Usually these are the ones on which a classifier is uncertain of the correct classification. This can be ef- fective in reducing annotation costs by a factor of 2–4, but has the problem that the good documents to label to train one type of classifier often are not the good documents to label to train a different type of classifier.
***Bayes Optimal Decision Rule->
***Bayesian prior->
|------------
This is referred to as the relative frequency of the event. Estimating the prob-RELATIVE FREQUENCY ability as the relative frequency is the maximum likelihood estimate (or MLE),MAXIMUM LIKELIHOOD ESTIMATE MLE because this value makes the observed data maximally likely. However, if we simply use the MLE, then the probability given to events we happened to see is usually too high, whereas other events may be completely unseen and giving them as a probability estimate their relative frequency of 0 is both an underestimate, and normally breaks our models, since anything multiplied by 0 is 0. Simultaneously decreasing the estimated probability of seen events and increasing the probability of unseen events is referred to as smoothing.SMOOTHING One simple way of smoothing is to add a number α to each of the observed counts. These pseudocounts correspond to the use of a uniform distributionPSEUDOCOUNTS over the vocabulary as a Bayesian prior, following Equation (11.4). We ini-BAYESIAN PRIOR tially assume a uniform distribution over events, where the size of α denotes the strength of our belief in uniformity, and we then update the probability based on observed events. Since our belief in uniformity is weak, we use α = 12 . This is a form of maximum a posteriori (MAP) estimation, where weMAXIMUM A POSTERIORI MAP choose the most likely point value for probabilities based on the prior and the observed evidence, following Equation (11.4). We will further discuss methods of smoothing estimated counts to give probability models in Sec- tion 12.2.2 (page 243); the simple method of adding 12 to each observed count will do for now.
|------------
4. We reestimate pt and ut on the basis of known relevant and nonrelevant documents. If the sets VR and VNR are large enough, we may be able to estimate these quantities directly from these documents as maximum likelihood estimates: pt = |VRt|/|VR|(11.23) (where VRt is the set of documents in VR containing xt). In practice, we usually need to smooth these estimates. We can do this by adding 1 2 to both the count |VRt| and to the number of relevant documents not containing the term, giving: pt = |VRt|+ 12 |VR| + 1(11.24) However, the set of documents judged by the user (V) is usually very small, and so the resulting statistical estimate is quite unreliable (noisy), even if the estimate is smoothed. So it is often better to combine the new information with the original guess in a process of Bayesian updating. In this case we have: p (k+1) t = |VRt|+ κp(k)t |VR| + κ(11.25) Here p(k)t is the k th estimate for pt in an iterative updating process and is used as a Bayesian prior in the next iteration with a weighting of κ.
***noise feature->
|------------
13.5 Feature selection Feature selection is the process of selecting a subset of the terms occurringFEATURE SELECTION in the training set and using only this subset as features in text classifica- tion. Feature selection serves two main purposes. First, it makes training and applying a classifier more efficient by decreasing the size of the effective vocabulary. This is of particular importance for classifiers that, unlike NB, are expensive to train. Second, feature selection often increases classifica- tion accuracy by eliminating noise features. A noise feature is one that, whenNOISE FEATURE added to the document representation, increases the classification error on new data. Suppose a rare term, say arachnocentric, has no information about a class, say China, but all instances of arachnocentric happen to occur in China documents in our training set. Then the learning method might produce a classifier that misassigns test documents containing arachnocentric to China.
***topical relevance->
|------------
Since CAS queries have both structural and content criteria, relevance as- sessments are more complicated than in unstructured retrieval. INEX 2002 defined component coverage and topical relevance as orthogonal dimen- sions of relevance. The component coverage dimension evaluates whether theCOMPONENT COVERAGE element retrieved is “structurally” correct, i.e., neither too low nor too high in the tree. We distinguish four cases: • Exact coverage (E). The information sought is the main topic of the com- ponent and the component is a meaningful unit of information.
|------------
The topical relevance dimension also has four levels: highly relevant (3),TOPICAL RELEVANCE fairly relevant (2), marginally relevant (1) and nonrelevant (0). Components are judged on both dimensions and the judgments are then combined into a digit-letter code. 2S is a fairly relevant component that is too small and 3E is a highly relevant component that has exact coverage. In theory, there are 16 combinations of coverage and relevance, but many cannot occur. For example, a nonrelevant component cannot have exact coverage, so the com- bination 3N is not possible.
***smoothing->
|------------
6.4.2 Maximum tf normalization One well-studied technique is to normalize the tf weights of all terms occur- ring in a document by the maximum tf in that document. For each document d, let tfmax(d) = maxτ∈d tfτ,d, where τ ranges over all terms in d. Then, we compute a normalized term frequency for each term t in document d by ntft,d = a + (1 − a) tft,d tfmax(d) ,(6.15) where a is a value between 0 and 1 and is generally set to 0.4, although some early work used the value 0.5. The term a in (6.15) is a smoothing term whoseSMOOTHING role is to damp the contribution of the second term – which may be viewed as a scaling down of tf by the largest tf value in d. We will encounter smoothing further in Chapter 13 when discussing classification; the basic idea is to avoid a large swing in ntft,d from modest changes in tft,d (say from 1 to 2). The main idea of maximum tf normalization is to mitigate the following anomaly: we observe higher term frequencies in longer documents, merely because longer documents tend to repeat the same words over and over again. To appreciate this, consider the following extreme example: supposed we were to take a document d and create a new document d′ by simply appending a copy of d to itself. While d′ should be no more relevant to any query than d is, the use of (6.9) would assign it twice as high a score as d. Replacing tf-idft,d in (6.9) by ntf-idft,d eliminates the anomaly in this example. Maximum tf normalization does suffer from the following issues: 1. The method is unstable in the following sense: a change in the stop word list can dramatically alter term weightings (and therefore ranking). Thus, it is hard to tune.
|------------
11.3.2 Probability estimates in theory For each term t, what would these ct numbers look like for the whole collec- tion? (11.19) gives a contingency table of counts of documents in the collec- tion, where dft is the number of documents that contain term t: (11.19) documents relevant nonrelevant Total Term present xt = 1 s dft − s dft Term absent xt = 0 S− s (N − dft) − (S− s) N − dft Total S N − S N Using this, pt = s/S and ut = (dft − s)/(N − S) and ct = K(N, dft, S, s) = log s/(S− s) (dft − s)/((N − dft) − (S− s)) (11.20) To avoid the possibility of zeroes (such as if every or no relevant document has a particular term) it is fairly standard to add 12 to each of the quantities in the center 4 terms of (11.19), and then to adjust the marginal counts (the totals) accordingly (so, the bottom right cell totals N + 2). Then we have: ĉt = K(N, dft, S, s) = log (s + 12 )/(S− s + 12 ) (dft − s + 12 )/(N − dft − S + s + 12 ) (11.21) Adding 12 in this way is a simple form of smoothing. For trials with cat- egorical outcomes (such as noting the presence or absence of a term), one way to estimate the probability of an event from data is simply to count the number of times an event occurred divided by the total number of trials.
|------------
This is referred to as the relative frequency of the event. Estimating the prob-RELATIVE FREQUENCY ability as the relative frequency is the maximum likelihood estimate (or MLE),MAXIMUM LIKELIHOOD ESTIMATE MLE because this value makes the observed data maximally likely. However, if we simply use the MLE, then the probability given to events we happened to see is usually too high, whereas other events may be completely unseen and giving them as a probability estimate their relative frequency of 0 is both an underestimate, and normally breaks our models, since anything multiplied by 0 is 0. Simultaneously decreasing the estimated probability of seen events and increasing the probability of unseen events is referred to as smoothing.SMOOTHING One simple way of smoothing is to add a number α to each of the observed counts. These pseudocounts correspond to the use of a uniform distributionPSEUDOCOUNTS over the vocabulary as a Bayesian prior, following Equation (11.4). We ini-BAYESIAN PRIOR tially assume a uniform distribution over events, where the size of α denotes the strength of our belief in uniformity, and we then update the probability based on observed events. Since our belief in uniformity is weak, we use α = 12 . This is a form of maximum a posteriori (MAP) estimation, where weMAXIMUM A POSTERIORI MAP choose the most likely point value for probabilities based on the prior and the observed evidence, following Equation (11.4). We will further discuss methods of smoothing estimated counts to give probability models in Sec- tion 12.2.2 (page 243); the simple method of adding 12 to each observed count will do for now.
***cross-language information retrieval,->
***filtering->
|------------
To capture the generality and scope of the problem space to which stand- ing queries belong, we now introduce the general notion of a classificationCLASSIFICATION problem. Given a set of classes, we seek to determine which class(es) a given object belongs to. In the example, the standing query serves to divide new newswire articles into the two classes: documents about multicore computer chips and documents not about multicore computer chips. We refer to this as two-class classification. Classification using standing queries is also called routing orROUTING filteringand will be discussed further in Section 15.3.1 (page 335).FILTERING A class need not be as narrowly focused as the standing query multicore computer chips. Often, a class is a more general subject area like China or coffee.
|------------
Such more general classes are usually referred to as topics, and the classifica- tion task is then called text classification, text categorization, topic classification,TEXT CLASSIFICATION or topic spotting. An example for China appears in Figure 13.1. Standing queries and topics differ in their degree of specificity, but the methods for solving routing, filtering, and text classification are essentially the same. We therefore include routing and filtering under the rubric of text classification in this and the following chapters.
|------------
14.7 References and further reading As discussed in Chapter 9, Rocchio relevance feedback is due to Rocchio (1971). Joachims (1997) presents a probabilistic analysis of the method. Roc- chio classification was widely used as a classification method in TREC in the 1990s (Buckley et al. 1994a;b, Voorhees and Harman 2005). Initially, it was used as a form of routing. Routing merely ranks documents according to rel-ROUTING evance to a class without assigning them. Early work on filtering, a true clas-FILTERING sification approach that makes an assignment decision on each document, was published by Ittner et al. (1995) and Schapire et al. (1998). The definition of routing we use here should not be confused with another sense. Routing can also refer to the electronic distribution of documents to subscribers, the so-called push model of document distribution. In a pull model, each transferPUSH MODEL PULL MODEL of a document to the user is initiated by the user – for example, by means of search or by selecting it from a list of documents on a news aggregation website.
***feature selection->
|------------
SELECTFEATURES(D, c, k) 1 V ← EXTRACTVOCABULARY(D) 2 L ← [] 3 for each t ∈ V 4 do A(t, c) ← COMPUTEFEATUREUTILITY(D, t, c) 5 APPEND(L, 〈A(t, c), t〉) 6 return FEATURESWITHLARGESTVALUES(L, k) ◮ Figure 13.6 Basic feature selection algorithm for selecting the k best features.
|------------
13.5 Feature selection Feature selection is the process of selecting a subset of the terms occurringFEATURE SELECTION in the training set and using only this subset as features in text classifica- tion. Feature selection serves two main purposes. First, it makes training and applying a classifier more efficient by decreasing the size of the effective vocabulary. This is of particular importance for classifiers that, unlike NB, are expensive to train. Second, feature selection often increases classifica- tion accuracy by eliminating noise features. A noise feature is one that, whenNOISE FEATURE added to the document representation, increases the classification error on new data. Suppose a rare term, say arachnocentric, has no information about a class, say China, but all instances of arachnocentric happen to occur in China documents in our training set. Then the learning method might produce a classifier that misassigns test documents containing arachnocentric to China.
|------------
Such an incorrect generalization from an accidental property of the training set is called overfitting.OVERFITTING We can view feature selection as a method for replacing a complex clas- sifier (using all features) with a simpler one (using a subset of the features).
***average-link clustering->
|------------
17.3 Group-average agglomerative clustering Group-average agglomerative clustering or GAAC (see Figure 17.3, (d)) evaluatesGROUP-AVERAGE AGGLOMERATIVE CLUSTERING cluster quality based on all similarities between documents, thus avoiding the pitfalls of the single-link and complete-link criteria, which equate cluster similarity with the similarity of a single pair of documents. GAAC is also called group-average clustering and average-link clustering. GAAC computes the average similarity SIM-GA of all pairs of documents, including pairs from the same cluster. But self-similarities are not included in the average: SIM-GA(ωi,ωj) = 1 (Ni + Nj)(Ni + Nj − 1) ∑dm∈ωi∪ωj ∑ dn∈ωi∪ωj,dn 6=dm ~dm · ~dn(17.1) where ~d is the length-normalized vector of document d, · denotes the dot product, and Ni and Nj are the number of documents in ωi and ωj, respec- tively.
***one-of classification->
|------------
This setup is called one-of classification (Section 14.5, page 306). Show that in one-of classification (i) the total number of false positive decisions equals the total number of false negative decisions and (ii) microaveraged F1 and accuracy are identical.
|------------
The second type of classification with more than two classes is one-of clas-ONE-OF CLASSIFICATION sification. Here, the classes are mutually exclusive. Each document must belong to exactly one of the classes. One-of classification is also called multi- nomial, polytomous4, multiclass, or single-label classification. Formally, there is a single classification function γ in one-of classification whose range is C, i.e., γ(d) ∈ {c1, . . . , cJ}. kNN is a (nonlinear) one-of classifier.
|------------
J hyperplanes do not divide R|V| into J distinct regions as illustrated in Figure 14.12. Thus, we must use a combination method when using two- class linear classifiers for one-of classification. The simplest method is to 4. A synonym of polytomous is polychotomous.
***margin->
|------------
b b b b b b bb b ut ut ut ut ut ut ut Support vectorsMaximum margin decision hyperplane Margin is maximized ◮ Figure 15.1 The support vectors are the 5 points right up against the margin of the classifier.
|------------
15.1 Support vector machines: The linearly separable case For two-class, separable training data sets, such as the one in Figure 14.8 (page 301), there are lots of possible linear separators. Intuitively, a decision boundary drawn in the middle of the void between data items of the two classes seems better than one which approaches very close to examples of one or both classes. While some learning methods such as the perceptron algorithm (see references in Section 14.7, page 314) find just any linear sepa- rator, others, like Naive Bayes, search for the best linear separator according to some criterion. The SVM in particular defines the criterion to be looking for a decision surface that is maximally far away from any data point. This distance from the decision surface to the closest data point determines the margin of the classifier. This method of construction necessarily means thatMARGIN the decision function for an SVM is fully specified by a (usually small) sub- set of the data which defines the position of the separator. These points are referred to as the support vectors (in a vector space, a point can be thought ofSUPPORT VECTOR as a vector between the origin and that point). Figure 15.1 shows the margin and support vectors for a sample problem. Other data points play no part in determining the decision surface that is chosen.
***compound-splitter->
|------------
Other languages make the problem harder in new ways. German writes compound nouns without spaces (e.g., Computerlinguistik ‘computational lin-COMPOUNDS guistics’; Lebensversicherungsgesellschaftsangestellter ‘life insurance company employee’). Retrieval systems for German greatly benefit from the use of a compound-splitter module, which is usually implemented by seeing if a wordCOMPOUND-SPLITTER can be subdivided into multiple words that appear in a vocabulary. This phe- nomenon reaches its limit case with major East Asian Languages (e.g., Chi- nese, Japanese, Korean, and Thai), where text is written without any spaces between words. An example is shown in Figure 2.3. One approach here is to perform word segmentation as prior linguistic processing. Methods of wordWORD SEGMENTATION segmentation vary from having a large vocabulary and taking the longest vocabulary match with some heuristics for unknown words to the use of machine learning sequence models, such as hidden Markov models or condi- tional random fields, trained over hand-segmented words (see the references ! " # $ % & ' ' ( ) * + , # - . / ◮ Figure 2.3 The standard unsegmented form of Chinese text using the simplified characters of mainland China. There is no whitespace between words, not even be- tween sentences – the apparent space after the Chinese period (◦) is just a typograph- ical illusion caused by placing the character on the left side of its square box. The first sentence is just words in Chinese characters with no spaces between them. The second and third sentences include Arabic numerals and punctuation breaking up the Chinese characters.
***hierarchical classification->
|------------
Most large sets of categories have a hierarchical structure, and attempting to exploit the hierarchy by doing hierarchical classification is a promising ap-HIERARCHICAL CLASSIFICATION proach. However, at present the effectiveness gains from doing this rather than just working with the classes that are the leaves of the hierarchy re- main modest.6 But the technique can be very useful simply to improve the scalability of building classifiers over large hierarchies. Another simple way to improve the scalability of classifiers over large hierarchies is the use of aggressive feature selection. We provide references to some work on hierar- chical classification in Section 15.5.
|------------
A number of approaches to hierarchical classification have been developed in order to deal with the common situation where the classes to be assigned have a natural hierarchical organization (Koller and Sahami 1997, McCal- lum et al. 1998, Weigend et al. 1999, Dumais and Chen 2000). In a recent large study on scaling SVMs to the entire Yahoo! directory, Liu et al. (2005) conclude that hierarchical classification noticeably if still modestly outper- forms flat classification. Classifier effectiveness remains limited by the very small number of training documents for many classes. For a more general approach that can be applied to modeling relations between classes, which may be arbitrary rather than simply the case of a hierarchy, see Tsochan- taridis et al. (2005).
***results snippets->
|------------
The resulting stream of tokens feeds into two modules. First, we retain a copy of each parsed document in a document cache. This will enable us to generate results snippets: snippets of text accompanying each document in the results list for a query. This snippet tries to give a succinct explana- tion to the user of why the document matches the query. The automatic generation of such snippets is the subject of Section 8.7. A second copy of the tokens is fed to a bank of indexers that create a bank of indexes in- cluding zone and field indexes that store the metadata for each document, ◮ Figure 7.5 A complete search system. Data paths are shown primarily for a free text query.
***Rocchio algorithm->
|------------
The Rocchio (1971) algorithm. This was the relevance feedback mecha-ROCCHIO ALGORITHM 1. In the equation, arg maxx f (x) returns a value of x which maximizes the value of the function f (x). Similarly, arg minx f (x) returns a value of x which minimizes the value of the function f (x).
***CO topics->
|------------
Exercise 10.3 How many structural terms does the document in Figure 10.1 yield? 10.4 Evaluation of XML retrieval The premier venue for research on XML retrieval is the INEX (INitiative forINEX the Evaluation of XML retrieval) program, a collaborative effort that has pro- duced reference collections, sets of queries, and relevance judgments. A yearly INEX meeting is held to present and discuss research results. The 12,107 number of documents 494 MB size 1995–2002 time of publication of articles 1,532 average number of XML nodes per document 6.9 average depth of a node 30 number of CAS topics 30 number of CO topics ◮ Table 10.2 INEX 2002 collection statistics.
|------------
Two types of information needs or topics in INEX are content-only or CO topics and content-and-structure (CAS) topics. CO topics are regular key-CO TOPICS word queries as in unstructured information retrieval. CAS topics have struc-CAS TOPICS tural constraints in addition to keywords. We already encountered an exam- ple of a CAS topic in Figure 10.3. The keywords in this case are summer and holidays and the structural constraints specify that the keywords occur in a section that in turn is part of an article and that this article has an embedded year attribute with value 2001 or 2002.
***multinomial distribution->
|------------
12.1.3 Multinomial distributions over words Under the unigram language model the order of words is irrelevant, and so such models are often called “bag of words” models, as discussed in Chap- ter 6 (page 117). Even though there is no conditioning on preceding context, this model nevertheless still gives the probability of a particular ordering of terms. However, any other ordering of this bag of terms will have the same probability. So, really, we have a multinomial distribution over words. So longMULTINOMIAL DISTRIBUTION as we stick to unigram models, the language model name and motivation could be viewed as historical rather than necessary. We could instead just refer to the model as a multinomial model. From this perspective, the equa- tions presented above do not present the multinomial probability of a bag of words, since they do not sum over all possible orderings of those words, as is done by the multinomial coefficient (the first term on the right-hand side) in the standard presentation of a multinomial model: P(d) = Ld! tft1,d!tft2,d! · · · tftM,d! P(t1) tft1,dP(t2) tft2,d · · · P(tM)tftM,d(12.7) Here, Ld = ∑1≤i≤M tfti,d is the length of document d, M is the size of the term vocabulary, and the products are now over the terms in the vocabulary, not the positions in the document. However, just as with STOP probabilities, in practice we can also leave out the multinomial coefficient in our calculations, since, for a particular bag of words, it will be a constant, and so it has no effect on the likelihood ratio of two different models generating a particular bag of words. Multinomial distributions also appear in Section 13.2 (page 258).
***term normalization->
|------------
An alternative to creating equivalence classes is to maintain relations be- tween unnormalized tokens. This method can be extended to hand-constructed lists of synonyms such as car and automobile, a topic we discuss further in Chapter 9. These term relationships can be achieved in two ways. The usual way is to index unnormalized tokens and to maintain a query expansion list of multiple vocabulary entries to consider for a certain query term. A query term is then effectively a disjunction of several postings lists. The alterna- tive is to perform the expansion during index construction. When the doc- ument contains automobile, we index it under car as well (and, usually, also vice-versa). Use of either of these methods is considerably less efficient than equivalence classing, as there are more postings to store and merge. The first 4. It is also often referred to as term normalization, but we prefer to reserve the name term for the output of the normalization process.
***kappa statistic->
|------------
Judge 2 Relevance Yes No Total Judge 1 Yes 300 20 320 Relevance No 10 70 80 Total 310 90 400 Observed proportion of the times the judges agreed P(A) = (300 + 70)/400 = 370/400 = 0.925 Pooled marginals P(nonrelevant) = (80 + 90)/(400 + 400) = 170/800 = 0.2125 P(relevant) = (320 + 310)/(400 + 400) = 630/800 = 0.7878 Probability that the two judges agreed by chance P(E) = P(nonrelevant)2 + P(relevant)2 = 0.21252 + 0.78782 = 0.665 Kappa statistic κ = (P(A)− P(E))/(1− P(E)) = (0.925− 0.665)/(1− 0.665) = 0.776 ◮ Table 8.2 Calculating the kappa statistic.
|------------
Nevertheless, it is interesting to consider and measure how much agree- ment between judges there is on relevance judgments. In the social sciences, a common measure for agreement between judges is the kappa statistic. It isKAPPA STATISTIC designed for categorical judgments and corrects a simple agreement rate for the rate of chance agreement.
|------------
The kappa statistic and its use for language-related purposes is discussedKAPPA STATISTIC by Carletta (1996). Many standard sources (e.g., Siegel and Castellan 1988) present pooled calculation of the expected agreement, but Di Eugenio and Glass (2004) argue for preferring the unpooled agreement (though perhaps presenting multiple measures). For further discussion of alternative mea- sures of agreement, which may in fact be better, see Lombard et al. (2002) and Krippendorff (2003).
***segment file->
|------------
The map phase of MapReduce consists of mapping splits of the input dataMAP PHASE to key-value pairs. This is the same parsing task we also encountered in BSBI and SPIMI, and we therefore call the machines that execute the map phase parsers. Each parser writes its output to local intermediate files, the segmentPARSER SEGMENT FILE files (shown as a-f g-p q-z in Figure 4.5).
|------------
For the reduce phase, we want all values for a given key to be stored closeREDUCE PHASE together, so that they can be read and processed quickly. This is achieved by masterassign map phase reduce phase assign parser splits parser parser inverter postings inverter inverter a-f g-p q-z a-f g-p q-z a-f g-p q-z a-f segment files g-p q-z ◮ Figure 4.5 An example of distributed indexing with MapReduce. Adapted from Dean and Ghemawat (2004).
***clickthrough log analysis->
|------------
The most common version of this is A/B testing, a term borrowed from theA/B TEST advertising industry. For such a test, precisely one thing is changed between the current system and a proposed system, and a small proportion of traf- fic (say, 1–10% of users) is randomly directed to the variant system, while most users use the current system. For example, if we wish to investigate a change to the ranking algorithm, we redirect a random sample of users to a variant system and evaluate measures such as the frequency with which people click on the top result, or any result on the first page. (This particular analysis method is referred to as clickthrough log analysis or clickstream min-CLICKTHROUGH LOG ANALYSIS CLICKSTREAM MINING ing. It is further discussed as a method of implicit feedback in Section 9.1.7 (page 187).) The basis of A/B testing is running a bunch of single variable tests (either in sequence or in parallel): for each test only one parameter is varied from the control (the current live system). It is therefore easy to see whether varying each parameter has a positive or negative effect. Such testing of a live system can easily and cheaply gauge the effect of a change on users, and, with a large enough user base, it is practical to measure even very small positive and negative effects. In principle, more analytic power can be achieved by varying multiple things at once in an uncorrelated (random) way, and doing standard multivariate statistical analysis, such as multiple linear regression.
***language model->
|------------
12.1 Language models 12.1.1 Finite automata and language models What do we mean by a document model generating a query? A traditional generative model of a language, of the kind familiar from formal languageGENERATIVE MODEL theory, can be used either to recognize or to generate strings. For example, the finite automaton shown in Figure 12.1 can generate strings that include the examples shown. The full set of strings that can be generated is called the language of the automaton.1LANGUAGE I wish I wish I wish I wish I wish I wish I wish I wish I wish I wish I wish I wish I wish . . .
|------------
◮ Figure 12.2 A one-state finite automaton that acts as a unigram language model.
|------------
If instead each node has a probability distribution over generating differ- ent terms, we have a language model. The notion of a language model is inherently probabilistic. A language model is a function that puts a probabilityLANGUAGE MODEL measure over strings drawn from some vocabulary. That is, for a language model M over an alphabet Σ: ∑ s∈Σ∗ P(s) = 1(12.1) One simple kind of language model is equivalent to a probabilistic finite automaton consisting of just a single node with a single probability distri- bution over producing different terms, so that ∑t∈V P(t) = 1, as shown in Figure 12.2. After generating each word, we decide whether to stop or to loop around and then produce another word, and so the model also re- quires a probability of stopping in the finishing state. Such a model places a probability distribution over any sequence of words. By construction, it also provides a model for generating text according to its distribution.
***parameter tuning->
|------------
The effectiveness of Rocchio classification and kNN is highly dependent on careful parameter tuning (in particular, the parameters b′ for Rocchio on page 296 and k for kNN), feature engineering (Section 15.3, page 334) and feature selection (Section 13.5, page 271). Buckley and Salton (1995), Schapire et al. (1998), Yang and Kisiel (2003) and Moschitti (2003) address these issues for Rocchio and Yang (2001) and Ault and Yang (2002) for kNN. Zavrel et al.
***in relevance feedback->
***development set->
|------------
In a clean statistical text classification experiment, you should never run any program on or even look at the test set while developing a text classifica- tion system. Instead, set aside a development set for testing while you developDEVELOPMENT SET your method. When such a set serves the primary purpose of finding a good value for a parameter, for example, the number of selected features, then it is also called held-out data. Train the classifier on the rest of the training setHELD-OUT DATA with different parameter values, and then select the value that gives best re- sults on the held-out part of the training set. Ideally, at the very end, when all parameters have been set and the method is fully specified, you run one final experiment on the test set and publish the results. Because no informa- ◮ Table 13.10 Data for parameter estimation exercise.
***novelty detection->
|------------
Second, in some applications the purpose of clustering is not to create a complete hierarchy or exhaustive partition of the entire document set. For instance, first story detection or novelty detection is the task of detecting the firstFIRST STORY DETECTION occurrence of an event in a stream of news stories. One approach to this task is to find a tight cluster within the documents that were sent across the wire in a short period of time and are dissimilar from all previous documents. For example, the documents sent over the wire in the minutes after the World Trade Center attack on September 11, 2001 form such a cluster. Variations of single-link clustering can do well on this task since it is the structure of small parts of the vector space – and not global structure – that is important in this case.
***dictionary->
|------------
We keep a dictionary of terms (sometimes also referred to as a vocabulary orDICTIONARY VOCABULARY lexicon; in this book, we use dictionary for the data structure and vocabulary LEXICON for the set of terms). Then for each term, we have a list that records which documents the term occurs in. Each item in the list – which records that a term appeared in a document (and, later, often, the positions in the docu- ment) – is conventionally called a posting.4 The list is then called a postingsPOSTING POSTINGS LIST list (or inverted list), and all the postings lists taken together are referred to as the postings. The dictionary in Figure 1.3 has been sorted alphabetically andPOSTINGS each postings list is sorted by document ID. We will see why this is useful in Section 1.3, below, but later we will also consider alternatives to doing this (Section 7.1.5).
|------------
︸ ︷︷ ︸ ︸ ︷︷ ︸ Dictionary Postings ◮ Figure 1.3 The two parts of an inverted index. The dictionary is commonly kept in memory, with pointers to each postings list, which is stored on disk.
|------------
4. Index the documents that each term occurs in by creating an inverted in- dex, consisting of a dictionary and postings.
|------------
Within a document collection, we assume that each document has a unique serial number, known as the document identifier (docID). During index con-DOCID struction, we can simply assign successive integers to each new document when it is first encountered. The input to indexing is a list of normalized tokens for each document, which we can equally think of as a list of pairs of term and docID, as in Figure 1.4. The core indexing step is sorting this listSORTING so that the terms are alphabetical, giving us the representation in the middle column of Figure 1.4. Multiple occurrences of the same term from the same document are then merged.5 Instances of the same term are then grouped, and the result is split into a dictionary and postings, as shown in the right column of Figure 1.4. Since a term generally occurs in a number of docu- ments, this data organization already reduces the storage requirements of the index. The dictionary also records some statistics, such as the number of documents which contain each term (the document frequency, which is hereDOCUMENT FREQUENCY also the length of each postings list). This information is not vital for a ba- sic Boolean search engine, but it allows us to improve the efficiency of the 5. Unix users can note that these steps are similar to use of the sort and then uniq commands.
***124->
***RSS->
|------------
A measure of how well the centroids represent the members of their clus- ters is the residual sum of squares or RSS, the squared distance of each vectorRESIDUAL SUM OF SQUARES from its centroid summed over all vectors: RSSk = ∑ ~x∈ωk |~x−~µ(ωk)|2 RSS = K ∑ k=1 RSSk(16.7) RSS is the objective function in K-means and our goal is to minimize it. Since N is fixed, minimizing RSS is equivalent to minimizing the average squared distance, a measure of how well centroids represent their documents.
***A/B test->
|------------
The most common version of this is A/B testing, a term borrowed from theA/B TEST advertising industry. For such a test, precisely one thing is changed between the current system and a proposed system, and a small proportion of traf- fic (say, 1–10% of users) is randomly directed to the variant system, while most users use the current system. For example, if we wish to investigate a change to the ranking algorithm, we redirect a random sample of users to a variant system and evaluate measures such as the frequency with which people click on the top result, or any result on the first page. (This particular analysis method is referred to as clickthrough log analysis or clickstream min-CLICKTHROUGH LOG ANALYSIS CLICKSTREAM MINING ing. It is further discussed as a method of implicit feedback in Section 9.1.7 (page 187).) The basis of A/B testing is running a bunch of single variable tests (either in sequence or in parallel): for each test only one parameter is varied from the control (the current live system). It is therefore easy to see whether varying each parameter has a positive or negative effect. Such testing of a live system can easily and cheaply gauge the effect of a change on users, and, with a large enough user base, it is practical to measure even very small positive and negative effects. In principle, more analytic power can be achieved by varying multiple things at once in an uncorrelated (random) way, and doing standard multivariate statistical analysis, such as multiple linear regression.
|------------
In practice, though, A/B testing is widely used, because A/B tests are easy to deploy, easy to understand, and easy to explain to management.
***pooling->
|------------
Given information needs and documents, you need to collect relevance assessments. This is a time-consuming and expensive process involving hu- man beings. For tiny collections like Cranfield, exhaustive judgments of rel- evance for each query and document pair were obtained. For large modern collections, it is usual for relevance to be assessed only for a subset of the documents for each query. The most standard approach is pooling, where rel-POOLING evance is assessed over a subset of the collection that is formed from the top k documents returned by a number of different IR systems (usually the ones to be evaluated), and perhaps other sources such as the results of Boolean keyword searches or documents found by expert searchers in an interactive process.
|------------
Schamber et al. (1990) examine the concept of relevance, stressing its multi- dimensional and context-specific nature, but also arguing that it can be mea- sured effectively. (Voorhees 2000) is the standard article for examining vari- ation in relevance judgments and their effects on retrieval system scores and ranking for the TREC Ad Hoc task. Voorhees concludes that although the numbers change, the rankings are quite stable. Hersh et al. (1994) present similar analysis for a medical IR collection. In contrast, Kekäläinen (2005) analyze some of the later TRECs, exploring a 4-way relevance judgment and the notion of cumulative gain, arguing that the relevance measure used does substantially affect system rankings. See also Harter (1998). Zobel (1998) studies whether the pooling method used by TREC to collect a subset of doc- uments that will be evaluated for relevance is reliable and fair, and concludes that it is.
***RF->
|------------
9.1 Relevance feedback and pseudo relevance feedback The idea of relevance feedback (RF) is to involve the user in the retrieval processRELEVANCE FEEDBACK so as to improve the final result set. In particular, the user gives feedback on the relevance of documents in an initial set of results. The basic procedure is: • The user issues a (short, simple) query.
***Euclidean distance->
|------------
? Exercise 6.18One measure of the similarity of two vectors is the Euclidean distance (or L2 distance)EUCLIDEAN DISTANCE between them: |~x−~y| = √√√√ M∑ i=1 (xi − yi)2 query document word tf wf df idf qi = wf-idf tf wf di = normalized wf qi · di digital 10,000 video 100,000 cameras 50,000 ◮ Table 6.1 Cosine computation for Exercise 6.19.
|------------
both contain sweet. As a result, it takes 25 iterations for the term to be unam- biguously associated with cluster 2. (qsweet,1 = 0 in iteration 25.) Finding good seeds is even more critical for EM than for K-means. EM is prone to get stuck in local optima if the seeds are not chosen well. This is a general problem that also occurs in other applications of EM.4 Therefore, as with K-means, the initial assignment of documents to clusters is often com- puted by a different algorithm. For example, a hard K-means clustering may provide the initial assignment, which EM can then “soften up.” ? Exercise 16.6We saw above that the time complexity of K-means is Θ(IKNM). What is the time complexity of EM? 16.6 References and further reading Berkhin (2006b) gives a general up-to-date survey of clustering methods with special attention to scalability. The classic reference for clustering in pat- tern recognition, covering both K-means and EM, is (Duda et al. 2000). Ras- mussen (1992) introduces clustering from an information retrieval perspec- tive. Anderberg (1973) provides a general introduction to clustering for ap- plications. In addition to Euclidean distance and cosine similarity, Kullback- Leibler divergence is often used in clustering as a measure of how (dis)similar documents and clusters are (Xu and Croft 1999, Muresan and Harper 2004, Kurland and Lee 2004).
***query expansion->
|------------
9.2.2 Query expansion In relevance feedback, users give additional input on documents (by mark- ing documents in the results set as relevant or not), and this input is used to reweight the terms in the query for documents. In query expansion on theQUERY EXPANSION other hand, users give additional input on query words or phrases, possibly suggesting additional query terms. Some search engines (especially on the S P O N S O R R E S U L T SH a n d h e l d s a t D e l lS t a y C o n n e c t e d w i t hH a n d h e l d P C s & P D A s .S h o p a t D e l l ™ O f f i c i a lS i t e .w w w . D e l l . c o mB u y P a l m C e n t r oC a s e sU l t i m a t e s e l e c t i o n o fc a s e s a n d a c c e s s o r i e sf o r b u s i n e s s d e v i c e s .w w w . C a s e s . c o mF r e e P l a m T r e oG e t A F r e e P a l m T r e o7 0 0 W P h o n e . P a r t i c i p a t eT o d a y .E v a l u a t i o n N a t i o n . c o m /t r e o 1 ª 1 0 o f a b o u t 5 3 4 , 0 0 0 , 0 0 0 f o r p a l m ( A b o u t t h i s p a g e ) ª 0 . 1 1 s e c .
***Golomb codes->
|------------
Several additional index compression techniques are covered by Witten et al. (1999; Sections 3.3 and 3.4 and Chapter 5). They recommend using param-PARAMETERIZED CODE eterized codes for index compression, codes that explicitly model the probabil- ity distribution of gaps for each term. For example, they show that GolombGOLOMB CODES codes achieve better compression ratios than γ codes for large collections.
***probability vector->
|------------
In a Markov chain, the probability distribution of next states for a Markov chain depends only on the current state, and not on how the Markov chain arrived at the current state. Figure 21.2 shows a simple Markov chain with three states. From the middle state A, we proceed with (equal) probabilities of 0.5 to either B or C. From either B or C, we proceed with probability 1 to A. The transition probability matrix of this Markov chain is then 0 0.5 0.5 1 0 0 1 0 0 A Markov chain’s probability distribution over its states may be viewed as a probability vector: a vector all of whose entries are in the interval [0, 1], andPROBABILITY VECTOR the entries add up to 1. An N-dimensional probability vector each of whose components corresponds to one of the N states of a Markov chain can be viewed as a probability distribution over its states. For our simple Markov chain of Figure 21.2, the probability vector would have 3 components that sum to 1.
***nonlinear classifier->
|------------
An example of a nonlinear classifier is kNN. The nonlinearity of kNN isNONLINEAR CLASSIFIER intuitively clear when looking at examples like Figure 14.6. The decision boundaries of kNN (the double lines in Figure 14.6) are locally linear seg- ments, but in general have a complex shape that is not equivalent to a line in 2D or a hyperplane in higher dimensions.
|------------
Figure 14.11 is another example of a nonlinear problem: there is no good linear separator between the distributions P(d|c) and P(d|c) because of the circular “enclave” in the upper left part of the graph. Linear classifiers mis- classify the enclave, whereas a nonlinear classifier like kNN will be highly accurate for this type of problem if the training set is large enough.
|------------
If a problem is nonlinear and its class boundaries cannot be approximated well with linear hyperplanes, then nonlinear classifiers are often more accu- rate than linear classifiers. If a problem is linear, it is best to use a simpler linear classifier.
***SVD->
|------------
Authors that are often credited with the invention of the K-means algo- rithm include Lloyd (1982) (first distributed in 1957), Ball (1965), MacQueen (1967), and Hartigan and Wong (1979). Arthur and Vassilvitskii (2006) in- vestigate the worst-case complexity of K-means. Bradley and Fayyad (1998), Pelleg and Moore (1999) and Davidson and Satyanarayana (2003) investi- gate the convergence properties of K-means empirically and how it depends on initial seed selection. Dhillon and Modha (2001) compare K-means clus- ters with SVD-based clusters (Chapter 18). The K-medoid algorithm was presented by Kaufman and Rousseeuw (1990). The EM algorithm was orig- inally introduced by Dempster et al. (1977). An in-depth treatment of EM is (McLachlan and Krishnan 1996). See Section 18.5 (page 417) for publications on latent analysis, which can also be viewed as soft clustering.
|------------
An example of an efficient divisive algorithm is bisecting K-means (Stein- bach et al. 2000). Spectral clustering algorithms (Kannan et al. 2000, DhillonSPECTRAL CLUSTERING 2001, Zha et al. 2001, Ng et al. 2001a), including principal direction divisive partitioning (PDDP) (whose bisecting decisions are based on SVD, see Chap- ter 18) (Boley 1998, Savaresi and Boley 2004), are computationally more ex- pensive than bisecting K-means, but have the advantage of being determin- istic.
|------------
Theorem 18.3. Let r be the rank of the M× N matrix C. Then, there is a singular- value decomposition (SVD for short) of C of the formSVD C = UΣVT ,(18.9) where 1. The eigenvalues λ1, . . . , λr of CCT are the same as the eigenvalues of CTC; 2. For 1 ≤ i ≤ r, let σi = √ λi, with λi ≥ λi+1. Then the M × N matrix Σ is composed by setting Σii = σi for 1 ≤ i ≤ r, and zero otherwise.
|------------
When writing down the numerical values of the SVD, it is conventional to represent Σ as an r × r matrix with the singular values on the diagonals, since all its entries outside this sub-matrix are zeros. Accordingly, it is con- ventional to omit the rightmost M− r columns of U corresponding to these omitted rows of Σ; likewise the rightmost N − r columns of V are omitted since they correspond in VT to the rows that will be multiplied by the N − r columns of zeros in Σ. This written form of the SVD is sometimes known rr rr r rr rr r rr rr r rr rr r rr rr r rr rr r rr rr r rr rr r r r r rr r rr r rr r C = U Σ VT rr r rr r rr r rr r rr r rr r rr r rr r r r r rr rr r rr rr r rr rr r rr rr r rr rr r ◮ Figure 18.1 Illustration of the singular-value decomposition. In this schematic illustration of (18.9), we see two cases illustrated. In the top half of the figure, we have a matrix C for which M > N. The lower half illustrates the case M < N.
***seed->
|------------
K-MEANS({~x1, . . . ,~xN},K) 1 (~s1,~s2, . . . ,~sK) ← SELECTRANDOMSEEDS({~x1, . . . ,~xN},K) 2 for k ← 1 to K 3 do ~µk ←~sk 4 while stopping criterion has not been met 5 do for k ← 1 to K 6 do ωk ← {} 7 for n ← 1 to N 8 do j ← arg minj′ |~µj′ −~xn| 9 ωj ← ωj ∪ {~xn} (reassignment of vectors) 10 for k ← 1 to K 11 do ~µk ← 1|ωk| ∑~x∈ωk ~x (recomputation of centroids) 12 return {~µ1, . . . ,~µK} ◮ Figure 16.5 The K-means algorithm. For most IR applications, the vectors ~xn ∈ RM should be length-normalized. Alternative methods of seed selection and initialization are discussed on page 364.
|------------
The first step of K-means is to select as initial cluster centers K randomly selected documents, the seeds. The algorithm then moves the cluster centersSEED around in space in order to minimize RSS. As shown in Figure 16.5, this is done iteratively by repeating two steps until a stopping criterion is met: reas- signing documents to the cluster with the closest centroid; and recomputing each centroid based on the current members of its cluster. Figure 16.6 shows snapshots from nine iterations of the K-means algorithm for a set of points.
|------------
• Terminate when RSS falls below a threshold. This criterion ensures that the clustering is of a desired quality after termination. In practice, we 0 1 2 3 4 5 6 0 1 2 3 4 b b b b b b b b b b b b b b bb bb b b b b b b b b b b b b b b b b b b b b b ×× selection of seeds 0 1 2 3 4 5 6 0 1 2 3 4 b b b b b b b b b b b b b b bb bb b b b b b b b b b b b b b b b b b b b b b ×× assignment of documents (iter. 1) 0 1 2 3 4 5 6 0 1 2 3 4 + + + + + + + + + + o o + o+ + + ++ + o + + o + + + + o o + o + + o+ o × ×× × recomputation/movement of ~µ’s (iter. 1) 0 1 2 3 4 5 6 0 1 2 3 4 + + + + + + + + + + + + + o+ + + o+ o o o o + o + o + o + o o o + o+ o× × ~µ’s after convergence (iter. 9) 0 1 2 3 4 5 6 0 1 2 3 4 .
***bag of words->
|------------
For a document d, the set of weights determined by the tf weights above (or indeed any weighting function that maps the number of occurrences of t in d to a positive real value) may be viewed as a quantitative digest of that document. In this view of a document, known in the literature as the bagBAG OF WORDS of words model, the exact ordering of the terms in a document is ignored but the number of occurrences of each term is material (in contrast to Boolean retrieval). We only retain information on the number of occurrences of each term. Thus, the document “Mary is quicker than John” is, in this view, iden- tical to the document “John is quicker than Mary”. Nevertheless, it seems intuitive that two documents with similar bag of words representations are similar in content. We will develop this intuition further in Section 6.3.
|------------
P(Xk1 = t|c) = P(Xk2 = t|c) for all positions k1, k2, terms t and classes c. Thus, we have a single dis- tribution of terms that is valid for all positions ki and we can use X as its symbol.4 Positional independence is equivalent to adopting the bag of words model, which we introduced in the context of ad hoc retrieval in Chapter 6 (page 117).
***Bayes error rate->
|------------
As we will see in the next chapter, kNN’s effectiveness is close to that of the most accurate learning methods in text classification (Table 15.2, page 334). A measure of the quality of a learning method is its Bayes error rate, the averageBAYES ERROR RATE error rate of classifiers learned by it for a particular problem. kNN is not optimal for problems with a non-zero Bayes error rate – that is, for problems where even the best possible classifier has a non-zero classification error. The error of 1NN is asymptotically (as the training set increases) bounded by ◮ Figure 14.8 There are an infinite number of hyperplanes that separate two linearly separable classes.
***document->
|------------
The way to avoid linearly scanning the texts for each query is to index theINDEX documents in advance. Let us stick with Shakespeare’s Collected Works, and use it to introduce the basics of the Boolean retrieval model. Suppose we record for each document – here a play of Shakespeare’s – whether it contains each word out of all the words Shakespeare used (Shakespeare used about 32,000 different words). The result is a binary term-document incidenceINCIDENCE MATRIX matrix, as in Figure 1.1. Terms are the indexed units (further discussed inTERM Section 2.2); they are usually words, and for the moment you can think of Antony Julius The Hamlet Othello Macbeth . . .
|------------
◮ Figure 1.1 A term-document incidence matrix. Matrix element (t, d) is 1 if the play in column d contains the word in row t, and is 0 otherwise.
|------------
them as words, but the information retrieval literature normally speaks of terms because some of them, such as perhaps I-9 or Hong Kong are not usually thought of as words. Now, depending on whether we look at the matrix rows or columns, we can have a vector for each term, which shows the documents it appears in, or a vector for each document, showing the terms that occur in it.2 To answer the query Brutus AND Caesar AND NOT Calpurnia, we take the vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a bitwise AND: 110100 AND 110111 AND 101111 = 100100 The answers for this query are thus Antony and Cleopatra and Hamlet (Fig- ure 1.2).
|------------
The model views each document as just a set of words.
|------------
Let us now consider a more realistic scenario, simultaneously using the opportunity to introduce some terminology and notation. Suppose we have N = 1 million documents. By documents we mean whatever units we haveDOCUMENT decided to build a retrieval system over. They might be individual memos or chapters of a book (see Section 2.1.2 (page 20) for further discussion). We will refer to the group of documents over which we perform retrieval as the (document) collection. It is sometimes also referred to as a corpus (a body ofCOLLECTION CORPUS texts). Suppose each document is about 1000 words long (2–3 book pages). If 2. Formally, we take the transpose of the matrix to be able to get the terms as column vectors.
|------------
2.1 Document delineation and character sequence decoding 2.1.1 Obtaining the character sequence in a document Digital documents that are the input to an indexing process are typically bytes in a file or on a web server. The first step of processing is to convert this byte sequence into a linear sequence of characters. For the case of plain En- glish text in ASCII encoding, this is trivial. But often things get much more complex. The sequence of characters may be encoded by one of various sin- gle byte or multibyte encoding schemes, such as Unicode UTF-8, or various national or vendor-specific standards. We need to determine the correct en- coding. This can be regarded as a machine learning classification problem, as discussed in Chapter 13,1 but is often handled by heuristic methods, user selection, or by using provided document metadata. Once the encoding is determined, we decode the byte sequence to a character sequence. We might save the choice of encoding because it gives some evidence about what lan- guage the document is written in.
|------------
Again, we must determine the document format, and then an appropriate decoder has to be used. Even for plain text documents, additional decoding may need to be done. In XML documents (Section 10.1, page 197), charac- ter entities, such as &, need to be decoded to give the correct character, namely & for &. Finally, the textual part of the document may need to be extracted out of other material that will not be processed. This might be the desired handling for XML files, if the markup is going to be ignored; we would almost certainly want to do this with postscript or PDF files. We will not deal further with these issues in this book, and will assume henceforth that our documents are a list of characters. Commercial products usually need to support a broad range of document types and encodings, since users want things to just work with their data as is. Often, they just think of docu- ments as text inside applications and are not even aware of how it is encoded on disk. This problem is usually solved by licensing a software library that handles decoding document formats and character encodings.
|------------
2.1.2 Choosing a document unit The next phase is to determine what the document unit for indexing is. ThusDOCUMENT UNIT far we have assumed that documents are fixed units for the purposes of in- dexing. For example, we take each file in a folder as a document. But there 1. A classifier is a function that takes objects of some sort and assigns them to one of a number of distinct classes (see Chapter 13). Usually classification is done by machine learning methods such as probabilistic models, but it can also be done by hand-written rules.
***permuterm index->
|------------
Permuterm indexes Our first special index for general wildcard queries is the permuterm index,PERMUTERM INDEX a form of inverted index. First, we introduce a special symbol $ into our character set, to mark the end of a term. Thus, the term hello is shown here as the augmented term hello$. Next, we construct a permuterm index, in which the various rotations of each term (augmented with $) all link to the original vocabulary term. Figure 3.3 gives an example of such a permuterm index entry for the term hello.
|------------
We refer to the set of rotated terms in the permuterm index as the per- muterm vocabulary.
|------------
How does this index help us with wildcard queries? Consider the wildcard query m*n. The key is to rotate such a wildcard query so that the * symbol appears at the end of the string – thus the rotated wildcard query becomes n$m*. Next, we look up this string in the permuterm index, where seeking n$m* (via a search tree) leads to rotations of (among others) the terms man and moron.
|------------
Now that the permuterm index enables us to identify the original vocab- ulary terms matching a wildcard query, we look up these terms in the stan- dard inverted index to retrieve matching documents. We can thus handle any wildcard query with a single * symbol. But what about a query such as fi*mo*er? In this case we first enumerate the terms in the dictionary that are in the permuterm index of er$fi*. Not all such dictionary terms will have the string mo in the middle - we filter these out by exhaustive enumera- tion, checking each candidate to see if it contains mo. In this example, the term fishmonger would survive this filtering but filibuster would not. We then ◮ Figure 3.3 A portion of a permuterm index.
***posterior probability->
|------------
Writing P(A) for the complement of an event, we similarly have: P(A, B) = P(B|A)P(A)(11.2) Probability theory also has a partition rule, which says that if an event B canPARTITION RULE be divided into an exhaustive set of disjoint subcases, then the probability of B is the sum of the probabilities of the subcases. A special case of this rule gives that: P(B) = P(A, B) + P(A, B)(11.3) From these we can derive Bayes’ Rule for inverting conditional probabili-BAYES’ RULE ties: P(A|B) = P(B|A)P(A) P(B) = [ P(B|A) ∑X∈{A,A} P(B|X)P(X) ] P(A)(11.4) This equation can also be thought of as a way of updating probabilities. We start off with an initial estimate of how likely the event A is when we do not have any other information; this is the prior probability P(A). Bayes’ rulePRIOR PROBABILITY lets us derive a posterior probability P(A|B) after having seen the evidence B,POSTERIOR PROBABILITY based on the likelihood of B occurring in the two cases that A does or does not hold.1 Finally, it is often useful to talk about the odds of an event, which provideODDS a kind of multiplier for how probabilities change: Odds: O(A) = P(A) P(A) = P(A) 1 − P(A)(11.5) 11.2 The Probability Ranking Principle 11.2.1 The 1/0 loss case We assume a ranked retrieval setup as in Section 6.3, where there is a collec- tion of documents, the user issues a query, and an ordered list of documents is returned. We also assume a binary notion of relevance as in Chapter 8. For a query q and a document d in the collection, let Rd,q be an indicator random variable that says whether d is relevant with respect to a given query q. That is, it takes on a value of 1 when the document is relevant and 0 otherwise. In context we will often write just R for Rd,q.
***multivariate Bernoulli model->
|------------
An alternative to the multinomial model is the multivariate Bernoulli model or Bernoulli model. It is equivalent to the binary independence model of Sec-BERNOULLI MODEL tion 11.3 (page 222), which generates an indicator for each term of the vo- cabulary, either 1 indicating presence of the term in the document or 0 indi- cating absence. Figure 13.3 presents training and testing algorithms for the Bernoulli model. The Bernoulli model has the same time complexity as the multinomial model.
***power law->
|------------
Equivalently, we can write Zipf’s law as cfi = cik or as log cfi = log c + k log i where k = −1 and c is a constant to be defined in Section 5.3.2. It is therefore a power law with exponent k = −1. See Chapter 19, page 426,POWER LAW for another power law, a law characterizing the distribution of links on web pages.
|------------
There is ample evidence that these links are not randomly distributed; for one thing, the distribution of the number of links into a web page does not follow the Poisson distribution one would expect if every web page were to pick the destinations of its links uniformly at random. Rather, this dis- tribution is widely reported to be a power law, in which the total number ofPOWER LAW web pages with in-degree i is proportional to 1/iα; the value of α typically reported by studies is 2.1.1 Furthermore, several studies have suggested that the directed graph connecting web pages has a bowtie shape: there are threeBOWTIE major categories of web pages that are sometimes referred to as IN, OUT and SCC. A web surfer can pass from any page in IN to any page in SCC, by following hyperlinks. Likewise, a surfer can pass from page in SCC to any page in OUT. Finally, the surfer can surf from any page in SCC to any other page in SCC. However, it is not possible to pass from a page in SCC to any page in IN, or from a page in OUT to a page in SCC (or, consequently, IN).
|------------
Notably, in several studies IN and OUT are roughly equal in size, whereas 1. Cf. Zipf’s law of the distribution of words in text in Chapter 5 (page 90), which is a power law with α = 1.
***routing->
|------------
To capture the generality and scope of the problem space to which stand- ing queries belong, we now introduce the general notion of a classificationCLASSIFICATION problem. Given a set of classes, we seek to determine which class(es) a given object belongs to. In the example, the standing query serves to divide new newswire articles into the two classes: documents about multicore computer chips and documents not about multicore computer chips. We refer to this as two-class classification. Classification using standing queries is also called routing orROUTING filteringand will be discussed further in Section 15.3.1 (page 335).FILTERING A class need not be as narrowly focused as the standing query multicore computer chips. Often, a class is a more general subject area like China or coffee.
|------------
Such more general classes are usually referred to as topics, and the classifica- tion task is then called text classification, text categorization, topic classification,TEXT CLASSIFICATION or topic spotting. An example for China appears in Figure 13.1. Standing queries and topics differ in their degree of specificity, but the methods for solving routing, filtering, and text classification are essentially the same. We therefore include routing and filtering under the rubric of text classification in this and the following chapters.
|------------
14.7 References and further reading As discussed in Chapter 9, Rocchio relevance feedback is due to Rocchio (1971). Joachims (1997) presents a probabilistic analysis of the method. Roc- chio classification was widely used as a classification method in TREC in the 1990s (Buckley et al. 1994a;b, Voorhees and Harman 2005). Initially, it was used as a form of routing. Routing merely ranks documents according to rel-ROUTING evance to a class without assigning them. Early work on filtering, a true clas-FILTERING sification approach that makes an assignment decision on each document, was published by Ittner et al. (1995) and Schapire et al. (1998). The definition of routing we use here should not be confused with another sense. Routing can also refer to the electronic distribution of documents to subscribers, the so-called push model of document distribution. In a pull model, each transferPUSH MODEL PULL MODEL of a document to the user is initiated by the user – for example, by means of search or by selecting it from a list of documents on a news aggregation website.
***term-document matrix->
|------------
Viewing a collection of N documents as a collection of vectors leads to a natural view of a collection as a term-document matrix: this is an M×N matrixTERM-DOCUMENT MATRIX whose rows represent the M terms (dimensions) of the N columns, each of which corresponds to a document. As always, the terms being indexed could be stemmed before indexing; for instance, jealous and jealousy would under stemming be considered as a single dimension. This matrix view will prove to be useful in Chapter 18.
***language identification->
|------------
Boolean or free text queries, you always want to do the exact same tokeniza- tion of document and query words, generally by processing queries with the same tokenizer. This guarantees that a sequence of characters in a text will always match the same sequence typed in a query.3 These issues of tokenization are language-specific. It thus requires the lan- guage of the document to be known. Language identification based on clas-LANGUAGE IDENTIFICATION sifiers that use short character subsequences as features is highly effective; most languages have distinctive signature patterns (see page 46 for refer- ences).
|------------
Language identification was perhaps first explored in cryptography; for example, Konheim (1981) presents a character-level k-gram language identi- fication algorithm. While other methods such as looking for particular dis- tinctive function words and letter combinations have been used, with the advent of widespread digital text, many people have explored the charac- ter n-gram technique, and found it to be highly successful (Beesley 1998, Dunning 1994, Cavnar and Trenkle 1994). Written language identification is regarded as a fairly easy problem, while spoken language identification remains more difficult; see Hughes et al. (2006) for a recent survey.
***static web pages->
|------------
While the question “how big is the Web?” has no easy answer (see Sec- tion 19.5), the question “how many web pages are in a search engine’s index” is more precise, although, even this question has issues. By the end of 1995, Altavista reported that it had crawled and indexed approximately 30 million static web pages. Static web pages are those whose content does not vary fromSTATIC WEB PAGES one request for that page to the next. For this purpose, a professor who man- ually updates his home page every week is considered to have a static web page, but an airport’s flight status page is considered to be dynamic. Dy- namic pages are typically mechanically generated by an application server in response to a query to a database, as show in Figure 19.1. One sign of such a page is that the URL has the character "?" in it. Since the number of static web pages was believed to be doubling every few months in 1995, early web search engines such as Altavista had to constantly add hardware and bandwidth for crawling and indexing web pages.
***Quadratic Programming->
|------------
Quadratic optimization problems are a standard, well-known class of mathe-QUADRATIC PROGRAMMING matical optimization problems, and many algorithms exist for solving them.
|------------
We could in principle build our SVM using standard quadratic programming (QP) libraries, but there has been much recent research in this area aiming to exploit the structure of the kind of QP that emerges from an SVM. As a result, there are more intricate but much faster and more scalable libraries available especially for building SVMs, which almost everyone uses to build models.
***sequence model->
|------------
With conditional and positional independence assumptions, we only need to estimate Θ(M|C|) parameters P(tk|c) (multinomial model) or P(ei|c) (Bernoulli 4. Our terminology is nonstandard. The random variable X is a categorical variable, not a multi- nomial variable, and the corresponding NB model should perhaps be called a sequence model. We have chosen to present this sequence model and the multinomial model in Section 13.4.1 as the same model because they are computationally identical.
***regular expressions->
|------------
Friedl (2006) covers the practical usage of regular expressions for searching.REGULAR EXPRESSIONS The underlying computer science appears in (Hopcroft et al. 2000).
***specificity->
|------------
Thus, R-precision turns out to be identical to the break-even point, anotherBREAK-EVEN POINT measure which is sometimes used, defined in terms of this equality relation- ship holding. Like Precision at k, R-precision describes only one point on the precision-recall curve, rather than attempting to summarize effectiveness across the curve, and it is somewhat unclear why you should be interested in the break-even point rather than either the best point on the curve (the point with maximal F-measure) or a retrieval level of interest to a particular application (Precision at k). Nevertheless, R-precision turns out to be highly correlated with MAP empirically, despite measuring only a single point on 0.0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1 1 − specificity se ns itiv ity ( = re ca ll) ◮ Figure 8.4 The ROC curve corresponding to the precision-recall curve in Fig- ure 8.2.
|------------
Another concept sometimes used in evaluation is an ROC curve. (“ROC”ROC CURVE stands for “Receiver Operating Characteristics”, but knowing that doesn’t help most people.) An ROC curve plots the true positive rate or sensitiv- ity against the false positive rate or (1 − specificity). Here, sensitivity is justSENSITIVITY another term for recall. The false positive rate is given by f p/( f p+ tn). Fig- ure 8.4 shows the ROC curve corresponding to the precision-recall curve in Figure 8.2. An ROC curve always goes from the bottom left to the top right of the graph. For a good system, the graph climbs steeply on the left side. For unranked result sets, specificity, given by tn/( f p+ tn), was not seen as a verySPECIFICITY useful notion. Because the set of true negatives is always so large, its value would be almost 1 for all information needs (and, correspondingly, the value of the false positive rate would be almost 0). That is, the “interesting” part of Figure 8.2 is 0 < recall < 0.4, a part which is compressed to a small corner of Figure 8.4. But an ROC curve could make sense when looking over the full retrieval spectrum, and it provides another way of looking at the data.
***clickstream mining->
|------------
The most common version of this is A/B testing, a term borrowed from theA/B TEST advertising industry. For such a test, precisely one thing is changed between the current system and a proposed system, and a small proportion of traf- fic (say, 1–10% of users) is randomly directed to the variant system, while most users use the current system. For example, if we wish to investigate a change to the ranking algorithm, we redirect a random sample of users to a variant system and evaluate measures such as the frequency with which people click on the top result, or any result on the first page. (This particular analysis method is referred to as clickthrough log analysis or clickstream min-CLICKTHROUGH LOG ANALYSIS CLICKSTREAM MINING ing. It is further discussed as a method of implicit feedback in Section 9.1.7 (page 187).) The basis of A/B testing is running a bunch of single variable tests (either in sequence or in parallel): for each test only one parameter is varied from the control (the current live system). It is therefore easy to see whether varying each parameter has a positive or negative effect. Such testing of a live system can easily and cheaply gauge the effect of a change on users, and, with a large enough user base, it is practical to measure even very small positive and negative effects. In principle, more analytic power can be achieved by varying multiple things at once in an uncorrelated (random) way, and doing standard multivariate statistical analysis, such as multiple linear regression.
|------------
On the web, DirectHit introduced the idea of ranking more highly docu- ments that users chose to look at more often. In other words, clicks on links were assumed to indicate that the page was likely relevant to the query. This approach makes various assumptions, such as that the document summaries displayed in results lists (on whose basis users choose which documents to click on) are indicative of the relevance of these documents. In the original DirectHit search engine, the data about the click rates on pages was gathered globally, rather than being user or query specific. This is one form of the gen- eral area of clickstream mining. Today, a closely related approach is used inCLICKSTREAM MINING ranking the advertisements that match a web search query (Chapter 19).
***GAAC->
|------------
17.3 Group-average agglomerative clustering Group-average agglomerative clustering or GAAC (see Figure 17.3, (d)) evaluatesGROUP-AVERAGE AGGLOMERATIVE CLUSTERING cluster quality based on all similarities between documents, thus avoiding the pitfalls of the single-link and complete-link criteria, which equate cluster similarity with the similarity of a single pair of documents. GAAC is also called group-average clustering and average-link clustering. GAAC computes the average similarity SIM-GA of all pairs of documents, including pairs from the same cluster. But self-similarities are not included in the average: SIM-GA(ωi,ωj) = 1 (Ni + Nj)(Ni + Nj − 1) ∑dm∈ωi∪ωj ∑ dn∈ωi∪ωj,dn 6=dm ~dm · ~dn(17.1) where ~d is the length-normalized vector of document d, · denotes the dot product, and Ni and Nj are the number of documents in ωi and ωj, respec- tively.
***partitional clustering->
|------------
A note on terminology. An alternative definition of hard clustering is that a document can be a full member of more than one cluster. Partitional clus-PARTITIONAL CLUSTERING tering always refers to a clustering where each document belongs to exactly one cluster. (But in a partitional hierarchical clustering (Chapter 17) all mem- bers of a cluster are of course also members of its parent.) On the definition of hard clustering that permits multiple membership, the difference between soft clustering and hard clustering is that membership values in hard clus- tering are either 0 or 1, whereas they can take on any non-negative value in soft clustering.
***implicit relevance feedback->
|------------
9.1.7 Indirect relevance feedback We can also use indirect sources of evidence rather than explicit feedback on relevance as the basis for relevance feedback. This is often called implicit (rel-IMPLICIT RELEVANCE FEEDBACK evance) feedback. Implicit feedback is less reliable than explicit feedback, but is more useful than pseudo relevance feedback, which contains no evidence of user judgments. Moreover, while users are often reluctant to provide explicit feedback, it is easy to collect implicit feedback in large quantities for a high volume system, such as a web search engine.
***email->
|------------
• Personal email sorting. A user may have folders like talk announcements,EMAIL SORTING electronic bills, email from family and friends, and so on, and may want a classifier to classify each incoming email and automatically move it to the appropriate folder. It is easier to find messages in sorted folders than in a very large inbox. The most common case of this application is a spam folder that holds all suspected spam messages.
***33->
***patent databases->
|------------
However, many structured data sources containing text are best modeled as structured documents rather than relational data. We call the search over such structured documents structured retrieval. Queries in structured retrievalSTRUCTURED RETRIEVAL can be either structured or unstructured, but we will assume in this chap- ter that the collection consists only of structured documents. Applications of structured retrieval include digital libraries, patent databases, blogs, text in which entities like persons and locations have been tagged (in a process called named entity tagging) and output from office suites like OpenOffice that save documents as marked up text. In all of these applications, we want to be able to run queries that combine textual criteria with structural criteria.
***positional independence->
|------------
For these reasons, we make a second independence assumption for the multinomial model, positional independence: The conditional probabilities forPOSITIONAL INDEPENDENCE a term are the same independent of position in the document.
|------------
P(Xk1 = t|c) = P(Xk2 = t|c) for all positions k1, k2, terms t and classes c. Thus, we have a single dis- tribution of terms that is valid for all positions ki and we can use X as its symbol.4 Positional independence is equivalent to adopting the bag of words model, which we introduced in the context of ad hoc retrieval in Chapter 6 (page 117).
|------------
With conditional and positional independence assumptions, we only need to estimate Θ(M|C|) parameters P(tk|c) (multinomial model) or P(ei|c) (Bernoulli 4. Our terminology is nonstandard. The random variable X is a categorical variable, not a multi- nomial variable, and the corresponding NB model should perhaps be called a sequence model. We have chosen to present this sequence model and the multinomial model in Section 13.4.1 as the same model because they are computationally identical.
***Voronoi tessellation->
|------------
x x x x x x x x x x x ⋄ ⋄ ⋄ ⋄ ⋄ ⋄ ⋄ ⋄⋄ ⋄ ⋄ ⋆ ◮ Figure 14.6 Voronoi tessellation and decision boundaries (double lines) in 1NN classification. The three classes are: X, circle and diamond.
|------------
Decision boundaries in 1NN are concatenated segments of the Voronoi tes-VORONOI TESSELLATION sellation as shown in Figure 14.6. The Voronoi tessellation of a set of objects decomposes space into Voronoi cells, where each object’s cell consists of all points that are closer to the object than to other objects. In our case, the ob- jects are documents. The Voronoi tessellation then partitions the plane into |D| convex polygons, each containing its corresponding document (and no other) as shown in Figure 14.6, where a convex polygon is a convex region in 2-dimensional space bounded by lines.
***information retrieval->
|------------
1 Boolean retrieval The meaning of the term information retrieval can be very broad. Just getting a credit card out of your wallet so that you can type in the card number is a form of information retrieval. However, as an academic field of study, information retrieval might be defined thus:INFORMATION RETRIEVAL Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).
|------------
As defined in this way, information retrieval used to be an activity that only a few people engaged in: reference librarians, paralegals, and similar pro- fessional searchers. Now the world has changed, and hundreds of millions of people engage in information retrieval every day when they use a web search engine or search their email.1 Information retrieval is fast becoming the dominant form of information access, overtaking traditional database- style searching (the sort that is going on when a clerk says to you: “I’m sorry, I can only look up your order if you can give me your Order ID”).
***group-average clustering->
|------------
17.3 Group-average agglomerative clustering Group-average agglomerative clustering or GAAC (see Figure 17.3, (d)) evaluatesGROUP-AVERAGE AGGLOMERATIVE CLUSTERING cluster quality based on all similarities between documents, thus avoiding the pitfalls of the single-link and complete-link criteria, which equate cluster similarity with the similarity of a single pair of documents. GAAC is also called group-average clustering and average-link clustering. GAAC computes the average similarity SIM-GA of all pairs of documents, including pairs from the same cluster. But self-similarities are not included in the average: SIM-GA(ωi,ωj) = 1 (Ni + Nj)(Ni + Nj − 1) ∑dm∈ωi∪ωj ∑ dn∈ωi∪ωj,dn 6=dm ~dm · ~dn(17.1) where ~d is the length-normalized vector of document d, · denotes the dot product, and Ni and Nj are the number of documents in ωi and ωj, respec- tively.
***inverter->
|------------
For the reduce phase, we want all values for a given key to be stored closeREDUCE PHASE together, so that they can be read and processed quickly. This is achieved by masterassign map phase reduce phase assign parser splits parser parser inverter postings inverter inverter a-f g-p q-z a-f g-p q-z a-f g-p q-z a-f segment files g-p q-z ◮ Figure 4.5 An example of distributed indexing with MapReduce. Adapted from Dean and Ghemawat (2004).
|------------
Collecting all values (here: docIDs) for a given key (here: termID) into one list is the task of the inverters in the reduce phase. The master assigns eachINVERTER term partition to a different inverter – and, as in the case of parsers, reas- signs term partitions in case of failing or slow inverters. Each term partition (corresponding to r segment files, one on each parser) is processed by one in- verter. We assume here that segment files are of a size that a single machine can handle (Exercise 4.9). Finally, the list of values is sorted for each key and written to the final sorted postings list (“postings” in the figure). (Note that postings in Figure 4.6 include term frequencies, whereas each posting in the other sections of this chapter is simply a docID without term frequency in- formation.) The data flow is shown for a–f in Figure 4.5. This completes the construction of the inverted index.
***Kruskal’s algorithm->
|------------
17.9 References and further reading An excellent general review of clustering is (Jain et al. 1999). Early references for specific HAC algorithms are (King 1967) (single-link), (Sneath and Sokal 1973) (complete-link, GAAC) and (Lance and Williams 1967) (discussing a large variety of hierarchical clustering algorithms). The single-link algorithm in Figure 17.9 is similar to Kruskal’s algorithm for constructing a minimumKRUSKAL’S ALGORITHM spanning tree. A graph-theoretical proof of the correctness of Kruskal’s al- gorithm (which is analogous to the proof in Section 17.5) is provided by Cor- men et al. (1990, Theorem 23.1). See Exercise 17.5 for the connection between minimum spanning trees and single-link clusterings.
***HAC->
|------------
17.1 Hierarchical agglomerative clustering Hierarchical clustering algorithms are either top-down or bottom-up. Bottom- up algorithms treat each document as a singleton cluster at the outset and then successively merge (or agglomerate) pairs of clusters until all clusters have been merged into a single cluster that contains all documents. Bottom- up hierarchical clustering is therefore called hierarchical agglomerative cluster-HIERARCHICAL AGGLOMERATIVE CLUSTERING ing or HAC. Top-down clustering requires a method for splitting a cluster.
|------------
HAC It proceeds by splitting clusters recursively until individual documents are reached. See Section 17.6. HAC is more frequently used in IR than top-down clustering and is the main subject of this chapter.
|------------
Before looking at specific similarity measures used in HAC in Sections 17.2–17.4, we first introduce a method for depicting hierarchical clusterings graphically, discuss a few key properties of HACs and present a simple algo- rithm for computing an HAC.
|------------
An HAC clustering is typically visualized as a dendrogram as shown inDENDROGRAM Figure 17.1. Each merge is represented by a horizontal line. The y-coordinate of the horizontal line is the similarity of the two clusters that were merged, where documents are viewed as singleton clusters. We call this similarity the combination similarity of the merged cluster. For example, the combinationCOMBINATION SIMILARITY similarity of the cluster consisting of Lloyd’s CEO questioned and Lloyd’s chief / U.S. grilling in Figure 17.1 is ≈ 0.56. We define the combination similarity of a singleton cluster as its document’s self-similarity (which is 1.0 for cosine similarity).
|------------
A fundamental assumption in HAC is that the merge operation is mono-MONOTONICITY tonic. Monotonic means that if s1, s2, . . . , sK−1 are the combination similarities of the successive merges of an HAC, then s1 ≥ s2 ≥ . . . ≥ sK−1 holds. A non- monotonic hierarchical clustering contains at least one inversion si < si+1INVERSION and contradicts the fundamental assumption that we chose the best merge available at each step. We will see an example of an inversion in Figure 17.12.
***NMI->
|------------
mation or NMI: NMI(Ω, C) = I(Ω; C) [H(Ω) + H(C)]/2 (16.2) I is mutual information (cf. Chapter 13, page 272): I(Ω; C) = ∑ k ∑ j P(ωk ∩ cj) log P(ωk ∩ cj) P(ωk)P(cj) (16.3) = ∑ k ∑ j |ωk ∩ cj| N log N|ωk ∩ cj| |ωk||cj| (16.4) where P(ωk), P(cj), and P(ωk ∩ cj) are the probabilities of a document being in cluster ωk, class cj, and in the intersection of ωk and cj, respectively. Equa- tion (16.4) is equivalent to Equation (16.3) for maximum likelihood estimates of the probabilities (i.e., the estimate of each probability is the corresponding relative frequency).
|------------
The normalization by the denominator [H(Ω)+H(C)]/2 in Equation (16.2) fixes this problem since entropy tends to increase with the number of clus- ters. For example, H(Ω) reaches its maximum log N for K = N, which en- sures that NMI is low for K = N. Because NMI is normalized, we can use it to compare clusterings with different numbers of clusters. The particular form of the denominator is chosen because [H(Ω) + H(C)]/2 is a tight upper bound on I(Ω; C) (Exercise 16.8). Thus, NMI is always a number between 0 and 1.
***in index construction->
***support vector machine->
|------------
15 Support vector machines andmachine learning on documents Improving classifier effectiveness has been an area of intensive machine- learning research over the last two decades, and this work has led to a new generation of state-of-the-art classifiers, such as support vector machines, boosted decision trees, regularized logistic regression, neural networks, and random forests. Many of these methods, including support vector machines (SVMs), the main topic of this chapter, have been applied with success to information retrieval problems, particularly text classification. An SVM is a kind of large-margin classifier: it is a vector space based machine learning method where the goal is to find a decision boundary between two classes that is maximally far from any point in the training data (possibly discount- ing some points as outliers or noise).
|------------
We will initially motivate and develop SVMs for the case of two-class data sets that are separable by a linear classifier (Section 15.1), and then extend the model in Section 15.2 to non-separable data, multi-class problems, and non- linear models, and also present some additional discussion of SVM perfor- mance. The chapter then moves to consider the practical deployment of text classifiers in Section 15.3: what sorts of classifiers are appropriate when, and how can you exploit domain-specific text features in classification? Finally, we will consider how the machine learning technology that we have been building for text classification can be applied back to the problem of learning how to rank documents in ad hoc retrieval (Section 15.4). While several ma- chine learning methods have been applied to this task, use of SVMs has been prominent. Support vector machines are not necessarily better than other machine learning methods (except perhaps in situations with little training data), but they perform at the state-of-the-art level and have much current theoretical and empirical appeal.
|------------
15.5 References and further reading The somewhat quirky name support vector machine originates in the neu- ral networks literature, where learning algorithms were thought of as ar- chitectures, and often referred to as “machines”. The distinctive element of this model is that the decision boundary to use is completely decided (“sup- ported”) by a few training data points, the support vectors.
***focused retrieval->
|------------
An active area of XML retrieval research is focused retrieval (Trotman et al.FOCUSED RETRIEVAL 2007), which aims to avoid returning nested elements that share one or more common subelements (cf. discussion in Section 10.2, page 203). There is ev- idence that users dislike redundancy caused by nested elements (Betsi et al.
|------------
2006). Focused retrieval requires evaluation measures that penalize redun- dant results lists (Kazai and Lalmas 2006, Lalmas et al. 2007). Trotman and Geva (2006) argue that XML retrieval is a form of passage retrieval. In passagePASSAGE RETRIEVAL retrieval (Salton et al. 1993, Hearst and Plaunt 1993, Zobel et al. 1995, Hearst 1997, Kaszkiel and Zobel 1997), the retrieval system returns short passages instead of documents in response to a user query. While element bound- aries in XML documents are cues for identifying good segment boundaries between passages, the most relevant passage often does not coincide with an XML element.
***geometric margin->
|------------
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 b b b b b b bb b ut ut ut ~x + ~x′ r ut ut ut ut ρ ~w ◮ Figure 15.3 The geometric margin of a point (r) and a decision boundary (ρ).
|------------
The geometric margin of the classifier is the maximum width of the band thatGEOMETRIC MARGIN can be drawn separating the support vectors of the two classes. That is, it is twice the minimum value over data points for r given in Equation (15.4), or, equivalently, the maximal width of one of the fat separators shown in Fig- ure 15.2. The geometric margin is clearly invariant to scaling of parameters: if we replace ~w by 5~w and b by 5b, then the geometric margin is the same, be- cause it is inherently normalized by the length of ~w. This means that we can impose any scaling constraint we wish on ~w without affecting the geometric margin. Among other choices, we could use unit vectors, as in Chapter 6, by 2. Recall that |~w| = √ ~wT~w.
***add α->
***phrase index->
|------------
The concept of a biword index can be extended to longer sequences of words, and if the index includes variable length word sequences, it is gen- erally referred to as a phrase index. Indeed, searches for a single term arePHRASE INDEX not naturally handled in a biword index (you would need to scan the dic- tionary for all biwords containing the term), and so we also need to have an index of single-word terms. While there is always a chance of false positive matches, the chance of a false positive match on indexed phrases of length 3 or more becomes very small indeed. But on the other hand, storing longer phrases has the potential to greatly expand the vocabulary size. Maintain- ing exhaustive phrase indexes for phrases of length greater than two is a daunting prospect, and even use of an exhaustive biword dictionary greatly expands the size of the vocabulary. However, towards the end of this sec- tion we discuss the utility of the strategy of using a partial phrase index in a compound indexing scheme.
***pointwise mutual information->
|------------
Early uses of mutual information and χ2 for feature selection in text clas- sification are Lewis and Ringuette (1994) and Schütze et al. (1995), respec- tively. Yang and Pedersen (1997) review feature selection methods and their impact on classification effectiveness. They find that pointwise mutual infor-POINTWISE MUTUAL INFORMATION mation is not competitive with other methods. Yang and Pedersen refer to expected mutual information (Equation (13.16)) as information gain (see Ex- ercise 13.13, page 285). (Snedecor and Cochran 1989) is a good reference for the χ2 test in statistics, including the Yates’ correction for continuity for 2 × 2 tables. Dunning (1993) discusses problems of the χ2 test when counts are small. Nongreedy feature selection techniques are described by Hastie et al.
***maximum a posteriori->
|------------
This is referred to as the relative frequency of the event. Estimating the prob-RELATIVE FREQUENCY ability as the relative frequency is the maximum likelihood estimate (or MLE),MAXIMUM LIKELIHOOD ESTIMATE MLE because this value makes the observed data maximally likely. However, if we simply use the MLE, then the probability given to events we happened to see is usually too high, whereas other events may be completely unseen and giving them as a probability estimate their relative frequency of 0 is both an underestimate, and normally breaks our models, since anything multiplied by 0 is 0. Simultaneously decreasing the estimated probability of seen events and increasing the probability of unseen events is referred to as smoothing.SMOOTHING One simple way of smoothing is to add a number α to each of the observed counts. These pseudocounts correspond to the use of a uniform distributionPSEUDOCOUNTS over the vocabulary as a Bayesian prior, following Equation (11.4). We ini-BAYESIAN PRIOR tially assume a uniform distribution over events, where the size of α denotes the strength of our belief in uniformity, and we then update the probability based on observed events. Since our belief in uniformity is weak, we use α = 12 . This is a form of maximum a posteriori (MAP) estimation, where weMAXIMUM A POSTERIORI MAP choose the most likely point value for probabilities based on the prior and the observed evidence, following Equation (11.4). We will further discuss methods of smoothing estimated counts to give probability models in Sec- tion 12.2.2 (page 243); the simple method of adding 12 to each observed count will do for now.
|------------
13.4 Properties of Naive Bayes To gain a better understanding of the two models and the assumptions they make, let us go back and examine how we derived their classification rules in Chapters 11 and 12. We decide class membership of a document by assigning it to the class with the maximum a posteriori probability (cf. Section 11.3.2, page 226), which we compute as follows: cmap = arg max c∈C P(c|d) = arg max c∈C P(d|c)P(c) P(d) (13.9) = arg max c∈C P(d|c)P(c),(13.10) where Bayes’ rule (Equation (11.4), page 220) is applied in (13.9) and we drop the denominator in the last step because P(d) is the same for all classes and does not affect the argmax.
***stochastic matrix->
|------------
A Markov chain is characterized by an N×N transition probability matrix P each of whose entries is in the interval [0, 1]; the entries in each row of P add up to 1. The Markov chain can be in one of the N states at any given time- step; then, the entry Pij tells us the probability that the state at the next time- step is j, conditioned on the current state being i. Each entry Pij is known as a transition probability and depends only on the current state i; this is known as the Markov property. Thus, by the Markov property, ∀i, j, Pij ∈ [0, 1] and ∀i, N ∑ j=1 Pij = 1.(21.1) A matrix with non-negative entries that satisfies Equation (21.1) is known as a stochastic matrix. A key property of a stochastic matrix is that it has aSTOCHASTIC MATRIX principal left eigenvector corresponding to its largest eigenvalue, which is 1.PRINCIPAL LEFT EIGENVECTOR 1. This is consistent with our usage of N for the number of documents in the collection.
***principle->
|------------
One criterion for selecting the most appropriate part of a document is the structured document retrieval principle:STRUCTURED DOCUMENT RETRIEVAL PRINCIPLE Structured document retrieval principle. A system should always re- trieve the most specific part of a document answering the query.
|------------
This principle motivates a retrieval strategy that returns the smallest unit that contains the information sought, but does not go below this level. How- ever, it can be hard to implement this principle algorithmically. Consider the query title#"Macbeth" applied to Figure 10.2. The title of the tragedy, Macbeth, and the title of Act I, Scene vii, Macbeth’s castle, are both good hits because they contain the matching term Macbeth. But in this case, the title of the tragedy, the higher node, is preferred. Deciding which level of the tree is right for answering a query is difficult.
***K-medoids->
|------------
The same efficiency problem is addressed by K-medoids, a variant of K-K-MEDOIDS means that computes medoids instead of centroids as cluster centers. We define the medoid of a cluster as the document vector that is closest to theMEDOID centroid. Since medoids are sparse document vectors, distance computations are fast.
***inverse document frequency->
|------------
6.2.1 Inverse document frequency Raw term frequency as above suffers from a critical problem: all terms are considered equally important when it comes to assessing relevancy on a query. In fact certain terms have little or no discriminating power in de- termining relevance. For instance, a collection of documents on the auto industry is likely to have the term auto in almost every document. To this Word cf df try 10422 8760 insurance 10440 3997 ◮ Figure 6.7 Collection frequency (cf) and document frequency (df) behave differ- ently, as in this example from the Reuters collection.
|------------
How is the document frequency df of a term used to scale its weight? De- noting as usual the total number of documents in a collection by N, we define the inverse document frequency (idf) of a term t as follows:INVERSE DOCUMENT FREQUENCY idft = log N dft .(6.7) Thus the idf of a rare term is high, whereas the idf of a frequent term is likely to be low. Figure 6.8 gives an example of idf’s in the Reuters collection of 806,791 documents; in this example logarithms are to the base 10. In fact, as we will see in Exercise 6.12, the precise base of the logarithm is not material to ranking. We will give on page 227 a justification of the particular form in Equation (6.7).
|------------
6.2.2 Tf-idf weighting We now combine the definitions of term frequency and inverse document frequency, to produce a composite weight for each term in each document.
|------------
The outermost loop beginning Step 3 repeats the updating of Scores, iter- ating over each query term t in turn. In Step 5 we calculate the weight in the query vector for term t. Steps 6-8 update the score of each document by adding in the contribution from term t. This process of adding in contribu- tions one query term at a time is sometimes known as term-at-a-time scoringTERM-AT-A-TIME or accumulation, and the N elements of the array Scores are therefore known as accumulators. For this purpose, it would appear necessary to store, withACCUMULATOR each postings entry, the weight wft,d of term t in document d (we have thus far used either tf or tf-idf for this weight, but leave open the possibility of other functions to be developed in Section 6.4). In fact this is wasteful, since storing this weight may require a floating point number. Two ideas help alle- viate this space problem. First, if we are using inverse document frequency, we need not precompute idft; it suffices to store N/dft at the head of the postings for t. Second, we store the term frequency tft,d for each postings en- try. Finally, Step 12 extracts the top K scores – this requires a priority queue data structure, often implemented using a heap. Such a heap takes no more than 2N comparisons to construct, following which each of the K top scores can be extracted from the heap at a cost of O(log N) comparisons.
***DNS resolution->
|------------
20.2.2 DNS resolution Each web server (and indeed any host connected to the internet) has a unique IP address: a sequence of four bytes generally represented as four integersIP ADDRESS separated by dots; for instance 207.142.131.248 is the numerical IP address as- sociated with the host www.wikipedia.org. Given a URL such as www.wikipedia.org in textual form, translating it to an IP address (in this case, 207.142.131.248) is a process known as DNS resolution or DNS lookup; here DNS stands for Do-DNS RESOLUTION main Name Service. During DNS resolution, the program that wishes to per- form this translation (in our case, a component of the web crawler) contacts a DNS server that returns the translated IP address. (In practice the entire trans-DNS SERVER lation may not occur at a single DNS server; rather, the DNS server contacted initially may recursively call upon other DNS servers to complete the transla- tion.) For a more complex URL such as en.wikipedia.org/wiki/Domain_Name_System, the crawler component responsible for DNS resolution extracts the host name – in this case en.wikipedia.org – and looks up the IP address for the host en.wikipedia.org.
|------------
DNS resolution is a well-known bottleneck in web crawling. Due to the distributed nature of the Domain Name Service, DNS resolution may entail multiple requests and round-trips across the internet, requiring seconds and sometimes even longer. Right away, this puts in jeopardy our goal of fetching several hundred documents a second. A standard remedy is to introduce caching: URLs for which we have recently performed DNS lookups are likely to be found in the DNS cache, avoiding the need to go to the DNS servers on the internet. However, obeying politeness constraints (see Section 20.2.3) limits the of cache hit rate.
|------------
There is another important difficulty in DNS resolution; the lookup imple- mentations in standard libraries (likely to be used by anyone developing a crawler) are generally synchronous. This means that once a request is made to the Domain Name Service, other crawler threads at that node are blocked until the first request is completed. To circumvent this, most web crawlers implement their own DNS resolver as a component of the crawler. Thread i executing the resolver code sends a message to the DNS server and then performs a timed wait: it resumes either when being signaled by another thread or when a set time quantum expires. A single, separate DNS thread listens on the standard DNS port (port 53) for incoming response packets from the name service. Upon receiving a response, it signals the appropriate crawler thread (in this case, i) and hands it the response packet if i has not yet resumed because its time quantum has expired. A crawler thread that re- sumes because its wait time quantum has expired retries for a fixed number of attempts, sending out a new message to the DNS server and performing a timed wait each time; the designers of Mercator recommend of the order of five attempts. The time quantum of the wait increases exponentially with each of these attempts; Mercator started with one second and ended with roughly 90 seconds, in consideration of the fact that there are host names that take tens of seconds to resolve.
***authority score->
|------------
21.3 Hubs and Authorities We now develop a scheme in which, given a query, every web page is as- signed two scores. One is called its hub score and the other its authority score.HUB SCORE AUTHORITY SCORE For any query, we compute two ranked lists of results rather than one. The ranking of one list is induced by the hub scores and that of the other by the authority scores.
|------------
This approach stems from a particular insight into the creation of web pages, that there are two primary kinds of web pages useful as results for broad-topic searches. By a broad topic search we mean an informational query such as "I wish to learn about leukemia". There are authoritative sources of information on the topic; in this case, the National Cancer Institute’s page on leukemia would be such a page. We will call such pages authorities; in the computation we are about to describe, they are the pages that will emerge with high authority scores.
***hard assignment->
|------------
A second important distinction can be made between hard and soft cluster- ing algorithms. Hard clustering computes a hard assignment – each documentHARD CLUSTERING is a member of exactly one cluster. The assignment of soft clustering algo-SOFT CLUSTERING rithms is soft – a document’s assignment is a distribution over all clusters.
***independence->
|------------
13.5.2 χ2 Feature selection Another popular feature selection method is χ2. In statistics, the χ2 test isχ2 FEATURE SELECTION applied to test the independence of two events, where two events A and B are defined to be independent if P(AB) = P(A)P(B) or, equivalently, P(A|B) =INDEPENDENCE P(A) and P(B|A) = P(B). In feature selection, the two events are occurrence of the term and occurrence of the class. We then rank terms with respect to the following quantity: X2(D, t, c) = ∑ et∈{0,1} ∑ ec∈{0,1} (Netec − Eetec)2 Eetec (13.18) where et and ec are defined as in Equation (13.16). N is the observed frequency in D and E the expected frequency. For example, E11 is the expected frequency of t and c occurring together in a document assuming that term and class are independent.
***multivalue classification->
|------------
Classification for classes that are not mutually exclusive is called any-of ,ANY-OF CLASSIFICATION multilabel, or multivalue classification. In this case, a document can belong to several classes simultaneously, or to a single class, or to none of the classes.
***seek time->
|------------
◮ Table 4.1 Typical system parameters in 2007. The seek time is the time needed to position the disk head in a new position. The transfer time per byte is the rate of transfer from disk to memory when the head is in the right position.
|------------
Symbol Statistic Value s average seek time 5 ms = 5 × 10−3 s b transfer time per byte 0.02 µs = 2 × 10−8 s processor’s clock rate 109 s−1 p lowlevel operation (e.g., compare & swap a word) 0.01 µs = 10−8 s size of main memory several GB size of disk space 1 TB or more 4.1 Hardware basics When building an information retrieval (IR) system, many decisions are based on the characteristics of the computer hardware on which the system runs.
|------------
• Access to data in memory is much faster than access to data on disk. It takes a few clock cycles (perhaps 5 × 10−9 seconds) to access a byte in memory, but much longer to transfer it from disk (about 2 × 10−8 sec- onds). Consequently, we want to keep as much data as possible in mem- ory, especially those data that we need to access frequently. We call the technique of keeping frequently used disk data in main memory caching.CACHING • When doing a disk read or write, it takes a while for the disk head to move to the part of the disk where the data are located. This time is called the seek time and it averages 5 ms for typical disks. No data are beingSEEK TIME transferred during the seek. To maximize data transfer rates, chunks of data that will be read together should therefore be stored contiguously on disk. For example, using the numbers in Table 4.1 it may take as little as 0.2 seconds to transfer 10 megabytes (MB) from disk to memory if it is stored as one chunk, but up to 0.2 + 100 × (5 × 10−3) = 0.7 seconds if it is stored in 100 noncontiguous chunks because we need to move the disk head up to 100 times.
***MapReduce->
|------------
The distributed index construction method we describe in this section is an application of MapReduce, a general architecture for distributed computing.MAPREDUCE MapReduce is designed for large computer clusters. The point of a cluster is to solve large computing problems on cheap commodity machines or nodes that are built from standard parts (processor, memory, disk) as opposed to on a supercomputer with specialized hardware. Although hundreds or thou- sands of machines are available in such clusters, individual machines can fail at any time. One requirement for robust distributed indexing is, there- fore, that we divide the work up into chunks that we can easily assign and – in case of failure – reassign. A master node directs the process of assigningMASTER NODE and reassigning tasks to individual worker nodes.
|------------
The map and reduce phases of MapReduce split up the computing job into chunks that standard machines can process in a short time. The various steps of MapReduce are shown in Figure 4.5 and an example on a collection consisting of two documents is shown in Figure 4.6. First, the input data, in our case a collection of web pages, are split into n splits where the size ofSPLITS the split is chosen to ensure that the work can be distributed evenly (chunks should not be too large) and efficiently (the total number of chunks we need to manage should not be too large); 16 or 64 MB are good sizes in distributed indexing. Splits are not preassigned to machines, but are instead assigned by the master node on an ongoing basis: As a machine finishes processing one split, it is assigned the next one. If a machine dies or becomes a laggard due to hardware problems, the split it is working on is simply reassigned to another machine.
|------------
In general, MapReduce breaks a large computing problem into smaller parts by recasting it in terms of manipulation of key-value pairs. For index-KEY-VALUE PAIRS ing, a key-value pair has the form (termID,docID). In distributed indexing, the mapping from terms to termIDs is also distributed and therefore more complex than in single-machine indexing. A simple solution is to maintain a (perhaps precomputed) mapping for frequent terms that is copied to all nodes and to use terms directly (instead of termIDs) for infrequent terms.
|------------
The map phase of MapReduce consists of mapping splits of the input dataMAP PHASE to key-value pairs. This is the same parsing task we also encountered in BSBI and SPIMI, and we therefore call the machines that execute the map phase parsers. Each parser writes its output to local intermediate files, the segmentPARSER SEGMENT FILE files (shown as a-f g-p q-z in Figure 4.5).
|------------
For the reduce phase, we want all values for a given key to be stored closeREDUCE PHASE together, so that they can be read and processed quickly. This is achieved by masterassign map phase reduce phase assign parser splits parser parser inverter postings inverter inverter a-f g-p q-z a-f g-p q-z a-f g-p q-z a-f segment files g-p q-z ◮ Figure 4.5 An example of distributed indexing with MapReduce. Adapted from Dean and Ghemawat (2004).
***154->
***Bayes risk->
|------------
Optimal Decision Rule, the decision which minimizes the risk of loss, is to simply return documents that are more likely relevant than nonrelevant: d is relevant iff P(R = 1|d, q) > P(R = 0|d, q)(11.6) Theorem 11.1. The PRP is optimal, in the sense that it minimizes the expected loss (also known as the Bayes risk) under 1/0 loss.BAYES RISK The proof can be found in Ripley (1996). However, it requires that all proba- bilities are known correctly. This is never the case in practice. Nevertheless, the PRP still provides a very useful foundation for developing models of IR.
***co-clustering->
|------------
The applications in Table 16.1 all cluster documents. Other information re- trieval applications cluster words (e.g., Crouch 1988), contexts of words (e.g., Schütze and Pedersen 1995) or words and documents simultaneously (e.g., Tishby and Slonim 2000, Dhillon 2001, Zha et al. 2001). Simultaneous clus- tering of words and documents is an example of co-clustering or biclustering.CO-CLUSTERING 16.7 Exercises ? Exercise 16.7Let Ω be a clustering that exactly reproduces a class structure C and Ω′ a clustering that further subdivides some clusters in Ω. Show that I(Ω; C) = I(Ω′; C).
***map phase->
|------------
The map phase of MapReduce consists of mapping splits of the input dataMAP PHASE to key-value pairs. This is the same parsing task we also encountered in BSBI and SPIMI, and we therefore call the machines that execute the map phase parsers. Each parser writes its output to local intermediate files, the segmentPARSER SEGMENT FILE files (shown as a-f g-p q-z in Figure 4.5).
|------------
For the reduce phase, we want all values for a given key to be stored closeREDUCE PHASE together, so that they can be read and processed quickly. This is achieved by masterassign map phase reduce phase assign parser splits parser parser inverter postings inverter inverter a-f g-p q-z a-f g-p q-z a-f g-p q-z a-f segment files g-p q-z ◮ Figure 4.5 An example of distributed indexing with MapReduce. Adapted from Dean and Ghemawat (2004).
***L2 distance->
|------------
? Exercise 6.18One measure of the similarity of two vectors is the Euclidean distance (or L2 distance)EUCLIDEAN DISTANCE between them: |~x−~y| = √√√√ M∑ i=1 (xi − yi)2 query document word tf wf df idf qi = wf-idf tf wf di = normalized wf qi · di digital 10,000 video 100,000 cameras 50,000 ◮ Table 6.1 Cosine computation for Exercise 6.19.
***statistical text classification->
|------------
Apart from manual classification and hand-crafted rules, there is a third approach to text classification, namely, machine learning-based text classifi- cation. It is the approach that we focus on in the next several chapters. In machine learning, the set of rules or, more generally, the decision criterion of the text classifier, is learned automatically from training data. This approach is also called statistical text classification if the learning method is statistical.STATISTICAL TEXT CLASSIFICATION In statistical text classification, we require a number of good example docu- ments (or training documents) for each class. The need for manual classifi- cation is not eliminated because the training documents come from a person who has labeled them – where labeling refers to the process of annotatingLABELING each document with its class. But labeling is arguably an easier task than writing rules. Almost anybody can look at a document and decide whether or not it is related to China. Sometimes such labeling is already implicitly part of an existing workflow. For instance, you may go through the news articles returned by a standing query each morning and give relevance feed- back (cf. Chapter 9) by moving the relevant articles to a special folder like multicore-processors.
***bias->
|------------
Writing ΓD for Γ(D) for better readability, we can transform Equation (14.7) as follows: learning-error(Γ) = ED[MSE(ΓD)] = EDEd[ΓD(d)− P(c|d)]2(14.10) = Ed[bias(Γ, d) + variance(Γ, d)](14.11) bias(Γ, d) = [P(c|d)− EDΓD(d)]2(14.12) variance(Γ, d) = ED[ΓD(d)− EDΓD(d)]2(14.13) where the equivalence between Equations (14.10) and (14.11) is shown in Equation (14.9) in Figure 14.13. Note that d and D are independent of each other. In general, for a random document d and a random training set D, D does not contain a labeled instance of d.
|------------
Bias is the squared difference between P(c|d), the true conditional prob-BIAS ability of d being in c, and ΓD(d), the prediction of the learned classifier, averaged over training sets. Bias is large if the learning method produces classifiers that are consistently wrong. Bias is small if (i) the classifiers are consistently right or (ii) different training sets cause errors on different docu- ments or (iii) different training sets cause positive and negative errors on the same documents, but that average out to close to 0. If one of these three con- ditions holds, then EDΓD(d), the expectation over all training sets, is close to P(c|d).
|------------
Linear methods like Rocchio and Naive Bayes have a high bias for non- linear problems because they can only model one type of class boundary, a linear hyperplane. If the generative model P(〈d, c〉) has a complex nonlinear class boundary, the bias term in Equation (14.11) will be high because a large number of points will be consistently misclassified. For example, the circular enclave in Figure 14.11 does not fit a linear model and will be misclassified consistently by linear classifiers.
|------------
We can think of bias as resulting from our domain knowledge (or lack thereof) that we build into the classifier. If we know that the true boundary between the two classes is linear, then a learning method that produces linear classifiers is more likely to succeed than a nonlinear method. But if the true class boundary is not linear and we incorrectly bias the classifier to be linear, then classification accuracy will be low on average.
|------------
Nonlinear methods like kNN have low bias. We can see in Figure 14.6 that the decision boundaries of kNN are variable – depending on the distribu- tion of documents in the training set, learned decision boundaries can vary greatly. As a result, each document has a chance of being classified correctly for some training sets. The average prediction EDΓD(d) is therefore closer to P(c|d) and bias is smaller than for a linear learning method.
***pseudo relevance feedback->
|------------
Perhaps the best evaluation of the utility of relevance feedback is to do user studies of its effectiveness, in particular by doing a time-based comparison: Precision at k = 50 Term weighting no RF pseudo RF lnc.ltc 64.2% 72.7% Lnu.ltu 74.2% 87.0% ◮ Figure 9.5 Results showing pseudo relevance feedback greatly improving perfor- mance. These results are taken from the Cornell SMART system at TREC 4 (Buckley et al. 1995), and also contrast the use of two different length normalization schemes (L vs. l); cf. Figure 6.15 (page 128). Pseudo relevance feedback consisted of adding 20 terms to each query.
|------------
9.1.6 Pseudo relevance feedback Pseudo relevance feedback, also known as blind relevance feedback, provides aPSEUDO RELEVANCE FEEDBACK BLIND RELEVANCE FEEDBACK method for automatic local analysis. It automates the manual part of rele- vance feedback, so that the user gets improved retrieval performance with- out an extended interaction. The method is to do normal retrieval to find an initial set of most relevant documents, to then assume that the top k ranked documents are relevant, and finally to do relevance feedback as before under this assumption.
|------------
9.1.7 Indirect relevance feedback We can also use indirect sources of evidence rather than explicit feedback on relevance as the basis for relevance feedback. This is often called implicit (rel-IMPLICIT RELEVANCE FEEDBACK evance) feedback. Implicit feedback is less reliable than explicit feedback, but is more useful than pseudo relevance feedback, which contains no evidence of user judgments. Moreover, while users are often reluctant to provide explicit feedback, it is easy to collect implicit feedback in large quantities for a high volume system, such as a web search engine.
***Robots Exclusion Protocol->
|------------
A similar test could be inclusive rather than exclusive. Many hosts on the Web place certain portions of their websites off-limits to crawling, under a standard known as the Robots Exclusion Protocol. This is done by placing aROBOTS EXCLUSION PROTOCOL file with the name robots.txt at the root of the URL hierarchy at the site. Here is an example robots.txt file that specifies that no robot should visit any URL whose position in the file hierarchy starts with /yoursite/temp/, except for the robot called “searchengine”.
***document-at-a-time->
|------------
Note that the general algorithm of Figure 6.14 does not prescribe a specific implementation of how we traverse the postings lists of the various query terms; we may traverse them one term at a time as in the loop beginning at Step 3, or we could in fact traverse them concurrently as in Figure 1.6. In such a concurrent postings traversal we compute the scores of one document at a time, so that it is sometimes called document-at-a-time scoring. We willDOCUMENT-AT-A-TIME say more about this in Section 7.1.5.
***parameter tying->
|------------
Separate feature spaces for document zones. There are two strategies that can be used for document zones. Above we upweighted words that appear in certain zones. This means that we are using the same features (that is, pa- rameters are “tied” across different zones), but we pay more attention to thePARAMETER TYING occurrence of terms in particular zones. An alternative strategy is to have a completely separate set of features and corresponding parameters for words occurring in different zones. This is in principle more powerful: a word could usually indicate the topic Middle East when in the title but Commodities when in the body of a document. But, in practice, tying parameters is usu- ally more successful. Having separate feature sets means having two or more times as many parameters, many of which will be much more sparsely seen in the training data, and hence with worse estimates, whereas upweighting has no bad effects of this sort. Moreover, it is quite uncommon for words to have different preferences when appearing in different zones; it is mainly the strength of their vote that should be adjusted. Nevertheless, ultimately this is a contingent result, depending on the nature and quantity of the training data.
***dynamic->
|------------
The two basic kinds of summaries are static, which are always the sameSTATIC SUMMARY regardless of the query, and dynamic (or query-dependent), which are cus-DYNAMIC SUMMARY tomized according to the user’s information need as deduced from a query.
|------------
Dynamic summaries attempt to explain why a particular document was re- trieved for the query at hand.
|------------
Dynamic summaries display one or more “windows” on the document, aiming to present the pieces that have the most utility to the user in evalu- ating the document with respect to their information need. Usually these windows contain one or several of the query terms, and so are often re- ferred to as keyword-in-context (KWIC) snippets, though sometimes they mayKEYWORD-IN-CONTEXT still be pieces of the text such as the title that are selected for their query- independent information value just as in the case of static summarization.
|------------
Dynamic summaries are generated in conjunction with scoring. If the query is found as a phrase, occurrences of the phrase in the document will be . . . In recent years, Papua New Guinea has faced severe economic difficulties and economic growth has slowed, partly as a result of weak governance and civil war, and partly as a result of external factors such as the Bougainville civil war which led to the closure in 1989 of the Panguna mine (at that time the most important foreign exchange earner and contributor to Government finances), the Asian financial crisis, a decline in the prices of gold and copper, and a fall in the production of oil. PNG’s economic development record over the past few years is evidence that governance issues underly many of the country’s problems. Good governance, which may be defined as the transparent and accountable management of human, natural, economic and financial resources for the purposes of equitable and sustainable development, flows from proper public sector management, efficient fiscal and accounting mechanisms, and a willingness to make service delivery a priority in practice. . . .
***extended query->
|------------
We can also support the user by interpreting all parent-child relationships in queries as descendant relationships with any number of intervening nodes allowed. We call such queries extended queries. The tree in Figure 10.3 and q4EXTENDED QUERY in Figure 10.6 are examples of extended queries. We show edges that are interpreted as descendant relationships as dashed arrows. In q4, a dashed arrow connects book and Gates. As a pseudo-XPath notation for q4, we adopt book//#"Gates": a book that somewhere in its structure contains the word Gates where the path from the book node to Gates can be arbitrarily long.
|------------
The pseudo-XPath notation for the extended query that in addition specifies that Gates occurs in a section of the book is book//section//#"Gates".
|------------
In Figure 10.7, the user is looking for a chapter entitled FFT (q5). Sup- pose there is no such chapter in the collection, but that there are references to books on FFT (d4). A reference to a book on FFT is not exactly what the user is looking for, but it is better than returning nothing. Extended queries do not help here. The extended query q6 also returns nothing. This is a case where we may want to interpret the structural constraints specified in the query as hints as opposed to as strict conditions. As we will discuss in Section 10.4, users prefer a relaxed interpretation of structural constraints: Elements that do not meet structural constraints perfectly should be ranked lower, but they should not be omitted from search results.
***multinomial classification->
***optimal clustering->
|------------
We then define Ω = {ω1, . . . ,ωK} to be optimal if all clusterings Ω′ with kOPTIMAL CLUSTERING clusters, k ≤ K, have lower combination similarities: |Ω′| ≤ |Ω| ⇒ COMB-SIM(Ω′) ≤ COMB-SIM(Ω) Figure 17.12 shows that centroid clustering is not optimal. The cluster- ing {{d1, d2}, {d3}} (for K = 2) has combination similarity −(4 − ǫ) and {{d1, d2, d3}} (for K = 1) has combination similarity -3.46. So the cluster- ing {{d1, d2}, {d3}} produced in the first merge is not optimal since there is a clustering with fewer clusters ({{d1, d2, d3}}) that has higher combination similarity. Centroid clustering is not optimal because inversions can occur.
***edit distance->
|------------
We begin by examining two techniques for addressing isolated-term cor- rection: edit distance, and k-gram overlap. We then proceed to context- sensitive correction.
|------------
3.3.3 Edit distance Given two character strings s1 and s2, the edit distance between them is theEDIT DISTANCE minimum number of edit operations required to transform s1 into s2. Most commonly, the edit operations allowed for this purpose are: (i) insert a char- acter into a string; (ii) delete a character from a string and (iii) replace a char- acter of a string by another character; for these operations, edit distance is sometimes known as Levenshtein distance. For example, the edit distance be-LEVENSHTEIN DISTANCE tween cat and dog is 3. In fact, the notion of edit distance can be generalized to allowing different weights for different kinds of edit operations, for in- stance a higher weight may be placed on replacing the character s by the character p, than on replacing it by the character a (the latter being closer to s on the keyboard). Setting weights in this way depending on the likelihood of letters substituting for each other is very effective in practice (see Section 3.4 for the separate issue of phonetic similarity). However, the remainder of our treatment here will focus on the case in which all edit operations have the same weight.
|------------
It is well-known how to compute the (weighted) edit distance between two strings in time O(|s1| × |s2|), where |si| denotes the length of a string si.
|------------
The idea is to use the dynamic programming algorithm in Figure 3.5, where the characters in s1 and s2 are given in array form. The algorithm fills the (integer) entries in a matrix m whose two dimensions equal the lengths of the two strings whose edit distances is being computed; the (i, j) entry of the matrix will hold (after the algorithm is executed) the edit distance between the strings consisting of the first i characters of s1 and the first j characters of s2. The central dynamic programming step is depicted in Lines 8-10 of Figure 3.5, where the three quantities whose minimum is taken correspond to substituting a character in s1, inserting a character in s1 and inserting a character in s2.
|------------
Figure 3.6 shows an example Levenshtein distance computation of Fig- ure 3.5. The typical cell [i, j] has four entries formatted as a 2 × 2 cell. The lower right entry in each cell is the min of the other three, corresponding to the main dynamic programming step in Figure 3.5. The other three entries are the three entries m[i − 1, j − 1] + 0 or 1 depending on whether s1[i] = EDITDISTANCE(s1, s2) 1 int m[i, j] = 0 2 for i ← 1 to |s1| 3 do m[i, 0] = i 4 for j ← 1 to |s2| 5 do m[0, j] = j 6 for i ← 1 to |s1| 7 do for j ← 1 to |s2| 8 do m[i, j] = min{m[i− 1, j− 1] + if (s1[i] = s2[j]) then 0 else 1fi, 9 m[i− 1, j] + 1, 10 m[i, j− 1] + 1} 11 return m[|s1|, |s2|] ◮ Figure 3.5 Dynamic programming algorithm for computing the edit distance be- tween strings s1 and s2.
***normal vector->
|------------
The generalization of a line in M-dimensional space is a hyperplane, which we define as the set of points ~x that satisfy: ~wT~x = b(14.2) where ~w is the M-dimensional normal vector1 of the hyperplane and b is aNORMAL VECTOR constant. This definition of hyperplanes includes lines (any line in 2D can be defined by w1x1 + w2x2 = b) and 2-dimensional planes (any plane in 3D can be defined by w1x1 + w2x2 + w3x3 = b). A line divides a plane in two, a plane divides 3-dimensional space in two, and hyperplanes divide higher- dimensional spaces in two.
***ad hoc retrieval->
|------------
Our goal is to develop a system to address the ad hoc retrieval task. This isAD HOC RETRIEVAL the most standard IR task. In it, a system aims to provide documents from within the collection that are relevant to an arbitrary user information need, communicated to the system by means of a one-off, user-initiated query. An information need is the topic about which the user desires to know more, andINFORMATION NEED is differentiated from a query, which is what the user conveys to the com-QUERY puter in an attempt to communicate the information need. A document is relevant if it is one that the user perceives as containing information of valueRELEVANCE with respect to their personal information need. Our example above was rather artificial in that the information need was defined in terms of par- ticular words, whereas usually a user is interested in a topic like “pipeline leaks” and would like to find relevant documents regardless of whether they precisely use those words or express the concept with other words such as pipeline rupture. To assess the effectiveness of an IR system (i.e., the quality ofEFFECTIVENESS its search results), a user will usually want to know two key statistics about the system’s returned results for a query: Precision: What fraction of the returned results are relevant to the informa-PRECISION tion need? Recall: What fraction of the relevant documents in the collection were re-RECALL turned by the system? Detailed discussion of relevance and evaluation measures including preci- sion and recall is found in Chapter 8.
|------------
13 Text classification and NaiveBayes Thus far, this book has mainly discussed the process of ad hoc retrieval, where users have transient information needs that they try to address by posing one or more queries to a search engine. However, many users have ongoing information needs. For example, you might need to track developments in multicore computer chips. One way of doing this is to issue the query multi- core AND computer AND chip against an index of recent newswire articles each morning. In this and the following two chapters we examine the question: How can this repetitive task be automated? To this end, many systems sup- port standing queries. A standing query is like any other query except that itSTANDING QUERY is periodically executed on a collection to which new documents are incre- mentally added over time.
***boosting->
|------------
Exercise 13.16 χ2 and mutual information do not distinguish between positively and negatively cor- related features. Because most good text classification features are positively corre- lated (i.e., they occur more often in c than in c), one may want to explicitly rule out the selection of negative indicators. How would you do this? 13.7 References and further reading General introductions to statistical classification and machine learning can be found in (Hastie et al. 2001), (Mitchell 1997), and (Duda et al. 2000), including many important methods (e.g., decision trees and boosting) that we do not cover. A comprehensive review of text classification methods and results is (Sebastiani 2002). Manning and Schütze (1999, Chapter 16) give an accessible introduction to text classification with coverage of decision trees, perceptrons and maximum entropy models. More information on the superlinear time complexity of learning methods that are more accurate than Naive Bayes can be found in (Perkins et al. 2003) and (Joachims 2006a).
***query->
|------------
◮ Figure 1.2 Results from Shakespeare for the query Brutus AND Caesar AND NOT Calpurnia.
|------------
Our goal is to develop a system to address the ad hoc retrieval task. This isAD HOC RETRIEVAL the most standard IR task. In it, a system aims to provide documents from within the collection that are relevant to an arbitrary user information need, communicated to the system by means of a one-off, user-initiated query. An information need is the topic about which the user desires to know more, andINFORMATION NEED is differentiated from a query, which is what the user conveys to the com-QUERY puter in an attempt to communicate the information need. A document is relevant if it is one that the user perceives as containing information of valueRELEVANCE with respect to their personal information need. Our example above was rather artificial in that the information need was defined in terms of par- ticular words, whereas usually a user is interested in a topic like “pipeline leaks” and would like to find relevant documents regardless of whether they precisely use those words or express the concept with other words such as pipeline rupture. To assess the effectiveness of an IR system (i.e., the quality ofEFFECTIVENESS its search results), a user will usually want to know two key statistics about the system’s returned results for a query: Precision: What fraction of the returned results are relevant to the informa-PRECISION tion need? Recall: What fraction of the relevant documents in the collection were re-RECALL turned by the system? Detailed discussion of relevance and evaluation measures including preci- sion and recall is found in Chapter 8.
***greedy feature selection->
|------------
All three methods – MI, χ2 and frequency based – are greedy methods.GREEDY FEATURE SELECTION They may select features that contribute no incremental information over previously selected features. In Figure 13.7, kong is selected as the seventh term even though it is highly correlated with previously selected hong and therefore redundant. Although such redundancy can negatively impact ac- curacy, non-greedy methods (see Section 13.7 for references) are rarely used in text classification due to their computational cost.
***idf->
|------------
Exercise 4.12 We claimed (on page 80) that an auxiliary index can impair the quality of collec- tion statistics. An example is the term weighting method idf, which is defined as log(N/dfi) where N is the total number of documents and dfi is the number of docu- ments that term i occurs in (Section 6.2.1, page 117). Show that even a small auxiliary index can cause significant error in idf when it is computed on the main index only.
|------------
A challenge in XML retrieval related to nesting is that we may need to distinguish different contexts of a term when we compute term statistics for ranking, in particular inverse document frequency (idf) statistics as defined in Section 6.2.1 (page 117). For example, the term Gates under the node author is unrelated to an occurrence under a content node like section if used to refer to the plural of gate. It makes little sense to compute a single document frequency for Gates in this example.
|------------
One solution is to compute idf for XML-context/term pairs, e.g., to com- pute different idf weights for author#"Gates" and section#"Gates".
|------------
11.3.3 Probability estimates in practice Under the assumption that relevant documents are a very small percentage of the collection, it is plausible to approximate statistics for nonrelevant doc- uments by statistics from the whole collection. Under this assumption, ut (the probability of term occurrence in nonrelevant documents for a query) is dft/N and log[(1 − ut)/ut] = log[(N − dft)/dft] ≈ log N/dft(11.22) In other words, we can provide a theoretical justification for the most fre- quently used form of idf weighting, which we saw in Section 6.2.1.
|------------
2. Croft and Harper (1979) proposed using a constant in their combination match model. For instance, we might assume that pt is constant over all terms xt in the query and that pt = 0.5. This means that each term has even odds of appearing in a relevant document, and so the pt and (1− pt) factors cancel out in the expression for RSV. Such an estimate is weak, but doesn’t disagree violently with our hopes for the search terms appearing in many but not all relevant documents. Combining this method with our earlier approximation for ut, the document ranking is determined simply by which query terms occur in documents scaled by their idf weighting.
|------------
11.4.3 Okapi BM25: a non-binary model The BIM was originally designed for short catalog records and abstracts of fairly consistent length, and it works reasonably in these contexts, but for modern full-text search collections, it seems clear that a model should pay attention to term frequency and document length, as in Chapter 6. The BM25BM25 WEIGHTS weighting scheme, often called Okapiweighting, after the system in which it wasOKAPI WEIGHTING first implemented, was developed as a way of building a probabilistic model sensitive to these quantities while not introducing too many additional pa- rameters into the model (Spärck Jones et al. 2000). We will not develop the full theory behind the model here, but just present a series of forms that build up to the standard form now used for document scoring. The simplest score for document d is just idf weighting of the query terms present, as in Equation (11.22): RSVd = ∑ t∈q log N dft (11.30) Sometimes, an alternative version of idf is used. If we start with the formula in Equation (11.21) but in the absence of relevance feedback information we estimate that S = s = 0, then we get an alternative idf formulation as follows: RSVd = ∑ t∈q log N − dft + 12 dft + 12 (11.31) This variant behaves slightly strangely: if a term occurs in over half the doc- uments in the collection then this model gives a negative term weight, which is presumably undesirable. But, assuming the use of a stop list, this normally doesn’t happen, and the value for each summand can be given a floor of 0.
***optimal learning method->
|------------
We can use learning error as a criterion for selecting a learning method in statistical text classification. A learning method Γ is optimal for a distributionOPTIMAL LEARNING METHOD P(D) if it minimizes the learning error.
***text summarization->
|------------
There has been extensive work within natural language processing (NLP) on better ways to do text summarization. Most such work still aims only toTEXT SUMMARIZATION choose sentences from the original document to present and concentrates on how to select good sentences. The models typically combine positional fac- tors, favoring the first and last paragraphs of documents and the first and last sentences of paragraphs, with content factors, emphasizing sentences with key terms, which have low document frequency in the collection as a whole, but high frequency and good distribution across the particular document being returned. In sophisticated NLP approaches, the system synthesizes sentences for a summary, either by doing full text generation or by editing and perhaps combining sentences used in the document. For example, it might delete a relative clause or replace a pronoun with the noun phrase that it refers to. This last class of methods remains in the realm of research and is seldom used for search results: it is easier, safer, and often even better to just use sentences from the original document.
***collection->
|------------
Let us now consider a more realistic scenario, simultaneously using the opportunity to introduce some terminology and notation. Suppose we have N = 1 million documents. By documents we mean whatever units we haveDOCUMENT decided to build a retrieval system over. They might be individual memos or chapters of a book (see Section 2.1.2 (page 20) for further discussion). We will refer to the group of documents over which we perform retrieval as the (document) collection. It is sometimes also referred to as a corpus (a body ofCOLLECTION CORPUS texts). Suppose each document is about 1000 words long (2–3 book pages). If 2. Formally, we take the transpose of the matrix to be able to get the terms as column vectors.
***lossless->
|------------
The compression techniques we describe in the remainder of this chapter are lossless, that is, all information is preserved. Better compression ratiosLOSSLESS can be achieved with lossy compression, which discards some information.LOSSY COMPRESSION Case folding, stemming, and stop word elimination are forms of lossy com- pression. Similarly, the vector space model (Chapter 6) and dimensionality reduction techniques like latent semantic indexing (Chapter 18) create com- pact representations from which we cannot fully restore the original collec- tion. Lossy compression makes sense when the “lost” information is unlikely ever to be used by the search system. For example, web search is character- ized by a large number of documents, short queries, and users who only look at the first few pages of results. As a consequence, we can discard postings of documents that would only be used for hits far down the list. Thus, there are retrieval scenarios where lossy methods can be used for compression without any reduction in effectiveness.
***out-links->
|------------
Figure 19.2 shows two nodes A and B from the web graph, each corre- sponding to a web page, with a hyperlink from A to B. We refer to the set of all such nodes and directed edges as the web graph. Figure 19.2 also shows that (as is the case with most links on web pages) there is some text surround- ing the origin of the hyperlink on page A. This text is generally encapsulated in the href attribute of the (for anchor) tag that encodes the hyperlink in the HTML code of page A, and is referred to as anchor text. As one mightANCHOR TEXT suspect, this directed graph is not strongly connected: there are pairs of pages such that one cannot proceed from one page of the pair to the other by follow- ing hyperlinks. We refer to the hyperlinks into a page as in-links and thoseIN-LINKS out of a page as out-links. The number of in-links to a page (also known asOUT-LINKS its in-degree) has averaged from roughly 8 to 15, in a range of studies. We similarly define the out-degree of a web page to be the number of links out ◮ Figure 19.3 A sample small web graph. In this example we have six pages labeled A-F. Page B has in-degree 3 and out-degree 1. This example graph is not strongly connected: there is no path from any of pages B-F to page A.
***weighted zone scoring->
|------------
In fact, we can reduce the size of the dictionary by encoding the zone in which a term occurs in the postings. In Figure 6.3 for instance, we show how occurrences of william in the title and author zones of various documents are encoded. Such an encoding is useful when the size of the dictionary is a concern (because we require the dictionary to fit in main memory). But there is another important reason why the encoding of Figure 6.3 is useful: the efficient computation of scores using a technique we will call weighted zoneWEIGHTED ZONE SCORING scoring.
***wildcard query->
|------------
3 Dictionaries and tolerantretrieval In Chapters 1 and 2 we developed the ideas underlying inverted indexes for handling Boolean and proximity queries. Here, we develop techniques that are robust to typographical errors in the query, as well as alternative spellings. In Section 3.1 we develop data structures that help the search for terms in the vocabulary in an inverted index. In Section 3.2 we study the idea of a wildcard query: a query such as *a*e*i*o*u*, which seeks doc-WILDCARD QUERY uments containing any term that includes all the five vowels in sequence.
|------------
leads to the wildcard query S*dney); (2) the user is aware of multiple vari- ants of spelling a term and (consciously) seeks documents containing any of the variants (e.g., color vs. colour); (3) the user seeks documents containing variants of a term that would be caught by stemming, but is unsure whether the search engine performs stemming (e.g., judicial vs. judiciary, leading to the wildcard query judicia*); (4) the user is uncertain of the correct rendition of a foreign word or phrase (e.g., the query Universit* Stuttgart).
|------------
A query such as mon* is known as a trailing wildcard query, because the *WILDCARD QUERY symbol occurs only once, at the end of the search string. A search tree on the dictionary is a convenient way of handling trailing wildcard queries: we walk down the tree following the symbols m, o and n in turn, at which point we can enumerate the set W of terms in the dictionary with the prefix mon.
***XML Schema->
|------------
We also need the concept of schema in this chapter. A schema puts con-SCHEMA straints on the structure of allowable XML documents for a particular ap- plication. A schema for Shakespeare’s plays may stipulate that scenes can only occur as children of acts and that only acts and scenes have the num- ber attribute. Two standards for schemas for XML documents are XML DTDXML DTD (document type definition) andXML Schema. Users can only write structuredXML SCHEMA queries for an XML retrieval system if they have some minimal knowledge about the schema of the collection.
***clique->
|------------
Both single-link and complete-link clustering have graph-theoretic inter- pretations. Define sk to be the combination similarity of the two clusters merged in step k, and G(sk) the graph that links all data points with a similar- ity of at least sk. Then the clusters after step k in single-link clustering are the connected components of G(sk) and the clusters after step k in complete-link clustering are maximal cliques of G(sk). A connected component is a maximalCONNECTED COMPONENT set of connected points such that there is a path connecting each pair. A clique CLIQUE is a set of points that are completely linked with each other.
***false negative->
|------------
An alternative to this information-theoretic interpretation of clustering is to view it as a series of decisions, one for each of the N(N − 1)/2 pairs of documents in the collection. We want to assign two documents to the same cluster if and only if they are similar. A true positive (TP) decision assigns two similar documents to the same cluster, a true negative (TN) decision as- signs two dissimilar documents to different clusters. There are two types of errors we can commit. A false positive (FP) decision assigns two dissim- ilar documents to the same cluster. A false negative (FN) decision assigns two similar documents to different clusters. The Rand index (RI) measuresRAND INDEX RI the percentage of decisions that are correct. That is, it is simply accuracy (Section 8.3, page 155).
|------------
The Rand index gives equal weight to false positives and false negatives.
|------------
Separating similar documents is sometimes worse than putting pairs of dis- similar documents in the same cluster. We can use the F measure (Section 8.3,F MEASURE page 154) to penalize false negatives more strongly than false positives by selecting a value β > 1, thus giving more weight to recall.
***first story detection->
|------------
Second, in some applications the purpose of clustering is not to create a complete hierarchy or exhaustive partition of the entire document set. For instance, first story detection or novelty detection is the task of detecting the firstFIRST STORY DETECTION occurrence of an event in a stream of news stories. One approach to this task is to find a tight cluster within the documents that were sent across the wire in a short period of time and are dissimilar from all previous documents. For example, the documents sent over the wire in the minutes after the World Trade Center attack on September 11, 2001 form such a cluster. Variations of single-link clustering can do well on this task since it is the structure of small parts of the vector space – and not global structure – that is important in this case.
|------------
The centroid algorithm described here is due to Voorhees (1985b). Voorhees recommends complete-link and centroid clustering over single-link for a re- trieval application. The Buckshot algorithm was originally published by Cut- ting et al. (1993). Allan et al. (1998) apply single-link clustering to first story detection.
***best-merge persistence->
|------------
Can we also speed up the other three HAC algorithms with an NBM ar- ray? We cannot because only single-link clustering is best-merge persistent.BEST-MERGE PERSISTENCE Suppose that the best merge cluster for ωk is ωj in single-link clustering.
|------------
Figure 17.10 demonstrates that best-merge persistence does not hold for complete-link clustering, which means that we cannot use an NBM array to speed up clustering. After merging d3’s best merge candidate d2 with cluster d1, an unrelated cluster d4 becomes the best merge candidate for d3. This is because the complete-link merge criterion is non-local and can be affected by points at a great distance from the area where two merge candidates meet.
***Probability Ranking Principle->
|------------
Writing P(A) for the complement of an event, we similarly have: P(A, B) = P(B|A)P(A)(11.2) Probability theory also has a partition rule, which says that if an event B canPARTITION RULE be divided into an exhaustive set of disjoint subcases, then the probability of B is the sum of the probabilities of the subcases. A special case of this rule gives that: P(B) = P(A, B) + P(A, B)(11.3) From these we can derive Bayes’ Rule for inverting conditional probabili-BAYES’ RULE ties: P(A|B) = P(B|A)P(A) P(B) = [ P(B|A) ∑X∈{A,A} P(B|X)P(X) ] P(A)(11.4) This equation can also be thought of as a way of updating probabilities. We start off with an initial estimate of how likely the event A is when we do not have any other information; this is the prior probability P(A). Bayes’ rulePRIOR PROBABILITY lets us derive a posterior probability P(A|B) after having seen the evidence B,POSTERIOR PROBABILITY based on the likelihood of B occurring in the two cases that A does or does not hold.1 Finally, it is often useful to talk about the odds of an event, which provideODDS a kind of multiplier for how probabilities change: Odds: O(A) = P(A) P(A) = P(A) 1 − P(A)(11.5) 11.2 The Probability Ranking Principle 11.2.1 The 1/0 loss case We assume a ranked retrieval setup as in Section 6.3, where there is a collec- tion of documents, the user issues a query, and an ordered list of documents is returned. We also assume a binary notion of relevance as in Chapter 8. For a query q and a document d in the collection, let Rd,q be an indicator random variable that says whether d is relevant with respect to a given query q. That is, it takes on a value of 1 when the document is relevant and 0 otherwise. In context we will often write just R for Rd,q.
|------------
Using a probabilistic model, the obvious order in which to present doc- uments to the user is to rank documents by their estimated probability of relevance with respect to the information need: P(R = 1|d, q). This is the ba- sis of the Probability Ranking Principle (PRP) (van Rijsbergen 1979, 113–114):PROBABILITY RANKING PRINCIPLE “If a reference retrieval system’s response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request, where the prob- abilities are estimated as accurately as possible on the basis of what- ever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data.” In the simplest case of the PRP, there are no retrieval costs or other utility concerns that would differentially weight actions or errors. You lose a point for either returning a nonrelevant document or failing to return a relevant document (such a binary situation where you are evaluated on your accuracy is called 1/0 loss). The goal is to return the best possible results as the top k1/0 LOSS documents, for any value of k the user chooses to examine. The PRP then says to simply rank all documents in decreasing order of P(R = 1|d, q). If a set of retrieval results is to be returned, rather than an ordering, the BayesBAYES OPTIMAL DECISION RULE 1. The term likelihood is just a synonym of probability. It is the probability of an event or data according to a model. The term is usually used when people are thinking of holding the data fixed, while varying the model.
***gold standard->
|------------
The standard approach to information retrieval system evaluation revolves around the notion of relevant and nonrelevant documents. With respect to aRELEVANCE user information need, a document in the test collection is given a binary classification as either relevant or nonrelevant. This decision is referred to as the gold standard or ground truth judgment of relevance. The test documentGOLD STANDARD GROUND TRUTH collection and suite of information needs have to be of a reasonable size: you need to average performance over fairly large test sets, as results are highly variable over different documents and information needs. As a rule of thumb, 50 information needs has usually been found to be a sufficient minimum.
***Porter stemmer->
|------------
The most common algorithm for stemming English, and one that has re- peatedly been shown to be empirically very effective, is Porter’s algorithmPORTER STEMMER (Porter 1980). The entire algorithm is too long and intricate to present here, but we will indicate its general nature. Porter’s algorithm consists of 5 phases of word reductions, applied sequentially. Within each phase there are var- ious conventions to select rules, such as selecting the rule from each rule group that applies to the longest suffix. In the first phase, this convention is used with the following rule group: (2.1) Rule Example SSES → SS caresses → caress IES → I ponies → poni SS → SS caress → caress S → cats → cat Many of the later rules use a concept of the measure of a word, which loosely checks the number of syllables to see whether a word is long enough that it is reasonable to regard the matching portion of a rule as a suffix rather than as part of the stem of a word. For example, the rule: (m > 1) EMENT → would map replacement to replac, but not cement to c. The official site for the Porter Stemmer is: http://www.tartarus.org/˜martin/PorterStemmer/ Other stemmers exist, including the older, one-pass Lovins stemmer (Lovins 1968), and newer entrants like the Paice/Husk stemmer (Paice 1990); see: http://www.cs.waikato.ac.nz/˜eibe/stemmers/ http://www.comp.lancs.ac.uk/computing/research/stemming/ Figure 2.8 presents an informal comparison of the different behaviors of these stemmers. Stemmers use language-specific rules, but they require less know- ledge than a lemmatizer, which needs a complete vocabulary and morpho- logical analysis to correctly lemmatize words. Particular domains may also require special stemming rules. However, the exact stemmed form does not matter, only the equivalence classes it forms.
|------------
Rather than using a stemmer, you can use a lemmatizer, a tool from Nat-LEMMATIZER ural Language Processing which does full morphological analysis to accu- rately identify the lemma for each word. Doing full morphological analysis produces at most very modest benefits for retrieval. It is hard to say more, Sample text: Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation Lovins stemmer: such an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres Porter stemmer: such an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret Paice stemmer: such an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret ◮ Figure 2.8 A comparison of three stemming algorithms on a sample text.
***NEXI->
|------------
A common format for XML queries is NEXI (Narrowed Extended XPathNEXI I). We give an example in Figure 10.3. We display the query on four lines for typographical convenience, but it is intended to be read as one unit without line breaks. In particular, //section is embedded under //article.
|------------
We usually handle relational attribute constraints by prefiltering or post- filtering: We simply exclude all elements from the result set that do not meet the relational attribute constraints. In this chapter, we will not address how to do this efficiently and instead focus on the core information retrieval prob- lem in XML retrieval, namely how to rank documents according to the rele- vance criteria expressed in the about conditions of the NEXI query.
***retrieval systems->
|------------
sion – instead of docIDs we can compress smaller gaps between IDs, thus reducing space requirements for the index. However, this structure for the index is not optimal when we build ranked (Chapters 6 and 7) – as opposed toRANKED Boolean – retrieval systems. In ranked retrieval, postings are often ordered ac-RETRIEVAL SYSTEMS cording to weight or impact, with the highest-weighted postings occurring first. With this organization, scanning of long postings lists during query processing can usually be terminated early when weights have become so small that any further documents can be predicted to be of low similarity to the query (see Chapter 6). In a docID-sorted index, new documents are always inserted at the end of postings lists. In an impact-sorted index (Sec- tion 7.1.5, page 140), the insertion can occur anywhere, thus complicating the update of the inverted index.
|------------
Securityis an important consideration for retrieval systems in corporations.SECURITY A low-level employee should not be able to find the salary roster of the cor- poration, but authorized managers need to be able to search for it. Users’ results lists must not contain documents they are barred from opening; the very existence of a document can be sensitive information.
***CLEF->
|------------
NII Test Collections for IR Systems (NTCIR). The NTCIR project has builtNTCIR various test collections of similar sizes to the TREC collections, focus- ing on East Asian language and cross-language information retrieval, whereCROSS-LANGUAGE INFORMATION RETRIEVAL queries are made in one language over a document collection containing documents in one or more other languages. See: http://research.nii.ac.jp/ntcir/data/data- en.html Cross Language Evaluation Forum (CLEF). This evaluation series has con-CLEF centrated on European languages and cross-language information retrieval.
|------------
See: http://www.clef-campaign.org/ Reuters-21578 and Reuters-RCV1. For text classification, the most used testREUTERS collection has been the Reuters-21578 collection of 21578 newswire arti- cles; see Chapter 13, page 279. More recently, Reuters released the much larger Reuters Corpus Volume 1 (RCV1), consisting of 806,791 documents; see Chapter 4, page 69. Its scale and rich annotation makes it a better basis for future research.
***Buckshot algorithm->
|------------
Even with these optimizations, HAC algorithms are all Θ(N2) or Θ(N2 log N) and therefore infeasible for large sets of 1,000,000 or more documents. For such large sets, HAC can only be used in combination with a flat clustering algorithm like K-means. Recall that K-means requires a set of seeds as initial- ization (Figure 16.5, page 361). If these seeds are badly chosen, then the re- sulting clustering will be of poor quality. We can employ an HAC algorithm to compute seeds of high quality. If the HAC algorithm is applied to a docu- ment subset of size √ N, then the overall runtime of K-means cum HAC seed generation is Θ(N). This is because the application of a quadratic algorithm to a sample of size √ N has an overall complexity of Θ(N). An appropriate adjustment can be made for an Θ(N2 log N) algorithm to guarantee linear- ity. This algorithm is referred to as the Buckshot algorithm. It combines theBUCKSHOT ALGORITHM determinism and higher reliability of HAC with the efficiency of K-means.
|------------
The centroid algorithm described here is due to Voorhees (1985b). Voorhees recommends complete-link and centroid clustering over single-link for a re- trieval application. The Buckshot algorithm was originally published by Cut- ting et al. (1993). Allan et al. (1998) apply single-link clustering to first story detection.
***singular value decomposition->
|------------
We next state a closely related decomposition of a symmetric square matrix into the product of matrices derived from its eigenvectors. This will pave the way for the development of our main tool for text analysis, the singular value decomposition (Section 18.2).
|------------
18.2 Term-document matrices and singular value decompositions The decompositions we have been studying thus far apply to square matri- ces. However, the matrix we are interested in is the M× N term-document matrix C where (barring a rare coincidence) M 6= N; furthermore, C is very unlikely to be symmetric. To this end we first describe an extension of the symmetric diagonal decomposition known as the singular value decomposi-SINGULAR VALUE DECOMPOSITION tion. We then show in Section 18.3 how this can be used to construct an ap- proximate version of C. It is beyond the scope of this book to develop a full treatment of the mathematics underlying singular value decompositions; fol- lowing the statement of Theorem 18.3 we relate the singular value decompo- sition to the symmetric diagonal decompositions from Section 18.1.1. GivenSYMMETRIC DIAGONAL DECOMPOSITION C, let U be the M× M matrix whose columns are the orthogonal eigenvec- tors of CCT , and V be the N × N matrix whose columns are the orthogonal eigenvectors of CTC. Denote by CT the transpose of a matrix C.
***singleton->
|------------
17.1 Hierarchical agglomerative clustering Hierarchical clustering algorithms are either top-down or bottom-up. Bottom- up algorithms treat each document as a singleton cluster at the outset and then successively merge (or agglomerate) pairs of clusters until all clusters have been merged into a single cluster that contains all documents. Bottom- up hierarchical clustering is therefore called hierarchical agglomerative cluster-HIERARCHICAL AGGLOMERATIVE CLUSTERING ing or HAC. Top-down clustering requires a method for splitting a cluster.
|------------
An HAC clustering is typically visualized as a dendrogram as shown inDENDROGRAM Figure 17.1. Each merge is represented by a horizontal line. The y-coordinate of the horizontal line is the similarity of the two clusters that were merged, where documents are viewed as singleton clusters. We call this similarity the combination similarity of the merged cluster. For example, the combinationCOMBINATION SIMILARITY similarity of the cluster consisting of Lloyd’s CEO questioned and Lloyd’s chief / U.S. grilling in Figure 17.1 is ≈ 0.56. We define the combination similarity of a singleton cluster as its document’s self-similarity (which is 1.0 for cosine similarity).
***XML attribute->
|------------
10.1 Basic XML concepts An XML document is an ordered, labeled tree. Each node of the tree is an XML element and is written with an opening and closing tag. An element canXML ELEMENT have one or more XML attributes. In the XML document in Figure 10.1, theXML ATTRIBUTE scene element is enclosed by the two tags and . It has an attribute number with value vii and two child elements, title and verse.
***cluster-based classification->
***weight vector->
|------------
Let us formalize an SVM with algebra. A decision hyperplane (page 302) can be defined by an intercept term b and a decision hyperplane normal vec- tor ~w which is perpendicular to the hyperplane. This vector is commonly referred to in the machine learning literature as the weight vector. To chooseWEIGHT VECTOR among all the hyperplanes that are perpendicular to the normal vector, we specify the intercept term b. Because the hyperplane is perpendicular to the normal vector, all points ~x on the hyperplane satisfy ~wT~x = −b. Now sup- pose that we have a set of training data points D = {(~xi, yi)}, where each member is a pair of a point ~xi and a class label yi corresponding to it.1 For SVMs, the two data classes are always named +1 and −1 (rather than 1 and 0), and the intercept term is always explicitly represented as b (rather than being folded into the weight vector ~w by adding an extra always-on feature).
***grep->
|------------
1.1 An example information retrieval problem A fat book which many people own is Shakespeare’s Collected Works. Sup- pose you wanted to determine which plays of Shakespeare contain the words Brutus AND Caesar AND NOT Calpurnia. One way to do that is to start at the beginning and to read through all the text, noting for each play whether it contains Brutus and Caesar and excluding it from consideration if it con- tains Calpurnia. The simplest form of document retrieval is for a computer to do this sort of linear scan through documents. This process is commonly referred to as grepping through text, after the Unix command grep, whichGREP performs this process. Grepping through text can be a very effective process, especially given the speed of modern computers, and often allows useful possibilities for wildcard pattern matching through the use of regular expres- sions. With modern computers, for simple querying of modest collections (the size of Shakespeare’s Collected Works is a bit under one million words of text in total), you really need nothing more.
|------------
2. To allow more flexible matching operations. For example, it is impractical to perform the query Romans NEAR countrymen with grep, where NEAR might be defined as “within 5 words” or “within the same sentence”.
***relational database->
|------------
10 XML retrieval Information retrieval systems are often contrasted with relational databases.
|------------
Traditionally, IR systems have retrieved information from unstructured text – by which we mean “raw” text without markup. Databases are designed for querying relational data: sets of records that have values for predefined attributes such as employee number, title and salary. There are fundamental differences between information retrieval and database systems in terms of retrieval model, data structures and query language as shown in Table 10.1.1 Some highly structured text search problems are most efficiently handled by a relational database, for example, if the employee table contains an at- tribute for short textual job descriptions and you want to find all employees who are involved with invoicing. In this case, the SQL query: select lastname from employees where job_desc like ’invoic%’; may be sufficient to satisfy your information need with high precision and recall.
|------------
Relational databases do not deal well with this use case.
***model-based clustering->
|------------
? Exercise 16.4Why are documents that do not use the same term for the concept car likely to end up in the same cluster in K-means clustering? Exercise 16.5 Two of the possible termination conditions for K-means were (1) assignment does not change, (2) centroids do not change (page 361). Do these two conditions imply each other? ✄ 16.5 Model-based clustering In this section, we describe a generalization of K-means, the EM algorithm.
|------------
In K-means, we attempt to find centroids that are good representatives. We can view the set of K centroids as a model that generates the data. Generating a document in this model consists of first picking a centroid at random and then adding some noise. If the noise is normally distributed, this procedure will result in clusters of spherical shape. Model-based clustering assumes thatMODEL-BASED CLUSTERING the data were generated by a model and tries to recover the original model from the data. The model that we recover from the data then defines clusters and an assignment of documents to clusters.
***access control lists->
|------------
In the indexes we have considered so far, postings lists are ordered with respect to docID. As we see in Chapter 5, this is advantageous for compres- users documents 0/1 doc e., 1 otherwis 0 if user can’t read ◮ Figure 4.8 A user-document matrix for access control lists. Element (i, j) is 1 if user i has access to document j and 0 otherwise. During query processing, a user’s access postings list is intersected with the results list returned by the text part of the index.
|------------
User authorization is often mediated through access control lists or ACLs.ACCESS CONTROL LISTS ACLs can be dealt with in an information retrieval system by representing each document as the set of users that can access them (Figure 4.8) and then inverting the resulting user-document matrix. The inverted ACL index has, for each user, a “postings list” of documents they can access – the user’s ac- cess list. Search results are then intersected with this list. However, such an index is difficult to maintain when access permissions change – we dis- cussed these difficulties in the context of incremental indexing for regular postings lists in Section 4.5. It also requires the processing of very long post- ings lists for users with access to large document subsets. User membership is therefore often verified by retrieving access information directly from the file system at query time – even though this slows down retrieval.
***relational->
|------------
IR can also cover other kinds of data and information problems beyond that specified in the core definition above. The term “unstructured data” refers to data which does not have clear, semantically overt, easy-for-a-computer structure. It is the opposite of structured data, the canonical example of which is a relational database, of the sort companies usually use to main- tain product inventories and personnel records. In reality, almost no data are truly “unstructured”. This is definitely true of all text data if you count the latent linguistic structure of human languages. But even accepting that the intended notion of structure is overt structure, most text has structure, such as headings and paragraphs and footnotes, which is commonly repre- sented in documents by explicit markup (such as the coding underlying web 1. In modern parlance, the word “search” has tended to replace “(information) retrieval”; the term “search” is quite ambiguous, but in context we use the two synonymously.
|------------
10 XML retrieval Information retrieval systems are often contrasted with relational databases.
|------------
Traditionally, IR systems have retrieved information from unstructured text – by which we mean “raw” text without markup. Databases are designed for querying relational data: sets of records that have values for predefined attributes such as employee number, title and salary. There are fundamental differences between information retrieval and database systems in terms of retrieval model, data structures and query language as shown in Table 10.1.1 Some highly structured text search problems are most efficiently handled by a relational database, for example, if the employee table contains an at- tribute for short textual job descriptions and you want to find all employees who are involved with invoicing. In this case, the SQL query: select lastname from employees where job_desc like ’invoic%’; may be sufficient to satisfy your information need with high precision and recall.
|------------
However, many structured data sources containing text are best modeled as structured documents rather than relational data. We call the search over such structured documents structured retrieval. Queries in structured retrievalSTRUCTURED RETRIEVAL can be either structured or unstructured, but we will assume in this chap- ter that the collection consists only of structured documents. Applications of structured retrieval include digital libraries, patent databases, blogs, text in which entities like persons and locations have been tagged (in a process called named entity tagging) and output from office suites like OpenOffice that save documents as marked up text. In all of these applications, we want to be able to run queries that combine textual criteria with structural criteria.
|------------
Relational databases do not deal well with this use case.
***logarithmic merging->
|------------
LMERGEADDTOKEN(indexes, Z0, token) 1 Z0 ← MERGE(Z0, {token}) 2 if |Z0| = n 3 then for i ← 0 to ∞ 4 do if Ii ∈ indexes 5 then Zi+1 ← MERGE(Ii, Zi) 6 (Zi+1 is a temporary index on disk.) 7 indexes ← indexes− {Ii} 8 else Ii ← Zi (Zi becomes the permanent index Ii.) 9 indexes ← indexes ∪ {Ii} 10 BREAK 11 Z0 ← ∅ LOGARITHMICMERGE() 1 Z0 ← ∅ (Z0 is the in-memory index.) 2 indexes ← ∅ 3 while true 4 do LMERGEADDTOKEN(indexes, Z0, GETNEXTTOKEN()) ◮ Figure 4.7 Logarithmic merging. Each token (termID,docID) is initially added to in-memory index Z0 by LMERGEADDTOKEN. LOGARITHMICMERGE initializes Z0 and indexes.
|------------
For the purpose of time complexity, a postings list is simply a list of docIDs.) We can do better than Θ(T2/n) by introducing log2(T/n) indexes I0, I1, I2, . . . of size 20 × n, 21 × n, 22 × n . . . . Postings percolate up this sequence of indexes and are processed only once on each level. This scheme is called log-LOGARITHMIC MERGING arithmic merging (Figure 4.7). As before, up to n postings are accumulated in an in-memory auxiliary index, which we call Z0. When the limit n is reached, the 20 × n postings in Z0 are transferred to a new index I0 that is created on disk. The next time Z0 is full, it is merged with I0 to create an index Z1 of size 21× n. Then Z1 is either stored as I1 (if there isn’t already an I1) or merged with I1 into Z2 (if I1 exists); and so on. We service search requests by query- ing in-memory Z0 and all currently valid indexes Ii on disk and merging the results. Readers familiar with the binomial heap data structure2 will recog- 2. See, for example, (Cormen et al. 1990, Chapter 19).
***polytope->
|------------
We can also weight the “votes” of the k nearest neighbors by their cosine 3. The generalization of a polygon to higher dimensions is a polytope. A polytope is a region in M-dimensional space bounded by (M− 1)-dimensional hyperplanes. In M dimensions, the decision boundaries for kNN consist of segments of (M− 1)-dimensional hyperplanes that form the Voronoi tessellation into convex polytopes for the training set of documents. The decision criterion of assigning a document to the majority class of its k nearest neighbors applies equally to M = 2 (tessellation into polygons) and M > 2 (tessellation into polytopes).
***optimal classifier->
|------------
In this book, we discuss NB as a classifier for text. The independence as- sumptions do not hold for text. However, it can be shown that NB is an optimal classifier (in the sense of minimal error rate on new data) for dataOPTIMAL CLASSIFIER where the independence assumptions do hold.
|------------
We define a classifier γ to be optimal for a distribution P(〈d, c〉) if it mini-OPTIMAL CLASSIFIER mizes MSE(γ).
***residual sum of squares->
|------------
A measure of how well the centroids represent the members of their clus- ters is the residual sum of squares or RSS, the squared distance of each vectorRESIDUAL SUM OF SQUARES from its centroid summed over all vectors: RSSk = ∑ ~x∈ωk |~x−~µ(ωk)|2 RSS = K ∑ k=1 RSSk(16.7) RSS is the objective function in K-means and our goal is to minimize it. Since N is fixed, minimizing RSS is equivalent to minimizing the average squared distance, a measure of how well centroids represent their documents.
***single-label classification->
|------------
The second type of classification with more than two classes is one-of clas-ONE-OF CLASSIFICATION sification. Here, the classes are mutually exclusive. Each document must belong to exactly one of the classes. One-of classification is also called multi- nomial, polytomous4, multiclass, or single-label classification. Formally, there is a single classification function γ in one-of classification whose range is C, i.e., γ(d) ∈ {c1, . . . , cJ}. kNN is a (nonlinear) one-of classifier.
***Mercator->
|------------
This seemingly simple recursive traversal of the web graph is complicated by the many demands on a practical web crawling system: the crawler has to be distributed, scalable, efficient, polite, robust and extensible while fetching pages of high quality. We examine the effects of each of these issues. Our treatment follows the design of the Mercator crawler that has formed the ba-MERCATOR sis of a number of research and commercial crawlers. As a reference point, fetching a billion pages (a small fraction of the static Web at present) in a month-long crawl requires fetching several hundred pages each second. We will see how to use a multi-threaded design to address several bottlenecks in the overall crawler system in order to attain this fetch rate.
***minimum spanning tree->
|------------
17.9 References and further reading An excellent general review of clustering is (Jain et al. 1999). Early references for specific HAC algorithms are (King 1967) (single-link), (Sneath and Sokal 1973) (complete-link, GAAC) and (Lance and Williams 1967) (discussing a large variety of hierarchical clustering algorithms). The single-link algorithm in Figure 17.9 is similar to Kruskal’s algorithm for constructing a minimumKRUSKAL’S ALGORITHM spanning tree. A graph-theoretical proof of the correctness of Kruskal’s al- gorithm (which is analogous to the proof in Section 17.5) is provided by Cor- men et al. (1990, Theorem 23.1). See Exercise 17.5 for the connection between minimum spanning trees and single-link clusterings.
|------------
17.10 Exercises ? Exercise 17.5A single-link clustering can also be computed from the minimum spanning tree of aMINIMUM SPANNING TREE graph. The minimum spanning tree connects the vertices of a graph at the smallest possible cost, where cost is defined as the sum over all edges of the graph. In our case the cost of an edge is the distance between two documents. Show that if ∆k−1 > ∆k > . . . > ∆1 are the costs of the edges of a minimum spanning tree, then these edges correspond to the k− 1 merges in constructing a single-link clustering.
***term frequency->
|------------
3. A Boolean model only records term presence or absence, but often we would like to accumulate evidence, giving more weight to documents that have a term several times as opposed to ones that contain it only once. To be able to do this we need term frequency information (the number of timesTERM FREQUENCY a term occurs in a document) in postings lists.
|------------
Exercise 6.6 For the value of g estimated in Exercise 6.5, compute the weighted zone score for each (query, document) example. How do these scores relate to the relevance judgments in Figure 6.5 (quantized to 0/1)? Exercise 6.7 Why does the expression for g in (6.6) not involve training examples in which sT(dt, qt) and sB(dt, qt) have the same value? 6.2 Term frequency and weighting Thus far, scoring has hinged on whether or not a query term is present in a zone within a document. We take the next logical step: a document or zone that mentions a query term more often has more to do with that query and therefore should receive a higher score. To motivate this, we recall the notion of a free text query introduced in Section 1.4: a query in which the terms of the query are typed freeform into the search interface, without any connecting search operators (such as Boolean operators). This query style, which is extremely popular on the web, views the query as simply a set of words. A plausible scoring mechanism then is to compute a score that is the sum, over the query terms, of the match scores between each query term and the document.
|------------
This weighting scheme is referred to as term frequency and is denoted tft,d,TERM FREQUENCY with the subscripts denoting the term and the document in order.
|------------
6.2.1 Inverse document frequency Raw term frequency as above suffers from a critical problem: all terms are considered equally important when it comes to assessing relevancy on a query. In fact certain terms have little or no discriminating power in de- termining relevance. For instance, a collection of documents on the auto industry is likely to have the term auto in almost every document. To this Word cf df try 10422 8760 insurance 10440 3997 ◮ Figure 6.7 Collection frequency (cf) and document frequency (df) behave differ- ently, as in this example from the Reuters collection.
***any-of classification->
|------------
Classification for classes that are not mutually exclusive is called any-of ,ANY-OF CLASSIFICATION multilabel, or multivalue classification. In this case, a document can belong to several classes simultaneously, or to a single class, or to none of the classes.
|------------
Solving an any-of classification task with linear classifiers is straightfor- ward: 1. Build a classifier for each class, where the training set consists of the set of documents in the class (positive labels) and its complement (negative labels).
***auxiliary index->
|------------
If there is a requirement that new documents be included quickly, one solu- tion is to maintain two indexes: a large main index and a small auxiliary indexAUXILIARY INDEX that stores new documents. The auxiliary index is kept in memory. Searches are run across both indexes and results merged. Deletions are stored in an in- validation bit vector. We can then filter out deleted documents before return- ing the search result. Documents are updated by deleting and reinserting them.
|------------
Each time the auxiliary index becomes too large, we merge it into the main index. The cost of this merging operation depends on how we store the index in the file system. If we store each postings list as a separate file, then the merge simply consists of extending each postings list of the main index by the corresponding postings list of the auxiliary index. In this scheme, the reason for keeping the auxiliary index is to reduce the number of disk seeks required over time. Updating each document separately requires up to Mave disk seeks, where Mave is the average size of the vocabulary of documents in the collection. With an auxiliary index, we only put additional load on the disk when we merge auxiliary and main indexes.
***add-one smoothing->
|------------
To eliminate zeros, we use add-one or Laplace smoothing, which simply addsADD-ONE SMOOTHING one to each count (cf. Section 11.3.2): P̂(t|c) = Tct + 1 ∑t′∈V(Tct′ + 1) = Tct + 1 (∑t′∈V Tct′) + B ,(13.7) where B = |V| is the number of terms in the vocabulary. Add-one smoothing can be interpreted as a uniform prior (each term occurs once for each class) that is then updated as evidence from the training data comes in. Note that this is a prior probability for the occurrence of a term as opposed to the prior probability of a class which we estimate in Equation (13.5) on the document level.
***bowtie->
|------------
There is ample evidence that these links are not randomly distributed; for one thing, the distribution of the number of links into a web page does not follow the Poisson distribution one would expect if every web page were to pick the destinations of its links uniformly at random. Rather, this dis- tribution is widely reported to be a power law, in which the total number ofPOWER LAW web pages with in-degree i is proportional to 1/iα; the value of α typically reported by studies is 2.1.1 Furthermore, several studies have suggested that the directed graph connecting web pages has a bowtie shape: there are threeBOWTIE major categories of web pages that are sometimes referred to as IN, OUT and SCC. A web surfer can pass from any page in IN to any page in SCC, by following hyperlinks. Likewise, a surfer can pass from page in SCC to any page in OUT. Finally, the surfer can surf from any page in SCC to any other page in SCC. However, it is not possible to pass from a page in SCC to any page in IN, or from a page in OUT to a page in SCC (or, consequently, IN).
***polychotomous->
|------------
J hyperplanes do not divide R|V| into J distinct regions as illustrated in Figure 14.12. Thus, we must use a combination method when using two- class linear classifiers for one-of classification. The simplest method is to 4. A synonym of polytomous is polychotomous.
***type->
|------------
2.2 Determining the vocabulary of terms 2.2.1 Tokenization Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation. Here is an example of tokenization: Input: Friends, Romans, Countrymen, lend me your ears; Output: Friends Romans Countrymen lend me your ears These tokens are often loosely referred to as terms or words, but it is some- times important to make a type/token distinction. A token is an instanceTOKEN of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing. A type is the class of allTYPE tokens containing the same character sequence. A term is a (perhaps nor-TERM malized) type that is included in the IR system’s dictionary. The set of index terms could be entirely distinct from the tokens, for instance, they could be semantic identifiers in a taxonomy, but in practice in modern IR systems they are strongly related to the tokens in the document. However, rather than be- ing exactly the tokens that appear in the document, they are usually derived from them by various normalization processes which are discussed in Sec- tion 2.2.3.2 For example, if the document to be indexed is to sleep perchance to dream, then there are 5 tokens, but only 4 types (since there are 2 instances of to). However, if to is omitted from the index (as a stop word, see Section 2.2.2 (page 27)), then there will be only 3 terms: sleep, perchance, and dream.
***external criterion of quality->
|------------
As a surrogate for user judgments, we can use a set of classes in an evalua- tion benchmark or gold standard (see Section 8.5, page 164, and Section 13.6, page 279). The gold standard is ideally produced by human judges with a good level of inter-judge agreement (see Chapter 8, page 152). We can then compute an external criterion that evaluates how well the clustering matchesEXTERNAL CRITERION OF QUALITY the gold standard classes. For example, we may want to say that the opti- mal clustering of the search results for jaguar in Figure 16.2 consists of three classes corresponding to the three senses car, animal, and operating system.
***word segmentation->
|------------
Other languages make the problem harder in new ways. German writes compound nouns without spaces (e.g., Computerlinguistik ‘computational lin-COMPOUNDS guistics’; Lebensversicherungsgesellschaftsangestellter ‘life insurance company employee’). Retrieval systems for German greatly benefit from the use of a compound-splitter module, which is usually implemented by seeing if a wordCOMPOUND-SPLITTER can be subdivided into multiple words that appear in a vocabulary. This phe- nomenon reaches its limit case with major East Asian Languages (e.g., Chi- nese, Japanese, Korean, and Thai), where text is written without any spaces between words. An example is shown in Figure 2.3. One approach here is to perform word segmentation as prior linguistic processing. Methods of wordWORD SEGMENTATION segmentation vary from having a large vocabulary and taking the longest vocabulary match with some heuristics for unknown words to the use of machine learning sequence models, such as hidden Markov models or condi- tional random fields, trained over hand-segmented words (see the references ! " # $ % & ' ' ( ) * + , # - . / ◮ Figure 2.3 The standard unsegmented form of Chinese text using the simplified characters of mainland China. There is no whitespace between words, not even be- tween sentences – the apparent space after the Chinese period (◦) is just a typograph- ical illusion caused by placing the character on the left side of its square box. The first sentence is just words in Chinese characters with no spaces between them. The second and third sentences include Arabic numerals and punctuation breaking up the Chinese characters.
***Laplace smoothing->
|------------
To eliminate zeros, we use add-one or Laplace smoothing, which simply addsADD-ONE SMOOTHING one to each count (cf. Section 11.3.2): P̂(t|c) = Tct + 1 ∑t′∈V(Tct′ + 1) = Tct + 1 (∑t′∈V Tct′) + B ,(13.7) where B = |V| is the number of terms in the vocabulary. Add-one smoothing can be interpreted as a uniform prior (each term occurs once for each class) that is then updated as evidence from the training data comes in. Note that this is a prior probability for the occurrence of a term as opposed to the prior probability of a class which we estimate in Equation (13.5) on the document level.
***Scatter-Gather->
|------------
The hypothesis states that if there is a document from a cluster that is rele- vant to a search request, then it is likely that other documents from the same cluster are also relevant. This is because clustering puts together documents that share many terms. The cluster hypothesis essentially is the contiguity Application What is Benefit Example clustered? Search result clustering search results more effective information presentation to user Figure 16.2 Scatter-Gather (subsets of) collection alternative user interface: “search without typing” Figure 16.3 Collection clustering collection effective information pre- sentation for exploratory browsing McKeown et al. (2002), http://news.google.com Language modeling collection increased precision and/or recall Liu and Croft (2004) Cluster-based retrieval collection higher efficiency: faster search Salton (1971a) ◮ Table 16.1 Some applications of clustering in information retrieval.
|------------
A better user interface is also the goal of Scatter-Gather, the second ap-SCATTER-GATHER plication in Table 16.1. Scatter-Gather clusters the whole collection to get groups of documents that the user can select or gather. The selected groups are merged and the resulting set is again clustered. This process is repeated until a cluster of interest is found. An example is shown in Figure 16.3.
***expectation step->
|------------
How do we use EM to infer the parameters of the clustering from the data? That is, how do we choose parameters Θ that maximize L(D|Θ)? EM is simi- lar to K-means in that it alternates between an expectation step, correspondingEXPECTATION STEP to reassignment, and a maximization step, corresponding to recomputation ofMAXIMIZATION STEP the parameters of the model. The parameters of K-means are the centroids, the parameters of the instance of EM in this section are the αk and qmk.
|------------
The expectation step computes the soft assignment of documents to clus- ters given the current parameters qmk and αk: Expectation step : rnk = αk(∏tm∈dn qmk)(∏tm/∈dn(1 − qmk)) ∑ K k=1 αk(∏tm∈dn qmk)(∏tm/∈dn(1 − qmk)) (16.17) This expectation step applies Equations (16.14) and (16.15) to computing the likelihood that ωk generated document dn. It is the classification procedure for the multivariate Bernoulli in Table 13.3. Thus, the expectation step is nothing else but Bernoulli Naive Bayes classification (including normaliza- tion, i.e. dividing by the denominator, to get a probability distribution over clusters).
***Expectation-Maximization algorithm,->
***nested elements->
|------------
The leaf node would then occur four times in the result set, once directly and three times as part of other elements. We call elements that are contained within each other nested. Returning redundant nested elements in a list ofNESTED ELEMENTS returned hits is not very user-friendly.
|------------
Because of the redundancy caused by nested elements it is common to re- strict the set of elements that are eligible to be returned. Restriction strategies include: • discard all small elements • discard all element types that users do not look at (this requires a working XML retrieval system that logs this information) • discard all element types that assessors generally do not judge to be rele- vant (if relevance assessments are available) • only keep element types that a system designer or librarian has deemed to be useful search results In most of these approaches, result sets will still contain nested elements.
|------------
Thus, we may want to remove some elements in a postprocessing step to re- duce redundancy. Alternatively, we can collapse several nested elements in the results list and use highlighting of query terms to draw the user’s atten- tion to the relevant passages. If query terms are highlighted, then scanning a medium-sized element (e.g., a section) takes little more time than scanning a small subelement (e.g., a paragraph). Thus, if the section and the paragraph both occur in the results list, it is sufficient to show the section. An additional advantage of this approach is that the paragraph is presented together with its context (i.e., the embedding section). This context may be helpful in in- terpreting the paragraph (e.g., the source of the information reported) even if the paragraph on its own satisfies the query.
|------------
If the user knows the schema of the collection and is able to specify the desired type of element, then the problem of redundancy is alleviated as few nested elements have the same type. But as we discussed in the introduction, users often don’t know what the name of an element in the collection is (Is the Vatican a country or a city?) or they may not know how to compose structured queries at all.
***Boolean->
|------------
The way to avoid linearly scanning the texts for each query is to index theINDEX documents in advance. Let us stick with Shakespeare’s Collected Works, and use it to introduce the basics of the Boolean retrieval model. Suppose we record for each document – here a play of Shakespeare’s – whether it contains each word out of all the words Shakespeare used (Shakespeare used about 32,000 different words). The result is a binary term-document incidenceINCIDENCE MATRIX matrix, as in Figure 1.1. Terms are the indexed units (further discussed inTERM Section 2.2); they are usually words, and for the moment you can think of Antony Julius The Hamlet Othello Macbeth . . .
|------------
The Boolean retrieval model is a model for information retrieval in which weBOOLEAN RETRIEVAL MODEL can pose any query which is in the form of a Boolean expression of terms, that is, in which terms are combined with the operators AND, OR, and NOT.
***DNS server->
|------------
20.2.2 DNS resolution Each web server (and indeed any host connected to the internet) has a unique IP address: a sequence of four bytes generally represented as four integersIP ADDRESS separated by dots; for instance 207.142.131.248 is the numerical IP address as- sociated with the host www.wikipedia.org. Given a URL such as www.wikipedia.org in textual form, translating it to an IP address (in this case, 207.142.131.248) is a process known as DNS resolution or DNS lookup; here DNS stands for Do-DNS RESOLUTION main Name Service. During DNS resolution, the program that wishes to per- form this translation (in our case, a component of the web crawler) contacts a DNS server that returns the translated IP address. (In practice the entire trans-DNS SERVER lation may not occur at a single DNS server; rather, the DNS server contacted initially may recursively call upon other DNS servers to complete the transla- tion.) For a more complex URL such as en.wikipedia.org/wiki/Domain_Name_System, the crawler component responsible for DNS resolution extracts the host name – in this case en.wikipedia.org – and looks up the IP address for the host en.wikipedia.org.
|------------
DNS resolution is a well-known bottleneck in web crawling. Due to the distributed nature of the Domain Name Service, DNS resolution may entail multiple requests and round-trips across the internet, requiring seconds and sometimes even longer. Right away, this puts in jeopardy our goal of fetching several hundred documents a second. A standard remedy is to introduce caching: URLs for which we have recently performed DNS lookups are likely to be found in the DNS cache, avoiding the need to go to the DNS servers on the internet. However, obeying politeness constraints (see Section 20.2.3) limits the of cache hit rate.
|------------
There is another important difficulty in DNS resolution; the lookup imple- mentations in standard libraries (likely to be used by anyone developing a crawler) are generally synchronous. This means that once a request is made to the Domain Name Service, other crawler threads at that node are blocked until the first request is completed. To circumvent this, most web crawlers implement their own DNS resolver as a component of the crawler. Thread i executing the resolver code sends a message to the DNS server and then performs a timed wait: it resumes either when being signaled by another thread or when a set time quantum expires. A single, separate DNS thread listens on the standard DNS port (port 53) for incoming response packets from the name service. Upon receiving a response, it signals the appropriate crawler thread (in this case, i) and hands it the response packet if i has not yet resumed because its time quantum has expired. A crawler thread that re- sumes because its wait time quantum has expired retries for a fixed number of attempts, sending out a new message to the DNS server and performing a timed wait each time; the designers of Mercator recommend of the order of five attempts. The time quantum of the wait increases exponentially with each of these attempts; Mercator started with one second and ended with roughly 90 seconds, in consideration of the fact that there are host names that take tens of seconds to resolve.
***top-down clustering->
|------------
17.6 Divisive clustering So far we have only looked at agglomerative clustering, but a cluster hierar- chy can also be generated top-down. This variant of hierarchical clustering is called top-down clustering or divisive clustering. We start at the top with allTOP-DOWN CLUSTERING documents in one cluster. The cluster is split using a flat clustering algo- rithm. This procedure is applied recursively until each document is in its own singleton cluster.
***machine-learned relevance->
|------------
6.1.2 Learning weights How do we determine the weights gi for weighted zone scoring? These weights could be specified by an expert (or, in principle, the user); but in- creasingly, these weights are “learned” using training examples that have been judged editorially. This latter methodology falls under a general class of approaches to scoring and ranking in information retrieval, known as machine-learned relevance. We provide a brief introduction to this topic hereMACHINE-LEARNED RELEVANCE because weighted zone scoring presents a clean setting for introducing it; a complete development demands an understanding of machine learning and is deferred to Chapter 15.
***synonymy->
|------------
9 Relevance feedback and queryexpansion In most collections, the same concept may be referred to using different words. This issue, known as synonymy, has an impact on the recall of mostSYNONYMY information retrieval systems. For example, you would want a search for aircraft to match plane (but only for references to an airplane, not a woodwork- ing plane), and for a search on thermodynamics to match references to heat in appropriate discussions. Users often attempt to address this problem them- selves by manually refining a query, as was discussed in Section 1.4; in this chapter we discuss ways in which a system can help with query refinement, either fully automatically or with the user in the loop.
***multinomial model->
|------------
13.3 The Bernoulli model There are two different ways we can set up an NB classifier. The model we in- troduced in the previous section is the multinomial model. It generates one term from the vocabulary in each position of the document, where we as- sume a generative model that will be discussed in more detail in Section 13.4 (see also page 237).
|------------
An alternative to the multinomial model is the multivariate Bernoulli model or Bernoulli model. It is equivalent to the binary independence model of Sec-BERNOULLI MODEL tion 11.3 (page 222), which generates an indicator for each term of the vo- cabulary, either 1 indicating presence of the term in the document or 0 indi- cating absence. Figure 13.3 presents training and testing algorithms for the Bernoulli model. The Bernoulli model has the same time complexity as the multinomial model.
|------------
The different generation models imply different estimation strategies and different classification rules. The Bernoulli model estimates P̂(t|c) as the frac- tion of documents of class c that contain term t (Figure 13.3, TRAINBERNOULLI- NB, line 8). In contrast, the multinomial model estimates P̂(t|c) as the frac- tion of tokens or fraction of positions in documents of class c that contain term t (Equation (13.7)). When classifying a test document, the Bernoulli model uses binary occurrence information, ignoring the number of occurrences, whereas the multinomial model keeps track of multiple occurrences. As a result, the Bernoulli model typically makes many mistakes when classifying long documents. For example, it may assign an entire book to the class China because of a single occurrence of the term China.
|------------
13.4.1 A variant of the multinomial model An alternative formalization of the multinomial model represents each doc- ument d as an M-dimensional vector of counts 〈tft1,d, . . . , tftM,d〉 where tfti,d is the term frequency of ti in d. P(d|c) is then computed as follows (cf. Equa- tion (12.8), page 243); P(d|c) = P(〈tft1,d, . . . , tftM ,d〉|c) ∝ ∏ 1≤i≤M P(X = ti|c)tfti,d(13.15) Note that we have omitted the multinomial factor. See Equation (12.8) (page 243).
***indexer->
|------------
4 Index construction In this chapter, we look at how to construct an inverted index. We call this process index construction or indexing; the process or machine that performs itINDEXING the indexer. The design of indexing algorithms is governed by hardware con-INDEXER straints. We therefore begin this chapter with a review of the basics of com- puter hardware that are relevant for indexing. We then introduce blocked sort-based indexing (Section 4.2), an efficient single-machine algorithm de- signed for static collections that can be viewed as a more scalable version of the basic sort-based indexing algorithm we introduced in Chapter 1. Sec- tion 4.3 describes single-pass in-memory indexing, an algorithm that has even better scaling properties because it does not hold the vocabulary in memory. For very large collections like the web, indexing has to be dis- tributed over computer clusters with hundreds or thousands of machines.
|------------
The indexer needs raw text, but documents are encoded in many ways (see Chapter 2). Indexers compress and decompress intermediate files and the final index (see Chapter 5). In web search, documents are not on a local file system, but have to be spidered or crawled (see Chapter 20). In enter- prise search, most documents are encapsulated in varied content manage- ment systems, email applications, and databases. We give some examples in Section 4.7. Although most of these applications can be accessed via http, native Application Programming Interfaces (APIs) are usually more efficient.
***adjacency table->
|------------
We assume that each web page is represented by a unique integer; the specific scheme used to assign these integers is described below. We build an adjacency table that resembles an inverted index: it has a row for each web page, with the rows ordered by the corresponding integers. The row for any page p contains a sorted list of integers, each corresponding to a web page that links to p. This table permits us to respond to queries of the form which pages link to p? In similar fashion we build a table whose entries are the pages linked to by p.
***frequency-based feature selection->
|------------
13.5.3 Frequency-based feature selection A third feature selection method is frequency-based feature selection, that is, selecting the terms that are most common in the class. Frequency can be either defined as document frequency (the number of documents in the class c that contain the term t) or as collection frequency (the number of tokens of t that occur in documents in c). Document frequency is more appropriate for the Bernoulli model, collection frequency for the multinomial model.
|------------
Frequency-based feature selection selects some frequent terms that have no specific information about the class, for example, the days of the week (Monday, Tuesday, . . . ), which are frequent across classes in newswire text.
|------------
When many thousands of features are selected, then frequency-based fea- ture selection often does well. Thus, if somewhat suboptimal accuracy is acceptable, then frequency-based feature selection can be a good alternative to more complex methods. However, Figure 13.8 is a case where frequency- based feature selection performs a lot worse than MI and χ2 and should not be used.
***pornography filtering->
***development test collection->
|------------
Many systems contain various weights (often known as parameters) that can be adjusted to tune system performance. It is wrong to report results on a test collection which were obtained by tuning these parameters to maxi- mize performance on that collection. That is because such tuning overstates the expected performance of the system, because the weights will be set to maximize performance on one particular set of queries rather than for a ran- dom sample of queries. In such cases, the correct procedure is to have one or more development test collections, and to tune the parameters on the devel-DEVELOPMENT TEST COLLECTION opment test collection. The tester then runs the system with those weights on the test collection and reports the results on that collection as an unbiased estimate of performance.
***variable byte encoding->
|------------
5.3.1 Variable byte codes Variable byte (VB) encoding uses an integral number of bytes to encode a gap.VARIABLE BYTE ENCODING The last 7 bits of a byte are “payload” and encode part of the gap. The first bit of the byte is a continuation bit.It is set to 1 for the last byte of the encodedCONTINUATION BIT gap and to 0 otherwise. To decode a variable byte code, we read a sequence of bytes with continuation bit 0 terminated by a byte with continuation bit 1.
***Retrieval Status Value->
|------------
We can manipulate this expression by including the query terms found in the document into the right product, but simultaneously dividing through by them in the left product, so the value is unchanged. Then we have: O(R|~q,~x) = O(R|~q) · ∏ t:xt=qt=1 pt(1 − ut) ut(1 − pt) · ∏ t:qt=1 1 − pt 1 − ut (11.16) The left product is still over query terms found in the document, but the right product is now over all query terms. That means that this right product is a constant for a particular query, just like the oddsO(R|~q). So the only quantity that needs to be estimated to rank documents for relevance to a query is the left product. We can equally rank documents by the logarithm of this term, since log is a monotonic function. The resulting quantity used for ranking is called the Retrieval Status Value (RSV) in this model:RETRIEVAL STATUS VALUE RSVd = log ∏ t:xt=qt=1 pt(1 − ut) ut(1 − pt) = ∑ t:xt=qt=1 log pt(1 − ut) ut(1 − pt) (11.17) So everything comes down to computing the RSV. Define ct: ct = log pt(1 − ut) ut(1 − pt) = log pt (1 − pt) + log 1 − ut ut (11.18) The ct terms are log odds ratios for the terms in the query. We have the odds of the term appearing if the document is relevant (pt/(1 − pt)) and the odds of the term appearing if the document is nonrelevant (ut/(1− ut)). The odds ratio is the ratio of two such odds, and then we finally take the log of thatODDS RATIO quantity. The value will be 0 if a term has equal odds of appearing in relevant and nonrelevant documents, and positive if it is more likely to appear in relevant documents. The ct quantities function as term weights in the model, and the document score for a query is RSVd = ∑xt=qt=1 ct. Operationally, we sum them in accumulators for query terms appearing in documents, just as for the vector space model calculations discussed in Section 7.1 (page 135).
***spelling correction->
|------------
(tiered) positional indexes, indexes for spelling correction and other tolerant retrieval, and structures for accelerating inexact top-K retrieval. A free text user query (top center) is sent down to the indexes both directly and through a module for generating spelling-correction candidates. As noted in Chap- ter 3 the latter may optionally be invoked only when the original query fails to retrieve enough results. Retrieved documents (dark arrow) are passed to a scoring module that computes scores based on machine-learned rank- ing (MLR), a technique that builds on Section 6.1.2 (to be further developed in Section 15.4.1) for scoring and ranking documents. Finally, these ranked documents are rendered as a results page.
|------------
12.1.2 Types of language models How do we build probabilities over sequences of terms? We can always use the chain rule from Equation (11.1) to decompose the probability of a sequence of events into the probability of each successive event conditioned on earlier events: P(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t1t2)P(t4|t1t2t3)(12.4) The simplest form of language model simply throws away all conditioning context, and estimates each term independently. Such a model is called a unigram language model:UNIGRAM LANGUAGE MODEL Puni(t1t2t3t4) = P(t1)P(t2)P(t3)P(t4)(12.5) There are many more complex kinds of language models, such as bigramBIGRAM LANGUAGE MODEL language models, which condition on the previous term, Pbi(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t2)P(t4|t3)(12.6) and even more complex grammar-based language models such as proba- bilistic context-free grammars. Such models are vital for tasks like speech recognition, spelling correction, and machine translation, where you need the probability of a term conditioned on surrounding context. However, most language-modeling work in IR has used unigram language models.
|------------
Exercise 12.2 [⋆] If the stop probability is omitted from calculations, what will the sum of the scores assigned to strings in the language of length 1 be? Exercise 12.3 [⋆] What is the likelihood ratio of the document according to M1 and M2 in Exam- ple 12.2? Exercise 12.4 [⋆] No explicit STOP probability appeared in Example 12.2. Assuming that the STOP probability of each model is 0.1, does this change the likelihood ratio of a document according to the two models? Exercise 12.5 [⋆⋆] How might a language model be used in a spelling correction system? In particular, consider the case of context-sensitive spelling correction, and correcting incorrect us- ages of words, such as their in Are you their? (See Section 3.5 (page 65) for pointers to some literature on this topic.) 12.2 The query likelihood model 12.2.1 Using query likelihood language models in IR Language modeling is a quite general formal approach to IR, with many vari- ant realizations. The original and basic method for using language models in IR is the query likelihood model. In it, we construct from each document dQUERY LIKELIHOOD MODEL in the collection a language model Md. Our goal is to rank documents by P(d|q), where the probability of a document is interpreted as the likelihood that it is relevant to the query. Using Bayes rule (as introduced in Section 11.1, page 220), we have: P(d|q) = P(q|d)P(d)/P(q) P(q) is the same for all documents, and so can be ignored. The prior prob- ability of a document P(d) is often treated as uniform across all d and so it can also be ignored, but we could implement a genuine prior which could in- clude criteria like authority, length, genre, newness, and number of previous people who have read the document. But, given these simplifications, we return results ranked by simply P(q|d), the probability of the query q under the language model derived from d. The Language Modeling approach thus attempts to model the query generation process: Documents are ranked by the probability that a query would be observed as a random sample from the respective document model.
***memory capacity->
|------------
We can also think of variance as the model complexity or, equivalently, mem-MEMORY CAPACITY ory capacity of the learning method – how detailed a characterization of the training set it can remember and then apply to new data. This capacity corre- sponds to the number of independent parameters available to fit the training set. Each kNN neighborhood Sk makes an independent classification deci- sion. The parameter in this case is the estimate P̂(c|Sk) from Figure 14.7.
***tiered indexes->
|------------
7.2.1 Tiered indexes We mentioned in Section 7.1.2 that when using heuristics such as index elim- ination for inexact top-K retrieval, we may occasionally find ourselves with a set A of contenders that has fewer than K documents. A common solution to this issue is the user of tiered indexes, which may be viewed as a gener-TIERED INDEXES alization of champion lists. We illustrate this idea in Figure 7.4, where we represent the documents and terms of Figure 6.9. In this example we set a tf threshold of 20 for tier 1 and 10 for tier 2, meaning that the tier 1 index only has postings entries with tf values exceeding 20, while the tier 2 index only ◮ Figure 7.4 Tiered indexes. If we fail to get K results from tier 1, query processing “falls back” to tier 2, and so on. Within each tier, postings are ordered by document ID.
***sensitivity->
|------------
Another concept sometimes used in evaluation is an ROC curve. (“ROC”ROC CURVE stands for “Receiver Operating Characteristics”, but knowing that doesn’t help most people.) An ROC curve plots the true positive rate or sensitiv- ity against the false positive rate or (1 − specificity). Here, sensitivity is justSENSITIVITY another term for recall. The false positive rate is given by f p/( f p+ tn). Fig- ure 8.4 shows the ROC curve corresponding to the precision-recall curve in Figure 8.2. An ROC curve always goes from the bottom left to the top right of the graph. For a good system, the graph climbs steeply on the left side. For unranked result sets, specificity, given by tn/( f p+ tn), was not seen as a verySPECIFICITY useful notion. Because the set of true negatives is always so large, its value would be almost 1 for all information needs (and, correspondingly, the value of the false positive rate would be almost 0). That is, the “interesting” part of Figure 8.2 is 0 < recall < 0.4, a part which is compressed to a small corner of Figure 8.4. But an ROC curve could make sense when looking over the full retrieval spectrum, and it provides another way of looking at the data.
***content management system->
|------------
An effective indexer for enterprise search needs to be able to communicate efficiently with a number of applications that hold text data in corporations, including Microsoft Outlook, IBM’s Lotus software, databases like Oracle and MySQL, content management systems like Open Text, and enterprise resource planning software like SAP.
***cluster-internal labeling->
|------------
Cluster-internal labeling computes a label that solely depends on the clusterCLUSTER-INTERNAL LABELING itself, not on other clusters. Labeling a cluster with the title of the document closest to the centroid is one cluster-internal method. Titles are easier to read than a list of terms. A full title can also contain important context that didn’t make it into the top 10 terms selected by MI. On the web, anchor text can 5. Selecting the most frequent terms is a non-differential feature selection technique we dis- cussed in Section 13.5. It can also be used for labeling clusters.
***rules in text classification->
|------------
A computer is not essential for classification. Many classification tasks have traditionally been solved manually. Books in a library are assigned Library of Congress categories by a librarian. But manual classification is expensive to scale. The multicore computer chips example illustrates one al- ternative approach: classification by the use of standing queries – which can be thought of as rules – most commonly written by hand. As in our exam-RULES IN TEXT CLASSIFICATION ple (multicore OR multi-core) AND (chip OR processor OR microprocessor), rules are sometimes equivalent to Boolean expressions.
***search results->
|------------
The hypothesis states that if there is a document from a cluster that is rele- vant to a search request, then it is likely that other documents from the same cluster are also relevant. This is because clustering puts together documents that share many terms. The cluster hypothesis essentially is the contiguity Application What is Benefit Example clustered? Search result clustering search results more effective information presentation to user Figure 16.2 Scatter-Gather (subsets of) collection alternative user interface: “search without typing” Figure 16.3 Collection clustering collection effective information pre- sentation for exploratory browsing McKeown et al. (2002), http://news.google.com Language modeling collection increased precision and/or recall Liu and Croft (2004) Cluster-based retrieval collection higher efficiency: faster search Salton (1971a) ◮ Table 16.1 Some applications of clustering in information retrieval.
|------------
Table 16.1 shows some of the main applications of clustering in informa- tion retrieval. They differ in the set of documents that they cluster – search results, collection or subsets of the collection – and the aspect of an informa- tion retrieval system they try to improve – user experience, user interface, effectiveness or efficiency of the search system. But they are all based on the basic assumption stated by the cluster hypothesis.
|------------
The first application mentioned in Table 16.1 is search result clustering whereSEARCH RESULT CLUSTERING by search results we mean the documents that were returned in response to a query. The default presentation of search results in information retrieval is a simple list. Users scan the list from top to bottom until they have found the information they are looking for. Instead, search result clustering clus- ters the search results, so that similar documents appear together. It is often easier to scan a few coherent groups than many individual documents. This is particularly useful if a search term has different word senses. The example in Figure 16.2 is jaguar. Three frequent senses on the web refer to the car, the animal and an Apple operating system. The Clustered Results panel returned by the Vivísimo search engine (http://vivisimo.com) can be a more effective user interface for understanding what is in the search results than a simple list of documents.
***pseudocounts->
|------------
This is referred to as the relative frequency of the event. Estimating the prob-RELATIVE FREQUENCY ability as the relative frequency is the maximum likelihood estimate (or MLE),MAXIMUM LIKELIHOOD ESTIMATE MLE because this value makes the observed data maximally likely. However, if we simply use the MLE, then the probability given to events we happened to see is usually too high, whereas other events may be completely unseen and giving them as a probability estimate their relative frequency of 0 is both an underestimate, and normally breaks our models, since anything multiplied by 0 is 0. Simultaneously decreasing the estimated probability of seen events and increasing the probability of unseen events is referred to as smoothing.SMOOTHING One simple way of smoothing is to add a number α to each of the observed counts. These pseudocounts correspond to the use of a uniform distributionPSEUDOCOUNTS over the vocabulary as a Bayesian prior, following Equation (11.4). We ini-BAYESIAN PRIOR tially assume a uniform distribution over events, where the size of α denotes the strength of our belief in uniformity, and we then update the probability based on observed events. Since our belief in uniformity is weak, we use α = 12 . This is a form of maximum a posteriori (MAP) estimation, where weMAXIMUM A POSTERIORI MAP choose the most likely point value for probabilities based on the prior and the observed evidence, following Equation (11.4). We will further discuss methods of smoothing estimated counts to give probability models in Sec- tion 12.2.2 (page 243); the simple method of adding 12 to each observed count will do for now.
***latent semantic indexing->
|------------
Thesaurus-based query expansion has the advantage of not requiring any user input. Use of query expansion generally increases recall and is widely used in many science and engineering fields. As well as such global analysis techniques, it is also possible to do query expansion by local analysis, for instance, by analyzing the documents in the result set. User input is now Word Nearest neighbors absolutely absurd, whatsoever, totally, exactly, nothing bottomed dip, copper, drops, topped, slide, trimmed captivating shimmer, stunningly, superbly, plucky, witty doghouse dog, porch, crawling, beside, downstairs makeup repellent, lotion, glossy, sunscreen, skin, gel mediating reconciliation, negotiate, case, conciliation keeping hoping, bring, wiping, could, some, would lithographs drawings, Picasso, Dali, sculptures, Gauguin pathogens toxins, bacteria, organisms, bacterial, parasite senses grasp, psyche, truly, clumsy, naive, innate ◮ Figure 9.8 An example of an automatically generated thesaurus. This example is based on the work in Schütze (1998), which employs latent semantic indexing (see Chapter 18).
|------------
18.4 Latent semantic indexing We now discuss the approximation of a term-document matrix C by one of lower rank using the SVD. The low-rank approximation to C yields a new representation for each document in the collection. We will cast queries into this low-rank representation as well, enabling us to compute query- document similarity scores in this low-rank representation. This process is known as latent semantic indexing (generally abbreviated LSI).LATENT SEMANTIC INDEXING But first, we motivate such an approximation. Recall the vector space rep- resentation of documents and queries introduced in Section 6.3 (page 120).
|------------
Polysemy on the other hand refers to the case where a term such as charge has multiple meanings, so that the computed similarity ~q · ~d overestimates the similarity that a user would perceive. Could we use the co-occurrences of terms (whether, for instance, charge occurs in a document containing steed versus in a document containing electron) to capture the latent semantic as- sociations of terms and alleviate these problems? Even for a collection of modest size, the term-document matrix C is likely to have several tens of thousand of rows and columns, and a rank in the tens of thousands as well. In latent semantic indexing (sometimes referred to as latent semantic analysis (LSA)), we use the SVD to construct a low-rankLSA approximation Ck to the term-document matrix, for a value of k that is far smaller than the original rank of C. In the experimental work cited later in this section, k is generally chosen to be in the low hundreds. We thus map each row/column (respectively corresponding to a term/document) to a k-dimensional space; this space is defined by the k principal eigenvectors (corresponding to the largest eigenvalues) of CCT and CTC. Note that the matrix Ck is itself still an M× N matrix, irrespective of k.
***purity->
|------------
This section introduces four external criteria of clustering quality. Purity is a simple and transparent evaluation measure. Normalized mutual information can be information-theoretically interpreted. The Rand index penalizes both false positive and false negative decisions during clustering. The F measure in addition supports differential weighting of these two types of errors.
|------------
To compute purity, each cluster is assigned to the class which is most fre-PURITY quent in the cluster, and then the accuracy of this assignment is measured by counting the number of correctly assigned documents and dividing by N.
***LSI as soft clustering->
***parametric search->
***random variable X->
|------------
To reduce the number of parameters, we make the Naive Bayes conditionalCONDITIONAL INDEPENDENCE ASSUMPTION independence assumption. We assume that attribute values are independent of each other given the class: Multinomial P(d|c) = P(〈t1, . . . , tnd〉|c) = ∏ 1≤k≤nd P(Xk = tk|c)(13.13) Bernoulli P(d|c) = P(〈e1, . . . , eM〉|c) = ∏ 1≤i≤M P(Ui = ei|c).(13.14) We have introduced two random variables here to make the two different generative models explicit. Xk is the random variable for position k in theRANDOM VARIABLE X document and takes as values terms from the vocabulary. P(Xk = t|c) is the probability that in a document of class c the term t will occur in position k. UiRANDOM VARIABLE U is the random variable for vocabulary term i and takes as values 0 (absence) and 1 (presence). P̂(Ui = 1|c) is the probability that in a document of class c the term ti will occur – in any position and possibly multiple times.
***parameter-free compression->
***truncated SVD->
|------------
as the reduced SVD or truncated SVD and we will encounter it again in Ex-REDUCED SVD TRUNCATED SVD ercise 18.9. Henceforth, our numerical examples and exercises will use this reduced form.
|------------
Exercise 18.10 Exercise 18.9 can be generalized to rank k approximations: we let U′k and V ′ k denote the “reduced” matrices formed by retaining only the first k columns of U and V, respectively. Thus U′k is an M× k matrix while V ′ T k is a k× N matrix. Then, we have Ck = U ′ kΣ ′ kV ′T k ,(18.20) where Σ′k is the square k × k submatrix of Σk with the singular values σ1, . . . , σk on the diagonal. The primary advantage of using (18.20) is to eliminate a lot of redun- dant columns of zeros in U and V, thereby explicitly eliminating multiplication by columns that do not affect the low-rank approximation; this version of the SVD is sometimes known as the reduced SVD or truncated SVD and is a computationally simpler representation from which to compute the low rank approximation.
|------------
Examination of C2 and Σ2 in Example 18.4 shows that the last 3 rows of each of these matrices are populated entirely by zeros. This suggests that the SVD product UΣVT in Equation (18.18) can be carried out with only two rows in the representations of Σ2 and VT ; we may then replace these matrices by their truncated versions Σ′2 and (V ′)T. For instance, the truncated SVD document matrix (V′)T in this example is: d1 d2 d3 d4 d5 d6 1 −1.62 −0.60 −0.44 −0.97 −0.70 −0.26 2 −0.46 −0.84 −0.30 1.00 0.35 0.65 Figure 18.3 illustrates the documents in (V′)T in two dimensions. Note also that C2 is dense relative to C.
***Levenshtein distance->
|------------
3.3.3 Edit distance Given two character strings s1 and s2, the edit distance between them is theEDIT DISTANCE minimum number of edit operations required to transform s1 into s2. Most commonly, the edit operations allowed for this purpose are: (i) insert a char- acter into a string; (ii) delete a character from a string and (iii) replace a char- acter of a string by another character; for these operations, edit distance is sometimes known as Levenshtein distance. For example, the edit distance be-LEVENSHTEIN DISTANCE tween cat and dog is 3. In fact, the notion of edit distance can be generalized to allowing different weights for different kinds of edit operations, for in- stance a higher weight may be placed on replacing the character s by the character p, than on replacing it by the character a (the latter being closer to s on the keyboard). Setting weights in this way depending on the likelihood of letters substituting for each other is very effective in practice (see Section 3.4 for the separate issue of phonetic similarity). However, the remainder of our treatment here will focus on the case in which all edit operations have the same weight.
|------------
Figure 3.6 shows an example Levenshtein distance computation of Fig- ure 3.5. The typical cell [i, j] has four entries formatted as a 2 × 2 cell. The lower right entry in each cell is the min of the other three, corresponding to the main dynamic programming step in Figure 3.5. The other three entries are the three entries m[i − 1, j − 1] + 0 or 1 depending on whether s1[i] = EDITDISTANCE(s1, s2) 1 int m[i, j] = 0 2 for i ← 1 to |s1| 3 do m[i, 0] = i 4 for j ← 1 to |s2| 5 do m[0, j] = j 6 for i ← 1 to |s1| 7 do for j ← 1 to |s2| 8 do m[i, j] = min{m[i− 1, j− 1] + if (s1[i] = s2[j]) then 0 else 1fi, 9 m[i− 1, j] + 1, 10 m[i, j− 1] + 1} 11 return m[|s1|, |s2|] ◮ Figure 3.5 Dynamic programming algorithm for computing the edit distance be- tween strings s1 and s2.
***gain->
|------------
A final approach that has seen increasing adoption, especially when em- ployed with machine learning approaches to ranking (see Section 15.4, page 341) is measures of cumulative gain, and in particular normalized discounted cumu-CUMULATIVE GAIN NORMALIZED DISCOUNTED CUMULATIVE GAIN lative gain (NDCG). NDCG is designed for situations of non-binary notionsNDCG of relevance (cf. Section 8.5.1). Like precision at k, it is evaluated over some number k of top search results. For a set of queries Q, let R(j, d) be the rele- vance score assessors gave to document d for query j. Then, NDCG(Q, k) = 1 |Q| |Q| ∑ j=1 Zkj k ∑ m=1 2R(j,m) − 1 log2(1 + m) ,(8.9) where Zkj is a normalization factor calculated to make it so that a perfect ranking’s NDCG at k for query j is 1. For queries for which k′ < k documents are retrieved, the last summation is done up to k′.
***maximization step->
|------------
How do we use EM to infer the parameters of the clustering from the data? That is, how do we choose parameters Θ that maximize L(D|Θ)? EM is simi- lar to K-means in that it alternates between an expectation step, correspondingEXPECTATION STEP to reassignment, and a maximization step, corresponding to recomputation ofMAXIMIZATION STEP the parameters of the model. The parameters of K-means are the centroids, the parameters of the instance of EM in this section are the αk and qmk.
|------------
The maximization step recomputes the conditional parameters qmk and the priors αk as follows: Maximization step: qmk = ∑ N n=1 rnk I(tm ∈ dn) ∑ N n=1 rnk αk = ∑ N n=1 rnk N (16.16) where I(tm ∈ dn) = 1 if tm ∈ dn and 0 otherwise and rnk is the soft as- signment of document dn to cluster k as computed in the preceding iteration.
***within-point scatter->
|------------
Exercise 16.20 [⋆ ⋆ ⋆] The within-point scatter of a clustering is defined as ∑k 1 2 ∑~xi∈ωk ∑~xj∈ωk |~xi−~xj|2. ShowWITHIN-POINT SCATTER that minimizing RSS and minimizing within-point scatter are equivalent.
***web->
|------------
• Topic-specific or vertical search. Vertical search engines restrict searches toVERTICAL SEARCH ENGINE a particular topic. For example, the query computer science on a vertical search engine for the topic China will return a list of Chinese computer science departments with higher precision and recall than the query com- puter science China on a general purpose search engine. This is because the vertical search engine does not include web pages in its index that contain the term china in a different sense (e.g., referring to a hard white ceramic), but does include relevant pages even if they do not explicitly mention the term China.
***labeling->
|------------
Apart from manual classification and hand-crafted rules, there is a third approach to text classification, namely, machine learning-based text classifi- cation. It is the approach that we focus on in the next several chapters. In machine learning, the set of rules or, more generally, the decision criterion of the text classifier, is learned automatically from training data. This approach is also called statistical text classification if the learning method is statistical.STATISTICAL TEXT CLASSIFICATION In statistical text classification, we require a number of good example docu- ments (or training documents) for each class. The need for manual classifi- cation is not eliminated because the training documents come from a person who has labeled them – where labeling refers to the process of annotatingLABELING each document with its class. But labeling is arguably an easier task than writing rules. Almost anybody can look at a document and decide whether or not it is related to China. Sometimes such labeling is already implicitly part of an existing workflow. For instance, you may go through the news articles returned by a standing query each morning and give relevance feed- back (cf. Chapter 9) by moving the relevant articles to a special folder like multicore-processors.
***random variable U->
|------------
To reduce the number of parameters, we make the Naive Bayes conditionalCONDITIONAL INDEPENDENCE ASSUMPTION independence assumption. We assume that attribute values are independent of each other given the class: Multinomial P(d|c) = P(〈t1, . . . , tnd〉|c) = ∏ 1≤k≤nd P(Xk = tk|c)(13.13) Bernoulli P(d|c) = P(〈e1, . . . , eM〉|c) = ∏ 1≤i≤M P(Ui = ei|c).(13.14) We have introduced two random variables here to make the two different generative models explicit. Xk is the random variable for position k in theRANDOM VARIABLE X document and takes as values terms from the vocabulary. P(Xk = t|c) is the probability that in a document of class c the term t will occur in position k. UiRANDOM VARIABLE U is the random variable for vocabulary term i and takes as values 0 (absence) and 1 (presence). P̂(Ui = 1|c) is the probability that in a document of class c the term ti will occur – in any position and possibly multiple times.
***recall->
|------------
Our goal is to develop a system to address the ad hoc retrieval task. This isAD HOC RETRIEVAL the most standard IR task. In it, a system aims to provide documents from within the collection that are relevant to an arbitrary user information need, communicated to the system by means of a one-off, user-initiated query. An information need is the topic about which the user desires to know more, andINFORMATION NEED is differentiated from a query, which is what the user conveys to the com-QUERY puter in an attempt to communicate the information need. A document is relevant if it is one that the user perceives as containing information of valueRELEVANCE with respect to their personal information need. Our example above was rather artificial in that the information need was defined in terms of par- ticular words, whereas usually a user is interested in a topic like “pipeline leaks” and would like to find relevant documents regardless of whether they precisely use those words or express the concept with other words such as pipeline rupture. To assess the effectiveness of an IR system (i.e., the quality ofEFFECTIVENESS its search results), a user will usually want to know two key statistics about the system’s returned results for a query: Precision: What fraction of the returned results are relevant to the informa-PRECISION tion need? Recall: What fraction of the relevant documents in the collection were re-RECALL turned by the system? Detailed discussion of relevance and evaluation measures including preci- sion and recall is found in Chapter 8.
|------------
8.3 Evaluation of unranked retrieval sets Given these ingredients, how is system effectiveness measured? The two most frequent and basic measures for information retrieval effectiveness are precision and recall. These are first defined for the simple case where an IR system returns a set of documents for a query. We will see later how to extend these notions to ranked retrieval situations.
|------------
Precision (P) is the fraction of retrieved documents that are relevantPRECISION Precision = #(relevant items retrieved) #(retrieved items) = P(relevant|retrieved)(8.1) Recall (R) is the fraction of relevant documents that are retrievedRECALL Recall = #(relevant items retrieved) #(relevant items) = P(retrieved|relevant)(8.2) These notions can be made clear by examining the following contingency table: (8.3) Relevant Nonrelevant Retrieved true positives (tp) false positives (fp) Not retrieved false negatives (fn) true negatives (tn) Then: P = tp/(tp+ f p)(8.4) R = tp/(tp+ f n) An obvious alternative that may occur to the reader is to judge an infor- mation retrieval system by its accuracy, that is, the fraction of its classifica-ACCURACY tions that are correct. In terms of the contingency table above, accuracy = (tp+ tn)/(tp + f p + f n + tn). This seems plausible, since there are two ac- tual classes, relevant and nonrelevant, and an information retrieval system can be thought of as a two-class classifier which attempts to label them as such (it retrieves the subset of documents which it believes to be relevant).
|------------
There is a good reason why accuracy is not an appropriate measure for information retrieval problems. In almost all circumstances, the data is ex- tremely skewed: normally over 99.9% of the documents are in the nonrele- vant category. A system tuned to maximize accuracy can appear to perform well by simply deeming all documents nonrelevant to all queries. Even if the system is quite good, trying to label some documents as relevant will almost always lead to a high rate of false positives. However, labeling all documents as nonrelevant is completely unsatisfying to an information retrieval system user. Users are always going to want to see some documents, and can be assumed to have a certain tolerance for seeing some false positives provid- ing that they get some useful information. The measures of precision and recall concentrate the evaluation on the return of true positives, asking what percentage of the relevant documents have been found and how many false positives have also been returned.
***Bernoulli model->
|------------
TRAINBERNOULLINB(C, D) 1 V ← EXTRACTVOCABULARY(D) 2 N ← COUNTDOCS(D) 3 for each c ∈ C 4 do Nc ← COUNTDOCSINCLASS(D, c) 5 prior[c] ← Nc/N 6 for each t ∈ V 7 do Nct ← COUNTDOCSINCLASSCONTAININGTERM(D, c, t) 8 condprob[t][c] ← (Nct + 1)/(Nc + 2) 9 return V, prior, condprob APPLYBERNOULLINB(C,V, prior, condprob, d) 1 Vd ← EXTRACTTERMSFROMDOC(V, d) 2 for each c ∈ C 3 do score[c] ← log prior[c] 4 for each t ∈ V 5 do if t ∈ Vd 6 then score[c] += log condprob[t][c] 7 else score[c] += log(1 − condprob[t][c]) 8 return arg maxc∈C score[c] ◮ Figure 13.3 NB algorithm (Bernoulli model): Training and testing. The add-one smoothing in Line 8 (top) is in analogy to Equation (13.7) with B = 2.
|------------
13.3 The Bernoulli model There are two different ways we can set up an NB classifier. The model we in- troduced in the previous section is the multinomial model. It generates one term from the vocabulary in each position of the document, where we as- sume a generative model that will be discussed in more detail in Section 13.4 (see also page 237).
|------------
An alternative to the multinomial model is the multivariate Bernoulli model or Bernoulli model. It is equivalent to the binary independence model of Sec-BERNOULLI MODEL tion 11.3 (page 222), which generates an indicator for each term of the vo- cabulary, either 1 indicating presence of the term in the document or 0 indi- cating absence. Figure 13.3 presents training and testing algorithms for the Bernoulli model. The Bernoulli model has the same time complexity as the multinomial model.
|------------
The different generation models imply different estimation strategies and different classification rules. The Bernoulli model estimates P̂(t|c) as the frac- tion of documents of class c that contain term t (Figure 13.3, TRAINBERNOULLI- NB, line 8). In contrast, the multinomial model estimates P̂(t|c) as the frac- tion of tokens or fraction of positions in documents of class c that contain term t (Equation (13.7)). When classifying a test document, the Bernoulli model uses binary occurrence information, ignoring the number of occurrences, whereas the multinomial model keeps track of multiple occurrences. As a result, the Bernoulli model typically makes many mistakes when classifying long documents. For example, it may assign an entire book to the class China because of a single occurrence of the term China.
***Binary Independence Model->
|------------
11.3 The Binary Independence Model The Binary Independence Model (BIM) we present in this section is the modelBINARY INDEPENDENCE MODEL that has traditionally been used with the PRP. It introduces some simple as- sumptions, which make estimating the probability function P(R|d, q) prac- tical. Here, “binary” is equivalent to Boolean: documents and queries are both represented as binary term incidence vectors. That is, a document d is represented by the vector ~x = (x1, . . . , xM) where xt = 1 if term t is present in document d and xt = 0 if t is not present in d. With this rep- resentation, many possible documents have the same vector representation.
|------------
“Independence” means that terms are modeled as occurring in documents independently. The model recognizes no association between terms. This assumption is far from correct, but it nevertheless often gives satisfactory results in practice; it is the “naive” assumption of Naive Bayes models, dis- cussed further in Section 13.4 (page 265). Indeed, the Binary Independence Model is exactly the same as the multivariate Bernoulli Naive Bayes model presented in Section 13.3 (page 263). In a sense this assumption is equivalent to an assumption of the vector space model, where each term is a dimension that is orthogonal to all other terms.
***lexicon->
|------------
We keep a dictionary of terms (sometimes also referred to as a vocabulary orDICTIONARY VOCABULARY lexicon; in this book, we use dictionary for the data structure and vocabulary LEXICON for the set of terms). Then for each term, we have a list that records which documents the term occurs in. Each item in the list – which records that a term appeared in a document (and, later, often, the positions in the docu- ment) – is conventionally called a posting.4 The list is then called a postingsPOSTING POSTINGS LIST list (or inverted list), and all the postings lists taken together are referred to as the postings. The dictionary in Figure 1.3 has been sorted alphabetically andPOSTINGS each postings list is sorted by document ID. We will see why this is useful in Section 1.3, below, but later we will also consider alternatives to doing this (Section 7.1.5).
***HITS->
|------------
This method of link analysis is known as HITS, which is an acronym forHITS Hyperlink-Induced Topic Search.
***learning algorithm->
|------------
Using a learning method or learning algorithm, we then wish to learn a clas-LEARNING METHOD sifier or classification function γ that maps documents to classes:CLASSIFIER γ : X → C(13.1) This type of learning is called supervised learning because a supervisor (theSUPERVISED LEARNING human who defines the classes and labels training documents) serves as a teacher directing the learning process. We denote the supervised learning method by Γ and write Γ(D) = γ. The learning method Γ takes the training set D as input and returns the learned classification function γ.
***indexing->
|------------
4 Index construction In this chapter, we look at how to construct an inverted index. We call this process index construction or indexing; the process or machine that performs itINDEXING the indexer. The design of indexing algorithms is governed by hardware con-INDEXER straints. We therefore begin this chapter with a review of the basics of com- puter hardware that are relevant for indexing. We then introduce blocked sort-based indexing (Section 4.2), an efficient single-machine algorithm de- signed for static collections that can be viewed as a more scalable version of the basic sort-based indexing algorithm we introduced in Chapter 1. Sec- tion 4.3 describes single-pass in-memory indexing, an algorithm that has even better scaling properties because it does not hold the vocabulary in memory. For very large collections like the web, indexing has to be dis- tributed over computer clusters with hundreds or thousands of machines.
|------------
We discuss this in Section 4.4. Collections with frequent changes require dy- namic indexing introduced in Section 4.5 so that changes in the collection are immediately reflected in the index. Finally, we cover some complicating is- sues that can arise in indexing – such as security and indexes for ranked retrieval – in Section 4.6.
|------------
The reader should be aware that building the subsystem that feeds raw text to the indexing process can in itself be a challenging problem.
***random variable C->
|------------
To summarize, we generate a document in the multinomial model (Fig- ure 13.4) by first picking a class C = c with P(c) where C is a random variableRANDOM VARIABLE C taking values from C as values. Next we generate term tk in position k with P(Xk = tk|c) for each of the nd positions of the document. The Xk all have the same distribution over terms for a given c. In the example in Figure 13.4, we show the generation of 〈t1, t2, t3, t4, t5〉 = 〈Beijing, and, Taipei, join, WTO〉, corresponding to the one-sentence document Beijing and Taipei join WTO.
***principal left eigenvector->
|------------
A Markov chain is characterized by an N×N transition probability matrix P each of whose entries is in the interval [0, 1]; the entries in each row of P add up to 1. The Markov chain can be in one of the N states at any given time- step; then, the entry Pij tells us the probability that the state at the next time- step is j, conditioned on the current state being i. Each entry Pij is known as a transition probability and depends only on the current state i; this is known as the Markov property. Thus, by the Markov property, ∀i, j, Pij ∈ [0, 1] and ∀i, N ∑ j=1 Pij = 1.(21.1) A matrix with non-negative entries that satisfies Equation (21.1) is known as a stochastic matrix. A key property of a stochastic matrix is that it has aSTOCHASTIC MATRIX principal left eigenvector corresponding to its largest eigenvalue, which is 1.PRINCIPAL LEFT EIGENVECTOR 1. This is consistent with our usage of N for the number of documents in the collection.
***Jaccard coefficient->
|------------
quently, we require more nuanced measures of the overlap in k-grams be- tween a vocabulary term and q. The linear scan intersection can be adapted when the measure of overlap is the Jaccard coefficient for measuring the over-JACCARD COEFFICIENT lap between two sets A and B, defined to be |A∩ B|/|A∪ B|. The two sets we consider are the set of k-grams in the query q, and the set of k-grams in a vo- cabulary term. As the scan proceeds, we proceed from one vocabulary term t to the next, computing on the fly the Jaccard coefficient between q and t. If the coefficient exceeds a preset threshold, we add t to the output; if not, we move on to the next term in the postings. To compute the Jaccard coefficient, we need the set of k-grams in q and t.
|------------
Since we are scanning the postings for all k-grams in q, we immediately have these k-grams on hand. What about the k-grams of t? In principle, we could enumerate these on the fly from t; in practice this is not only slow but potentially infeasible since, in all likelihood, the postings entries them- selves do not contain the complete string t but rather some encoding of t. The crucial observation is that to compute the Jaccard coefficient, we only need the length of the string t. To see this, recall the example of Figure 3.7 and consider the point when the postings scan for query q = bord reaches term t = boardroom. We know that two bigrams match. If the postings stored the (pre-computed) number of bigrams in boardroom (namely, 8), we have all the information we require to compute the Jaccard coefficient to be 2/(8 + 3− 2); the numerator is obtained from the number of postings hits (2, from bo and rd) while the denominator is the sum of the number of bigrams in bord and boardroom, less the number of postings hits.
|------------
We could replace the Jaccard coefficient by other measures that allow ef- ficient on the fly computation during postings scans. How do we use these for spelling correction? One method that has some empirical support is to first use the k-gram index to enumerate a set of candidate vocabulary terms that are potential corrections of q. We then compute the edit distance from q to each term in this set, selecting terms from the set with small edit distance to q.
|------------
Let S(dj) denote the set of shingles of document dj. Recall the Jaccard coefficient from page 61, which measures the degree of overlap between the sets S(d1) and S(d2) as |S(d1) ∩ S(d2)|/|S(d1) ∪ S(d2)|; denote this by J(S(d1), S(d2)). Our test for near duplication between d1 and d2 is to com- pute this Jaccard coefficient; if it exceeds a preset threshold (say, 0.9), we declare them near duplicates and eliminate one from indexing. However, this does not appear to have simplified matters: we still have to compute Jaccard coefficients pairwise.
***token->
|------------
2. Tokenize the text.
|------------
3. Do linguistic preprocessing of tokens.
|------------
In this chapter we first briefly mention how the basic unit of a document can be defined and how the character sequence that it comprises is determined (Section 2.1). We then examine in detail some of the substantive linguis- tic issues of tokenization and linguistic preprocessing, which determine the vocabulary of terms which a system uses (Section 2.2). Tokenization is the process of chopping character streams into tokens, while linguistic prepro- cessing then deals with building equivalence classes of tokens which are the set of terms that are indexed. Indexing itself is covered in Chapters 1 and 4.
|------------
2.2 Determining the vocabulary of terms 2.2.1 Tokenization Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation. Here is an example of tokenization: Input: Friends, Romans, Countrymen, lend me your ears; Output: Friends Romans Countrymen lend me your ears These tokens are often loosely referred to as terms or words, but it is some- times important to make a type/token distinction. A token is an instanceTOKEN of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing. A type is the class of allTYPE tokens containing the same character sequence. A term is a (perhaps nor-TERM malized) type that is included in the IR system’s dictionary. The set of index terms could be entirely distinct from the tokens, for instance, they could be semantic identifiers in a taxonomy, but in practice in modern IR systems they are strongly related to the tokens in the document. However, rather than be- ing exactly the tokens that appear in the document, they are usually derived from them by various normalization processes which are discussed in Sec- tion 2.2.3.2 For example, if the document to be indexed is to sleep perchance to dream, then there are 5 tokens, but only 4 types (since there are 2 instances of to). However, if to is omitted from the index (as a stop word, see Section 2.2.2 (page 27)), then there will be only 3 terms: sleep, perchance, and dream.
***translation model->
|------------
Basic LMs do not address issues of alternate expression, that is, synonymy, or any deviation in use of language between queries and documents. Berger and Lafferty (1999) introduce translation models to bridge this query-document gap. A translation model lets you generate query words not in a document byTRANSLATION MODEL translation to alternate terms with similar meaning. This also provides a ba- sis for performing cross-language IR. We assume that the translation model can be represented by a conditional probability distribution T(·|·) between vocabulary terms. The form of the translation query generation model is then: P(q|Md) = ∏ t∈q ∑ v∈V P(v|Md)T(t|v)(12.15) The term P(v|Md) is the basic document language model, and the term T(t|v) performs translation. This model is clearly more computationally intensive and we need to build a translation model. The translation model is usually built using separate resources (such as a traditional thesaurus or bilingual dictionary or a statistical machine translation system’s translation diction- ary), but can be built using the document collection if there are pieces of text that naturally paraphrase or summarize other pieces of text. Candi- date examples are documents and their titles or abstracts, or documents and anchor-text pointing to them in a hypertext environment.
|------------
Building extended LM approaches remains an active area of research. In general, translation models, relevance feedback models, and model compar- ison approaches have all been demonstrated to improve performance over the basic query likelihood LM.
***blocked storage->
|------------
5.2.2 Blocked storage We can further compress the dictionary by grouping terms in the string into blocks of size k and keeping a term pointer only for the first term of each block (Figure 5.5). We store the length of the term in the string as an ad- ditional byte at the beginning of the term. We thus eliminate k − 1 term pointers, but need an additional k bytes for storing the length of each term.
***tf-idf->
|------------
The tf-idf weighting scheme assigns to term t a weight in document d givenTF-IDF by tf-idft,d = tft,d × idft.(6.8) In other words, tf-idft,d assigns to term t a weight in document d that is 1. highest when t occurs many times within a small number of documents (thus lending high discriminating power to those documents); 2. lower when the term occurs fewer times in a document, or occurs in many documents (thus offering a less pronounced relevance signal); 3. lowest when the term occurs in virtually all documents.
|------------
At this point, we may view each document as a vector with one componentDOCUMENT VECTOR corresponding to each term in the dictionary, together with a weight for each component that is given by (6.8). For dictionary terms that do not occur in a document, this weight is zero. This vector form will prove to be crucial to scoring and ranking; we will develop these ideas in Section 6.3. As a first step, we introduce the overlap score measure: the score of a document d is the sum, over all query terms, of the number of times each of the query terms occurs in d. We can refine this idea so that we add up not the number of occurrences of each query term t in d, but instead the tf-idf weight of each term in d.
|------------
Score(q, d) = ∑ t∈q tf-idft,d.(6.9) In Section 6.3 we will develop a more rigorous form of Equation (6.9).
***prefix-free code->
***maximum a posteriori class->
|------------
In text classification, our goal is to find the best class for the document. The best class in NB classification is the most likely or maximum a posteriori (MAP)MAXIMUM A POSTERIORI CLASS class cmap: cmap = arg max c∈C P̂(c|d) = arg max c∈C P̂(c) ∏ 1≤k≤nd P̂(tk|c).(13.3) We write P̂ for P because we do not know the true values of the parameters P(c) and P(tk|c), but estimate them from the training set as we will see in a moment.
***information need->
|------------
Our goal is to develop a system to address the ad hoc retrieval task. This isAD HOC RETRIEVAL the most standard IR task. In it, a system aims to provide documents from within the collection that are relevant to an arbitrary user information need, communicated to the system by means of a one-off, user-initiated query. An information need is the topic about which the user desires to know more, andINFORMATION NEED is differentiated from a query, which is what the user conveys to the com-QUERY puter in an attempt to communicate the information need. A document is relevant if it is one that the user perceives as containing information of valueRELEVANCE with respect to their personal information need. Our example above was rather artificial in that the information need was defined in terms of par- ticular words, whereas usually a user is interested in a topic like “pipeline leaks” and would like to find relevant documents regardless of whether they precisely use those words or express the concept with other words such as pipeline rupture. To assess the effectiveness of an IR system (i.e., the quality ofEFFECTIVENESS its search results), a user will usually want to know two key statistics about the system’s returned results for a query: Precision: What fraction of the returned results are relevant to the informa-PRECISION tion need? Recall: What fraction of the relevant documents in the collection were re-RECALL turned by the system? Detailed discussion of relevance and evaluation measures including preci- sion and recall is found in Chapter 8.
|------------
8.1 Information retrieval system evaluation To measure ad hoc information retrieval effectiveness in the standard way, we need a test collection consisting of three things: 1. A document collection 2. A test suite of information needs, expressible as queries 3. A set of relevance judgments, standardly a binary assessment of either relevant or nonrelevant for each query-document pair.
|------------
The standard approach to information retrieval system evaluation revolves around the notion of relevant and nonrelevant documents. With respect to aRELEVANCE user information need, a document in the test collection is given a binary classification as either relevant or nonrelevant. This decision is referred to as the gold standard or ground truth judgment of relevance. The test documentGOLD STANDARD GROUND TRUTH collection and suite of information needs have to be of a reasonable size: you need to average performance over fairly large test sets, as results are highly variable over different documents and information needs. As a rule of thumb, 50 information needs has usually been found to be a sufficient minimum.
|------------
Relevance is assessed relative to an information need, not a query. ForINFORMATION NEED example, an information need might be: Information on whether drinking red wine is more effective at reduc- ing your risk of heart attacks than white wine.
|------------
This might be translated into a query such as: wine AND red AND white AND heart AND attack AND effective A document is relevant if it addresses the stated information need, not be- cause it just happens to contain all the words in the query. This distinction is often misunderstood in practice, because the information need is not overt.
|------------
But, nevertheless, an information need is present. If a user types python into a web search engine, they might be wanting to know where they can purchase a pet python. Or they might be wanting information on the programming language Python. From a one word query, it is very difficult for a system to know what the information need is. But, nevertheless, the user has one, and can judge the returned results on the basis of their relevance to it. To evalu- ate a system, we require an overt expression of an information need, which can be used for judging returned documents as relevant or nonrelevant. At this point, we make a simplification: relevance can reasonably be thought of as a scale, with some documents highly relevant and others marginally so. But for the moment, we will use just a binary decision of relevance. We discuss the reasons for using binary relevance judgments and alternatives in Section 8.5.1.
***spam->
|------------
Features for text The default in both ad hoc retrieval and text classification is to use terms as features. However, for text classification, a great deal of mileage can be achieved by designing additional features which are suited to a specific prob- lem. Unlike the case of IR query languages, since these features are internal to the classifier, there is no problem of communicating these features to an end user. This process is generally referred to as feature engineering. At pre-FEATURE ENGINEERING sent, feature engineering remains a human craft, rather than something done by machine learning. Good feature engineering can often markedly improve the performance of a text classifier. It is especially beneficial in some of the most important applications of text classification, like spam and porn filter- ing.
|------------
19.2.2 Spam Early in the history of web search, it became clear that web search engines were an important means for connecting advertisers to prospective buyers.
|------------
A user searching for maui golf real estate is not merely seeking news or en- tertainment on the subject of housing on golf courses on the island of Maui, but instead likely to be seeking to purchase such a property. Sellers of such property and their agents, therefore, have a strong incentive to create web pages that rank highly on this query. In a search engine whose scoring was based on term frequencies, a web page with numerous repetitions of maui golf real estate would rank highly. This led to the first generation of spam, whichSPAM (in the context of web search) is the manipulation of web page content for the purpose of appearing high up in search results for selected keywords.
|------------
To avoid irritating users with these repetitions, sophisticated spammers re- sorted to such tricks as rendering these repeated terms in the same color as the background. Despite these words being consequently invisible to the hu- man user, a search engine indexer would parse the invisible words out of ◮ Figure 19.5 Cloaking as used by spammers.
***bag-of-words->
|------------
c1 c2 class selected true probability P(c|d) 0.6 0.4 c1 P̂(c) ∏1≤k≤nd P̂(tk|c) (Equation (13.13)) 0.00099 0.00001 NB estimate P̂(c|d) 0.99 0.01 c1 glish in Figure 13.7 are examples of highly dependent terms. In addition, the multinomial model makes an assumption of positional independence. The Bernoulli model ignores positions in documents altogether because it only cares about absence or presence. This bag-of-words model discards all in- formation that is communicated by the order of words in natural language sentences. How can NB be a good text classifier when its model of natural language is so oversimplified? The answer is that even though the probability estimates of NB are of low quality, its classification decisions are surprisingly good. Consider a document d with true probabilities P(c1|d) = 0.6 and P(c2|d) = 0.4 as shown in Ta- ble 13.4. Assume that d contains many terms that are positive indicators for c1 and many terms that are negative indicators for c2. Thus, when using the multinomial model in Equation (13.13), P̂(c1) ∏1≤k≤nd P̂(tk|c1) will be much larger than P̂(c2) ∏1≤k≤nd P̂(tk|c2) (0.00099 vs. 0.00001 in the table). After di- vision by 0.001 to get well-formed probabilities for P(c|d), we end up with one estimate that is close to 1.0 and one that is close to 0.0. This is common: The winning class in NB classification usually has a much larger probabil- ity than the other classes and the estimates diverge very significantly from the true probabilities. But the classification decision is based on which class gets the highest score. It does not matter how accurate the estimates are. De- spite the bad estimates, NB estimates a higher probability for c1 and therefore assigns d to the correct class in Table 13.4. Correct estimation implies accurate prediction, but accurate prediction does not imply correct estimation. NB classifiers estimate badly, but often classify well.
***structural SVM->
|------------
However, approaching IR result ranking like this is not necessarily the right way to think about the problem. Statisticians normally first divide problems into classification problems (where a categorical variable is pre- dicted) versus regression problems (where a real number is predicted). InREGRESSION between is the specialized field of ordinal regression where a ranking is pre-ORDINAL REGRESSION dicted. Machine learning for ad hoc retrieval is most properly thought of as an ordinal regression problem, where the goal is to rank a set of documents for a query, given training data of the same sort. This formulation gives some additional power, since documents can be evaluated relative to other candidate documents for the same query, rather than having to be mapped to a global scale of goodness, while also weakening the problem space, since just a ranking is required rather than an absolute measure of relevance. Is- sues of ranking are especially germane in web search, where the ranking at the very top of the results list is exceedingly important, whereas decisions of relevance of a document to a query may be much less important. Such work can and has been pursued using the structural SVM framework which we mentioned in Section 15.2.2, where the class being predicted is a ranking of results for a query, but here we will present the slightly simpler ranking SVM.
***cross-entropy->
***regularization->
|------------
The margin can be less than 1 for a point ~xi by setting ξi > 0, but then one pays a penalty of Cξi in the minimization for having done that. The sum of the ξi gives an upper bound on the number of training errors. Soft-margin SVMs minimize training error traded off against margin. The parameter C is a regularization term, which provides a way to control overfitting: as CREGULARIZATION becomes large, it is unattractive to not respect the data at the cost of reducing the geometric margin; when it is small, it is easy to account for some data points with the use of slack variables and to have a fat margin placed so it models the bulk of the data.
***Ward’s method->
|------------
An important HAC technique not discussed here is Ward’s method (WardWARD’S METHOD Jr. 1963, El-Hamdouchi and Willett 1986), also called minimum variance clus- tering. In each step, it selects the merge with the smallest RSS (Chapter 16, page 360). The merge criterion in Ward’s method (a function of all individual distances from the centroid) is closely related to the merge criterion in GAAC (a function of all individual similarities to the centroid).
***HTML->
|------------
The invention of hypertext, envisioned by Vannevar Bush in the 1940’s and first realized in working systems in the 1970’s, significantly precedes the for- mation of the World Wide Web (which we will simply refer to as the Web), in the 1990’s. Web usage has shown tremendous growth to the point where it now claims a good fraction of humanity as participants, by relying on a sim- ple, open client-server design: (1) the server communicates with the client via a protocol (the http or hypertext transfer protocol) that is lightweight andHTTP simple, asynchronously carrying a variety of payloads (text, images and – over time – richer media such as audio and video files) encoded in a sim- ple markup language called HTML (for hypertext markup language); (2) theHTML client – generally a browser, an application within a graphical user environ- ment – can ignore what it does not understand. Each of these seemingly innocuous features has contributed enormously to the growth of the Web, so it is worthwhile to examine them further.
***blog->
|------------
However, many structured data sources containing text are best modeled as structured documents rather than relational data. We call the search over such structured documents structured retrieval. Queries in structured retrievalSTRUCTURED RETRIEVAL can be either structured or unstructured, but we will assume in this chap- ter that the collection consists only of structured documents. Applications of structured retrieval include digital libraries, patent databases, blogs, text in which entities like persons and locations have been tagged (in a process called named entity tagging) and output from office suites like OpenOffice that save documents as marked up text. In all of these applications, we want to be able to run queries that combine textual criteria with structural criteria.
***caching->
|------------
• Access to data in memory is much faster than access to data on disk. It takes a few clock cycles (perhaps 5 × 10−9 seconds) to access a byte in memory, but much longer to transfer it from disk (about 2 × 10−8 sec- onds). Consequently, we want to keep as much data as possible in mem- ory, especially those data that we need to access frequently. We call the technique of keeping frequently used disk data in main memory caching.CACHING • When doing a disk read or write, it takes a while for the disk head to move to the part of the disk where the data are located. This time is called the seek time and it averages 5 ms for typical disks. No data are beingSEEK TIME transferred during the seek. To maximize data transfer rates, chunks of data that will be read together should therefore be stored contiguously on disk. For example, using the numbers in Table 4.1 it may take as little as 0.2 seconds to transfer 10 megabytes (MB) from disk to memory if it is stored as one chunk, but up to 0.2 + 100 × (5 × 10−3) = 0.7 seconds if it is stored in 100 noncontiguous chunks because we need to move the disk head up to 100 times.
|------------
DNS resolution is a well-known bottleneck in web crawling. Due to the distributed nature of the Domain Name Service, DNS resolution may entail multiple requests and round-trips across the internet, requiring seconds and sometimes even longer. Right away, this puts in jeopardy our goal of fetching several hundred documents a second. A standard remedy is to introduce caching: URLs for which we have recently performed DNS lookups are likely to be found in the DNS cache, avoiding the need to go to the DNS servers on the internet. However, obeying politeness constraints (see Section 20.2.3) limits the of cache hit rate.
***Rand index->
|------------
An alternative to this information-theoretic interpretation of clustering is to view it as a series of decisions, one for each of the N(N − 1)/2 pairs of documents in the collection. We want to assign two documents to the same cluster if and only if they are similar. A true positive (TP) decision assigns two similar documents to the same cluster, a true negative (TN) decision as- signs two dissimilar documents to different clusters. There are two types of errors we can commit. A false positive (FP) decision assigns two dissim- ilar documents to the same cluster. A false negative (FN) decision assigns two similar documents to different clusters. The Rand index (RI) measuresRAND INDEX RI the percentage of decisions that are correct. That is, it is simply accuracy (Section 8.3, page 155).
|------------
The Rand index gives equal weight to false positives and false negatives.
***model complexity->
|------------
We can also think of variance as the model complexity or, equivalently, mem-MEMORY CAPACITY ory capacity of the learning method – how detailed a characterization of the training set it can remember and then apply to new data. This capacity corre- sponds to the number of independent parameters available to fit the training set. Each kNN neighborhood Sk makes an independent classification deci- sion. The parameter in this case is the estimate P̂(c|Sk) from Figure 14.7.
|------------
A second type of criterion for cluster cardinality imposes a penalty for each new cluster – where conceptually we start with a single cluster containing all documents and then search for the optimal number of clusters K by succes- sively incrementing K by one. To determine the cluster cardinality in this way, we create a generalized objective function that combines two elements: distortion, a measure of how much documents deviate from the prototype ofDISTORTION their clusters (e.g., RSS for K-means); and a measure of model complexity. WeMODEL COMPLEXITY interpret a clustering here as a model of the data. Model complexity in clus- tering is usually the number of clusters or a function thereof. For K-means, we then get this selection criterion for K: K = arg min K [RSSmin(K) + λK](16.11) where λ is a weighting factor. A large value of λ favors solutions with few clusters. For λ = 0, there is no penalty for more clusters and K = N is the best solution.
***inverted index->
|------------
This idea is central to the first major concept in information retrieval, the inverted index. The name is actually redundant: an index always maps backINVERTED INDEX from terms to the parts of a document where they occur. Nevertheless, in- verted index, or sometimes inverted file, has become the standard term in infor- mation retrieval.3 The basic idea of an inverted index is shown in Figure 1.3.
|------------
1.2 A first take at building an inverted index To gain the speed benefits of indexing at retrieval time, we have to build the index in advance. The major steps in this are: 1. Collect the documents to be indexed: Friends, Romans, countrymen. So let it be with Caesar . . .
|------------
4. In a (non-positional) inverted index, a posting is just a document ID, but it is inherently associated with a term, via the postings list it is placed on; sometimes we will also talk of a (term, docID) pair as a posting.
***Technology->
|------------
Text Retrieval Conference (TREC). The U.S. National Institute of StandardsTREC and Technology (NIST) has run a large IR test bed evaluation series since 1992. Within this framework, there have been many tracks over a range of different test collections, but the best known test collections are the ones used for the TREC Ad Hoc track during the first 8 TREC evaluations between 1992 and 1999. In total, these test collections comprise 6 CDs containing 1.89 million documents (mainly, but not exclusively, newswire articles) and relevance judgments for 450 information needs, which are called topics and specified in detailed text passages. Individual test col- lections are defined over different subsets of this data. The early TRECs each consisted of 50 information needs, evaluated over different but over- lapping sets of documents. TRECs 6–8 provide 150 information needs over about 528,000 newswire and Foreign Broadcast Information Service articles. This is probably the best subcollection to use in future work, be- cause it is the largest and the topics are more consistent. Because the test document collections are so large, there are no exhaustive relevance judg- ments. Rather, NIST assessors’ relevance judgments are available only for the documents that were among the top k returned for some system which was entered in the TREC evaluation for which the information need was developed.
***zone->
|------------
6.1 Parametric and zone indexes We have thus far viewed a document as a sequence of terms. In fact, most documents have additional structure. Digital documents generally encode, in machine-recognizable form, certain metadata associated with each docu-METADATA ment. By metadata, we mean specific forms of data about a document, such as its author(s), title and date of publication. This metadata would generally include fields such as the date of creation and the format of the document, asFIELD well the author and possibly the title of the document. The possible values of a field should be thought of as finite – for instance, the set of all dates of authorship.
|------------
Zones are similar to fields, except the contents of a zone can be arbitraryZONE free text. Whereas a field may take on a relatively small set of values, a zone can be thought of as an arbitrary, unbounded amount of text. For instance, document titles and abstracts are generally treated as zones. We may build a separate inverted index for each zone of a document, to support queries such as “find documents with merchant in the title and william in the author list and the phrase gentle rain in the body”. This has the effect of building an index that looks like Figure 6.2. Whereas the dictionary for a parametric index comes from a fixed vocabulary (the set of languages, or the set of dates), the dictionary for a zone index must structure whatever vocabulary stems from the text of that zone.
|------------
In fact, we can reduce the size of the dictionary by encoding the zone in which a term occurs in the postings. In Figure 6.3 for instance, we show how occurrences of william in the title and author zones of various documents are encoded. Such an encoding is useful when the size of the dictionary is a concern (because we require the dictionary to fit in main memory). But there is another important reason why the encoding of Figure 6.3 is useful: the efficient computation of scores using a technique we will call weighted zoneWEIGHTED ZONE SCORING scoring.
|------------
15.3.2 Improving classifier performance For any particular application, there is usually significant room for improv- ing classifier effectiveness through exploiting features specific to the domain or document collection. Often documents will contain zones which are espe- cially useful for classification. Often there will be particular subvocabularies which demand special treatment for optimal classification effectiveness.
|------------
Document zones in text classification As already discussed in Section 6.1, documents usually have zones, such as mail message headers like the subject and author, or the title and keywords of a research article. Text classifiers can usually gain from making use of these zones during training and classification.
|------------
Upweighting document zones. In text classification problems, you can fre- quently get a nice boost to effectiveness by differentially weighting contri- butions from different document zones. Often, upweighting title words is particularly effective (Cohen and Singer 1999, p. 163). As a rule of thumb, it is often effective to double the weight of title words in text classification problems. You can also get value from upweighting words from pieces of text that are not so much clearly defined zones, but where nevertheless evi- dence from document structure or content suggests that they are important.
|------------
Upweighting document zones. In text classification problems, you can fre- quently get a nice boost to effectiveness by differentially weighting contri- butions from different document zones. Often, upweighting title words is particularly effective (Cohen and Singer 1999, p. 163). As a rule of thumb, it is often effective to double the weight of title words in text classification problems. You can also get value from upweighting words from pieces of text that are not so much clearly defined zones, but where nevertheless evi- dence from document structure or content suggests that they are important.
|------------
Separate feature spaces for document zones. There are two strategies that can be used for document zones. Above we upweighted words that appear in certain zones. This means that we are using the same features (that is, pa- rameters are “tied” across different zones), but we pay more attention to thePARAMETER TYING occurrence of terms in particular zones. An alternative strategy is to have a completely separate set of features and corresponding parameters for words occurring in different zones. This is in principle more powerful: a word could usually indicate the topic Middle East when in the title but Commodities when in the body of a document. But, in practice, tying parameters is usu- ally more successful. Having separate feature sets means having two or more times as many parameters, many of which will be much more sparsely seen in the training data, and hence with worse estimates, whereas upweighting has no bad effects of this sort. Moreover, it is quite uncommon for words to have different preferences when appearing in different zones; it is mainly the strength of their vote that should be adjusted. Nevertheless, ultimately this is a contingent result, depending on the nature and quantity of the training data.
|------------
Connections to text summarization. In Section 8.7, we mentioned the field of text summarization, and how most work in that field has adopted the limited goal of extracting and assembling pieces of the original text that are judged to be central based on features of sentences that consider the sen- tence’s position and content. Much of this work can be used to suggest zones that may be distinctively useful for text classification. For example Kołcz et al. (2000) consider a form of feature selection where you classify docu- ments based only on words in certain zones. Based on text summarization research, they consider using (i) only the title, (ii) only the first paragraph, (iii) only the paragraph with the most title words or keywords, (iv) the first two paragraphs or the first and last paragraph, or (v) all sentences with a minimum number of title words or keywords. In general, these positional feature selection methods produced as good results as mutual information (Section 13.5.1), and resulted in quite competitive classifiers. Ko et al. (2004) also took inspiration from text summarization research to upweight sen- tences with either words from the title or words that are central to the doc- ument’s content, leading to classification accuracy gains of almost 1%. This presumably works because most such sentences are somehow more central to the concerns of the document.
***held-out data->
|------------
In a clean statistical text classification experiment, you should never run any program on or even look at the test set while developing a text classifica- tion system. Instead, set aside a development set for testing while you developDEVELOPMENT SET your method. When such a set serves the primary purpose of finding a good value for a parameter, for example, the number of selected features, then it is also called held-out data. Train the classifier on the rest of the training setHELD-OUT DATA with different parameter values, and then select the value that gives best re- sults on the held-out part of the training set. Ideally, at the very end, when all parameters have been set and the method is fully specified, you run one final experiment on the test set and publish the results. Because no informa- ◮ Table 13.10 Data for parameter estimation exercise.
***stemming->
|------------
2.2.4 Stemming and lemmatization For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are fami- lies of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.
|------------
The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance: am, are, is ⇒ be car, cars, car’s, cars’ ⇒ car The result of this mapping of text will be something like: the boy’s cars are different colors ⇒ the boy car be differ color However, the two words differ in their flavor. Stemming usually refers toSTEMMING a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the re- moval of derivational affixes. Lemmatization usually refers to doing thingsLEMMATIZATION properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. If confrontedLEMMA with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatiza- tion commonly only collapses the different inflectional forms of a lemma.
|------------
Experiments on and discussion of the positive and negative impact of stemming in English can be found in the following works: Salton (1989), Har- man (1991), Krovetz (1995), Hull (1996). Hollink et al. (2004) provide detailed results for the effectiveness of language-specific methods on 8 European lan- guages. In terms of percent change in mean average precision (see page 159) over a baseline system, diacritic removal gains up to 23% (being especially helpful for Finnish, French, and Swedish). Stemming helped markedly for Finnish (30% improvement) and Spanish (10% improvement), but for most languages, including English, the gain from stemming was in the range 0– 5%, and results from a lemmatizer were poorer still. Compound splitting gained 25% for Swedish and 15% for German, but only 4% for Dutch. Rather than language-particular methods, indexing character k-grams (as we sug- gested for Chinese) could often give as good or better results: using within- word character 4-grams rather than words gave gains of 37% in Finnish, 27% in Swedish, and 20% in German, while even being slightly positive for other languages, such as Dutch, Spanish, and English. Tomlinson (2003) presents broadly similar results. Bar-Ilan and Gutman (2005) suggest that, at the time of their study (2003), the major commercial web search engines suffered from lacking decent language-particular processing; for example, a query on www.google.fr for l’électricité did not separate off the article l’ but only matched pages with precisely this string of article+noun.
***LSA->
|------------
Polysemy on the other hand refers to the case where a term such as charge has multiple meanings, so that the computed similarity ~q · ~d overestimates the similarity that a user would perceive. Could we use the co-occurrences of terms (whether, for instance, charge occurs in a document containing steed versus in a document containing electron) to capture the latent semantic as- sociations of terms and alleviate these problems? Even for a collection of modest size, the term-document matrix C is likely to have several tens of thousand of rows and columns, and a rank in the tens of thousands as well. In latent semantic indexing (sometimes referred to as latent semantic analysis (LSA)), we use the SVD to construct a low-rankLSA approximation Ck to the term-document matrix, for a value of k that is far smaller than the original rank of C. In the experimental work cited later in this section, k is generally chosen to be in the low hundreds. We thus map each row/column (respectively corresponding to a term/document) to a k-dimensional space; this space is defined by the k principal eigenvectors (corresponding to the largest eigenvalues) of CCT and CTC. Note that the matrix Ck is itself still an M× N matrix, irrespective of k.
***topic classification->
|------------
Such more general classes are usually referred to as topics, and the classifica- tion task is then called text classification, text categorization, topic classification,TEXT CLASSIFICATION or topic spotting. An example for China appears in Figure 13.1. Standing queries and topics differ in their degree of specificity, but the methods for solving routing, filtering, and text classification are essentially the same. We therefore include routing and filtering under the rubric of text classification in this and the following chapters.
***Frobenius norm->
|------------
Given an M × N matrix C and a positive integer k, we wish to find an M× N matrix Ck of rank at most k, so as to minimize the Frobenius norm ofFROBENIUS NORM the matrix difference X = C− Ck, defined to be ‖X‖F = √√√√ M ∑ i=1 N ∑ j=1 X2ij.(18.15) Thus, the Frobenius norm of X measures the discrepancy between Ck and C; our goal is to find a matrix Ck that minimizes this discrepancy, while con- straining Ck to have rank at most k. If r is the rank of C, clearly Cr = C and the Frobenius norm of the discrepancy is zero in this case. When k is far smaller than r, we refer to Ck as a low-rank approximation.LOW-RANK APPROXIMATION The singular value decomposition can be used to solve the low-rank ma- trix approximation problem. We then derive from it an application to ap- proximating term-document matrices. We invoke the following three-step procedure to this end: Ck = U Σk V T rr r rr r rr r rr r rr r rr r rr r rr r r r r rr rr r rr rr r rr rr r rr rr r rr rr r ◮ Figure 18.2 Illustration of low rank approximation using the singular-value de- composition. The dashed boxes indicate the matrix entries affected by “zeroing out” the smallest singular values.
***term partitioning->
|------------
Two obvious alternative index implementations suggest themselves: parti-TERM PARTITIONING tioning by terms, also known as global index organization, and partitioning byDOCUMENT PARTITIONING documents, also know as local index organization. In the former, the diction- ary of index terms is partitioned into subsets, each subset residing at a node.
***biword index->
|------------
2.4.1 Biword indexes One approach to handling phrases is to consider every pair of consecutive terms in a document as a phrase. For example, the text Friends, Romans, Countrymen would generate the biwords:BIWORD INDEX friends romans romans countrymen In this model, we treat each of these biwords as a vocabulary term. Being able to process two-word phrase queries is immediate. Longer phrases can be processed by breaking them down. The query stanford university palo alto can be broken into the Boolean query on biwords: “stanford university” AND “university palo” AND “palo alto” This query could be expected to work fairly well in practice, but there can and will be occasional false positives. Without examining the documents, we cannot verify that the documents matching the above Boolean query do actually contain the original 4 word phrase.
|------------
Among possible queries, nouns and noun phrases have a special status in describing the concepts people are interested in searching for. But related nouns can often be divided from each other by various function words, in phrases such as the abolition of slavery or renegotiation of the constitution. These needs can be incorporated into the biword indexing model in the following way. First, we tokenize the text and perform part-of-speech-tagging.6 We can then group terms into nouns, including proper nouns, (N) and function words, including articles and prepositions, (X), among other classes. Now deem any string of terms of the form NX*N to be an extended biword. Each such extended biword is made a term in the vocabulary. For example: renegotiation of the constitution N X X N To process a query using such an extended biword index, we need to also parse it into N’s and X’s, and then segment the query into extended biwords, which can be looked up in the index.
|------------
2.4.3 Combination schemes The strategies of biword indexes and positional indexes can be fruitfully combined. If users commonly query on particular phrases, such as Michael Jackson, it is quite inefficient to keep merging positional postings lists. A combination strategy uses a phrase index, or just a biword index, for certain queries and uses a positional index for other phrase queries. Good queries to include in the phrase index are ones known to be common based on re- cent querying behavior. But this is not the only criterion: the most expensive phrase queries to evaluate are ones where the individual words are com- mon but the desired phrase is comparatively rare. Adding Britney Spears as a phrase index entry may only give a speedup factor to that query of about 3, since most documents that mention either word are valid results, whereas adding The Who as a phrase index entry may speed up that query by a factor of 1000. Hence, having the latter is more desirable, even if it is a relatively less common query.
***Dice coefficient->
|------------
Exercise 8.6 [⋆⋆] What is the relationship between the value of F1 and the break-even point? Exercise 8.7 [⋆⋆] The Dice coefficient of two sets is a measure of their intersection scaled by their sizeDICE COEFFICIENT (giving a value in the range 0 to 1): Dice(X,Y) = 2|X ∩ Y| |X| + |Y| Show that the balanced F-measure (F1) is equal to the Dice coefficient of the retrieved and relevant document sets.
***kernel function->
|------------
SVMs, and also a number of other linear classifiers, provide an easy and efficient way of doing this mapping to a higher dimensional space, which is referred to as “the kernel trick”. It’s not really a trick: it just exploits the mathKERNEL TRICK that we have seen. The SVM linear classifier relies on a dot product between data point vectors. Let K(~xi,~xj) = ~xiT~xj. Then the classifier we have seen so far is: f (~x) = sign(∑ i αiyiK(~xi,~x) + b)(15.13) Now suppose we decide to map every data point into a higher dimensional space via some transformation Φ:~x 7→ φ(~x). Then the dot product becomes φ(~xi) Tφ(~xj). If it turned out that this dot product (which is just a real num- ber) could be computed simply and efficiently in terms of the original data points, then we wouldn’t have to actually map from ~x 7→ φ(~x). Rather, we could simply compute the quantity K(~xi,~xj) = φ(~xi)Tφ(~xj), and then use the function’s value in Equation (15.13). A kernel function K is such a functionKERNEL FUNCTION that corresponds to a dot product in some expanded feature space.
|------------
✎ Example 15.2: The quadratic kernel in two dimensions. For 2-dimensionalvectors ~u = (u1 u2), ~v = (v1 v2), consider K(~u,~v) = (1 + ~uT~v)2. We wish to show that this is a kernel, i.e., that K(~u,~v) = φ(~u)Tφ(~v) for some φ. Consider φ(~u) = (1 u21 √ 2u1u2 u22 √ 2u1 √ 2u2). Then: K(~u,~v) = (1 +~uT~v)2(15.14) = 1 + u21v 2 1 + 2u1v1u2v2 + u 2 2v 2 2 + 2u1v1 + 2u2v2 = (1 u21 √ 2u1u2 u 2 2 √ 2u1 √ 2u2)T(1 v 2 1 √ 2v1v2 v 2 2 √ 2v1 √ 2v2) = φ(~u)Tφ(~v) In the language of functional analysis, what kinds of functions are valid kernel functions? Kernel functions are sometimes more precisely referred toKERNEL as Mercer kernels, because they must satisfy Mercer’s condition: for any g(~x)MERCER KERNEL such that ∫ g(~x)2d~x is finite, we must have that: ∫ K(~x,~z)g(~x)g(~z)d~xd~z ≥ 0 .(15.15) A kernel function K must be continuous, symmetric, and have a positive def- inite gram matrix. Such a K means that there exists a mapping to a reproduc- ing kernel Hilbert space (a Hilbert space is a vector space closed under dot products) such that the dot product there gives the same value as the function K. If a kernel does not satisfy Mercer’s condition, then the corresponding QP may have no solution. If you would like to better understand these issues, you should consult the books on SVMs mentioned in Section 15.5. Other- wise, you can content yourself with knowing that 90% of work with kernels uses one of two straightforward families of functions of two vectors, which we define below, and which define valid kernels.
***instance-based learning->
|------------
In kNN classification, we do not perform any estimation of parameters as we do in Rocchio classification (centroids) or in Naive Bayes (priors and con- ditional probabilities). kNN simply memorizes all examples in the training set and then compares the test document to them. For this reason, kNN is also called memory-based learning or instance-based learning. It is usually desir-MEMORY-BASED LEARNING able to have as much training data as possible in machine learning. But in kNN large training sets come with a severe efficiency penalty in classifica- tion.
***sparseness->
|------------
IR is not the place where you most immediately need complex language models, since IR does not directly depend on the structure of sentences to the extent that other tasks like speech recognition do. Unigram models are often sufficient to judge the topic of a text. Moreover, as we shall see, IR lan- guage models are frequently estimated from a single document and so it is questionable whether there is enough training data to do more. Losses from data sparseness (see the discussion on page 260) tend to outweigh any gains from richer models. This is an example of the bias-variance tradeoff (cf. Sec- tion 14.6, page 308): With limited training data, a more constrained model tends to perform better. In addition, unigram models are more efficient to estimate and apply than higher-order models. Nevertheless, the importance of phrase and proximity queries in IR in general suggests that future work should make use of more sophisticated language models, and some has be- gun to (see Section 12.5, page 252). Indeed, making this move parallels the model of van Rijsbergen in Chapter 11 (page 231).
|------------
assign a high probability to the UK class because the term Britain occurs. The problem is that the zero probability for WTO cannot be “conditioned away,” no matter how strong the evidence for the class UK from other features. The estimate is 0 because of sparseness: The training data are never large enoughSPARSENESS to represent the frequency of rare events adequately, for example, the fre- quency of WTO occurring in UK documents.
***XML DOM->
|------------
The standard for accessing and processing XML documents is the XML Document Object Model or DOM. The DOM represents elements, attributesXML DOM and text within elements as nodes in a tree. Figure 10.2 is a simplified DOM representation of the XML document in Figure 10.1.2 With a DOM API, we 2. The representation is simplified in a number of respects. For example, we do not show the Shakespeare Macbeth Macbeth’s castle Will I with wine and wassail ... ◮ Figure 10.1 An XML document.
***soft assignment->
|------------
In a soft assignment, a document has fractional membership in several clus- ters. Latent semantic indexing, a form of dimensionality reduction, is a soft clustering algorithm (Chapter 18, page 417).
***parametric index->
|------------
Consider queries of the form “find documents authored by William Shake- speare in 1601, containing the phrase alas poor Yorick”. Query processing then consists as usual of postings intersections, except that we may merge post- ings from standard inverted as well as parametric indexes. There is one para-PARAMETRIC INDEX metric index for each field (say, date of creation); it allows us to select only the documents matching a date specified in the query. Figure 6.1 illustrates the user’s view of such a parametric search. Some of the fields may assume ordered values, such as dates; in the example query above, the year 1601 is one such field value. The search engine may support querying ranges on such ordered values; to this end, a structure like a B-tree may be used for the field’s dictionary.
|------------
Zones are similar to fields, except the contents of a zone can be arbitraryZONE free text. Whereas a field may take on a relatively small set of values, a zone can be thought of as an arbitrary, unbounded amount of text. For instance, document titles and abstracts are generally treated as zones. We may build a separate inverted index for each zone of a document, to support queries such as “find documents with merchant in the title and william in the author list and the phrase gentle rain in the body”. This has the effect of building an index that looks like Figure 6.2. Whereas the dictionary for a parametric index comes from a fixed vocabulary (the set of languages, or the set of dates), the dictionary for a zone index must structure whatever vocabulary stems from the text of that zone.
***keyword-in-context->
|------------
Dynamic summaries display one or more “windows” on the document, aiming to present the pieces that have the most utility to the user in evalu- ating the document with respect to their information need. Usually these windows contain one or several of the query terms, and so are often re- ferred to as keyword-in-context (KWIC) snippets, though sometimes they mayKEYWORD-IN-CONTEXT still be pieces of the text such as the title that are selected for their query- independent information value just as in the case of static summarization.
***hierarchical agglomerative clustering,->
***topic->
|------------
Text Retrieval Conference (TREC). The U.S. National Institute of StandardsTREC and Technology (NIST) has run a large IR test bed evaluation series since 1992. Within this framework, there have been many tracks over a range of different test collections, but the best known test collections are the ones used for the TREC Ad Hoc track during the first 8 TREC evaluations between 1992 and 1999. In total, these test collections comprise 6 CDs containing 1.89 million documents (mainly, but not exclusively, newswire articles) and relevance judgments for 450 information needs, which are called topics and specified in detailed text passages. Individual test col- lections are defined over different subsets of this data. The early TRECs each consisted of 50 information needs, evaluated over different but over- lapping sets of documents. TRECs 6–8 provide 150 information needs over about 528,000 newswire and Foreign Broadcast Information Service articles. This is probably the best subcollection to use in future work, be- cause it is the largest and the topics are more consistent. Because the test document collections are so large, there are no exhaustive relevance judg- ments. Rather, NIST assessors’ relevance judgments are available only for the documents that were among the top k returned for some system which was entered in the TREC evaluation for which the information need was developed.
|------------
Such more general classes are usually referred to as topics, and the classifica- tion task is then called text classification, text categorization, topic classification,TEXT CLASSIFICATION or topic spotting. An example for China appears in Figure 13.1. Standing queries and topics differ in their degree of specificity, but the methods for solving routing, filtering, and text classification are essentially the same. We therefore include routing and filtering under the rubric of text classification in this and the following chapters.
***accumulator->
|------------
Some literature refers to the array scores[] above as a set of accumulators. TheACCUMULATOR reason for this will be clear as we consider more complex Boolean functions than the AND; thus we may assign a non-zero score to a document even if it does not contain all query terms.
|------------
The outermost loop beginning Step 3 repeats the updating of Scores, iter- ating over each query term t in turn. In Step 5 we calculate the weight in the query vector for term t. Steps 6-8 update the score of each document by adding in the contribution from term t. This process of adding in contribu- tions one query term at a time is sometimes known as term-at-a-time scoringTERM-AT-A-TIME or accumulation, and the N elements of the array Scores are therefore known as accumulators. For this purpose, it would appear necessary to store, withACCUMULATOR each postings entry, the weight wft,d of term t in document d (we have thus far used either tf or tf-idf for this weight, but leave open the possibility of other functions to be developed in Section 6.4). In fact this is wasteful, since storing this weight may require a floating point number. Two ideas help alle- viate this space problem. First, if we are using inverse document frequency, we need not precompute idft; it suffices to store N/dft at the head of the postings for t. Second, we store the term frequency tft,d for each postings en- try. Finally, Step 12 extracts the top K scores – this requires a priority queue data structure, often implemented using a heap. Such a heap takes no more than 2N comparisons to construct, following which each of the K top scores can be extracted from the heap at a cost of O(log N) comparisons.
***enterprise search->
***Bayesian networks->
|------------
11.4.4 Bayesian network approaches to IR Turtle and Croft (1989; 1991) introduced into information retrieval the use of Bayesian networks (Jensen and Jensen 2001), a form of probabilistic graph-BAYESIAN NETWORKS ical model. We skip the details because fully introducing the formalism of Bayesian networks would require much too much space, but conceptually, Bayesian networks use directed graphs to show probabilistic dependencies between variables, as in Figure 11.1, and have led to the development of so- phisticated algorithms for propagating influence so as to allow learning and inference with arbitrary knowledge within arbitrary directed acyclic graphs.
|------------
The result is a flexible probabilistic network which can generalize vari- ous simpler Boolean and probabilistic models. Indeed, this is the primary case of a statistical ranked retrieval model that naturally supports structured query operators. The system allowed efficient large-scale retrieval, and was the basis of the InQuery text retrieval system, built at the University of Mas- sachusetts. This system performed very well in TREC evaluations and for a time was sold commercially. On the other hand, the model still used various approximations and independence assumptions to make parameter estima- tion and computation possible. There has not been much follow-on work along these lines, but we would note that this model was actually built very early on in the modern era of using Bayesian networks, and there have been many subsequent developments in the theory, and the time is perhaps right for a new generation of Bayesian network-based information retrieval sys- tems.
***capture-recapture method->
|------------
Thus, search engine indexes include multiple classes of indexed pages, so that there is no single measure of index size. These issues notwithstanding, a number of techniques have been devised for crude estimates of the ratio of the index sizes of two search engines, E1 and E2. The basic hypothesis under- lying these techniques is that each search engine indexes a fraction of the Web chosen independently and uniformly at random. This involves some ques- tionable assumptions: first, that there is a finite size for the Web from which each search engine chooses a subset, and second, that each engine chooses an independent, uniformly chosen subset. As will be clear from the discus- sion of crawling in Chapter 20, this is far from true. However, if we begin with these assumptions, then we can invoke a classical estimation technique known as the capture-recapture method.CAPTURE-RECAPTURE METHOD Suppose that we could pick a random page from the index of E1 and test whether it is in E2’s index and symmetrically, test whether a random page from E2 is in E1. These experiments give us fractions x and y such that our estimate is that a fraction x of the pages in E1 are in E2, while a fraction y of the pages in E2 are in E1. Then, letting |Ei| denote the size of the index of search engine Ei, we have x|E1| ≈ y|E2|, from which we have the form we will use |E1| |E2| ≈ y x .(19.1) If our assumption about E1 and E2 being independent and uniform random subsets of the Web were true, and our sampling process unbiased, then Equa- tion (19.1) should give us an unbiased estimator for |E1|/|E2|. We distinguish between two scenarios here. Either the measurement is performed by some- one with access to the index of one of the search engines (say an employee of E1), or the measurement is performed by an independent party with no ac- cess to the innards of either search engine. In the former case, we can simply pick a random document from one index. The latter case is more challeng- ing; by picking a random page from one search engine from outside the search engine, then verify whether the random page is present in the other search engine.
***expected edge density->
|------------
The discussion of external evaluation measures is partially based on Strehl (2002). Dom (2002) proposes a measure Q0 that is better motivated theoret- ically than NMI. Q0 is the number of bits needed to transmit class member- ships assuming cluster memberships are known. The Rand index is due to Rand (1971). Hubert and Arabie (1985) propose an adjusted Rand index thatADJUSTED RAND INDEX ranges between −1 and 1 and is 0 if there is only chance agreement between clusters and classes (similar to κ in Chapter 8, page 165). Basu et al. (2004) ar- gue that the three evaluation measures NMI, Rand index and F measure give very similar results. Stein et al. (2003) propose expected edge density as an in- ternal measure and give evidence that it is a good predictor of the quality of a clustering. Kleinberg (2002) and Meilă (2005) present axiomatic frameworks for comparing clusterings.
***TREC->
|------------
Text Retrieval Conference (TREC). The U.S. National Institute of StandardsTREC and Technology (NIST) has run a large IR test bed evaluation series since 1992. Within this framework, there have been many tracks over a range of different test collections, but the best known test collections are the ones used for the TREC Ad Hoc track during the first 8 TREC evaluations between 1992 and 1999. In total, these test collections comprise 6 CDs containing 1.89 million documents (mainly, but not exclusively, newswire articles) and relevance judgments for 450 information needs, which are called topics and specified in detailed text passages. Individual test col- lections are defined over different subsets of this data. The early TRECs each consisted of 50 information needs, evaluated over different but over- lapping sets of documents. TRECs 6–8 provide 150 information needs over about 528,000 newswire and Foreign Broadcast Information Service articles. This is probably the best subcollection to use in future work, be- cause it is the largest and the topics are more consistent. Because the test document collections are so large, there are no exhaustive relevance judg- ments. Rather, NIST assessors’ relevance judgments are available only for the documents that were among the top k returned for some system which was entered in the TREC evaluation for which the information need was developed.
|------------
14.7 References and further reading As discussed in Chapter 9, Rocchio relevance feedback is due to Rocchio (1971). Joachims (1997) presents a probabilistic analysis of the method. Roc- chio classification was widely used as a classification method in TREC in the 1990s (Buckley et al. 1994a;b, Voorhees and Harman 2005). Initially, it was used as a form of routing. Routing merely ranks documents according to rel-ROUTING evance to a class without assigning them. Early work on filtering, a true clas-FILTERING sification approach that makes an assignment decision on each document, was published by Ittner et al. (1995) and Schapire et al. (1998). The definition of routing we use here should not be confused with another sense. Routing can also refer to the electronic distribution of documents to subscribers, the so-called push model of document distribution. In a pull model, each transferPUSH MODEL PULL MODEL of a document to the user is initiated by the user – for example, by means of search or by selecting it from a list of documents on a news aggregation website.
***distortion->
|------------
A second type of criterion for cluster cardinality imposes a penalty for each new cluster – where conceptually we start with a single cluster containing all documents and then search for the optimal number of clusters K by succes- sively incrementing K by one. To determine the cluster cardinality in this way, we create a generalized objective function that combines two elements: distortion, a measure of how much documents deviate from the prototype ofDISTORTION their clusters (e.g., RSS for K-means); and a measure of model complexity. WeMODEL COMPLEXITY interpret a clustering here as a model of the data. Model complexity in clus- tering is usually the number of clusters or a function thereof. For K-means, we then get this selection criterion for K: K = arg min K [RSSmin(K) + λK](16.11) where λ is a weighting factor. A large value of λ favors solutions with few clusters. For λ = 0, there is no penalty for more clusters and K = N is the best solution.
***medoid->
|------------
The same efficiency problem is addressed by K-medoids, a variant of K-K-MEDOIDS means that computes medoids instead of centroids as cluster centers. We define the medoid of a cluster as the document vector that is closest to theMEDOID centroid. Since medoids are sparse document vectors, distance computations are fast.
***skip list->
|------------
One way to do this is to use a skip list by augmenting postings lists withSKIP LIST skip pointers (at indexing time), as shown in Figure 2.9. Skip pointers are effectively shortcuts that allow us to avoid processing parts of the postings list that will not figure in the search results. The two questions are then where to place skip pointers and how to do efficient merging using skip pointers.
|------------
Consider first efficient merging, with Figure 2.9 as an example. Suppose we’ve stepped through the lists in the figure until we have matched 8 on each list and moved it to the results list. We advance both pointers, giving us 16 on the upper list and 41 on the lower list. The smallest item is then the element 16 on the top list. Rather than simply advancing the upper pointer, we first check the skip list pointer and note that 28 is also less than 41. Hence we can follow the skip list pointer, and then we advance the upper pointer to 28 . We thus avoid stepping to 19 and 23 on the upper list. A number of variant versions of postings list intersection with skip pointers is possible depending on when exactly you check the skip pointer. One version is shown INTERSECTWITHSKIPS(p1, p2) 1 answer ← 〈 〉 2 while p1 6= NIL and p2 6= NIL 3 do if docID(p1) = docID(p2) 4 then ADD(answer, docID(p1)) 5 p1 ← next(p1) 6 p2 ← next(p2) 7 else if docID(p1) < docID(p2) 8 then if hasSkip(p1) and (docID(skip(p1)) ≤ docID(p2)) 9 then while hasSkip(p1) and (docID(skip(p1)) ≤ docID(p2)) 10 do p1 ← skip(p1) 11 else p1 ← next(p1) 12 else if hasSkip(p2) and (docID(skip(p2)) ≤ docID(p1)) 13 then while hasSkip(p2) and (docID(skip(p2)) ≤ docID(p1)) 14 do p2 ← skip(p2) 15 else p2 ← next(p2) 16 return answer ◮ Figure 2.10 Postings lists intersection with skip pointers.
|------------
The classic presentation of skip pointers for IR can be found in Moffat andSKIP LIST Zobel (1996). Extended techniques are discussed in Boldi and Vigna (2005).
***summarization->
***break-even->
|------------
15.2.4 Experimental results We presented results in Section 13.6 showing that an SVM is a very effec- tive text classifier. The results of Dumais et al. (1998) given in Table 13.9 show SVMs clearly performing the best. This was one of several pieces of work from this time that established the strong reputation of SVMs for text classification. Another pioneering work on scaling and evaluating SVMs for text classification was (Joachims 1998). We present some of his results Roc- Dec. linear SVM rbf-SVM NB chio Trees kNN C = 0.5 C = 1.0 σ ≈ 7 earn 96.0 96.1 96.1 97.8 98.0 98.2 98.1 acq 90.7 92.1 85.3 91.8 95.5 95.6 94.7 money-fx 59.6 67.6 69.4 75.4 78.8 78.5 74.3 grain 69.8 79.5 89.1 82.6 91.9 93.1 93.4 crude 81.2 81.5 75.5 85.8 89.4 89.4 88.7 trade 52.2 77.4 59.2 77.9 79.2 79.2 76.6 interest 57.6 72.5 49.1 76.7 75.6 74.8 69.1 ship 80.9 83.1 80.9 79.8 87.4 86.5 85.8 wheat 63.4 79.4 85.5 72.9 86.6 86.8 82.4 corn 45.2 62.2 87.7 71.4 87.5 87.8 84.6 microavg. 72.3 79.9 79.4 82.6 86.7 87.5 86.4 ◮ Table 15.2 SVM classifier break-even F1 from (Joachims 2002a, p. 114). Results are shown for the 10 largest categories and for microaveraged performance over all 90 categories on the Reuters-21578 data set.
|------------
15.3 Issues in the classification of text documents There are lots of applications of text classification in the commercial world; email spam filtering is perhaps now the most ubiquitous. Jackson and Mou- linier (2002) write: “There is no question concerning the commercial value of being able to classify documents automatically by content. There are myriad 5. These results are in terms of the break-even F1 (see Section 8.4). Many researchers disprefer this measure for text classification evaluation, since its calculation may involve interpolation rather than an actual parameter setting of the system and it is not clear why this value should be reported rather than maximal F1 or another point on the precision/recall curve motivated by the task at hand. While earlier results in (Joachims 1998) suggested notable gains on this task from the use of higher order polynomial or rbf kernels, this was with hard-margin SVMs. With soft-margin SVMs, a simple linear SVM with the default C = 1 performs best.
***anchor text->
|------------
Figure 19.2 shows two nodes A and B from the web graph, each corre- sponding to a web page, with a hyperlink from A to B. We refer to the set of all such nodes and directed edges as the web graph. Figure 19.2 also shows that (as is the case with most links on web pages) there is some text surround- ing the origin of the hyperlink on page A. This text is generally encapsulated in the href attribute of the (for anchor) tag that encodes the hyperlink in the HTML code of page A, and is referred to as anchor text. As one mightANCHOR TEXT suspect, this directed graph is not strongly connected: there are pairs of pages such that one cannot proceed from one page of the pair to the other by follow- ing hyperlinks. We refer to the hyperlinks into a page as in-links and thoseIN-LINKS out of a page as out-links. The number of in-links to a page (also known asOUT-LINKS its in-degree) has averaged from roughly 8 to 15, in a range of studies. We similarly define the out-degree of a web page to be the number of links out ◮ Figure 19.3 A sample small web graph. In this example we have six pages labeled A-F. Page B has in-degree 3 and out-degree 1. This example graph is not strongly connected: there is no path from any of pages B-F to page A.
***collection frequency->
|------------
2.2.2 Dropping common terms: stop words Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words. The generalSTOP WORDS strategy for determining a stop list is to sort the terms by collection frequencyCOLLECTION FREQUENCY (the total number of times each term appears in the document collection), and then to take the most frequent terms, often hand-filtered for their se- mantic content relative to the domain of the documents being indexed, as a stop list, the members of which are then discarded during indexing. AnSTOP LIST example of a stop list is shown in Figure 2.5. Using a stop list significantly reduces the number of postings that a system has to store; we will present some statistics on this in Chapter 5 (see Table 5.1, page 87). And a lot of the time not indexing stop words does little harm: keyword searches with terms like the and by don’t seem very useful. However, this is not true for phrase searches. The phrase query “President of the United States”, which con- tains two stop words, is more precise than President AND “United States”. The meaning of flights to London is likely to be lost if the word to is stopped out. A search for Vannevar Bush’s article As we may think will be difficult if the first three words are stopped out, and the system searches simply for documents containing the word think. Some special query types are disproportionately affected. Some song titles and well known pieces of verse consist entirely of words that are commonly on stop lists (To be or not to be, Let It Be, I don’t want to be, . . . ).
***random variable->
|------------
We will give a very quick review; some references for further reading appear at the end of the chapter. A variable A represents an event (a subset of the space of possible outcomes). Equivalently, we can represent the subset via a random variable, which is a function from outcomes to real numbers; the sub-RANDOM VARIABLE set is the domain over which the random variable A has a particular value.
|------------
Writing P(A) for the complement of an event, we similarly have: P(A, B) = P(B|A)P(A)(11.2) Probability theory also has a partition rule, which says that if an event B canPARTITION RULE be divided into an exhaustive set of disjoint subcases, then the probability of B is the sum of the probabilities of the subcases. A special case of this rule gives that: P(B) = P(A, B) + P(A, B)(11.3) From these we can derive Bayes’ Rule for inverting conditional probabili-BAYES’ RULE ties: P(A|B) = P(B|A)P(A) P(B) = [ P(B|A) ∑X∈{A,A} P(B|X)P(X) ] P(A)(11.4) This equation can also be thought of as a way of updating probabilities. We start off with an initial estimate of how likely the event A is when we do not have any other information; this is the prior probability P(A). Bayes’ rulePRIOR PROBABILITY lets us derive a posterior probability P(A|B) after having seen the evidence B,POSTERIOR PROBABILITY based on the likelihood of B occurring in the two cases that A does or does not hold.1 Finally, it is often useful to talk about the odds of an event, which provideODDS a kind of multiplier for how probabilities change: Odds: O(A) = P(A) P(A) = P(A) 1 − P(A)(11.5) 11.2 The Probability Ranking Principle 11.2.1 The 1/0 loss case We assume a ranked retrieval setup as in Section 6.3, where there is a collec- tion of documents, the user issues a query, and an ordered list of documents is returned. We also assume a binary notion of relevance as in Chapter 8. For a query q and a document d in the collection, let Rd,q be an indicator random variable that says whether d is relevant with respect to a given query q. That is, it takes on a value of 1 when the document is relevant and 0 otherwise. In context we will often write just R for Rd,q.
***union-find algorithm->
|------------
Similarly, we will describe an approach to duplicate detection on the web in Section 19.6 (page 440) where single-link clustering is used in the guise of the union-find algorithm. Again, the decision whether a group of documents are duplicates of each other is not influenced by documents that are located far away and single-link clustering is a good choice for duplicate detection.
|------------
How can we quickly compute |ψi ∩ ψj|/200 for all pairs i, j? Indeed, how do we represent all pairs of documents that are similar, without incurring a blowup that is quadratic in the number of documents? First, we use fin- gerprints to remove all but one copy of identical documents. We may also remove common HTML tags and integers from the shingle computation, to eliminate shingles that occur very commonly in documents without telling us anything about duplication. Next we use a union-find algorithm to create clusters that contain documents that are similar. To do this, we must accom- plish a crucial step: going from the set of sketches to the set of pairs i, j such that di and dj are similar.
***complete-link clustering->
|------------
17.2 Single-link and complete-link clustering In single-link clustering or single-linkage clustering, the similarity of two clus-SINGLE-LINK CLUSTERING ters is the similarity of their most similar members (see Figure 17.3, (a))3. This single-link merge criterion is local. We pay attention solely to the area where the two clusters come closest to each other. Other, more distant parts of the cluster and the clusters’ overall structure are not taken into account.
|------------
In complete-link clustering or complete-linkage clustering, the similarity of twoCOMPLETE-LINK CLUSTERING clusters is the similarity of their most dissimilar members (see Figure 17.3, (b)).
|------------
Figure 17.4 depicts a single-link and a complete-link clustering of eight documents. The first four steps, each producing a cluster consisting of a pair of two documents, are identical. Then single-link clustering joins the up- per two pairs (and after that the lower two pairs) because on the maximum- similarity definition of cluster similarity, those two clusters are closest. Complete- 3. Throughout this chapter, we equate similarity with proximity in 2D depictions of clustering.
***cosine similarity->
|------------
6.3.1 Dot products We denote by ~V(d) the vector derived from document d, with one com- ponent in the vector for each dictionary term. Unless otherwise specified, the reader may assume that the components are computed using the tf-idf weighting scheme, although the particular weighting scheme is immaterial to the discussion that follows. The set of documents in a collection then may be viewed as a set of vectors in a vector space, in which there is one axis for 0 1 0 1 jealous gossip ~v(q) ~v(d1) ~v(d2) ~v(d3) θ ◮ Figure 6.10 Cosine similarity illustrated. sim(d1, d2) = cos θ.
|------------
To compensate for the effect of document length, the standard way of quantifying the similarity between two documents d1 and d2 is to compute the cosine similarity of their vector representations ~V(d1) and ~V(d2)COSINE SIMILARITY sim(d1, d2) = ~V(d1) · ~V(d2) |~V(d1)||~V(d2)| ,(6.10) where the numerator represents the dot product (also known as the inner prod-DOT PRODUCT uct) of the vectors ~V(d1) and ~V(d2), while the denominator is the product of their Euclidean lengths. The dot product ~x · ~y of two vectors is defined asEUCLIDEAN LENGTH ∑ M i=1 xiyi. Let ~V(d) denote the document vector for d, with M components ~V1(d) . . . ~VM(d). The Euclidean length of d is defined to be √ ∑ M i=1 ~V2i (d).
|------------
both contain sweet. As a result, it takes 25 iterations for the term to be unam- biguously associated with cluster 2. (qsweet,1 = 0 in iteration 25.) Finding good seeds is even more critical for EM than for K-means. EM is prone to get stuck in local optima if the seeds are not chosen well. This is a general problem that also occurs in other applications of EM.4 Therefore, as with K-means, the initial assignment of documents to clusters is often com- puted by a different algorithm. For example, a hard K-means clustering may provide the initial assignment, which EM can then “soften up.” ? Exercise 16.6We saw above that the time complexity of K-means is Θ(IKNM). What is the time complexity of EM? 16.6 References and further reading Berkhin (2006b) gives a general up-to-date survey of clustering methods with special attention to scalability. The classic reference for clustering in pat- tern recognition, covering both K-means and EM, is (Duda et al. 2000). Ras- mussen (1992) introduces clustering from an information retrieval perspec- tive. Anderberg (1973) provides a general introduction to clustering for ap- plications. In addition to Euclidean distance and cosine similarity, Kullback- Leibler divergence is often used in clustering as a measure of how (dis)similar documents and clusters are (Xu and Croft 1999, Muresan and Harper 2004, Kurland and Lee 2004).
***maximum likelihood estimation->
***key-value pairs->
|------------
In general, MapReduce breaks a large computing problem into smaller parts by recasting it in terms of manipulation of key-value pairs. For index-KEY-VALUE PAIRS ing, a key-value pair has the form (termID,docID). In distributed indexing, the mapping from terms to termIDs is also distributed and therefore more complex than in single-machine indexing. A simple solution is to maintain a (perhaps precomputed) mapping for frequent terms that is copied to all nodes and to use terms directly (instead of termIDs) for infrequent terms.
|------------
The map phase of MapReduce consists of mapping splits of the input dataMAP PHASE to key-value pairs. This is the same parsing task we also encountered in BSBI and SPIMI, and we therefore call the machines that execute the map phase parsers. Each parser writes its output to local intermediate files, the segmentPARSER SEGMENT FILE files (shown as a-f g-p q-z in Figure 4.5).
***security->
|------------
Securityis an important consideration for retrieval systems in corporations.SECURITY A low-level employee should not be able to find the salary roster of the cor- poration, but authorized managers need to be able to search for it. Users’ results lists must not contain documents they are barred from opening; the very existence of a document can be sensitive information.
***inversion->
|------------
Inversion involves two steps. First, we sort the termID–docID pairs. Next,INVERSION we collect all termID–docID pairs with the same termID into a postings list, where a posting is simply a docID. The result, an inverted index for the blockPOSTING we have just read, is then written to disk. Applying this to Reuters-RCV1 and assuming we can fit 10 million termID–docID pairs into memory, we end up with ten blocks, each an inverted index of one part of the collection.
|------------
A fundamental assumption in HAC is that the merge operation is mono-MONOTONICITY tonic. Monotonic means that if s1, s2, . . . , sK−1 are the combination similarities of the successive merges of an HAC, then s1 ≥ s2 ≥ . . . ≥ sK−1 holds. A non- monotonic hierarchical clustering contains at least one inversion si < si+1INVERSION and contradicts the fundamental assumption that we chose the best merge available at each step. We will see an example of an inversion in Figure 17.12.
|------------
In contrast to the other three HAC algorithms, centroid clustering is not monotonic. So-called inversions can occur: Similarity can increase duringINVERSION 0 1 2 3 4 5 0 1 2 3 4 5 × × × bc d1 d2 d3 −4 −3 −2 −1 0 d1 d2 d3 ◮ Figure 17.12 Centroid clustering is not monotonic. The documents d1 at (1 + ǫ, 1), d2 at (5, 1), and d3 at (3, 1 + 2 √ 3) are almost equidistant, with d1 and d2 closer to each other than to d3. The non-monotonic inversion in the hierarchical clustering of the three points appears as an intersecting merge line in the dendrogram. The intersection is circled.
***Extensible Markup Language->
***hyphens->
|------------
In English, hyphenation is used for various purposes ranging from split-HYPHENS ting up vowels in words (co-education) to joining nouns as names (Hewlett- Packard) to a copyediting device to show word grouping (the hold-him-back- and-drag-him-awaymaneuver). It is easy to feel that the first example should be regarded as one token (and is indeed more commonly written as just coedu- cation), the last should be separated into words, and that the middle case is unclear. Handling hyphens automatically can thus be complex: it can either be done as a classification problem, or more commonly by some heuristic rules, such as allowing short hyphenated prefixes on words, but not longer hyphenated forms.
***multiclass SVM->
|------------
The construction of multiclass SVMs is discussed in (Weston and Watkins 1999), (Crammer and Singer 2001), and (Tsochantaridis et al. 2005). The last reference provides an introduction to the general framework of structural SVMs.
***index construction->
|------------
4 Index construction In this chapter, we look at how to construct an inverted index. We call this process index construction or indexing; the process or machine that performs itINDEXING the indexer. The design of indexing algorithms is governed by hardware con-INDEXER straints. We therefore begin this chapter with a review of the basics of com- puter hardware that are relevant for indexing. We then introduce blocked sort-based indexing (Section 4.2), an efficient single-machine algorithm de- signed for static collections that can be viewed as a more scalable version of the basic sort-based indexing algorithm we introduced in Chapter 1. Sec- tion 4.3 describes single-pass in-memory indexing, an algorithm that has even better scaling properties because it does not hold the vocabulary in memory. For very large collections like the web, indexing has to be dis- tributed over computer clusters with hundreds or thousands of machines.
|------------
Index construction interacts with several topics covered in other chapters.
***personalized PageRank->
|------------
But what if a user is known to have a mixture of interests from multiple topics? For instance, a user may have an interest mixture (or profile) that is 60% sports and 40% politics; can we compute a personalized PageRank for thisPERSONALIZED PAGERANK user? At first glance, this appears daunting: how could we possibly compute a different PageRank distribution for each user profile (with, potentially, in- finitely many possible profiles)? We can in fact address this provided we assume that an individual’s interests can be well-approximated as a linear ◮ Figure 21.5 Topic-specific PageRank. In this example we consider a user whose interests are 60% sports and 40% politics. If the teleportation probability is 10%, this user is modeled as teleporting 6% to sports pages and 4% to politics pages.
***learning method->
|------------
Using a learning method or learning algorithm, we then wish to learn a clas-LEARNING METHOD sifier or classification function γ that maps documents to classes:CLASSIFIER γ : X → C(13.1) This type of learning is called supervised learning because a supervisor (theSUPERVISED LEARNING human who defines the classes and labels training documents) serves as a teacher directing the learning process. We denote the supervised learning method by Γ and write Γ(D) = γ. The learning method Γ takes the training set D as input and returns the learned classification function γ.
|------------
Most names for learning methods Γ are also used for classifiers γ. We talk about the Naive Bayes (NB) learning method Γ when we say that “Naive Bayes is robust,” meaning that it can be applied to many different learning problems and is unlikely to produce classifiers that fail catastrophically. But when we say that “Naive Bayes had an error rate of 20%,” we are describing an experiment in which a particular NB classifier γ (which was produced by the NB learning method) had a 20% error rate in an application.
***document space->
|------------
13.1 The text classification problem In text classification, we are given a description d ∈ X of a document, where X is the document space; and a fixed set of classes C = {c1, c2, . . . , cJ}. ClassesDOCUMENT SPACE CLASS are also called categories or labels. Typically, the document space X is some type of high-dimensional space, and the classes are human defined for the needs of an application, as in the examples China and documents that talk about multicore computer chips above. We are given a training set D of labeledTRAINING SET documents 〈d, c〉,where 〈d, c〉 ∈ X × C. For example: 〈d, c〉 = 〈Beijing joins the World Trade Organization,China〉 for the one-sentence document Beijing joins the World Trade Organization and the class (or label) China.
***IP address->
|------------
20.2.2 DNS resolution Each web server (and indeed any host connected to the internet) has a unique IP address: a sequence of four bytes generally represented as four integersIP ADDRESS separated by dots; for instance 207.142.131.248 is the numerical IP address as- sociated with the host www.wikipedia.org. Given a URL such as www.wikipedia.org in textual form, translating it to an IP address (in this case, 207.142.131.248) is a process known as DNS resolution or DNS lookup; here DNS stands for Do-DNS RESOLUTION main Name Service. During DNS resolution, the program that wishes to per- form this translation (in our case, a component of the web crawler) contacts a DNS server that returns the translated IP address. (In practice the entire trans-DNS SERVER lation may not occur at a single DNS server; rather, the DNS server contacted initially may recursively call upon other DNS servers to complete the transla- tion.) For a more complex URL such as en.wikipedia.org/wiki/Domain_Name_System, the crawler component responsible for DNS resolution extracts the host name – in this case en.wikipedia.org – and looks up the IP address for the host en.wikipedia.org.
***posting->
|------------
We keep a dictionary of terms (sometimes also referred to as a vocabulary orDICTIONARY VOCABULARY lexicon; in this book, we use dictionary for the data structure and vocabulary LEXICON for the set of terms). Then for each term, we have a list that records which documents the term occurs in. Each item in the list – which records that a term appeared in a document (and, later, often, the positions in the docu- ment) – is conventionally called a posting.4 The list is then called a postingsPOSTING POSTINGS LIST list (or inverted list), and all the postings lists taken together are referred to as the postings. The dictionary in Figure 1.3 has been sorted alphabetically andPOSTINGS each postings list is sorted by document ID. We will see why this is useful in Section 1.3, below, but later we will also consider alternatives to doing this (Section 7.1.5).
|------------
4. In a (non-positional) inverted index, a posting is just a document ID, but it is inherently associated with a term, via the postings list it is placed on; sometimes we will also talk of a (term, docID) pair as a posting.
|------------
︸ ︷︷ ︸ ︸ ︷︷ ︸ Dictionary Postings ◮ Figure 1.3 The two parts of an inverted index. The dictionary is commonly kept in memory, with pointers to each postings list, which is stored on disk.
|------------
4. Index the documents that each term occurs in by creating an inverted in- dex, consisting of a dictionary and postings.
|------------
Within a document collection, we assume that each document has a unique serial number, known as the document identifier (docID). During index con-DOCID struction, we can simply assign successive integers to each new document when it is first encountered. The input to indexing is a list of normalized tokens for each document, which we can equally think of as a list of pairs of term and docID, as in Figure 1.4. The core indexing step is sorting this listSORTING so that the terms are alphabetical, giving us the representation in the middle column of Figure 1.4. Multiple occurrences of the same term from the same document are then merged.5 Instances of the same term are then grouped, and the result is split into a dictionary and postings, as shown in the right column of Figure 1.4. Since a term generally occurs in a number of docu- ments, this data organization already reduces the storage requirements of the index. The dictionary also records some statistics, such as the number of documents which contain each term (the document frequency, which is hereDOCUMENT FREQUENCY also the length of each postings list). This information is not vital for a ba- sic Boolean search engine, but it allows us to improve the efficiency of the 5. Unix users can note that these steps are similar to use of the sort and then uniq commands.
|------------
Inversion involves two steps. First, we sort the termID–docID pairs. Next,INVERSION we collect all termID–docID pairs with the same termID into a postings list, where a posting is simply a docID. The result, an inverted index for the blockPOSTING we have just read, is then written to disk. Applying this to Reuters-RCV1 and assuming we can fit 10 million termID–docID pairs into memory, we end up with ten blocks, each an inverted index of one part of the collection.
|------------
In the final step, the algorithm simultaneously merges the ten blocks into one large merged index. An example with two blocks is shown in Figure 4.3, where we use di to denote the ith document of the collection. To do the merg- ing, we open all block files simultaneously, and maintain small read buffers for the ten blocks we are reading and a write buffer for the final merged in- dex we are writing. In each iteration, we select the lowest termID that has not been processed yet using a priority queue or a similar data structure. All postings lists for this termID are read and merged, and the merged list is written back to disk. Each read buffer is refilled from its file when necessary.
|------------
How expensive is BSBI? Its time complexity is Θ(T log T) because the step with the highest time complexity is sorting and T is an upper bound for the number of items we must sort (i.e., the number of termID–docID pairs). But brutus d1,d3 caesar d1,d2,d4 noble d5 with d1,d2,d3,d5 brutus d6,d7 caesar d8,d9 julius d10 killed d8 postings lists to be merged brutus d1,d3,d6,d7 caesar d1,d2,d4,d8,d9 julius d10 killed d8 noble d5 with d1,d2,d3,d5 merged postings lists disk ◮ Figure 4.3 Merging in blocked sort-based indexing. Two blocks (“postings lists to be merged”) are loaded from disk into memory, merged in memory (“merged post- ings lists”) and written back to disk. We show terms instead of termIDs for better readability.
|------------
In this chapter, we define a posting as a docID in a postings list. For exam-POSTING ple, the postings list (6; 20, 45, 100), where 6 is the termID of the list’s term, contains three postings. As discussed in Section 2.4.2 (page 41), postings in most search systems also contain frequency and position information; but we will only consider simple docID postings here. See Section 5.4 for references on compressing frequencies and positions.
|------------
This chapter first gives a statistical characterization of the distribution of the entities we want to compress – terms and postings in large collections (Section 5.1). We then look at compression of the dictionary, using the dictionary- as-a-string method and blocked storage (Section 5.2). Section 5.3 describes two techniques for compressing the postings file, variable byte encoding and γ encoding.
|------------
5.1 Statistical properties of terms in information retrieval As in the last chapter, we use Reuters-RCV1 as our model collection (see Ta- ble 4.2, page 70). We give some term and postings statistics for the collection in Table 5.1. “∆%” indicates the reduction in size from the previous line.
|------------
The table shows the number of terms for different levels of preprocessing (column 2). The number of terms is the main factor in determining the size of the dictionary. The number of nonpositional postings (column 3) is an indicator of the expected size of the nonpositional index of the collection.
|------------
In general, the statistics in Table 5.1 show that preprocessing affects the size of the dictionary and the number of nonpositional postings greatly. Stem- ming and case folding reduce the number of (distinct) terms by 17% each and the number of nonpositional postings by 4% and 3%, respectively. The treatment of the most frequent words is also important. The rule of 30 statesRULE OF 30 that the 30 most common words account for 30% of the tokens in written text (31% in the table). Eliminating the 150 most common words from indexing (as stop words; cf. Section 2.2.2, page 27) cuts 25% to 30% of the nonpositional postings. But, although a stop list of 150 words reduces the number of post- ings by a quarter or more, this size reduction does not carry over to the size of the compressed index. As we will see later in this chapter, the postings lists of frequent words require only a few bits per posting after compression.
***context resemblance->
|------------
But we still prefer documents that match the query structure closely by inserting fewer additional nodes. We ensure that retrieval results respect this preference by computing a weight for each match. A simple measure of the similarity of a path cq in a query and a path cd in a document is the following context resemblance function CR:CONTEXT RESEMBLANCE CR(cq, cd) = { 1+|cq| 1+|cd| if cq matches cd 0 if cq does not match cd (10.1) where |cq| and |cd| are the number of nodes in the query path and document path, respectively, and cq matches cd iff we can transform cq into cd by in- serting additional nodes. Two examples from Figure 10.6 are CR(cq4 , cd2) = 3/4 = 0.75 and CR(cq4 , cd3) = 3/5 = 0.6 where cq4 , cd2 and cd3 are the rele- vant paths from top to leaf node in q4, d2 and d3, respectively. The value of CR(cq, cd) is 1.0 if q and d are identical.
***document-partitioned index->
***performance->
|------------
We will use effectiveness as a generic term for measures that evaluate theEFFECTIVENESS quality of classification decisions, including precision, recall, F1, and accu- racy. Performance refers to the computational efficiency of classification andPERFORMANCE EFFICIENCY IR systems in this book. However, many researchers mean effectiveness, not efficiency of text classification when they use the term performance.
***variance->
|------------
Writing ΓD for Γ(D) for better readability, we can transform Equation (14.7) as follows: learning-error(Γ) = ED[MSE(ΓD)] = EDEd[ΓD(d)− P(c|d)]2(14.10) = Ed[bias(Γ, d) + variance(Γ, d)](14.11) bias(Γ, d) = [P(c|d)− EDΓD(d)]2(14.12) variance(Γ, d) = ED[ΓD(d)− EDΓD(d)]2(14.13) where the equivalence between Equations (14.10) and (14.11) is shown in Equation (14.9) in Figure 14.13. Note that d and D are independent of each other. In general, for a random document d and a random training set D, D does not contain a labeled instance of d.
|------------
Variance is the variation of the prediction of learned classifiers: the aver-VARIANCE age squared difference between ΓD(d) and its average EDΓD(d). Variance is large if different training sets D give rise to very different classifiers ΓD . It is small if the training set has a minor effect on the classification decisions ΓD makes, be they correct or incorrect. Variance measures how inconsistent the decisions are, not whether they are correct or incorrect.
|------------
Linear learning methods have low variance because most randomly drawn training sets produce similar decision hyperplanes. The decision lines pro- duced by linear learning methods in Figures 14.10 and 14.11 will deviate slightly from the main class boundaries, depending on the training set, but the class assignment for the vast majority of documents (with the exception of those close to the main boundary) will not be affected. The circular enclave in Figure 14.11 will be consistently misclassified.
***XML->
|------------
Again, we must determine the document format, and then an appropriate decoder has to be used. Even for plain text documents, additional decoding may need to be done. In XML documents (Section 10.1, page 197), charac- ter entities, such as &, need to be decoded to give the correct character, namely & for &. Finally, the textual part of the document may need to be extracted out of other material that will not be processed. This might be the desired handling for XML files, if the markup is going to be ignored; we would almost certainly want to do this with postscript or PDF files. We will not deal further with these issues in this book, and will assume henceforth that our documents are a list of characters. Commercial products usually need to support a broad range of document types and encodings, since users want things to just work with their data as is. Often, they just think of docu- ments as text inside applications and are not even aware of how it is encoded on disk. This problem is usually solved by licensing a software library that handles decoding document formats and character encodings.
|------------
We will only look at one standard for encoding structured documents: Ex- tensible Markup Language or XML, which is currently the most widely usedXML such standard. We will not cover the specifics that distinguish XML from other types of markup such as HTML and SGML. But most of what we say in this chapter is applicable to markup languages in general.
|------------
In the context of information retrieval, we are only interested in XML as a language for encoding text and documents. A perhaps more widespread use of XML is to encode non-text data. For example, we may want to export data in XML format from an enterprise resource planning system and then read them into an analytics program to produce graphs for a presentation.
|------------
This type of application of XML is called data-centric because numerical andDATA-CENTRIC XML non-text attribute-value data dominate and text is usually a small fraction of the overall data. Most data-centric XML is stored in databases – in contrast to the inverted index-based methods for text-centric XML that we present in this chapter.
***Hierarchical Dirichlet Processes->
***document frequency->
|------------
Within a document collection, we assume that each document has a unique serial number, known as the document identifier (docID). During index con-DOCID struction, we can simply assign successive integers to each new document when it is first encountered. The input to indexing is a list of normalized tokens for each document, which we can equally think of as a list of pairs of term and docID, as in Figure 1.4. The core indexing step is sorting this listSORTING so that the terms are alphabetical, giving us the representation in the middle column of Figure 1.4. Multiple occurrences of the same term from the same document are then merged.5 Instances of the same term are then grouped, and the result is split into a dictionary and postings, as shown in the right column of Figure 1.4. Since a term generally occurs in a number of docu- ments, this data organization already reduces the storage requirements of the index. The dictionary also records some statistics, such as the number of documents which contain each term (the document frequency, which is hereDOCUMENT FREQUENCY also the length of each postings list). This information is not vital for a ba- sic Boolean search engine, but it allows us to improve the efficiency of the 5. Unix users can note that these steps are similar to use of the sort and then uniq commands.
|------------
6.2.1 Inverse document frequency Raw term frequency as above suffers from a critical problem: all terms are considered equally important when it comes to assessing relevancy on a query. In fact certain terms have little or no discriminating power in de- termining relevance. For instance, a collection of documents on the auto industry is likely to have the term auto in almost every document. To this Word cf df try 10422 8760 insurance 10440 3997 ◮ Figure 6.7 Collection frequency (cf) and document frequency (df) behave differ- ently, as in this example from the Reuters collection.
|------------
Instead, it is more commonplace to use for this purpose the document fre-DOCUMENT FREQUENCY quency dft, defined to be the number of documents in the collection that con- tain a term t. This is because in trying to discriminate between documents for the purpose of scoring it is better to use a document-level statistic (such as the number of documents containing a term) than to use a collection-wide statistic for the term. The reason to prefer df to cf is illustrated in Figure 6.7, where a simple example shows that collection frequency (cf) and document frequency (df) can behave rather differently. In particular, the cf values for both try and insurance are roughly equal, but their df values differ signifi- cantly. Intuitively, we want the few documents that contain insurance to get a higher boost for a query on insurance than the many documents containing try get from a query on try.
|------------
How is the document frequency df of a term used to scale its weight? De- noting as usual the total number of documents in a collection by N, we define the inverse document frequency (idf) of a term t as follows:INVERSE DOCUMENT FREQUENCY idft = log N dft .(6.7) Thus the idf of a rare term is high, whereas the idf of a frequent term is likely to be low. Figure 6.8 gives an example of idf’s in the Reuters collection of 806,791 documents; in this example logarithms are to the base 10. In fact, as we will see in Exercise 6.12, the precise base of the logarithm is not material to ranking. We will give on page 227 a justification of the particular form in Equation (6.7).
|------------
6.2.2 Tf-idf weighting We now combine the definitions of term frequency and inverse document frequency, to produce a composite weight for each term in each document.
***category->
***Rocchio classification->
|------------
? Exercise 14.1For small areas, distances on the surface of the hypersphere are approximated well by distances on its projection (Figure 14.2) because α ≈ sin α for small angles. For what size angle is the distortion α/ sin(α) (i) 1.01, (ii) 1.05 and (iii) 1.1? 14.2 Rocchio classification Figure 14.1 shows three classes, China, UK and Kenya, in a two-dimensional (2D) space. Documents are shown as circles, diamonds and X’s. The bound- aries in the figure, which we call decision boundaries, are chosen to separateDECISION BOUNDARY the three classes, but are otherwise arbitrary. To classify a new document, depicted as a star in the figure, we determine the region it occurs in and as- sign it the class of that region – China in this case. Our task in vector space classification is to devise algorithms that compute good boundaries where “good” means high classification accuracy on data unseen during training.
|------------
Perhaps the best-known way of computing good class boundaries is Roc-ROCCHIO CLASSIFICATION chio classification, which uses centroids to define the boundaries. The centroid CENTROID of a class c is computed as the vector average or center of mass of its mem- bers: ~µ(c) = 1 |Dc| ∑d∈Dc ~v(d)(14.1) where Dc is the set of documents in D whose class is c: Dc = {d : 〈d, c〉 ∈ D}.
|------------
The boundary between two classes in Rocchio classification is the set of points with equal distance from the two centroids. For example, |a1| = |a2|, xx x x ⋄ ⋄ ⋄⋄ ⋄ ⋄ China Kenya UK ⋆ a1 a2 b1 b2 c1 c2 ◮ Figure 14.3 Rocchio classification.
***dendrogram->
|------------
An HAC clustering is typically visualized as a dendrogram as shown inDENDROGRAM Figure 17.1. Each merge is represented by a horizontal line. The y-coordinate of the horizontal line is the similarity of the two clusters that were merged, where documents are viewed as singleton clusters. We call this similarity the combination similarity of the merged cluster. For example, the combinationCOMBINATION SIMILARITY similarity of the cluster consisting of Lloyd’s CEO questioned and Lloyd’s chief / U.S. grilling in Figure 17.1 is ≈ 0.56. We define the combination similarity of a singleton cluster as its document’s self-similarity (which is 1.0 for cosine similarity).
|------------
By moving up from the bottom layer to the top node, a dendrogram al- lows us to reconstruct the history of merges that resulted in the depicted clustering. For example, we see that the two documents entitled War hero Colin Powell were merged first in Figure 17.1 and that the last merge added Ag trade reform to a cluster consisting of the other 29 documents.
***structured query->