An Introduction to Information Retrieval

Only available on StudyMode
  • Topic: Information retrieval, Cluster analysis, Vector space model
  • Pages : 31 (8670 words )
  • Download(s) : 184
  • Published : April 6, 2012
Open Document
Text Preview
An Introduction to Information Retrieval

Draft of April 1, 2009

Online edition (c) 2009 Cambridge UP

Online edition (c) 2009 Cambridge UP

An Introduction to Information Retrieval

Christopher D. Manning Prabhakar Raghavan Hinrich Schütze

Cambridge University Press Cambridge, England

Online edition (c) 2009 Cambridge UP

DRAFT! DO NOT DISTRIBUTE WITHOUT PRIOR PERMISSION

© 2009 Cambridge University Press
By Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze Printed on April 1, 2009

Website: http://www.informationretrieval.org/ Comments, corrections, and other feedback most welcome at:

informationretrieval@yahoogroups.com

Online edition (c) 2009 Cambridge UP

DRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome.

v

Brief Contents

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

1 Boolean retrieval The term vocabulary and postings lists 19 Dictionaries and tolerant retrieval 49 Index construction 67 Index compression 85 Scoring, term weighting and the vector space model 109 Computing scores in a complete search system 135 Evaluation in information retrieval 151 Relevance feedback and query expansion 177 XML retrieval 195 Probabilistic information retrieval 219 Language models for information retrieval 237 Text classification and Naive Bayes 253 Vector space classification 289 Support vector machines and machine learning on documents Flat clustering 349 Hierarchical clustering 377 Matrix decompositions and latent semantic indexing 403 Web search basics 421 Web crawling and indexes 443 Link analysis 461

319

Online edition (c) 2009 Cambridge UP

Online edition (c) 2009 Cambridge UP

DRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome.

vii

Contents

List of Tables List of Figures Table of Notation Preface xxxi

xv xix xxvii

1 Boolean retrieval 1.1 1.2 1.3 1.4 1.5

1

An example information retrieval problem 3 A first take at building an inverted index 6 Processing Boolean queries 10 The extended Boolean model versus ranked retrieval References and further reading 17 19

14

2 The term vocabulary and postings lists 2.1

2.2

2.3 2.4

2.5

Document delineation and character sequence decoding 2.1.1 Obtaining the character sequence in a document 2.1.2 Choosing a document unit 20 Determining the vocabulary of terms 22 2.2.1 Tokenization 22 2.2.2 Dropping common terms: stop words 27 2.2.3 Normalization (equivalence classing of terms) 2.2.4 Stemming and lemmatization 32 Faster postings list intersection via skip pointers 36 Positional postings and phrase queries 39 2.4.1 Biword indexes 39 2.4.2 Positional indexes 41 2.4.3 Combination schemes 43 References and further reading 45

19 19

28

Online edition (c) 2009 Cambridge UP

viii 49 3 Dictionaries and tolerant retrieval 3.1 Search structures for dictionaries 49 3.2 Wildcard queries 51 3.2.1 General wildcard queries 53 3.2.2 k-gram indexes for wildcard queries 54 3.3 Spelling correction 56 3.3.1 Implementing spelling correction 57 3.3.2 Forms of spelling correction 57 3.3.3 Edit distance 58 3.3.4 k-gram indexes for spelling correction 60 3.3.5 Context sensitive spelling correction 62 3.4 Phonetic correction 63 3.5 References and further reading 65 4 Index construction 67 4.1 Hardware basics 68 4.2 Blocked sort-based indexing 69 4.3 Single-pass in-memory indexing 73 4.4 Distributed indexing 74 4.5 Dynamic indexing 78 4.6 Other types of indexes 80 4.7 References and further reading 83 5 Index compression 85 5.1 Statistical properties of terms in information retrieval 5.1.1 Heaps’ law: Estimating the number of terms 5.1.2 Zipf’s law: Modeling the distribution of terms 5.2 Dictionary compression 90 5.2.1 Dictionary as a string 91 5.2.2 Blocked storage 92 5.3 Postings file compression 95 5.3.1 Variable byte codes 96 5.3.2 γ codes 98 5.4 References and further reading 105 6 Scoring, term weighting and the vector space model 6.1 Parametric and zone...
tracking img