⭐文章目录⭐
👇
Inverted File Index
Solutions
1.Scan each documents
💩💩💩💩💩💩
2.Term-Document Incidence Matrix
Here is an example of a term-document incidence matrix:
3.Compact Version - Inverted File Index
Definition
- Index is a mechanism for locating a given term in a text.
- Inverted file contains a list of pointers (e.g. the number of a page) to all occurrences of that term in the text.
Here is an example of an inverted file index:
Term reading
1.Word Stemming
Process a word so that only its stem or root form is left.
2.Stop Words
Some words are so common that almost every document contains them, such as “a” “the” “it”. It is useless to index them. They are called stop words. We can eliminate them from the original documents.
4.Distributed indexing
5.Thresholding
- Document: Only retrieve the top x documents where the documents are ranked by weight
- Query: Sort the query terms by their frequency in ascending order; search according to only some percentage of the original query terms.
The performance of index
Related | Inrelated | |
---|---|---|
Retrived | n1 | n2 |
Not Retrived | n3 | n4 |
Precision $P=\frac{n1}{n1+n2}$
Recall $R=\frac{n1}{n1+n3}$