Advanced Data Structure Project

CSCI4117
Advanced Data Structure Project Proposal

Yejia Tong/B00537881
2012.11.5

1. Title of Project

Succinct data structure in top-k documents retrieval

2. Objective of Research

The main aim of this project is to discover how to efficiently find the k documents where a given pattern occurs most frequently. While the problem has been discussed in many papers and solved in various ways, our research is to look for the novel algorithms and (succinct) data structures among lately related materials and find the one dominating almost all the space/time tradeoff.

3. Background/History of the Study

Before we beigin our aim to find a such a succinct data structure, there are a number of fundamental works in our approach. There exist two main among many ideas in classic information retrieval: inverted index and term frequency. (Angelos, Giannis, Epimeneidis, Euripides, & Evangelos, 2005) The inverted index is a also referred to as postings file, which is an index dara structure storing a mapping from content. It is the most utilized data structure in the Information Retrieval domain, used on a large scale for example in search engines. Term frequency is a measure of how often a term is found in a collection of documents. However, there are restricted assumptions for the efficiency of the ideas: the text must be easily tokenized into words, there must not be too many different words, and queries must be whole words or phrases, causing lots of difficulty in the document retrieval via various languages. Moreover, one of the attractive properties of an inverted file is that it is easily compressible while still supporting fast queries. In practice, an inverted file occupies space close to that if a compressed document collection. (Niko & Veli, 2007) In further development, people find efficient data structures such as suffix arrays and suffix trees (full-text indexes) providing good space/time efficiency to inverted files. Recently, several

References: Bibliography Angelos, H., Giannis, V., Epimeneidis, V., Euripides, P. G., & Evangelos, M. (2005). Information Retrieval by Semantic Similarity. Dalhousie University, Faculty of Computer Science. Halifax: None. Bieganski, P. (1994). Generalized suffix trees for biological sequence data: applications and implementation. Minnesota University, Dept. of Comput. Sci. Minneapolis: None. Gonzalo, N., & Daniel, V. (2012). Space-Efficient Top-k Document Retrieval. Univ. of Chile, Dept. of Computer Science. Valdivia: None. Hon, W. K., Shah, R., & Wu, S. B. (2009). Efficient INdex for Retrieving Top-k Most Frequenct Documents. None: Springer, Heidelberg. Niko, V., & Veli, M. (2007). Space-efficient Algorithms for Document Retrieval. University of Helsinki, Department of Computer Science. Finland: None. Y., M., S., M., S., C. S., & J., Z. (1998). Augmenting suffix trees with applications. 6th Annual European Symposium on Algorithms (ESA 1998) (pp. 67-78). None: Springer-Verlag.

Advanced Data Structure Project

You May Also Find These Documents Helpful

DBQHansaandtheSwahili

DBQHansaandtheSwahili

Ap World History Dbq Analysis

Ap World History Dbq Analysis

Compter Science

Compter Science

ITEC 610 Assingement 1

ITEC 610 Assingement 1

Marijuana Industry History Paper

Marijuana Industry History Paper

IT3400 Chapter 5 Assignment

IT3400 Chapter 5 Assignment

The Apostolate

The Apostolate

Full Text Search in a Ruby on Rails Application

Full Text Search in a Ruby on Rails Application

Secure Document Similarity Detection

Secure Document Similarity Detection

Recommendation Report

Recommendation Report

Ftir

Ftir

Multimedia Database

Multimedia Database

Poster Abstract: Labeling Personal Characteristics from Mobile Phone Traces

Poster Abstract: Labeling Personal Characteristics from Mobile Phone Traces

Complementary relevance feedback-based content-based image retrieval

Complementary relevance feedback-based content-based image retrieval

research on mining

research on mining

Related Topics