LILAC Centre, School of Library Studies Clyde College, Elgin, Australia Abstract
Ranking techniques are used to evaluate natural-language queries on text databases. Text databases are an important component of digital libraries. Effective ranking can be costly in memory and time: the database may contain millions of documents and queries can contain large numbers of terms. These information retrieval systems must access large volumes of text, often divided into several collections that may be held on separate machines. In many environments, such as current desktop computers, standard CPU speeds and volumes of mem- ory are more than adequate to rapidly resolve queries, even on databases of many gigabytes of text. Techniques for locating answers to queries must therefore consider identification of probable collections as well as identification of documents that are probable answers, to avoid the situation in which all queries must be answered in full by all servers. In other environ- ments, however, both memory and time are limited: examples include Internet search engines, corporate data servers, online product databases, and, at the other extreme, handheld com- puters with PCIMIA-slot disk drives. In this paper we show that use of centralised blocked indexes, expressly designed for a multi-collection environment, can meet these objectives and simultaneously reduce overall query processing costs. 1 Introduction
The use of information retrieval systems for management of text data is widespread, and their use is likely to accelerate with the advent of the digital library. All of these techniques reduce the time or memory required to resolve a query. Newspaper archives, library catalogues, and legislation repositories all require access by record content if they are to be useful and effective. However, they do not necessarily bound it. These text databases are also quite different from the more usual key-based databases, in that the access criteria for any given record cannot be reliably determined in advance. In a practical system, however, it is important to allocate a fixed quantity of each resource to each user, to avoid problems such as thrashing. Instead, a record may be retrieved based upon any combination of words contained therein, so a relatively large amount of index information must be maintained for each record. In such contexts an important design goal may not be to seek maximum effectiveness, or the most efficient query evaluation algorithm that yields effectiveness close to that of the most effective algorithm; rather, the design goal is to maximise effectiveness once other constraints have been imposed. This facility makes text databases ideal for managing the textual component of a digital library, since the rigid structure of a conventional catalogue system need no longer restrict the access path used to locate information. A consequence of this result is that two-level indexes may be appropriate for large collections even if all of the data is physically present on one machine. Over the last decade, new ranking techniques have allowed the resources used to evaluate a ranked query to be greatly reduced. These techniques are largely based on inverted files, which for each distinct term includes an inverted list of the documents containing that term. Queries are resolved by fetching and 1
processing the inverted list for each query term, and, for each document in the list, updating an accumulator holding a partially-computed similarity for that document. Advances in efficiency include compression of inverted lists, to reduce disk transfer costs and increase the likelihood that lists are cached in memory; internal structuring of inverted lists, to reduce decompression costs [MZ96]; heuristic limits to the number of accumulators, to reduce the memory required to resolve a query [MZ96,...