Performance Issues in Digital Information Systems: introduction to a special issue
Performance issues lie at the heart of computer science and these issues are pervasive in digital information systems. Digital libraries have tremendous potential to affect our access to information, but the management of information on such a huge scale is a daunting task. Adding information to digital libraries requires high performance incremental indexing and cataloging as well as tools for recognizing existing documents currently stored on analog media such as paper, microfilm and videotape. The growing use of multimedia in modern documents brings with it all the performance issues that arise with high data volume and real-time delivery requirements. The World Wide Web has transformed the way people use computers and their level of interest in them, but it has also spawned repeated complaints about delays in information delivery. Even document authoring and browsing systems, which are generally quite fast, must address performance issues because of their many features and because every end-user is affected by how they perform.
The paradox is that research on digital information systems has had little to say about efficiency. There is a sensible reason for this state of affairs. The effectiveness of digital information systems is not yet adequate, even though great strides have been made. Existing document search engines exhibit only mediocre precision and recall. Document recognition tools are inaccurate both at the high level of document structure and the low level of character recognition. The Web supports only a fraction of the functionality provided by earlier hypertext systems. Thus, most research has focused on improving accuracy and functionality rather than performance.
Two trends convince me that efficiency of digital information systems is a research area of increasing importance. The obvious trend is the growth of the Web into the largest information system ever created. The size, heterogeneity and distributed architecture of the Web present tremendous performance problems for search engines, directories and other information resources that must be addressed. The second and less obvious trend is the growing use of portable computing devices. PDAs and mobile phones are expected to provide functionality near that of desktop personal computers using processors and memory configurations that are about 20 times slower and smaller. It is precisely these kinds of resource constraints that force computer scientists to find efficient solutions to important problems and I have every reason to believe that we will do so.
This special issue of the Journal of Digital Information presents three article which, taken together, suggest the range and depth of research in performance issues for digital information systems.
In an invited paper, Frieder et al. survey performance issues for scalable information retrieval systems. They begin by examining techniques for compressing the inverted index and for performing incremental updates of its term weights. Query processing can be improved by modifying the index representation and by a variety of techniques for cutting off computation. The authors also discuss how parallel and distributed implementations can be used to construct scalable retrieval systems.
Miller et al. provide a more focused look at the same general domain in a paper that describes how their TELLTALE system was made suitable for use with a gigabyte-scale corpus. TELLTALE uses n-grams rather than words as the basis for its indexing. This design choice makes it suitable for multilingual corpi and tolerant of noise or errors, such as might come from optical character recognition. However, it also increases the size of the inverted index substantially. The authors show how gamma compression along with a number of changes to data structures give TELLTALE sufficient performance to handle a gigabyte corpus when running on stock personal computer hardware.
A concern with memory usage also underlies Cumaranatunge and Munson's paper describing a runtime system for constraint-based multimedia style sheets. The growth of the Web, with its dependence on the structured document paradigm, is increasing the importance of style sheets. While most widely-used style sheet systems use the flow layout method, a constraint-based approach is better suited to non-textual data such as graphics or video. However, typical constraint system implementations will use unacceptable amounts of memory for large documents because each constraint is represented by a first-class object. Cumaranatunge and Munson show that memory usage can be reduced substantially with new data structures that treat constraints as second-class objects.
These papers show that performance issues are being actively addressed by researchers in digital information systems. It is my hope that this humble effort will stimulate others to pursue questions of information system performance and report those results in venues such as future issues of this journal.