Fault-tolerant Fulltext Search for Large Multilingual Scientific Text Corpora

Wolfram M. Esser

Abstract


In the work reported here, we present a new way of performing fault-tolerant fulltext retrieval on large text corpora, such as scientific encyclopedias. The weighted pattern morphing (WPM) technique introduced in this paper overcomes disadvantages of both the popular edit distance measure and the Soundex code approaches, yet keeping their flexibility. This algorithm handles phonetic similarities; common typing errors such as omission or transposition of letters, and inconsistent usage of abbreviations and hyphenation. After showing how WPM can be implemented efficiently, we present a novel method of how the weights of the internal penalty matrix can be automatically adjusted for even better results. Though the described technique can be applied without prior knowledge of actual user patterns, re-examination with a large number of online-user's patterns proves the portability of this fine-tuning approach. We further show how shifting the penalty matrix from one language to another can be accomplished. The described WPM technique is integrated into a large commercial pharmaceutical encyclopedia CDROM, an online dermatological encyclopedia, and an online-reference encyclopedia of parasitology research, thus also proving its "road capability".

Full Text: PDF