XML Functionality for Digital Dictionaries: Muller and Beddow: JoDI

Abstract

A report on the new developments in the online Digital Dictionary of Buddhism and CJK-English Dictionary, focusing on their implementation in XML.The paper is in two parts:

Project Manager's Report, by Charles Muller
Delivering CJK Dictionaries from Pure XML Sources: A Developer's Perspective, by Michael Beddow

1 Project Manager's Report, by Charles Muller

1.1 Technical Review

Compilation of the Digital Dictionary of Buddhism (DDB) began with the realization of the dearth of adequate lexicographical and other reference works in the English language for the textual scholar of East Asian Buddhism in particular, and East Asian philosophy and religion in general. The (Chinese, Japanese, Korean) CJK-English Dictionary (CJK-E) began soon after. I decided, during my first Buddhist and Confucian/Taoist texts readings courses, to save everything I looked up, and have continued that practice to the present, through the course of studying scores of classical texts. Although the content of these two lexicons is presently being supplemented by other interested parties, the terms that I have been compiling serve as the major portion of the work.

At the beginning I could not have dreamed of the Internet, or even thought of the possibility of having this material available as a digital database. I simply envisioned the eventual publication of a newer, larger and more useful printed work. As IT developments progressed, the potential gradually began to dawn. The first Web version was uploaded in the summer of 1995. It was not long after that Christian Wittern discovered the DDB and applied a basic SGML structure, which is the ancestor of the XML markup system used today.

Due to limitations of popular browsers, among otheh things, this framework languished for a couple of years, during which time new HTML versions of the dictionary were periodically regenerated with an array of Word macros. Even slight changes usually necessitated a complete re-tooling of the macro system. Also, access to the data in the dictionaries was limited to hyperlinking, through an array of index files also generated from Word macros.

The most important tool for making a dictionary really useful - a search engine, was lacking. What was needed was to keep the data in a stable, validated SGML/XML format, and presented to users by a style sheet, or some other database-retrieval technique. Once XML support had been included in MS IE5 for a year or so, I went back and experimented further, but found that many aspects of XSLT were still not adequately supported. Without Xlink/Xpointer support, not much could be done beyond the level of support provided by popular browsers.

The XML solution first appeared in the summer of 2000, when Christian Wittern developed an experimental version of the DDB using the Zope system. This was the first attempt to use the data in a form close to the original XML, and also the first time a search engine had been applied to either of the dictionaries. As the maintainer of the dictionaries, however, this system presented difficulties in the sense that the data needed to be converted into thousands of small files, which made the dictionary difficult for me to maintain locally. There was also the problem of a lack of full native support for XSLT. Nonetheless, Wittern's work marked the first time a version of the combined dictionaries had been generated more or less directly from the XML source.

A major turning point for the CJK-E/DDB came in January 2001. While browsing a Japanese magazine on Palm computing, I noticed that Jim Breen's Japanese dictionary was becoming a sort of standard for inclusion on DoCoMo portable telephones. It occurred to me that although my data had always been publicly available, since the time of finalization and validation of the XML format, no effort had been made to let IT people know these data were freely available to download, as Breen's were. The availability of the data files was announced on some major XML lists. Not long after, I was contacted by Michael Beddow.

1.2 Recent Evolution: XML Comes Alive

Michael Beddow, a scholar of German Studies, had a strong interest in using XML as a means of storage and delivery of literary and lexicographical documents. He was sure that he could add XSLT and XLink functionality to the latest versions of the standard browsers. Based on the markup structure of the CJK-E, he generated an array of indexes that used Xpointers to call single-entry data units from large files, each of which contained hundreds of entries. This was a landmark event for the project: up to this time, to call a single-entry sized unit the data files needed to be created to that size in advance, or HTML anchors could point to aa location in a larger file. With the new system the data could be plugged into a system as is, and function like a real digital dictionary.

It looked likely that the Perl-XLinking system could be applied to provide a search engine. Devising a search engine that can deal with mixed Western/CJK text in UTF-8 encoding had been difficult, as the software has trouble parsing the divisions in the character codes. A prototype CJK-Utf-8 search engine was developed.

The present number of terms included in the DDB (15,000 at the time of writing) is not small, but it represents only a tiny fraction of the terms, names, places, temples, schools, texts, etc., that are included in the entire East Asian Buddhist corpus. Thus, a search for a term conducted by someone whose research interests are significantly different to those of the compilers is likely to draw a blank. A group of scholars of East Asian Buddhism has been developing a comprehensive, composite index drawn from the indexes of dozens of major East Asian Buddhist reference works, which now includes almost 300,000 entries (described in further detail below). The search engine was extended to cover this comprehensive index. In its present state, the DDB may be searched for a term and if not found the search continues on this comprehensive index. (Michael's view of these events is described below.).

This section has focused on developments in the DDB, but the same enhancements have been applied to the CJK-E, except for the search through a comprehensive index. Some concrete examples of Web page format and search functions are given below, but first consider some content developments.

1.3 Content Development

1.3.1 Digital Dictionary of Buddhism

By January 1999 content included 4,200 entries. That number (at March 2002) has jumped to 15,000 and continues to increase rapidly. Grant support from the Japan Society for the Promotion of Science has enabled content to be built in a number of ways:

(1) Development of the comprehensive index (contents described in the Appendix): This project used the International Research Institute for Zen Buddhism (IRIZ) Zendicts.dat file as a starting point (containing around 56,000 entries). To this we at Toyo Gakuen University, in collaboration with teams at the Chung-Hwa Institute of Buddhist Studies and at IRIZ, added the indexes from a large number of major East Asian Buddhist reference works, bringing the total of entries to almost 300,000.

(2) Digitization of East Asian reference works: such lexicons as the Fo Kuang Shan dictionary and the Ding Fubao have already been formally and professionally digitized. We are adding to this by digitizing other valuable print works whose copyrights have expired, such as Soothill's Dictionary of Chinese Buddhist Terms, and works where we have permission to digitize from the copyright holder, such as Lancaster's Descriptive Catalog of the Korean Buddhist Canon. Students paid by our grants are scanning, OCRing, and correcting this data.

(3) Research Input from graduate student assistants: while the volume of these materials has not been especially great, this has been a good way to stimulate interest in the project. The students also benefitted from the chance to learn the computing techniques we are using for input, and to learn about XML.

(4) Automated input technology: based on a set of indexes and tables, most of the assistants are able to use our system of MS-Word macros to add new entries rapidly. The macros create a ready-made entry structure, along with suggested readings of the entries for Chinese, Korean and Japanese pronunciation. We are developing the necessary indexes to include Vietnamese as well. The system is limited to MS-Word, but since the indexes upon which the system is based are saved in Unicode text format, the development of an open platform input system which emulates our present Word system is feasible.

(5) Input from interested scholars: sizeable personal research glossaries have been received, and it is hoped that the continued increase in use of the DDB will encourage more scholars to contribute.

1.3.2 CJK-E Dictionary

These efforts have, for the past few years, been directed primarily at the development of the DDB, somewhat to the neglect of the CJK-E. Nonetheless, that collection now has almost 6000 compound words. Also, all 20,902 single character headwords in Unicode 2.0 have been made available for browsing, even though only about 8000 of these contain complete phonetic and semantic information.

1.4 XML Browsing Environment of the Combined Dictionaries

The home of the dictionaries (since February 2001) offers a choice between entering the DDB and the CJK-E. Upon entering the DDB table of contents page, the user is presented with the entire menu for the dictionary, including (1) the search engine and the various topic indexes; (2) the front matter and other explanatory materials for the dictionary, and (3) a small list of seminal resources for the study of classical East Asian Buddhist texts (Figure 1).

Figure 1. Table of contents page for the Digital Dictionary of Buddhism

By presenting the entire dictionary menu, plus the most important scholarly sites for those doing research in East Asian canonical texts, this page becomes a useful one-stop portal for specialists in our area. Also, all links to areas within the site use absolute, rather than relative, URLs. Thus, if you save this page to your desktop, you have ready access to all these materials via Internet connection.

Most serious researchers and translators are likely to use the search engine for basic access. For those who are not sure what they are looking for, or who do not have a Unicode supporting browser, or who simply want to browse, the indexes remain useful.

The search engine interface is shown in Figure 2. When activated, the search will yield a menu of matches, containing headword hits, and instances occurring in the explanatory body of other entries, as in Figure 3. Selecting, for instance, the headword match, the term in question can be browsed (Figure 4).

Figure 2. Search interface

Figure 3. Headword and text matches

Figure 4. Headword retrieval

For the user, this all looks and feels pretty much the same as it did in the earlier HTML versions of the DDB, but what is happening is fundamentally different, as this HMTL text is being generated on the fly by Perl, XSLT, and Xlinking protocols.

The menu above provides standard links for returning to important places within the site, and also allows the user to view the XML source (Figure 5). This source view provides access to the names of those responsible for the various content areas.

Figure 5. XML Source Code Display

Those who have been watching the development of the DDB over time may notice the addition of a new field at the top of the <sense> area, called <trans>. This tag is borrowed from the Text Encoding Initiative (TEI), meaning "translation", but here it refers strictly to the word or short phrase as the direct common rendering that translators would use when rendering this term into English.

1.5 Inclusion of the Allindex Files

As mentioned above, one of the most important developments of the DDB is the integration of the comprehensive composite index of East Asian Buddhological reference works. When a user's search does not find the required term, the allindex files are searched, rendering a list of sources which might contain information on the searched term. For example, at the time of the writing, the term hamal 夏末 ("end of the summer retreat"), was not contained within the DDB. but a search gives the information in Figure 6. The Allindex project is discussed in the Appendix.

Figure 6. Alternative references in the Allindex files

As can be seen, we are finally reaching a point where many of the impediments to full implementation have been overcome. Most importantly, we are starting to be able to handle Unicode-encoded documents and take direct advantage of XML.

2 Delivering CJK Dictionaries from Pure XML Sources: A Developer's Perspective, by Michael Beddow

Probably the most important thing to stress about the collaboration between Charles Muller and myself on an XML-based delivery platform for the DDB and the CJK-E is that no more than six weeks elapsed between our first contact and the announcement of a fully-functional system (and indeed one that had more functions than either of us had envisaged at the start). Perhaps even more noteworthy is that I had the core of the system up and running (in so far as individual entries were being retrieved from the larger files) within a single day of first downloading the data.

I say this not to praise myself as a lightning-speed programmer, but to bring out what it is that makes XML such a hugely important force for changing the way we in the Humanities work with digital data. Years of effort had gone into Charles Muller's collection and markup of the data, and months of work had gone into my development of the modules from which I built a delivery platform tailored to those data; but because data marked up in XML really does describe its own structure, and because software that follows the recommendations for processors issued by the W3C is intrinsically adaptable to any sort of well-formed XML, none of our earlier independent work had to be redone to get the data online. When recoding his data in XML, Charles had been focusing on the scholarly content and the abstract structure, with relatively few detailed ideas about how it would eventually be delivered to users (who in the meantime continued to access his work via the conventional HTML site). For my part, I had been working on techniques of retrieving fragments of larger XML documents and rendering them into HTML on demand, with no substantial experience either of handling CJK data or of the problems specific to lexicographical applications. Yet, the required retrieval, delivery and rendering system more or less sprang into life of its own accord. "Self-describing data" pretty much engendered a self-creating delivery system. It was an exciting, if slightly uncanny, experience.

2.1 The Old and the New

One of the chief benefits Charles had foreseen when moving to XML encoding of his material was the possibility of using XLink and XPointer[1] technologies to allow users to retrieve selected fragments of larger documents. In an HTML implementation, either the editors have to maintain a very large number of small documents, with all the version management problems that entails, or users have to accept that the results of their queries are large documents of which only a small portion may be relevant to what they were looking for. Anticipating the removal of this serious limitation implicit in HTML, Charles had been marking up internal and external links in the XML version of his materials using a basic form of XLink/XPointer notation, but had believed that these links would only be used as intended once browser (and server) support for XLinking was widely implemented.

I was able to show that by using some simple cgi scripts in combination with server-side XSLT transformations, it is possible to implement a small but useful subset of the (still not finalised) XPointer and XLink proposals that can be used with present day browsers and servers. I originally developed these techniques, based on freely-available open source models, to allow the online publication of a long monograph of mine from a single canonical and easily-maintained XML file, while enabling users to request and receive portions of this single file as small as a single (printed) page, transformed on demand from the TEI-conformant XML into HTML.[2]

Like the HTML-based system that preceded it, the XML-based platform involves the creation of many thousands of files, largely because of the caching and indexation facilities it uses to speed retrieval and delivery. But there is an immensely significant difference from the editorial point of view. The thousands of HTML files had to be maintained by the editors themselves; in my system, all the editors need concern themselves with are the core XML files into which they enter their data. There are many other supporting files; but they are invisible to both the end user and the resource authors, and are generated and maintained transparently by the underlying system. The authors create and maintain XML files of whatever size best suits their methods of working, and whose structure is determined by their scholarly analysis of the material. The system validates, partitions and indexes those files, allows users to locate the items within them that they need, and renders the retrieved items into Web pages for delivery, creating hyperlinks for any internal or external cross-references as it does so.

About half way through my work on automatically creating the existing indices, Charles asked whether it would be feasible to create a free-text search engine that would supplement these indices as a means of access, and for some classes of user maybe even replace them. This free-text search engine is only half-way usable. Some of the problems lie in my own coding, which needs, and will in due course receive, much more work. Other problems stem from aspects of the underlying system libraries which only come to light when complex regular expressions involving utf-8 encoded characters from across the entire Unicode range are let loose on multilingual texts. There is also an irritating bug, alluded to by Muller in section 1, which occurs only on the (FreeBSD) hosting server but cannot be reproduced on my (Linux) development system, and which causes the initial failure of some search attempts. I hope users will not find it distractingly flippant that, as a much-needed caveat-cum-apology, I have cited on the query form the remark I suspect the father of Anglophone lexicography might have made about my efforts, had he encountered them betwixt his observations of women preachers and dogs walking on their hind legs. Given Dr Johnson's place in scholarship, this seemed more appropriate than the other citation which also springs to mind in my defence, G.K Chesterton's observation that "if a thing's worth doing, it's worth doing badly".

Aside from the search engine, the other thing the new system brings from a user's perspective is a more commodious display of the data. The layout, ordering and indeed the contents of the delivered HTML can easily be changed by editing a single controlling XSL style sheet, without touching the XML data, so it is easy to act on user comments (or editorial second thoughts) about the presentation of the material which previously might have required thousands of separate HTML pages be recreated. In other words, the separation of visual design from logical structure that XML allows for is here given full scope.

The nature of XML markup has also allowed a significant extension of what the user, specifically of the Digital Dictionary of Buddhism, can be offered. As the very large set of references to Buddhist CJK terms in printed or other digital dictionaries which Muller and his associates have assembled were also marked up in XML, the DDB's facilities could be greatly expanded with little programming effort. If a user looks up a term in the search engine which is not in the DDB (or if s/he follows a provisional cross-reference in the DDB where the reference target has not yet been edited into place), a secondary lookup is performed on the external references data. If the term concerned is found there, the user is offered a listing of the locations in those external sources where the term is defined or explained. Given the very large number of entries in this secondary data collection (c.300,000 and rising), lookups are assisted by a Berkeley db database (itself automatically built from the core XML) interposed between the client and the XML sources: this is the only instance in the current implementation where information is not located by a direct parse of the core XML files.

2.2 System in Operation

Each headword in the dictionary has a unique identifier as part of the markup. This ID is derived algorithmically from the name of the dictionary plus the Unicode numerical representation of the characters in the term. When a term is requested, either from one of the various user-accessible indices or as a result of a search engine query, the relevant ID is passed to a cgi script on the server. That script parses the appropriate xml file[3], locates the entry by its ID and extracts it, then passes the resulting XML fragment on to an XSLT processor[4], which converts it into HTML while building the necessary hyperlinks for any cross references the entry contains.

In practice, this process is complicated (but also accelerated from a user perspective) by a system of caching, by which both XML fragments and the corresponding HTML version, once created by an initial request, are stored so that future requests can be met without further parsing or transforming, until the editors alter the items concerned in the XML (which automatically invalidates any cached copies of the altered material), or alter the XSLT style sheet that controls presentation (upon which all cached HTML is marked invalid so that it will be regenerated with the changed presentation next time the XML is retrieved).

2.3 Platform Requirements

Though earlier work on XML fragment retrieval and rendering in real time was done on University servers which I specified and managed, giving me complete control of the hardware and software, the programs that deliver these dictionaries can run on servers which offer only the limited configuration facilities found at the inexpensive end of the commercial hosting market. No privileged access to the machine is needed to install or maintain them. There is, of course, a performance penalty: the whole thing would run faster and handle more simultaneous users without performance deterioration if it could be moved "in process" with the Web server, so that the script handling system did not have to be loaded and initialized for every single request, as happens at the moment. But performance is broadly satisfactory for the present size of the datasets and should cope with their planned expansion. My modified methods mean that other scholars who would like to deploy a version of this system adapted to their particular data have the prospect of getting it to run without excessive dependence on the cooperation or expertise of their local server administrators. One indispensable requirement, for CJK applications at any rate, is the presence of up-to-date system libraries for handling Unicode, and experience suggests that these are more commonly found on commercial sector servers than on campus facilities.

2.4 Moral of this Tale

Charles Muller and I were, within a matter of days, able to pool our knowledge and interests and work together across nine time zones as effectively as if we had been in neighbouring offices. Humanities scholars who still insist that computers are no more than glorified but temperamental typewriters, and campus finance officers who believe only scientists need decent computer hardware or network connections, might like to consider revising their views. Anyone who thinks XML is either just a fad or tomorrow's technology can see its enabling power at work.

Moving into XML Functionality: The Combined Digital Dictionaries of Buddhism and East Asian Literary Terms

Abstract

1 Project Manager's Report, by Charles Muller

1.1 Technical Review

1.2 Recent Evolution: XML Comes Alive

1.3 Content Development

1.3.2 CJK-E Dictionary

1.4 XML Browsing Environment of the Combined Dictionaries

1.5 Inclusion of the Allindex Files

2 Delivering CJK Dictionaries from Pure XML Sources: A Developer's Perspective, by Michael Beddow

2.1 The Old and the New

2.2 System in Operation

2.3 Platform Requirements

2.4 Moral of this Tale

Notes

Appendix: The Allindex Database

Indexed works

1. From IRIZ

b. At IRIZ; ABE Rie, Urs APP and Michel MOHR

2. At Toyo Gakuen University; Charles Muller, et. al.

3. At the Chung-Hwa Institute for Buddhist Studies; Christian Wittern, et. al.

4. At the Research Institute of the Tripitaka Koreana; In-Sub Hur, et. al.