CHANT database of ancient Chinese texts: Ho: JoDI

Abstract

The CHinese ANcient Texts (CHANT) database is a long-term project which began in 1988 to build up a comprehensive database of all ancient Chinese texts up to the sixth century AD. The project is near completion and the entire database, which includes both traditional and excavated materials, will be released on the CHANT Web site (www.chant.org) in mid-2002. With more than a decade of experience in establishing an electronic Chinese literary database, we have gained much insight useful to the development of similar databases in the future. We made use of the best available versions of all texts, noting variant readings in footnotes. The biggest problem we encountered is the inclusion of rare and obsolete Chinese characters. For excavated materials, we also have to incorporate a considerable number of inscriptions in the original oracle bones and bronze forms. Since we started building the database, information technology has advanced so rapidly that we had to upgrade the technical devices already in use in the database. Unification of different sub-databases is also a daunting task. To maintain our competitive edge over free online Chinese databases, we need to continue developing new databases employing the existing ones.

1 Introduction

The CHinese ANcient Texts (CHANT) database project was initiated by the Institute of Chinese Studies, Chinese University of Hong Kong in 1988, with the generous support of a grant by the University Grants Committee (UGC) of the Hong Kong Government. Its original scope was confined to the building of an electronic database of all pre-220 AD traditional Chinese texts. It has since grown into a long-term project covering all Chinese ancient texts spanning the two millennia from 1500 BC to about 600 AD into one single, vast and comprehensive database which will become a major tool for the study of the entire field of ancient China.

2 CHANT Database

The database includes five components:

pre-220 AD (the Pre-Han and Han period) traditional texts;
220-581 AD (the Weijin period) traditional texts;
excavated texts on wood/bamboo strips and silk (Jianbo);
excavated oracular inscriptions on tortoise shells and bones (Jiaguwen);
traditional as well as excavated bronze inscriptions (Jinwen).

Components (1)-(4) have already been completed, and have thus far led to the publication of more than 80 concordances in print (Figure 1a-b) and more than 60 titles on electronic media (Figure 2). There are 103 titles (over 9,000,000 Chinese characters) in the Pre-Han and Han database, over 1000 titles (over 21,000,000 Chinese characters) in the Weijin database, nine titles (over 300,000 Chinese characters) in the Jianbo database (Figure 3a-b) and over 1,000,000 Chinese characters in the Jiaguwen database. Parts of the above databases have been released on the CHANT Web site since 1998 and it is estimated the whole CHANT database, with search functions, will be available in September 2002.

Figure 1a. Concordance texts with textual notes

Figure 1b. Sample of concordance

Figure 2. CD of the Pre-Han and Han traditional texts

Figure 3a. CD of Jianbo

Figure 3b. Display of original texts and orthographic translation

The original objectives of the project were to publish paper concordances of ancient Chinese texts because computer technology was still not very common among individual users then. At that time common software functions could not handle multimedia data efficiently. Advances in information technology and wider usage have changed the emphasis of the project to electronic versions. High production costs of printed publication also explains why the focus has changed. Except for our ongoing publication of the Weijin traditional texts concordances and materials not easily available in the market, we will mainly release our database via electronic media.

3 Difficulties

There are quite a number of hurdles on the way to a computerized system of ancient Chinese texts.

3.1 Accuracy

3.1.1 Texts

Establishing such an enormous database is a daunting task. Accuracy has to be meticulous throughout. For traditional texts, we made use of the best available editions, mainly from the Sibucongkan library. It is a well-known fact that all extant editions of ancient Chinese texts were marred by corruptions. We used, therefore, reprints which had not been tampered with. They were mainly compiled in the Song Dynasty (960-1279 AD). Since they were not amended, they preserved all the advantages and details of the original texts on which they were based. For instance, a commonly used version of Huainanzi published in 1990 ¹ had amended some of the original texts. The following sentence appears in the first chapter of that edition of Huainanzi:

Text: ²yu hai zhi xin wang yu zhong, ze ji hu ke wei, he kuang gou ma zhi lei fu!

Translation: If one does not harbor a heart that jeopardizes, he can pull the tail of a hungry tiger. How much more the tails of animals like horses and dogs!

A prominent Sinologist in the Qing Dynasty, WangNianSun (1744-1832 AD), with his experience and expertise in studying ancient Chinese texts, suggested that the word Hai (jeopardize) should be Rou ³ which is the variant form of Rou (covets its meat). Thus, the meaning of that sentence would be:

But where one does not harbor a heart that covets its meat, he can pull the tail of even a hungry tiger. How much more the tails of animals such as horses and dogs! ⁴

In fact, Hai is written as Rou in a small word edition ⁵ preserved in the Sibucongkan. It illustrates the importance of using accurate editions for ancient Chinese texts.

So, textual comparisons were carried out for all CHANT concordances using different versions and all concordances were proofread carefully at least five times by our team of graduate-level editors led by scholars in ancient Chinese texts, e.g. Prof. D. C. Lau. Variant readings were given in footnotes for the reference of the readers, so even if readers disagree with our judgment, they can always refer back to the original wording. Since we did not base our texts on modern punctuated versions but on the best ancient versions and added punctuation and textual notes ourselves, the texts we have are therefore unique and have come to be recognized and valued as such in the academy. For excavated texts, prominent experts in particular fields were invited to join the editorial team. This level of accuracy posits a high demand on human resources.

3.1.2 Data Input

Since we have a vast amount of data, part of the data input was inevitably outsourced, to data entry companies in Mainland China. However, differences and discrepancies between the two writing (traditional vs simplified characters), coding (Big 5 vs GB) and inputting systems in Hong Kong and Mainland China, as well as other related technical issues, produced problems which cannot be solved easily even with the most up-to-date word processor. Besides affecting accuracy, they also impinged on computer programs. A lot of manual effort is needed to tackle these problems. After the data entry companies have converted the texts from GB to Big 5 codes and proofread them to minimize errors, we still have to proofread the texts meticulously because there are quite a number of Chinese characters which data entry companies cannot handle. We have to locate those characters one by one and then either substitute them with equivalent characters or create new characters for them, as described in section 3.2.

3.2 Rare and obsolete Chinese characters

3.2.1 Traditional Texts

A serious problem has to do with the treatment of rare and obsolete Chinese characters. There are only 26 letters in the English alphabet but there are around 60,000 Chinese characters and their variant forms in ancient Chinese texts. There are, however, only 13,094 traditional characters in ordinary standard Chinese fonts. When we encounter a rare character, we need to go through the time-consuming process of locating it in sizeable dictionaries to see if the rare character is equivalent to another character which is already present in ordinary Chinese fonts. For example, chinese text is equivalent to ¸Ñ in both meaning and pronunciation but it is not included in any ordinary Chinese fonts. We would use ¸Ñ to substitute in order to save font space. If a total equivalent character cannot be not found or the context prevents us from using equivalent characters, we need to create a new and unique character for a particular rare character. We have taken advantage of the font extension capabilities of Chinese Windows for the digitization of rare characters. We have already created over 5,500 new characters for this purpose and have collected more than 1,000 equivalent characters for traditional texts (Figure 4). We made use of a readily available Chinese font maker to create rare characters. Since the size of the same radical varies between different characters, painstaking effort is needed to effect the correct proportion among different radicals when creating these characters for more balanced shapes.

Figure 4. Self-created characters

3.2.2 Excavated Texts

Figure 5.
Original oracle bones

Excavated texts were even more complicated. Besides showing the orthographic translation of those characters, which involves a vast amount of self-created characters, we also needed to incorporate their original forms. Since there is no existing font for characters in excavated texts, we needed to scan all images and graphics of the inscriptions in the original oracle bones and bronze vessels (Figure 5). This not only required repeated handling of data, but also took up font space rapidly. We therefore had to devise different sets of fonts for both orthographic translation and the original forms for Jiaguwen, Jinwen and Jianbo.

Figure 6a. New Compilation of Jiagu Characters

We have created 6,051 characters in original Jiaguwen form. Multiple cross-platform testing is essential for the accurate display of these fonts. We have recently published all the data in our Jiaguwen database in "A New Compilation of Jiagu Characters" ⁶(Figure 6a-d). After extensive revisions and adjustments of all previously published data, the total number of Jiagu characters has been determined to be 6,051, of which 4,971 are distinct characters and 1,980 their variant forms. This represents a large increase over corresponding figures in previous works. On our Web site, all Jiaguwen characters are numbered according to previously published data and are hyperlinked to make the original forms readily searchable (Figure 6e-f).

Figure 6b. List of Jiagu radicals

Figure 6c. Supplementary information of Jiagu data

Figure 6d. Sample of Jiagu characters under the radical "Man"

Figure 6e. Jiagu characters with hyperlinks

Figure 6f. Display of search results

We are still establishing the electronic Jinwen database. We have published the Yinzhou Jinwen Zhicheng Shiwen ⁷ in which the orthographic translation of all characters on bronze vessels and their original images are juxtaposed with each other (Figure 7). We estimate that the number of individual Jinwen characters will exceed 100,000 in total. Since Jinwen characters, unlike Jiaguwen, do not have recognized and unique numbers, we cannot treat them the way we did for Jiaguwen. All Jinwen characters will be categorized according to different radicals in order to make them readily retrievable. With our experience in the Jiaguwen database, we are confident that the Jinwen database, besides displaying all Jinwen characters and their orthographic translation, can also provide powerful search functions for individual characters.

Figure 7. Original Jinwen and orthographic translation

3.2.3 Limitations

Since we have different sets of self-created fonts for our database, users need to download our fonts before they can view the created characters correctly on our Web site or our CD-ROMs. After downloading our self-created True Type Fonts (.ttf), users can view those characters as if they are already in standard Chinese fonts without altering their own font setting. Although browsers with multi-language support, e.g. IE 5 or above, allow users to view our data in standard Chinese fonts, the font extension capability is native only to Chinese Windows. Thus, only users with traditional Chinese Windows can install our .ttf files and view our created characters.

When we created the characters for both traditional texts and orthographic translation of excavated materials, we also assigned the input codes for each of them according to the Changjie and Quick Input Method, the two most commonly used Traditional Chinese input methods. For users to be able to input the created characters, they need to download our True Type Extension (.tte) and the input method (.tbl) for each font and then associate the one they need to their computers. Since only one .tbl can be associated with the Changjie Input Method at any one time, if users want to input created characters in different fonts, they need to change the Changjie association. It would be quite troublesome for users to associate the corresponding files every time they want to input the created characters in different fonts, so we have not provided .tte and .tbl files for the time being. We are, however, planning to prepare some handy input methods, which need not be associated with Changjie at all, which will enable users to input our created characters without altering their font settings time and again.

3.3 Changing platforms

The initial platform of the CHANT database was set on the Eten Chinese system which was the most popular DOS Chinese system for PCs in the 1980s and could handle Chinese texts well. We have published more than 60 concordances in print and 30 in electronic format for the Pre-Han and Han component using Eten. But it is outdated now because it does not readily support multi-tasking and a graphical interface, and these functions are vital for working with the excavated materials. We have therefore changed the platform to Windows for its wide usage and better functions. We have also had to adopt many new techniques and tools in our Web site development. The CHANT Web site is hosted on a Microsoft Internet Information Server 5.0 (IIS) in a Windows 2000 Server environment. We use Microsoft Index Server to index the database and use Active Server Page (ASP) and Open DataBase Connectivity (ODBC) techniques to perform user and security management. Cascading Style Sheets (CSS) and JavaScript are also used in page formatting and user communication. A flexible search engine with various search functions was also incorporated into the system. Hence, the online version of the CHANT database is not a simple reformatting of the original design but is a total reengineering effort.

Since all software will age rapidly, we need to keep upgrading the functionality of our database. We anticipate that once we have successfully migrated all our data to Windows, future upgrading would be less laborious than the conversion from Eten to Windows. We are now also using Extensible Markup Language (XML), for its versatility and better optimization, to markup our new data. As well as upgrading our database, we have kept our raw data in plain text forms as backup. We therefore have two sets of data: raw data and output data.

All data in our existing database, which were developed over more than a decade, are in Big 5 code which was the only available Traditional Chinese code then. As mentioned before, because of the limited font space, we need to have different sets of fonts for different databases, and this would hinder our development. Unicode, with an enormous font space (over 60,000 characters), seems to be a promising solution to this problem. Unicode has become the standard for disseminating multi-lingual texts over the Web and for digitization of Asian languages. We will construct our new database, e.g. the Leishu Database (see section 3.4), using Unicode.

3.4 Competitive online database

There are a few free online databases providing digital versions of ancient Chinese texts. To maintain our competitive edge over them we need to continue to develop new databases on top of the existing ones and have a clear focus. We aim at including all pre-600 AD ancient Chinese texts. The database of excavated materials is our primary advantage because we are the first group in the world to have devised a digital database for excavated materials with search functions. The high accuracy of our data and the unique CHANT texts are another of our strengths. We have a lot of experience in dealing with parallel passages in ancient Chinese texts -- a database of the "Parallel Passages Found in Pre-Han and Han Traditional Texts" is in progress. Since we have the entire body of ancient Chinese texts in the database, much can be gained by comparing the extant versions of these texts with the versions preserved in, for instance, Chinese Encyclopedias (Leishu). We are currently establishing a large database of Leishu which will, when completed, provide a powerful search tool for all parallel passages in ancient Chinese texts and Leishu alike.

Notes

¹ Chen, G. Z. chinese text

(1990) Huainanzi Yizhu (Jilin: Wenshi Chubenshi) chinese text

² Chen, G. Z. (1990) Huainanzi Yizhu (JiLin: WenShi ChuBenShi). pp. 15

³ Wang, N. S. chinese text

(1985) (reprint) Dushu Zazhi (Nanjing: Jiangsu Guji Chubenshi) chinese text

, pp. 765

⁴ Lau, D. C. and Ames, R. T. (1998) (trans.) YuanDao, Tracing Dao to its Source (New York: Ballantine Publishing Group), pp. 79

⁵ Liu, A. (1974) (reprint of Shibucongkan) Huainanzi (Taipei: Yiwen Yinshuguang), pp. 13

⁶ Shen, J. H. and Cao, J. Y. (2001) A New Compilation of Jiagu Characters (Hong Kong: The Chinese University Press)

⁷ Institute of Archaeology of the Chinese Academy of Social Sciences (2001) (eds) Yinzhou Jinwen Zhicheng Shiwen (Hong Kong: Chinese University Press)

CHANT (CHinese ANcient Texts): a comprehensive database of all ancient Chinese texts up to 600 AD