Vocabulary Mapping for Terminology Services

Diane Vizine-Goetz, Carol Hickey, Andrew Houghton and Roger Thompson
OCLC Research, OCLC Online Computer Library Center, Inc.
Email: vizine@oclc.org; hickeyc@oclc.org; houghton@oclc.org; thompson@oclc.org
Project Web Site: http://www.oclc.org/research/projects/termservices/

Abstract

The paper describes a project to add value to controlled vocabularies by making inter-vocabulary associations. A methodology for mapping terms from one vocabulary to another is presented in the form of a case study applying the approach to the Educational Resources Information Center (ERIC) Thesaurus and the Library of Congress Subject Headings (LCSH). Our approach to mapping involves encoding vocabularies according to Machine-Readable Cataloging (MARC) standards, machine matching of vocabulary terms, and categorizing candidate mappings by likelihood of valid mapping.  Mapping data is then stored as machine links. Vocabularies with associations to other schemes will be a key component of Web-based terminology services. The paper briefly describes how the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is used to provide access to a vocabulary with mappings.

1 Introduction

A majority of tools and features for accessing the names, subjects, and classification categories assigned to content objects are not easily accessed by people or computers. The knowledge organization schemes and the features found in cataloging and retrieval systems are often deeply embedded in proprietary formats and software. Even when knowledge organization resources are openly available, they are rarely linked with other compatible schemes or services. This paper describes a project to add value to controlled vocabularies through vocabulary mapping. The vocabulary associations are then made accessible through Web services.

In this paper, 'terminology services' is used to describe Web services involving various types of knowledge organization resources, including authority files, subject heading systems, thesauri, Web taxonomies, and classification schemes. The term 'vocabulary' is used to refer to these knowledge organization resources. Vocabularies with associations to other schemes will be a key component of Web-based terminology services. Web services are modular, Web-based, machine-to-machine applications that can be combined in various ways. For background information on Web services, see  Gardner (2001) and Tennant (2002). Web services can be accessed at various points in the metadata lifecycle, for example, when a work is authored or created, at the time an object is indexed or cataloged, or during search and retrieval. A Web service that provides mappings from a term in one vocabulary to one or more terms in another vocabulary is an example of a terminology service.

2 Vocabulary compatibility

Researchers have been interested in achieving compatibility among controlled vocabularies for many years. Lancaster and Smith (1983) published an overview of the issues involved in integrating vocabularies, which is still relevant today. They describe several factors that influence how successfully one vocabulary can be associated with another, including: Researchers involved in more recent efforts to integrate vocabularies have identified additional factors affecting vocabulary compatibility: Zeng and Chan (2003) review the primary methodologies used to associate and integrate vocabularies. Of the approaches they describe, two are relevant to this paper: Co-occurrence mappings are considered to be more loosely mapped than direct mappings, which usually have an intellectual review component.

3 OCLC vocabulary projects

3.1 Dewey mappings

In 1994, OCLC staff began linking Library of Congress Subject Headings (LCSH) to the Dewey Decimal Classification (DDC) scheme. DDC/LCSH pairs were generated from OCLC WorldCat records that contained both DDC numbers and LCSH. Co-occurrence mappings were made for frequently occurring pairs. Later, an association measure was introduced in the co-occurrence mapping process to provide a better indicator of association than simple pair frequencies (Vizine-Goetz 1998). Approximately 90,000 co-occurrence mappings have been made in WebDewey, the electronic version of the DDC. An example of DDC/LCSH co-occurrence mappings is shown below for DDC class, 617.522 Oral region-surgery:

LC Subject Headings
Cleft lip
Cleft lip-Surgery
Cleft palate
Mouth-Diseases
Mouth-Microbiology
Mouth-Surgery
Oral medicine
Temporomandibular joint-Diseases

The mapped LCSH provide additional indexing vocabulary for the electronic version of the DDC and also assist catalogers in assigning subject headings. These terms are also included in versions of the DDC used in automated classification services.

3.2 Other mappings

The scope of OCLC's vocabulary mapping research projects has expanded to include additional classification schemes, subject heading systems, and thesauri. A list of OCLC vocabulary associations and the mapping approach used (direct, co-occurrence, or both) is shown in Table 1. In addition to DDC/LCSH co-occurrence mappings, direct mappings have been made between selected classes from the Library of Congress Classification (LCC) and the National Library of Medicine Classification (NLMC) and DDC. The LCC/DDC mappings and NLMC/DDC mappings are used to profile questions and expertise for virtual reference services. Project staff members have also made direct mappings of genre terms for fiction and drama (GSAFD) to LCSH and to LCSHac (headings for children's materials) using the procedures outlined in this paper. Because the GSAFD vocabulary is quite small - only 153 preferred terms - and based largely on LCSH, the GSAFD mapping effort was not considered a suitable test of our mapping approach. For these reasons, the approach was applied to another vocabulary.

Table 1. OCLC vocabulary associations
From To
Vocabulary DDC ERIC GSAFD LCC LCSH LCSHac MeSH NLMC
DDC (Dewey Decimal Classification)       Direct Direct & Co-occur Direct & Co-occur Direct Direct
ERIC Thesaurus         Direct      
GSAFD (Genre terms for fiction)         Direct Direct    
LCC (Library of Congress Classification) Direct              
LCSH (LC Subject Headings) Direct & Co-occur Direct Direct Co-occur     Direct  
LCSHac (LC Children's Headings) Direct & Co-occur              
MeSH (Medical Subject Headings) Direct       Direct      
NLMC (National Library of Medicine Classification) Direct              

The GSAFD vocabulary terms with mappings are accessible using the OAI-PMH. The OAI protocol specifies a simple HTTP protocol for automated sharing of metadata, but as the OAI-Cat effort has shown, the approach works equally well for sharing other XML content. The content of the GSAFD records is MARC in XML (MARC Standards). The records are accessible to users via a browser (http://alcme.oclc.org/gsafd/) and to machines through the OAI-PMH Web services mechanisms. See Van de Sompel et al. (2003) for a more complete description of the how the file can be accessed using the OAI-PMH. The GSAFD/LCSH mapping file can also be downloaded from our project Web site. The file is encoded in MARC in XML and also according to version 0.5 of the Zthes schema. We have also prototyped some experimental Web services using co-occurrence mappings between the GSAFD vocabulary and LCSH.

3.3 Mapping to LCSH

As Table 1 shows, much of our mapping activity involves LCSH. Describing the relationship between the Art and Architecture Thesaurus (AAT) and LCSH, Whitehead (1990, p. 82) asks: "Why map to LCSH?" and replies:

Despite the weaknesses and the critical assessments that have plagued LCSH over the years, the fact remains that LCSH is the standard vocabulary used by the majority of information resources, especially libraries, in the United States.

She also notes that efforts to improve or replace LCSH must take into account its widespread use and the probability that it will be maintained for a long time. Others have reached similar conclusions. For example, the FAST project sponsored by OCLC selected LCSH as the basis for creating a faceted vocabulary for metadata. O'Neill and Chan (2003) cite the following reasons for choosing the LCSH scheme: LCSH are also among the recommended encoding schemes that can be used to qualify the Dublin Core subject element. Several prominent projects that use Dublin Core metadata create subject elements based on LCSH, including the Colorado Digitization Program, DSpace, and ePrints UK.

3.4 Vocabulary encoding standards

Many standards exist for encoding vocabularies: see Koch (2003) and the SWAD-Europe Thesaurus Activity thesaurus link page for listings of some current standards. For authority files, subject headings and thesauri, we have decided to use the MARC21 Format for Authority Data. For classification data, we use the MARC21 Format for Classification Data. MARC was chosen because many large vocabularies are available in the MARC formats, and the MARC Authority format supports inter-vocabulary relationships, which are particularly important to us because of our mapping work. Some examples of vocabularies available in the MARC format include: The MARC authority format enables us to provide detailed coding for many common controlled vocabulary elements. Preferred terms are coded in the block of MARC tags labeled 1XX. The tag 150 is used for topical terms. Non-preferred terms are coded in the 4XX range. The MARC authority format provides for the coding of some relationships between a preferred term and non-preferred terms, including earlier forms and acronyms. Broader term/narrower term relationships and associative relationships (related terms) are coded in MARC tags 5XX. Subfield $w is used to code relationships between 1XX and 4XX fields and 1XX and 5XX fields. Tags 7XX are used to provide links between equivalent terms in the same vocabulary and equivalent terms in different vocabularies. Section 5 provides a detailed explanation of MARC 7XX linking fields.

In the remainder of this paper we describe our approach to mapping the ERIC Thesaurus to LCSH. The ERIC Thesaurus was chosen because it is a well-established vocabulary, publicly accessible on the Web, and large enough to provide a meaningful test of our mapping approach. The ERIC Thesaurus is produced by the Educational Resources Information Center, an education information network, sponsored by the U.S. Department of Education, and provides public access to education literature (ERIC 2004).

4 Mapping the ERIC Thesaurus to LCSH

4.1 Converting ERIC to MARC

Vocabularies to be mapped are first converted to the MARC21 Authority Format. The effort involved in this step varies depending on the format of the source vocabulary (vocabulary being mapped). We have converted vocabularies from formats primarily intended for display, e.g. word processing documents without extensive use of styles and vocabularies in more structured formats such as the ERIC file (Figure 1).

Multiple instances of broader terms (BT), narrower terms (NT), and related terms (RT) stored in single ERIC fields are encoded as separate fields in the MARC format (Figure 2). The RT field shown below generates 14 fields in the MARC record. These are the fields labeled with MARC tag 550 (without $w subfields). The field labeled UF is similarly converted into two MARC fields (tag 450). One of the terms, Student ability, represents a formerly valid term. The notation in parentheses in the ERIC record indicates this and gives the lifespan of the term. When this data is converted to MARC, a 688 field (Application History Note) is constructed for this data. In the 450 field, subfield $w is added to indicate the term was formerly valid. By encoding the source and target vocabularies in the MARC Authorities Format we are able to standardize the representation of similar information and improve our ability to match vocabularies.

Figure 1. Sample ERIC record
<TERM> Academic Ability
<SCOPE> The degree of actual competence to perform in scholastic or educational activities (Note: For potential competence, use "Academic Aptitude" -- for measured achievement, use "Academic Achievement")
<RT> Ability Grouping; Academic Achievement; Academic Aptitude; Academic Aspiration; Academically Gifted; Aptitude Treatment Interaction; Cognitive Ability; College Entrance Examinations; High Risk Students; Intelligence; Scholarship; Spatial Ability; Student Characteristics; Verbal Ability
<BT> Ability
<UF> Scholastic Ability; Student Ability (1966 1980)
<GROUP> 120
<TYPE> Main
<ADD> 07/01/1966

Figure 2. ERIC record in MARC21 authority format
001 ERIC00025
003 OCoLC-O
005 20031117154238.0
008 031118 n|a|znn|bb||||||||||| ||an| ||| d
040 $beng$cOCoLC-O$dOCoLC-O$eericd
072 $a120
150 $aAcademic Ability
450 $aScholastic Ability
450 $wa$aStudent Ability
550 $aAbility Grouping
550 $aAcademic Achievement
550 $aAcademic Aptitude
550 $aAcademic Aspiration
550 $aAcademically Gifted
550 $aAptitude Treatment Interaction
550 $aCognitive Ability
550 $aCollege Entrance Examinations
550 $aHigh Risk Students
550 $aIntelligence
550 $aScholarship
550 $aSpatial Ability
550 $aStudent Characteristics
550 $aVerbal Ability
550 $aAbility$wg
680 $iThe degree of actual competence to perform in scholastic or educational activities (Note: For potential competence, use "Academic Aptitude" -- for measured achievement, use "Academic Achievement")
688 $aStudent Ability (1966 1980)

MARC field and subfield statistics are provided in Appendices 1-4 for the following versions of the files:

As these statistics show, LCSH is a large vocabulary with more than 200,000 preferred terms (MARC tag 150) and nearly as many topical non-preferred terms (MARC tag 450). In contrast, the ERIC Thesaurus has about 6,000 preferred terms and 4,500 non-preferred terms. Although these statistics do not provide information about the potential subject overlap between ERIC and LCSH, the sheer size of the LCSH file compared with ERIC leads us to expect a favorable match rate. Statistics are provided for the subset of ERIC records, without and with mapping data, reported in this paper. This subset is described in detail in section 4.2.

4.2 Matching vocabulary terms

After the ERIC file is encoded in the MARC Authority format, the ERIC vocabulary is matched to the LCSH vocabulary. Using a series of computer programs, all preferred terms (MARC tag 150) and non-preferred terms (MARC tag 450) in the source and target vocabularies are matched. Differences in spacing, capitalization, and punctuation are ignored during the matching process. The following terms are considered matches:
 
ERIC Thesaurus Term LCSH Term
Alzheimers Disease Alzheimer's disease
Nurses Aides Nurses' aides

Currently, plural versus singular forms, terms that differ only by the presence or absence of a parenthetical qualifier, and terms with a qualifier introduced by a comma are not being matched. These refinements would likely improve the match rate and will be employed in the next phase of the project.

ERIC Thesaurus Term LCSH Term
Echolocation Echolocation (Physiology)
Crack Crack (Drug)
Radiology Radiology, Medical
Rh factors Rh factor

A total of 3,797 ERIC terms were matched to LCSH and categorized according to the following match types:

4.3 Evaluating matches

Four categories of ERIC terms were reviewed and analyzed (numbers in parentheses are ERIC category codes): This subset comprises about 12% of ERIC preferred terms. Statistics for PT/PT matches and PT/NPT matches are shown in Table 2. Columns 3 and 4 show the number of term matches and concept matches for PT/PT matches and columns 5 and 6 present this information for PT/NPT matches.

Table 2. PT/PT and PT/NPT matches
ERIC Category Total Preferred Terms PT/PTTermMatch PT/PTConcept Match PT/NPT Term Match PT/NPT Concept Match
Learning & perception (110) 164 49 49 10 8
Individual development & characteristics (120) 269 83 81 25 23
Health & safety (210) 227 129 127 31 30
Disabilities (220) 113 37 37 12 10
Total 773 298 294 78 71

About 99% of PT/PT matches were found to represent equivalent concepts in the two vocabularies and 91% of PT/NPT matches represent equivalent concepts. Very few false matches were observed for these two match types. A false match occurs when terms from the vocabularies are identical but the concepts represented are different. Some examples of false matches are:

Term ERIC LCSH
Females For works on human females For works on female organisms in general. Works on the human female are entered under Women.
Males For works on human males For works on male organisms in general. Works on the human male are entered under Men.
Radiology For works on the use of radiation in medical diagnosis and treatment. For works on radiological physics

A total of 365 (294 + 71) equivalent concepts were identified. This is 47% (365/773) of the preferred terms in the ERIC subset. All matches in the subset were manually reviewed to determine which matches represented valid mappings. The following guidelines established in the Northwestern University LCSH/MeSH mapping project (Olson and Strawn 1997) were applied in the evaluation:

(PT matches NPT)
ERIC LCSH Match Type Valid Mapping
PT: Ametropia
PT: Eye-Refractive errors
NPT: Ametropia
PT/NPT
Yes
(PT matches PT) and (NPT matches PT)
ERIC LCSH Match Type Valid Mapping
PT: Cleft Palate
NPT: Cleft Lip
PT: Cleft Palate
PT/PT
Yes
PT: Cleft Lip
NPT/PT
Yes
Multiple mappings diagram (PT matches PT) and (NPT matches PT)
(NPT matches NPT) and (NPT matches PT)
ERIC LCSH Match Type Valid Mapping
PT: Extraversion  Introversion
NPT: Ambiversion
NPT: Extroversion
NPT: Introversion
PT: Extraversion
NPT: Extroversion
NPT/NPT
Yes
PT: Introversion
NPT/PT
Yes
Multiple mappings diagram (NPT matches NPT) and (NPT matches PT)

The match types guided our review of the matches. Matches were coded by type and each type was assigned a different color. PT/PT (white) matches were reviewed first, followed by PT/NPT (green). Evaluation of these matches was relatively straightforward since most involved one-to-one matches. NPT/NPT (yellow) and NPT/PT (blue) were more complex to review because they often involved matches to multiple terms in the target vocabulary.

(PT matches NPT) and two (NPT matches PT)
ERIC LCSH Match Type Valid Mapping
PT: Adolescents
PT: Teenagers
NPT: Adolescents
PT/NPT
Yes
NPT: Adolescence
PT: Adolescence
NPT/PT
No
NPT: Teenagers
PT: Teenagers
NPT/PT
Yes

In the example above, the NPT/PT match on the term Adolescence is an invalid mapping because the ERIC term and the LCSH term represent different concepts. The ERIC term Adolescents is for works on young people, 13-17 years of age. The LCSH term Adolescence is for works on the physiological, psychological, or social development of adolescents. The ERIC term, Adolescent Development, is a better match for the later term. For terms that matched three or more LCSH, e.g. Neurological Impairments, the review could be quite time-consuming and sometimes did not yield a correct mapping. In the subset, NPT/NPT matches represent equivalent concepts about 81% of the time, and NPT/PT matches represent equivalent concepts about 55% of the time. This last set of statistics should be viewed with some caution, given the small number of matches analyzed. Even so, the mapping results do have some interesting implications for future mapping projects.

If the term/concept-mapping rate is constant within a vocabulary, it should be possible to predict the expected mapping rate for a vocabulary based on a review of a sample of matches. Further, if the false match rate can be predicted reliably, review of matches with a high term/concept-mapping rate (PT/PT and PT/NPT, Table 2) could be dispensed with when the false match rate is below a particular threshold. Only those types of matches with low term/concept mapping rates (NPT/NPT and NPT/PT, Table 3) would need to be reviewed. Further, for matches requiring review, more experienced reviewers could be assigned to complex matches while less experienced reviewers could be given simpler matches.

Table 3. NPT/NPT and NPT/PT matches
ERIC Category Total NPT matches NPT/NPT Term Match NPT/NPT Concept Match NPT/PT Term Match NPT/PT Concept Match
Learning & perception (110) 12 6 6 6 4
Individual development & characteristics (120) 57 16 12 41 22
Health & safety (210) 60 22 n/a 38 n/a
Disabilities (220) 30 15 n/a 15 n/a
Total 159 59 18 100 26

5 Inter-vocabulary linking

Vocabulary links are stored in MARC fields 7XX. Using these fields, we can encode the following: In the following example, the first two 750 fields are mappings to LCSH. This information is coded in the indicator value. The first two character positions at the beginning of a field are called indicators. These character positions can contain information that interprets or supplements the data found in the field. The unique identifier for the term is coded in the $0 subfield and the organization that supplied the mapping data in the $5 subfield (e.g. OCoLC-O). The code, OCoLC-O, is the MARC organization code for OCLC's Office of Research. MARC organization codes are used to represent names of libraries and other organizations that need to be identified in the bibliographic environment. The last 750 field is a mapping to MeSH with the unique MeSH identifier coded in $0 subfield. No information about the mapping organization is provided. The unique identifiers for the mapped terms are linked to Web-accessible versions of LCSH and MeSH. Mapping data of this type can be used to create high quality terminology services. For example, terminology services that support search and retrieval might use the full range of available mappings, while services invoked during indexing or cataloging might use only mappings produced by a specific organization or for a given vocabulary.

A legitimate concern about vocabulary mapping is how the mappings will be maintained. Although not a trivial task, mappings can be maintained with the help of software that tracks changes to vocabulary term records. Changes to vocabulary terms are recorded in a number of ways, e.g. by data in a vocabulary record that indicates when the record was last modified, by notes fields that chronicle changes to a vocabulary term (see field 688 in the MARC record examples), and through notifications of additions and changes distributed by vocabulary owners. Depending on the nature of the changes, human review may be needed to determine if mappings are still valid when a vocabulary term changes.

Figure 3. ERIC record with mapping data
001    ERIC03056
003    OCoLC-O
005    20031117154238.0
008    031118 n|a|znn|bb||||||||||| ||an| ||| d
040    $beng$cOCoLC-O$dOCoLC-O$eericd
072  7 $a110$2ericd
150    $aEidetic Imagery
450    $wa$aEidetic Images
450    $aPhotographic Memory
550    $aVisualization
550    $aMemory$wg
680    $iVividly clear, detailed imagery of something (usually visual) that has been previously perceived
688    $aEidetic Images (1967 1980)
750  0 $aEidetic imagery$0(DLC)sh 85041379  $5OCoLC-O
750  0 $aPhotographic memory$0(DLC)sh 00009368  $5OCoLC-O
750  2 $a Eidetic Imagery$0(DNLM)D004538

In this example, the LCSH terms are linked to LC subject authority records accessible through the OAI-Cat framework. These records are accessible to users via a browser and to machines through the OAI-PMH Web services mechanisms. The MeSH link generates a search of the MeSH vocabulary using the search features of the MeSH Browser.

6 Next steps

Our plans for the near term include refining the matching software and developing improved tools for reviewers. When the review of the ERIC/LCSH matches is complete, the file of mappings will be made available to other researchers. The file will be available in MARC in XML and also encoded according to version 0.5 of the Zthes schema. We also anticipate making this file available via OAI-PMH and for searching using SRU/SRW and the Zthes profile. See the Terminology Services project Web site for details.

Acknowledgements

We thank the reviewers of this paper for their many helpful comments and suggestions.

References

Doerr, M. (2001) "Semantic Problems of Thesaurus Mapping". Journal of Digital Information 1(8) http://jodi.tamu.edu/Articles/v01/i08/Doerr/

Gardner, T. (2001) "An Introduction to Web Services". Ariadne (29) http://www.ariadne.ac.uk/issue29/gardner/

Koch, T. (2003) "Activities to advance the powerful use of vocabularies in the digital environment - Structured overview" http://www.lub.lu.se/~traugott/drafts/seattlespec-vocab.html

Lancaster, F. W. and L. Smith (1983) "Compatibility Issues Affecting Information Systems and Services". General Information Programme and UNISIST, PGI-83/WS/23 (Paris: UNESCO)

Mandel, C. (1987) "Multiple Thesauri in Online Library Bibliographic Systems". Cataloging Distribution Service (Library of Congress: Washington, D.C.)

Olson, T. and G. Strawn (1997) "Mapping the LCSH and MeSH Systems". Information Technology and Libraries, 16(1), 5-19

O'Neill, E. and L. Chan (2003) "FAST (Faceted Application of Subject Terminology): A Simplified LCSH-based Vocabulary". World Library and Information Congress: 69th IFLA General Conference and Council, 1-9 August, Berlin http://www.ifla.org/IV/ifla69/papers/010e-ONeill_Mai-Chan.pdf

Tennant, R. (2002) "Digital Libraries-What To Know About Web Services". Library Journal 12 (July 15) http://www.libraryjournal.com/index.asp?layout=articleArchive&articleid=CA231639

Van de Sompel, H., Young, J. and T. Hickey (2003) "Using the OAI-PMH... Differently". D-Lib Magazine 9(7/8) http://www.dlib.org/dlib/july03/young/07young.html

Vizine-Goetz, D. (1998) "Popular LCSH with Dewey Numbers". In Annual Review of OCLC Research 1997 http://digitalarchive.oclc.org/da/ViewObject.jsp?objid=0000003449

Whitehead, C. (1990) "Mapping LCSH into Thesauri: the AAT Model". In Beyond the Book: Extending MARC for Subject Access, edited by T. Petersen and P. Molholt (Boston: G.H. Hall), p. 81

Zeng, M. and L. Chan (2003) "Trends and issues in establishing interoperability among knowledge organization systems". Journal of the American Society for Information Science and Technology, published online 16 Dec 2003

Appendices

Definitions for column labels 
Tag  3-character MARC field tag 
Occ  Total number of this field in all records 
%Recs  Percent of records that contain this field 
Occ/Rec  Occurrence of this field divided by the total number of records 
Len/Occ  Average length of this field 
Sub  1-chracter MARC subfield code 
Occ  Total number of this subfield in all records 
Occ/Rec  Occurrence of this subfield divided by the total number of records 
Len/Occ  Average length of this subfield 

Appendix 1

ERIC Thesaurus encoded in MARC Authority Format 
Field and Subfield Statistics 
Tag  Occ  %Recs  Occ/Rec  Len/Occ  Sub  Occ  Occ/Rec  Len/Occ 
001 6080  100.00  1.00  9.00       
                 
003 6080  100.00  1.00  7.00       
                 
040 6080  100.00  1.00  29.00  6080  100.00  7.00 
          6080  100.00  3.00 
          6080  100.00  7.00 
          6080  100.00  7.00 
          6080  100.00  5.00 
072 6080  100.00  1.00  3.00  6080  100.00  3.00 
150 6080  100.00  1.00  16.30  6080  100.00  16.30 
450 4562  43.90  0.75  18.05  4562  43.90  17.86 
          873  11.71  1.00 
550 68725  100.00  11.30  16.42  68725  100.00  16.24 
          11878  91.71  1.00 
680 3774  62.07  0.62  148.18  3774  62.07  148.18 
688 873  11.71  0.14  29.95  873  11.71  29.95 

Appendix 2

ERIC Thesaurus encoded in MARC Authority Format 
773 Record Subset without mapping data 
Field and Subfield Statistics 
Tag  Occ  %Recs  Occ/Rec  Len/Occ  Sub  Occ  Occ/Rec  Len/Occ 
001 773  0.00  1.00  9.00       
                 
003 773  0.00  1.00  7.00       
                 
040 773  0.00  1.00  29.00  773  0.00  7.00 
          773  0.00  3.00 
          773  0.00  7.00 
          773  0.00  7.00 
          773  0.00  5.00 
072 773  0.00  1.00  3.00  773  0.00  3.00 
150 773  0.00  1.00  15.93  773  0.00  15.93 
450 668  0.00  0.86  17.47  668  0.00  17.26 
          139  0.00  1.00 
550 9365  0.00  12.12  15.94  9365  0.00  15.77 
          1595  0.00  1.00 
680 520  0.00  0.67  140.28  520  0.00  140.28 
688 139  0.00  0.18  29.43  139  0.00  29.43 

Appendix 3

ERIC Thesaurus encoded in MARC Authority Format 
773 Record Subset with mapping data 
Field and Subfield Statistics 
Tag  Occ  %Recs  Occ/Rec  Len/Occ  Sub  Occ  Occ/Rec  Len/Occ 
001 773  100.00  1.00  9.00       
                 
003 773  100.00  1.00  7.00       
                 
005 773  100.00  1.00  16.00       
                 
040 773  100.00  1.00  29.00  773  100.00  7.00 
          773  100.00  3.00 
          773  100.00  7.00 
          773  100.00  7.00 
          773  100.00  5.00 
072 773  100.00  1.00  3.00  773  100.00  3.00 
150 773  100.00  1.00  15.93  773  100.00  15.93 
450 668  49.94  0.86  17.47  668  49.94  17.26 
          139  13.71  1.00 
550 5777  99.61  7.47  15.81  5777  99.61  15.62 
          1098  76.33  1.00 
680 520  67.27  0.67  140.28  520  67.27  140.28 
688 139  13.71  0.18  29.43  139  13.71  29.43 
750 404  50.19  0.52  30.18  404  50.19  16.00 
          404  50.19  13.71 
          12  1.55  15.75 

Appendix 4

LCSH (updated in November 2003) in MARC Authority Format 
Field and Subfield Statistics 
Tag  Occ  %Recs  Occ/Rec  Len/Occ  Sub  Occ  Occ/Rec  Len/Occ 
 
001 277272  100.00  1.00  12.00       
                 
005 277272  100.00  1.00  16.00       
                 
008 277272  100.00  1.00  40.00       
                 
010 277272  100.00  1.00  12.07  277272  100.00  12.00 
          1503  0.45  12.00 
035 0.00  0.00  6.20  0.00  6.20 
040 277272  100.00  1.00  7.99  277272  100.00  3.06 
          32824  11.84  3.00 
          277272  100.00  3.00 
          145241  51.45  3.01 
043 0.00  0.00  10.00  0.00  10.00 
053 89983  29.46  0.32  10.95  89983  29.46  7.40 
          15728  5.30  5.95 
          21721  5.17  10.43 
073 3286  1.19  0.01  13.32  5046  1.19  6.07 
          3286  1.19  4.00 
100 19949  7.19  0.07  15.78  19949  7.19  14.48 
          23  0.01  2.39 
          133  0.05  18.02 
          714  0.26  8.94 
          19  0.01  15.74 
          21  0.01  13.67 
          154  0.05  12.92 
          1167  0.32  11.88 
          16  0.01  12.69 
          54  0.02  9.07 
110 5644  2.04  0.02  37.25  5644  2.04  33.19 
          603  0.21  6.17 
          0.00  7.00 
          0.00  17.71 
          161  0.06  13.02 
          1111  0.36  14.09 
          52  0.02  18.77 
          33  0.01  9.85 
111 0.00  0.00  24.29  0.00  21.57 
          0.00  5.00 
          0.00  7.00 
130 465  0.17  0.00  27.38  465  0.17  9.69 
          0.00  4.00 
          21  0.01  9.00 
          67  0.02  6.45 
          114  0.04  16.40 
          336  0.10  16.63 
          0.00  15.22 
150 202494  73.03  0.73  22.79  202494  73.03  18.55 
          3513  1.24  13.25 
          43118  13.80  14.42 
          2687  0.97  13.75 
          16385  5.84  9.40 
151 45427  16.38  0.16  29.73  45427  16.38  23.29 
          694  0.24  14.06 
          13696  4.48  13.27 
          7758  2.80  12.95 
          76  0.03  7.79 
180 2858  1.03  0.01  18.57  82  0.03  11.99 
          3453  1.03  14.61 
          108  0.04  14.04 
          10  0.00  12.00 
181 0.00  0.00  27.50  0.00  21.00 
          0.00  17.00 
182 34  0.01  0.00  16.71  34  0.01  16.71 
185 392  0.14  0.00  21.78  419  0.14  19.72 
          14  0.01  19.43 
260 714  0.26  0.00  125.81  1193  0.26  27.97 
          1380  0.26  40.92 
360 3900  1.41  0.01  117.73  6316  1.23  25.42 
          7713  1.41  38.71 
400 30220  3.41  0.11  14.59  30220  3.41  14.21 
          10  0.00  2.40 
          70  0.02  15.07 
          324  0.08  8.86 
          14  0.00  22.00 
          0.00  18.00 
          11  0.00  21.18 
          57  0.02  15.58 
          567  0.17  3.00 
          330  0.07  13.77 
410 6134  1.24  0.02  38.44  6134  1.24  37.31 
          155  0.05  9.20 
          0.00  18.00 
          0.00  12.00 
          49  0.02  20.43 
          294  0.10  3.00 
          201  0.06  16.84 
          0.00  21.33 
          0.00  10.33 
411 0.00  0.00  50.00  0.00  42.00 
          0.00  3.00 
          0.00  21.00 
430 248  0.07  0.00  24.77  248  0.07  8.88 
          0.00  4.00 
          0.00  28.00 
          0.00  6.86 
          37  0.01  4.76 
          50  0.01  13.24 
          31  0.01  3.00 
          178  0.05  16.36 
          0.00  9.00 
450 189834  35.10  0.68  21.59  189834  35.10  20.16 
          797  0.24  14.56 
          19091  6.33  3.00 
          12686  3.73  13.56 
          520  0.17  18.90 
          1899  0.64  10.96 
451 37085  7.02  0.13  28.61  37085  7.02  26.78 
          160  0.06  17.42 
          3695  1.14  3.00 
          2188  0.57  15.91 
          1316  0.32  14.39 
          0.00  12.50 
480 742  0.21  0.00  21.05  16  0.01  13.44 
          210  0.06  3.00 
          809  0.21  18.26 
482 0.00  0.00  37.80  0.00  3.00 
          0.00  34.80 
485 247  0.06  0.00  19.77  256  0.06  18.02 
          80  0.02  3.00 
          0.00  10.33 
500 4193  1.19  0.02  15.46  4193  1.19  13.78 
          0.00  2.00 
          108  0.04  20.43 
          51  0.02  9.29 
          0.00  19.50 
          23  0.01  11.83 
          301  0.10  1.02 
          308  0.10  10.51 
          0.00  24.00 
          62  0.02  6.39 
510 471  0.16  0.00  33.04  471  0.16  20.50 
          155  0.05  5.02 
          0.00  9.00 
          442  0.15  1.00 
          319  0.10  13.91 
          0.00  12.00 
          20  0.01  9.45 
530 127  0.04  0.00  25.75  127  0.04  5.99 
          0.00  5.50 
          46  0.01  6.13 
          10  0.00  11.10 
          104  0.04  1.00 
          104  0.03  19.24 
550 220509  56.84  0.80  17.74  220509  56.84  13.82 
          358  0.12  11.74 
          205025  56.10  1.00 
          16008  4.90  12.26 
          1192  0.38  12.26 
          50205  15.23  8.85 
551 14009  4.48  0.05  22.73  14009  4.48  9.80 
          413  0.13  10.98 
          13774  4.41  1.00 
          12518  3.80  10.22 
          2046  0.52  16.78 
          70  0.01  7.90 
580 780  0.27  0.00  14.86  780  0.27  1.00 
          782  0.27  13.82 
581 0.00  0.00  18.00  0.00  1.00 
          0.00  17.00 
585 188  0.06  0.00  13.06  192  0.06  11.81 
          188  0.06  1.00 
667 3769  1.36  0.01  62.46  3769  1.36  62.46 
670 245924  42.46  0.89  90.70  14  0.00  114.64 
          0.00  76.43 
          0.00  314.00 
          0.00  65.89 
          0.00  23.00 
          0.00  121.00 
          0.00  102.50 
          0.00  360.00 
          0.00  9.50 
          245926  42.46  48.44 
          144503  24.35  71.88 
          0.00  76.50 
675 35640  12.85  0.13  40.27  83548  12.85  17.18 
680 9464  3.33  0.03  163.18  4029  1.02  22.89 
          10729  3.33  135.35 
681 8872  3.15  0.03  40.53  9094  3.15  22.53 
          9093  3.15  17.01 
682 12  0.00  0.00  167.00  0.00  41.11 
          21  0.00  77.81 
781 28049  10.11  0.10  27.63  0.00  14.50 
          0.00  17.00 
          53046  10.11  14.61