Nine questions to guide you in choosing a metadata schema

Abstract

This article is a guide for collection developers at the point of considering a metadata schema for their digital collection. The nine questions asked in this article will assist a developer in clarifying how he wants the collection to be organized, described, and used. This article uses examples to illustrate how these questions guided the development of a digital collection built at the University of Southern California.

Keywords: Metadata, Digital Libraries, Schemas, Collection Development.

1 Introduction

A single digital object may serve a variety of purposes, reaching a broader audience than traditional library resources. As a result of this multiplicity, it may be difficult to know how to organize and describe an object so that it reaches the broadest population. We know that not one descriptive metadata schema exists to address all possible purposes. The author calls this 'the Goldilocks problem,' trying to find a metadata schema that isn't too little of one thing or too much of another, but rather, finding one that is just right, in order to make the best use of an organization's resources.

This article is designed to assist digital collection developers who are at the point of considering how to choose a schema and assign metadata. This can be a daunting task if the collection developer is not familiar with metadata, and may not be certain what steps to take to develop a model for describing and administering the collection. By following the guiding questions described in this article, a collection developer should be able to discuss with his metadata staff the goals for the collection. This article uses examples to illustrate how these questions guided the development of a digital collection that was built at the Norris Medical Library at the University of Southern California.

2 Assumptions

Let us assume for this discussion that the rationale for developing the digital collection has already been determined, and questions such as these have already been answered:

"Is the content original and of substantial intellectual quality?
Is it useful in the short and/or long term for research and instruction?
Does it match campus programmatic priorities and library collecting interests?" (University of Michigan Digital Library Production Service 1999)

We assume that the goals for the collection have been devised, and expectations about how the collection may be used have been considered. We also assume that any copyright issues have been dealt with, so that the metadata will provide access to the collection without restriction, or any known restrictions will be addressed during the design of the metadata schema. The following nine questions are a practical complement to Smith's recommendations (Smith 2001) for building digital collections. Smith's report advises institutions on how to devise a rationale for building the collection; this article advises on the following step of developing a description and administration metadata model.

3 The Nine Guiding Questions

3.1 Who will be using the collection?

Digital collections may be built to satisfy a stated user need, with the users already known to the library building the collection. A collection of maps, for example, may be collected and digitized for a course on spatial analysis. Knowing the users of the collection can be helpful in choosing a metadata schema because the developer will know what kinds of information the users will be seeking, and can be certain to choose a schema to accommodate those elements. Building a collection for a known user group allows the metadata elements to be more specific, or granular.

Digital collections may be built to address a collection need, rather than a known user need. An art history collection, for example, may be lacking images from a commonly discussed artist. A collection developer who has identified a collection gap may deem it necessary to add to the library on a particular theme so that the entire collection is viewed as more well-rounded. This kind of collection building provokes different expectations for the kind of use it may get, since the users, or potential users, may be unknown. The metadata schema one may choose for this kind of collection may provide for broader access points, or provide access to general themes, rather than very specific details.

Describing the potential users of a digital collection is probably the biggest question to ask as one begins to think about metadata possibilities. Think about this: is your user a history major that is going to need specific dates and geographical locations identified in the metadata? What are the implications for your metadata if so? Is your user a high school art student that needs to download some images for a class project? What might be the implications for your metadata if so? Identifying the possible users of your collection will help to conceive a metadata model appropriate for your user.

The Norris Medical Library at the University of Southern California is developing its first digital collection of 3000 slides of orthopedic anatomical images; some photographs and some illustrations. The collection is being built as a learning tool for the medical students, orthopedic surgery residents, and faculty on our campus. Since our known users are training to be medical experts we decided to focus on assigning extensive image descriptions and keywords. We chose to use medical subject headings (MeSH) to assist our local, expert searchers who would be familiar with that terminology, and also chose to assign "naive" keywords, or descriptors that someone in the wider population - a non-expert - would know to use.

3.2 Who is the collection cataloger?

Hert et al. describe a workable metadata model as one that "balance[s] the aspects internal to an organization...with those external" (Hert et al. 2007). The external aspects may be considered to be the user, as discussed in the previous section. The internal aspects may include things related to the collection infrastructure, such as a content management system and catalogers. This section focuses on the role of the cataloger in developing a balanced metadata model.

If the cataloger is an expert on the subject of your collection, your collection may be able to be cataloged with terminology perfectly appropriate for your intended user group. Even if the cataloger is not a subject expert, if he/she is familiar with the vocabulary or schema you choose to catalog the collection, a cataloger's confidence with the vocabulary or schema ensures some level of success.

As the Norris Medical Library prepared a collection of digitized anatomy slides, we realized the need for an expert on the subject of anatomy to assign appropriate terminology to the metadata records. A retired anatomist was found to fill that role. In this case the subject expertness was a bonus for the collection, yet the schema was not familiar to the retired anatomist, and training had to be considered.

Even if the cataloger is familiar with the subject material, vocabulary, and schema, a collection developer must also consider the manner in which data are entered into the content management system. Knowing the level of comfort the cataloger has with technology may lead a collection developer to consider either a simplified or complex data entry system. If one needs to train on the elements of a schema or appropriate vocabulary a collection developer may consider a content management system interface that makes data entry as seamless as possible, to counterbalance the intellectual strain of working with an unfamiliar vocabulary or schema.

If the collection developer has access to more than one cataloger, each can play a significant role in metadata creation. Perhaps a cataloger is not familiar with the terminology used to describe images of arrowheads, for example, but understands copyright language and is able to confidently enter metadata regarding image usage rights and restrictions. One metadata record may be split so that pieces may be completed by various catalogers, in order to play on appropriate strengths. For example, the person creating the digital file may be the one to assign structural metadata such as file format, image size, date deposited, etc. (Robertson 2005)

3.3 How much time/money do you have?

In addition to funding and scheduling the digitization, storage, database construction, web design, and forward migration of a collection, a collection developer must decide how much time and money to set aside for metadata creation. Greenberg & Robertson assert that a collection must have enough resources allocated in order to create metadata that is "accurate, consistent, sufficient, and thus reliable" (2002). Creating metadata that express the essential nature of the collection provides the user an accurate assessment of the contents found in the collection. Creating "consistent" metadata means that a term used one way to describe a digital object will be used the same way throughout the collection. "Sufficient" metadata provide enough description so that the objects may be discovered by a user. Sufficiency may be difficult to quantify, especially if the collection is being built for an undefined user group, because then it is harder to know when enough description has been given. Deciding how many metadata are sufficient depends on the user, and collection developers may consider which metadata elements will provide the most relevant information for the broadest user population (see Mizzaro 1997 for further discussion of relevance).

Other than appropriate funding for metadata a collection developer must consider the amount of time identified to describe the collection. If the collection must be searchable within a short period of time after its digitization, perhaps certain metadata elements should be assigned first, with the expectation of going back to add in other elements at a later point. Ideally the collection would be described in full before going live. If time is a pressing consideration a collection developer may wish to train a large group of catalogers to enter the metadata, allowing for a quicker turnaround than having a usual staff of one or two catalogers enter data over time.

Collaborations can have the same effect of a large group of catalogers in a single location. If the digital collection has "partners," or people spread across areas of expertise, they can enter their portion of the metadata, working to complete a piece of each record. These partnerships with members outside the usual realm of the digital library can have a lasting effect, in addition to speeding up the metadata entry timeframe: encouraging partners to participate in the creation of a collection can build "closer, more sustained relationships" with those colleagues (Lim 2003).

In building the anatomy collection at the Norris Medical Library we involved multiple partners to assist in metadata entry, with no pressing time commitment for completion of the project. The luxury of time gave us the opportunity to enter data in a few records, review and discuss, and then continue.

3.4 How will your collection be accessed?

Decisions about how to provide access to a digital collection is a complex process, and the discussion here is meant to guide a collection developer to consider how the metadata schema will interact with an appropriate user interface, and which kinds of metadata may be appropriate for the chosen interface.

If the collection will be accessed through a web interface designed by the local group, and will provide searching by only a few pre-defined elements such as subject or author, the metadata schema one chooses may be brief and simple. For example, if only five possible subject categories will be assigned to the objects in the collection a drop-down menu may be made to display those five options to the user. The implications for this related to choosing a metadata schema is that the fewer elements are needed, the fewer relationships between objects in a collection, the simpler the schema can be.

If the collection will be searched via a mechanism where the relationships are not pre-coordinated, the metadata schema may be more extensive. The schema may be more extensive because the relationships between search terms are as yet unknown, and a developer will want to provide for as many possibilities as are reasonable, in order to provide a satisfactory search result. The collection developer must consider more possible relationships between data elements if the search mechanism is complex. This kind of search is likely to be done with an empty search box, in which the user types in any keywords that come to mind, rather than being guided by drop-down boxes of finite possible keywords.

The digital anatomy collection at the Norris Medical Library can be searched two ways: as a discrete collection, identified in the collection interface with a picture and textual summary, browseable by thumbnail images; as a single image, found through a keyword search. Since the collection has a number of possible access methods we decided to use a simple metadata schema and focus on an extensive keyword list, using both the controlled vocabulary of MeSH as well as uncontrolled keywords.

3.5 How is your collection related to other collections?

As a collection developer considers his collection, he not only considers the single collection, but how it fits in with other existing collections at his institution. Relationships may be drawn not only between digital objects within or across collections, but also across formats; a digital resource may relate to books, objects, human beings, or corporations (Dublin Core Metadata Initiative 2007). If the collection being developed relates to other collections, it is important to identify how those relationships will be described. If, for example, for the current collection the date of the objects is a key element, it would be wise to format that element as is done in other collections; the ISO 8601 is an international standard format for entering the date in a metadata field (see http://www.w3.org/TR/NOTE-datetime for a description of the standard). If a collection developer desires for collections to be cross-searched on the date element, those dates must be formatted in the same manner.

More broadly than considering just one particular element that may be cross-searched, a collection developer may choose a pre-coordinated vocabulary to assist in populating the elements once they have been identified. By choosing a vocabulary that is common to other existing collections, the possibility for keyword matches in a search is greater. For example, using terms from a known vocabulary such as the Art & Architecture Thesaurus to describe art images increases the possibility that the collection will be successfully searched, as the terms found in that thesaurus are common to the art field.

A collection developer may decide to digitize several small collections that are part of a larger, or "parent," collection. The Norris Medical Library has developed such a collection, with the parent collection called the Orthopaedic Surgical Anatomy Teaching Collection. This main collection is a set of 3000 digitized slides of an orthopedic anatomy teaching collection, and is considered the “parent,” or top-level collection. Under that collection the slides have been grouped into three small, or "children," categories: one small collection is of anatomy photographs; the second small collection is of anatomical illustrations; the third small collection is of labeled anatomical illustrations. Identifying this parent/child relationship is important when considering a metadata schema so that the relating elements are defined in the same way. In addition to consistency, identifying key elements at the parent level allows for that information to auto-populate the elements of the child collections. For example, we identified the parent collection as the Rehman Anatomy Collection, and as a result each of the records held in the smaller collections has an element that says, "this record is part of the Orthopaedic Surgical Anatomy Teaching Collection."

3.6 What is the scope of your collection?

Knowing which other collections already exist at an institution can provide time- and effort-saving information about how a collection developer may choose to treat his new collection. For example, a developer building a digital collection of fine art images may look to see if there are other fine art collections in use in the institution. If there are existing collections, the developer may choose to use similar schema and vocabularies. In choosing a similar schema or vocabulary the developer can save some effort in schema design. In addition to saving time, by choosing the same vocabulary to describe the new objects, one sets the stage for potential relationships between objects to be discovered. The term landscape, for example, can provide a link between two separate collections of images of landscapes.

In addition to the enhanced discovery possibilities, a practical consideration for choosing to use similar schema and vocabularies across collections is the competence of their use from a cataloging perspective. The cataloger or data entry technician will be more competent, confident, and quick in his/her use of the tools if he/she has already used them in another project. If the collection being built is a part of a larger collection, using the same schema provides quicker object analysis and data entry. Familiarity with the schema and vocabulary provides a level of consistency in its application.

In the case at the Norris Medical Library, the Orthopaedic Surgical Anatomy Teaching Collection was the first digital collection developed at that library. As a result we did not have other internal collections to look to for schema or vocabulary guidance, so we looked at our larger institution, the USC Digital Archive for guidance. Since our collection did not need to "fit," or coordinate, with other existing medical collections we made the decision to use the Digital Archive's general metadata schema and tools, to ensure that our collection can be cross-searched with collections already in the Archive.

3.7 Will your metadata be harvested?

When planning the metadata schema for a digital collection, the developer considers if he wishes the collection to be "discovered" outside of the immediate user environment. If the developer wishes to build the collection for this wider purpose, one option for having the digital objects discovered is to deposit the metadata into a place where it will be gathered by a software tool called a harvester. The harvester then distributes the metadata so that users all over the world will be pointed to the digital objects in the local collection. One can imagine that in order for a software tool to gather metadata efficiently, it must all be structured similarly. To that end, if a developer wishes to have his metadata harvested, he must comply with a standard protocol. The Open Archives Initiative (OAI) uses Dublin Core as its protocol, and a developer must map his metadata to this schema before depositing them. The OAI has a general overview available for developers considering this option, found at http://www.oaforum.org/tutorial/.

The concept of mapping means that one takes the elements of one metadata schema and determines how they match particular elements of another metadata schema, so that the information may be exchanged seamlessly from one schema to another: this process is called crosswalking. The difficulty with crosswalking is that the elements of one schema do not always match the elements of another well enough to have the data mean the same thing. Day and Gill both describe existing crosswalks between popular metadata schema, lessening the amount of effort a collection developer will need in order to determine which metadata schema may be appropriate for his situation (Day 2002; Gill et al. 2005). These crosswalks are available to assist a collection developer to learn how one data element will migrate to another schema element, so that the least amount of data is lost.

3.8 Do you want your collection to work with other collections?

When one builds a digital collection one considers it in relation to the other resources his organization provides. Determining the possible interactions between new and existing resources can happen with how metadata is assigned. Designing a metadata schema so that collections may work together is called interoperability.

Caplan (2003) defines interoperability as "...the ability to perform a search over diverse sets of metadata records and obtain meaningful results." In order to effectively search across metadata records one needs to have designed those schema with the intention of interoperability; the result of this intention is that the schema are similar, or share vocabularies or standards across them. Entering date information, for example, in the same way in each metadata record means that those elements allow interaction between records and schemas; this standardization provides a user the possibility for successful information retrieval. Though standardization of elements like the date seem reasonable and expected, Anderson and Ross (2005) note that they believe "...the greatest challenge facing multimedia repositories may be populating interoperable metadata frameworks rather than implementing the technology," furthering the idea that the expectation of interoperability may not always be considered when constructing a metadata schema.

Imagining a collection to be bigger than the collection itself and thinking broadly about one's collection helps to define how it will interact with other collections, and assists a developer in identifying what is important about his collection. Cole and Shreeves articulate that, "a good [digital] collection fits into the larger context of significant related national and international digital library initiatives" (2004). By identifying the factors that enable this developing collection to "fit" into a larger context, a developer can be certain to identify those elements in a metadata schema.

In constructing the Orthopaedic Surgical Anatomy Teaching Collection we clarified that the collection did not have known existing resources with which we wished it to interact. While no other anatomy collections are currently held in the USC Digital Archive, it is possible that other related resources may be added at a later point. To this end, we used a simple metadata schema and standardized vocabularies. By making this choice at the inception of our digital collection we ensure that our collections in the future may interoperate.

3.9 How much maintenance and quality control do you wish?

When considering which metadata schema to use a collection developer will take into account the current status of the collection, as well as to where it may grow: will the collection stay in this format or repository for the foreseeable future, or is migration to a new system or integration with another collection already in the plans? If migration is eventual, the collection developer may consider a simple, straightforward metadata schema, with few alterations so that maintenance of the metadata is minimal. This simplified approach may make the transition to a new schema or platform easier. This approach for schema design can be likened to the "touch it once" approach used in traditional book cataloging; by choosing a schema that can transition easily to another, the data do not have to be altered, thus allowing the cataloger to just 'touch' the data one time rather than having to return to records to make changes.

Another approach to developing a schema may be to plan to enter the metadata in stages, beginning with the essential elements useful for discovery, then moving to expand the metadata. This approach may be useful if the future of a collection is not known, and will save metadata from being discarded. This approach has its flaws in currency and lag, as discussed in Bruce & Hillman (2004).

An important quality control factor when deciding on data elements in a schema is determining how the data will be entered. It is helpful to make decisions before any data are entered on issues such as how numbers will be entered; will the cataloger use the word "two" or the numeral "2"? A search engine will not intuit that those elements are the same, and will return different search results, depending on which is entered. Documenting your decisions about consistent data entry will allow your catalogers or administrators to do less data cleaning.

Even if decisions like these are made before data entry begins, a certain level of quality control will be necessary after the data have been entered, to ensure accuracy in the database. Beall identifies automatic metadata generation, crosswalking, and harvesting as problematic, but describes typographical errors as the most common quality issue (2005). Fox et al. note four dimensions for determining quality: accuracy, currentness, completeness, and consistency (1994). Ojala (1996) notes that errors inevitably slip past the quality control process and discusses options for correction after the fact. Guy et al. (2004) discuss ways in which collection developers can decide for their collection which quality issues they wish to address.

If there are multiple participants in creating one metadata record, a work form may be a good solution to guide catalogers to enter data only in the sections of the record that apply to them. A digital interface, perhaps with the sections highlighted that are meant for a particular cataloger, ensures that the cataloger is clear about which data are to be entered. This kind of quality control design may ensure that as data are entered it is done correctly.

When designing the Orthopaedic Surgical Anatomy Teaching Collection we did not foresee any transition to a new repository or schema, and so our decisions about which metadata schema to use were not influenced by those considerations. We have made clearly documented decisions about quality control, however, since we have multiple participants working on the metadata records over time. After data are entered into a Web-based workform another person performs initial quality control, looking at a number of specified fields for accuracy. Before final approval the principal cataloger reviews the record once more. It is assumed that once final approval has been granted, the record is considered complete.

4 Summary

These nine guiding questions may be used to assist a collection developer in thinking about the goals for his collection in a real and practical way. Deliberations about how to construct an appropriate metadata schema can be confusing, but by answering these questions a collection developer can be clear with his catalogers and programmers about how the collection is to be described and used. The answers to these questions have a direct effect on the search results for the users of the collection.