Experiences of Educators Using a Portal of Aggregated Metadata: Shreeves and Kirkham: JoDI

Abstract

The University of Illinois at Urbana-Champaign Open Archives Initiative Metadata Harvesting Project sought to test the viability of a search portal containing aggregated metadata for cultural heritage resources harvested using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Metadata was collected from 39 providers, including museums, archives, libraries, historical societies, consortiums, and digital libraries. Some resources existed in digital formats, such as .JPG images. Other resources were analog objects and were represented digitally through the metadata. The paper documents a pilot user test with a small group of K-12 teachers-in-training. The users were asked to use the portal to locate primary source materials for use in the classroom. The results highlight the challenges posed by aggregations of heterogeneous metadata for both users and service providers. Areas for further investigation and approaches for more in-depth studies are suggested.

1 Introduction

The Open Archives Initiative (OAI) Protocol for Metadata Harvesting (PMH) is now well established as an important tool for building aggregations of metadata from dispersed collections. The protocol relies on both data providers, who expose their metadata, and service providers, who harvest and aggregate metadata from one or more providers. This harvesting model allows the data providers to concentrate on developing content and the service providers to build services based on the aggregated metadata for large or small domains. (Lagoze and Van de Sompel 2001) The OAI-PMH is technically a "low-barrier" protocol to implement that relies primarily on HTTP and XML, and has been particularly successful for:

sharing metadata describing resources not readily available to current Web search engines, such as those within databases or with non-HTML content (the so-called "hidden web").
allowing participation by content developers who may be unable to participate in other methods for federated searching, such as Z39.50, due to technical or other limitations.

In 2001 the Andrew W. Mellon Foundation funded seven metadata harvesting projects to test the efficacy of OAI-PMH. The University of Illinois at Urbana-Champaign Library's OAI-PMH project began in June 2001 and ended in May 2003. The project sought to:

Document the benefits and obstacles of aggregating metadata via the OAI protocol.
Understand the viability of search and retrieval of aggregated heterogeneous metadata in a specific domain.
Investigate whether such a portal to minimally mediated metadata could be useful to a specific group of users.

The UIUC portal utilized metadata harvested from a restricted domain, cultural heritage. However, in terms of the type of material described, the domain was defined broadly. Data providers were selected based on their ability to provide metadata which, for the most part, described historical or cultural primary source materials--without regard to whether those materials were meant for use by a specific community or whether the descriptions were created with a specific community in mind. The team decided to build on earlier work done at Illinois in the Teaching with Digital Content project by targeting K-12 educators as the user group of interest. The portal, however, was not initially developed specifically for K-12 educators; no specialized services were developed, and only minimal metadata normalization and processing occurred. The focus of the grant was to expose and aggregate item-level metadata describing cultural heritage resources that could not be accessed at the item level via Web search engines. The project did not focus on providing community-specific services. That said, the Illinois team was interested in exploring what level of mediation would be required to make such a portal useful to a specific community. Would it require metadata enhancement? Contextual information? The study described here helps to establish a baseline for further investigation.

This paper describes a pilot study undertaken to test the utility of the Illinois portal for the K-12 teaching community. We describe the nature of the metadata in the portal and discuss research on OAI-enabled aggregations of metadata and the use of digitized primary source material by teachers. Methodology and results for the pilot study are described. We conclude with reflections on the limits of our methodology and with new questions raised by this admittedly small study.

2 Background

The following sections describe aspects of the development of the UIUC portal that are relevant to the pilot study. More information and technical details can be found on the project Web site as well as in published papers and conference proceedings.

2.1 Metadata

The Illinois project built a repository which can be accessed through a search portal called the UIUC Digital Gateway to Cultural Heritage Materials (referred to herein as the UIUC portal). At the point of the pilot study (Fall 2002) the repository contained approximately 1.1 million original metadata records and an additional 1.5 million records derived from Encoded Archival Description (EAD) finding aids. The aggregated metadata describes an array of cultural heritage resources from 39 providers, including museums, archives, academic and public libraries, historical societies, consortia, and digital libraries. Approximately half of the participating institutions were OAI-compliant data providers, whose records were harvested directly from their own servers. The non-OAI-compliant providers delivered "data dumps" of metadata, which acted as sources for surrogate data provider sites implemented at Illinois (and used only for harvest by this project). The resources described ranged from artifacts to photographs to finding aids. Some resources existed in digital formats, such as .JPG images. Other resources existed only in analog format and were represented digitally through the metadata.

The common schema used for metadata stored in the repository was Dublin Core (DC). The use of DC was dictated by the OAI-PMH, which requires that metadata be exposed in unqualified DC (although additional formats can be exposed as well). In its unqualified form, the Dublin Core is a 15-element metadata schema developed in the late 1990s to enable easy description of Web resources. It has since evolved into a cross-domain lingua franca for resource discovery and exchange. DC is flexible in that each element is optional and repeatable. In practice, use of the elements varies widely.

For the Illinois project, metadata from OAI-compliant data providers was received in unqualified DC. Metadata from the non-OAI-compliant data providers was received in multiple schemas, including MARC, EAD, Visual Resources Association (VRA) Core, and locally developed schemas. These schemas were mapped to unqualified DC using standard crosswalks (where they existed) or were mapped by project staff in consultation with the data provider. In addition, metadata quality and completeness were not used as criteria for inclusion.

Most of the metadata harvested and received for the Illinois project described individual resources. The EAD metadata, however, merits special mention. EAD is used to encode finding aids which are hierarchical descriptions of collections, usually in archives. An EAD file will typically include a top-level collection description plus multiple nodes (sometimes thousands) that describe items or groups of items in the collection. The implementation of EAD varies across (and within) institutions, as does the extent of description at the lower nodes. Mapping an entire EAD record to one unqualified Dublin Core record is not feasible, although certainly the item-level nodes could be mapped to the relation element. In many cases only the top-level collection description is mapped to a Dublin Core record (several OAI-compliant data providers practice this method). The project team decided to explore whether breaking up a single EAD record into multiple DC records with the relationships between them preserved in the relation element might make it possible for archives to share more information about their collections in metadata aggregations. This in turn might allow searching of item-level resources alongside item-level data from archival collections. The project received approximately 8000 EAD finding aids in "data dumps". Using an algorithm developed by the project team, these 8000 EAD files generated more than 1.5 item-level DC records (describing mostly analog resources) bringing the total number of item-level DC records to approximately 2.5 million. (Prom 2002)

Analysis of a subset of approximately 600,000 records provided natively in DC revealed wide variations in the interpretation and application of DC elements by different communities. For example, 93% of records from museums used the subject element; only 15% of records from academic libraries did so. (Shreeves et al. 2003) Such disparities, coupled with the variety of controlled vocabularies in use, present specific problems for anyone attempting to build an effective search service for aggregated metadata. Disparities in the subject element in particular disallow browsing by subject without significant mapping and mediation efforts by the service provider. The project developed a variety of strategies to minimize these disparities, including indexing and organizing metadata by type of material (image, text, physical object, etc.) and applying a normalization vocabulary to temporal information contained in the date and coverage elements. However, the project was also interested in exploring the value of minimally mediated metadata, given the limited resources and time available.

2.2 Portal interface

The aggregated metadata is accessed through a search portal built using the University of Michigan Digital Library eXtension Service (DLXS) tools. The project team adapted the DLXS middleware, bibclass, to unqualified DC, which reflected the XML files harvested via the OAI protocol. The XML files were organized according to the type of material the metadata described: archival collections; images; moving images; text and sheet music; audio; physical objects; and Web sites. Each of these categories comprised a separate index, allowing users to search within one category if desired. This organization was done at the data-provider level because of the diversity or lack of controlled vocabularies used in the type element at the item level.

The primary entry point for the portal was a simple search page (Figure 1).

Figure 1. UIUC portal's original simple search page

Also included was an advanced search page (Figure 2).

Figure 2. UIUC portal's original advanced search page

On both search pages, users were able to limit their searches to metadata records providing "online access" to the resources described. This limit was based on whether or not a metadata record provided a URL in its identifier element. Each URL led either to the digital object described by the metadata or to a Web page providing additional information about the resource. A Browse by Collection feature was also included, offering access to all records in the aggregation organized by institution. The About Collections page provided a description and URL for each data provider included in the repository (see Figure 3).

In response to a search, DLXS responds with brief records organized first by type of material and, within that set, by institution. The full display shows the entire DC record, although some elements were renamed for clarification (i.e. creator was renamed author/artist). In addition to the default DLXS interface, search results were redesigned to contextualize the EAD-derived item-level DC records. A URL was included in the identifier element that, when clicked, jumped to the appropriate node within the EAD record. (Prom and Habing 2002) The project team conducted minimal usability tests early in the development of the portal and made some minor adjustments to the interface.

3 Research question and literature review

For this pilot study, the Illinois team chose to focus on K-12 educators. Our research question is: What is the usefulness of an OAI service provider search portal to aggregated cultural heritage material for K-12 educators?

Related questions include: How did educators make choices about which resources to use? Did they pay attention to the institution that was providing a resource? Did educators use the decomposed EAD finding aids included in the portal? What level of mediation is needed to provide a useful aggregation of metadata harvested via the OAI protocol?

Little has been written on how users interact with collections of aggregated metadata. However, with the advent of the OAI-PMH several studies began to look at the challenges of aggregating metadata for both internal processes and end users. Hagedorn (2003) conducted log analysis, user testing, and a user survey for the University of Michigan OAIster service provider. She notes that users such as scholars and researchers often did not know where to begin to look for online information (online journals and reference sources). Arms et al. (2003) and Shreeves et al. (2003) have noted how variations in metadata authoring practices challenge service providers' abilities to build consistently searchable systems. Ward (2003) analyzes the use of Dublin Core by OAI data providers. Challenges mentioned by these authors include the variation in the elements and vocabularies used, the granularity of objects described, and the depth of description. Liu et al. (2002) document the problems of heterogeneous metadata in the Arc digital library. They suggest that through a series of interactions with subject terms culled from harvested collections a user can choose the appropriate collections in which to search.

This pilot study examines how educators look for digital primary sources and what criteria they use in choosing such sources for use in the classroom. The literature focuses on the use of both pre-packaged lesson plans and the use of search engines such as Yahoo! or Google. Many articles simply gave lists of useful sites. VanFossen and Shiveley (2000) describe where educators can access primary sources: textbooks, which often include duplicates of primary sources; commercial reproductions of primary source material (sometimes called "jackdaws") which include suggestions for lesson plans; and the Internet. They suggest using both pre-packaged jackdaws, such as those found on the Library of Congress American Memory site, and using standard search engines to create one's own jackdaw. While VanFossen and Shiveley do not speak to the question of authoritative sources, Kobrin (2001) writes that sites are most useful when they have been vetted by a historian. He writes that History Matters as well as the Library of Congress and the National Archives Web sites are "safe, secure, informative, and always accurate". Warren (2001) also describes using jackdaws from the National Archives in a high school history classroom. Lee (2002) notes that "educators and historians must closely evaluate digital historical resources before using them".

Interpretation by historians or curators, historical context, and relevance were found to be important qualities for educators in digital systems. Deniman et al. (2003) identify several design considerations for educational discovery systems, including:

Enable users to decide quickly and with less effort whether a resource is relevant or not.
Achieve appropriate balance between precision and recall and empower users to favor one over the other.

In the Digital Cultural Heritage Community project at the University of Illinois at Urbana-Champaign, educators, librarians, museum curators, and archivists worked together to identify primary source materials from local museum, library, and archival collections to digitize for use within a classroom (in this case 3rd, 4th, and 5th grades), focusing on materials that could be linked to curriculum units and state learning standards. Bennett et al. (2000) note that it was sometimes problematic to fit primary source material in local collections into the broad scope of curricula units. Educators desired access to digitized national artifacts and relied heavily on the curator's or archivist's interpretation of the artifact in order to fit it into their classroom activities. Bennett and Jones (2001) note: "We need to spend more time making what we digitize useful for teachers and students and less time worrying about getting in on the web". Supporting this view, Gilliland-Swetland et al. (1999) note that "a comparatively small amount of primary source material, if appropriately selected, described, and contextualized [emphasis theirs]" can be adequate for use in the classroom. VanFossen and Shiveley (2000) note that while textbooks and physical jackdaws have gone through a vetting and editorial process and provide context for primary source material, very often the material found on the Internet does not: "The selection of one's own primary source documents from the Internet also presents the task of packaging the material in a contextually accurate manner". In addition to the provision of interpretation, historical context, and relevance, Lee (2002) notes that clarity and a commitment to K-12 education are also important.

Although the project did not have the resources to develop contextual material for the portal, the literature also indicates that a portal specifically to primary source cultural heritage information could potentially be a valuable tool for educators in that it contains vetted sources from reputable institutions and provides direct links to digital objects. The Illinois OAI-PMH project was developed to provide precisely these sorts of links.

4 Methodology

For this pilot study, the user population comprised 23 upper-class college students training to become K-12 social studies teachers in an honors-level curriculum and instruction course. Although the Illinois team lacked the resources to identify and gather working professionals, the test subjects had classroom experience as student teachers. The test subjects were asked by their professor to use the UIUC portal to find primary sources for use in preparing a lesson plan on a specific social sciences topic for a high school class. They were also assigned to write short papers about their experience (see Appendix 1). Prior to initiating their searches, users were introduced to the concept of metadata aggregation and were informed that the portal would provide pointers to digital content held elsewhere (thus correcting an inaccurate statement in the written assignment, which was authored by their instructor and not by project staff) and would include records for analog resources.

Prior to the test, the Illinois team created a duplicate portal for use by the test subjects, enabling transaction data from the pilot study to be gathered. The team was aware of the limitations of transaction log analysis but was interested in how it might supplement qualitative data gathered from the pilot users. Individually, the students conducted unobserved searches on the portal and wrote preliminary evaluations. Two focus groups were conducted simultaneously in separate locations, with each group consisting of approximately one-half of the number of test users, e.g. approximately 11 subjects per group. Facilitators posed several questions to the groups (see Appendix 2) and responded to questions from the subjects. Each focus group session was audio taped and transcribed. After the focus groups, the subjects used the portal a second time and wrote final reports. In total, 23 subjects tested the portal and 46 papers were read and minimally coded. Table 1 summarizes the results gathered from the papers.

5 Results

This section presents the users' experiences with and comments about the portal. Given that this was a small pilot study, our intention is merely to describe key results and not to generalize to other user groups or situations.

5.1 Focus groups and written evaluations

Perhaps as a result of group dynamics or because the subjects wished to vent their frustration at the obstacles they encountered in using the portal, a great deal of focus group time was given over to critiques of the interface. In particular, the subjects complained that the portal's simple and advanced search pages were confusing and lacked instructions, and that the portal did a poor job of explaining its purpose. All of the subjects reported trouble making use of the Online Access Only switch in their searches. This confusion was largely the result of the inclusion of item-level records derived from EAD finding aids in the "Online Access Only" category. The subjects were unanimous in their disapproval of both the intended and actual functioning of the switch. Because finding aids refer to analog objects, the subjects resented it when finding aids were labeled as offering "online access". In addition, they referred to finding aids as "descriptions" or "lists", i.e. secondary instead of primary sources. In focus groups, subjects also reported that they found certain jargon used in the portal confusing, e.g. unique identifier, contributor, metadata. They also reported difficulty making use of advanced search features such as field-restricting and date ranges.

In their written reports, all subjects praised the idea of a unified portal to diverse sources of online primary materials. However, after using the UIUC portal, 74% of subjects stated that they would not use it as practicing educators. Table 1 quantifies some commonly identified issues in the written reports.

Table 1. Recurring comments in written reports
Comment	Percentage of Users (out of 23)
Confused by Online Access Only switch	100%
Reported trouble with results describing items that could not be viewed online (including results linked to online finding aids)	100%
Accorded equal credibility to all data providers	100%
Reported that portal would not be useful for teachers	74%
Wanted relevance ranking in results	70%
Said they tried to use search-refining features	65%
Compared portal to commercial Web search engines (e.g. Google)*	57%
Reported null or near-null result sets for at least one search	48%
Compared portal to electronic abstracting and indexing services (e.g. ERIC)	44%
Had trouble with "type of material" categories	38%
Used portal as a clearinghouse for reputable Web sites	22%
Reported difficulty browsing due to volume of content	22%
Said lack of controlled vocabulary was a factor in unsuccessful searches	9%
* Generally, the subjects rated Google and other commercial search engines more efficient and better at indicating the relevance of results. In addition, they noted that all results in Google refer to accessible digital objects (Web pages). They rated the UIUC portal better than commercial search engines on one criterion: being restricted to credible sources.

5.1.1 Frustration with redirects within search results

Despite some subjects' stated preference for visiting data provider sites individually, in their searches in the UIUC portal, the subjects were universally frustrated when a particular result directed them to a different site. The assignment promised that "The [UIUC portal] allows for accessing artifacts without going directly to individual institutional homepages". Although this instruction was meant to indicate that the portal allowed for a federated search of multi-sourced metadata (and was clarified during an informational session prior to the test), the subjects understandably interpreted it to mean that all items of interest were accessible without leaving the UIUC portal. They reported a significant slowing of their efforts when a pointer, or active link, within a record led them to another institution's Web site in which they had to execute an additional search. The subjects clearly believed that a live URL in a search result should immediately display the digital object of interest. One student described the interaction as being comparable to going to McDonald's "and upon walking up to the counter the employee hands across directions to Burger King across town".

5.1.2 Frustration with EAD-derived records and records describing analog resources

All users reported disappointment when records referred to analog resources. The records derived from the EADs, constituting over half the total number of records in the portal, for the most part had no digitized item (as opposed to the digital finding aid) associated with them. The inclusion of finding aids and their decomposed item-level records was an obstacle for these users. For example, a user who selected a search result labeled "letters from a WWI soldier" might find that the record referred to the holding institution's finding aid instead of to the letters themselves. The subjects referred disparagingly to finding aids (a term they were unfamiliar with) as "abstracts", "lists", and "descriptions", as distinguished from "sources" or "materials". They did not understand why the records were included and were confounded by opaque labels, such as "Box 23".

5.1.3 Search results difficult to use

Variations in controlled vocabularies and disparities in the use of Dublin Core had resulted in widespread inconsistencies in the harvested metadata. As a result, the Illinois team had enabled greater recall by making a keyword search on all fields the default search. Not unexpectedly, keyword searches produced vast quantities of unsorted results. The lack of a ranking feature in search results exacerbated the difficulty of identifying useful resources. The vast numbers of unranked results left the test group feeling overwhelmed and unable to make good use of search results. They consistently lauded the ranked results retrieved in online commercial search engines, such as Google, and in abstracting and indexing services, such as Lexis-Nexis.

Although the goal of the research was to harvest a significant quantity of metadata (eventually totaling over 2.5 million records), 22% of subjects saw the quantity of records as a detriment to successful search and discovery. For instance, they commented that the Browse Collections feature was not useable because the total number of items in any given "type of material" category might number in the tens of thousands.

Among the 23 subjects, 15 (65%) stated that they attempted to refine their searches using the advanced search page, with the majority of them attempting to limit their results by date range. The testers clearly understood that the entries in metadata fields determined their retrieved results, asking whether the date range field retrieves items created during or relating to the period. (In fact, this option malfunctioned during the test period, so the users' frustration is justified.) Several users reported trouble making use of field-restriction operators. For example, one tester inquired whether letters written by Franklin Delano Roosevelt were best retrieved by entering "Roosevelt" in the Author field or the Description field.

As might be predicted, the lack of a common controlled vocabulary among metadata providers was problematic. Some (9%) subjects noted that synonyms, such as America, United States, and U.S.A., produced entirely different result sets. One subject's search for Asian American Working Class found no results, even though the repository contained records for a large number of photographs of an early 20th century Japanese-American community. One student commented: "I don't imagine all entries are standardized, but it would still be useful to have some enumeration of possibilities".

5.1.4 High incidence of null or near-null result sets

Despite the large amount of aggregated metadata, some users found that there were no matches for particular topics. Searches on Women's Suffrage and Great Depression, for example, found few or no results. The repository's content, of course, was dictated by the metadata available to be harvested using the OAI-PMH, and by which institutions were willing to share their metadata. The Illinois team did not actively pursue particular data providers in order to round out the topical coverage. As a result, coverage of cultural heritage topics is spotty.

Some subjects reported specific expectations of content they believed should be available from a "cultural heritage" portal, including certain classic documents. They expressed surprise that no scanned images of the United States Constitution or the Magna Carta, for example, can be retrieved through the portal. The majority of testers demonstrated an understanding that the portal's metadata was produced elsewhere. Nevertheless, they held the portal's architects responsible for the aggregated metadata, and they stated that steps should be taken to mitigate the problems caused by missing, unpredictable, or inconsistent metadata. Students commented that: "If [the UIUC portal] recognizes that a subject is underrepresented, then the [portal] managers must find an archival collection that has more sources (on a particular topic or group)", and "I understand some sources are more difficult to come by, but it is [the portal's] mission to provide diversity among sources".

5.1.5 Type of material categories hard to use

The bulk of the indexing that was performed on the harvested records involved chunking the data by type of material, e.g. documents, images, archival collections, etc. However, these categories proved problematic for the test subjects. In some cases, users did not understand the terms, and some users commented that the categories seemed to overlap. For example, would a digital photograph of a WWII ration book appear under Images or Archival Collections? The term Archival Collections was particularly confusing. The portal's stated concentration on "cultural heritage" materials apparently led these subjects to consider all of the records therein as referring to materials that are "archival" in nature.

5.1.6 Users accorded equal credibility to all data providers

In focus groups and written evaluations, there was consensus that the name of a holding institution did not influence which search results the test subjects chose to view. The subjects reported that they assumed the nature of the UIUC portal assured that all data providers could be considered credible and authoritative providers of primary source materials. This fact has implications for how metadata aggregators organize and present their collections.

It is interesting that, while the subjects credited Google with speedier delivery of relevant results, some subjects commented that, in order to use Google to locate reputable sources of material, they were required to scan the results for URLs in the domains .org, .gov or .edu. The ease of scanning the URLs in search results was identified as a benefit of Google.

5.1.7 Useful as a screening tool for other online collections

Nearly one-quarter of the subjects discovered an unintended use for the portal: as a clearinghouse for pre-screened reputable online collections. Subjects reported that they used the collection descriptions and URLs on the About Collections page (shown in Figure 3) to identify and link to topic-specific sites. Having once discovered the About Collections page, these subjects expressed a preference for bypassing the metadata search provided by the portal and going directly to individual institutional Web sites. The testers commented that they believed these context-specific collections were likely to offer more, and more relevant, material than the UIUC portal. Indeed some subjects felt that vetting and providing access to specific online collections was the proper mission for the UIUC portal and that it should abandon attempts to offer searching of aggregated metadata.

About Collections page describing data providers

Figure 3. UIUC portal's About Collections page

5.2 Transaction log analysis

Although we are aware of the limitations of transaction log analysis, we found that site usage data helped to supplement the qualitative data we gathered.

Twenty-three users accessed the site 120 times during the test period. During these sessions, they performed 555 searches which were almost evenly split between the Simple Search (268) and the Advanced Search (287) pages.

Although 65% of subjects reported that they attempted to use the advanced search features, the transaction logs do not support these statements. The Title field was utilized in a total of 186 searches; in only 10 searches, however, was it chosen where it was not pre-selected by default. Author/Artist was selected in 218 searches but was selected only once in a list in which it was not the default. On the advanced search page users could combine search terms using from one to three Boolean operators. In only 22 cases was a Boolean operator selected when it was not the default selection.

Although users were vocal about their preference for quickly retrieving direct links to primary material, fewer than half of all searches (226) included the Online Access Only switch, which by default was not selected. In each retrieved record, users could select a link labeled Online Access Available which acted as a redirect, taking the user off the UIUC portal to a data provider site. Online Access Available links in search results were clicked 216 times. The test subjects reported that they interpreted the phrase online access to indicate that the link would display a digital object, such as a digital photograph. In reality, the redirect links took users to all types of materials including finding aids, digital images, and even other Web sites in which a user might have to re-execute a search. This was a source of frustration for users, who believed the Online Access Available option was malfunctioning.

6 Discussion

6.1 General observations

A clear finding is that, while the OAI-PMH itself is readily implemented, the challenges posed by large amounts of heterogeneous metadata are significant. It is clear that in addition to more concentrated attention on the usability of the interface early in a project's development, multiple levels of intervention and mediation are required to make aggregated heterogeneous metadata useful. To meet the needs of a community of educators it is insufficient to merely offer access to records in a domain of interest. Rather, the results suggest that by its very nature the aggregation of metadata is likely to produce unsatisfactory user experiences unless significant modification of metadata as well as targeted services are developed. This is a significant challenge for OAI service providers, because metadata quality and completeness vary tremendously from data provider to data provider. An alternative approach would be to develop a community of OAI data providers that expose metadata in schemas meant for use by educators (for example, IEEE Learning Object Metadata) as well as in Dublin Core. Indeed, the implementation guidelines of the OAI Protocol are deliberately non-specific so as to provide room for community-specific applications of the protocol. (Lagoze & Van de Sompel 2003) Few such communities have developed; the Open Language Archive Community (OLAC) is a notable exception.

The majority of these pilot testers reported that the portal would not be useful for K-12 educators. They found that searching for primary sources through the portal was inefficient and ineffective. They also stated that they could not browse the collections by subject effectively because there was no consistent use of the subject field. The inclusion of finding aids and other "dead ends", the frequency of null result sets, the unpredictable coverage of cultural heritage topics, and the lack of ranking for results were key factors in the subjects' negative evaluations of the portal. Some of these problems could be mitigated through a thorough analysis and manipulation of aggregated metadata. Such an effort could create topical groupings, normalize subject searching, and clearly identify which metadata describes digital objects.

The testers generally agreed that the UIUC portal was useful for pre-screening collections of digitized resources and providing both collection descriptions and links. They stated that this vetting of available sites saved them time and effort and enabled them to look for primary source materials in context-specific, specialized collections, increasing the chance that they might find useable items.

Similarly, these results support that educators prefer contextualized primary materials. The test subjects demonstrated this preference by commenting that after discovering a topic-specific data provider through the UIUC portal, they preferred to go directly to the data provider site to find additional material on their topic and to view these materials in a subject-specific context.

The problem of spotty coverage of the domain of cultural heritage raises important questions for metadata aggregators. For example, these users brought with them an expectation of active collection development (see section 5.1.4). How do we set realistic user expectations? How do we help users form a mental map of the subject areas that are covered and those that are excluded? One possible approach currently being explored in the IMLS-funded Digital Collections and Content project at the University of Illinois at Urbana-Champaign is the use of collection descriptions in conjunction with aggregated metadata to pre-select the item-level metadata sets that are searched. Users may thus be given a better sense of the collection "landscape" they are exploring. (Heaney 2000)

Given the number of records derived from EAD finding aids (and the minimal amount of information in each derived record), it is not surprising that the inclusion of finding aids was an obstacle for all of the test users. Several members of the test group acknowledged that these records could be useful for a researcher or scholar but felt that they were not useful for K-12 teachers.

6.2 Changes made to UIUC portal

Based on the pilot test, the Illinois team implemented several improvements to the interface and functionality of the portal. First, the team eliminated EAD-only collections from the portal to avoid the resulting frustration when a preponderance of results for a particular search referred to item-level records derived from finding aids.

In addition, the team attempted to clarify for users which records offered direct online access--and which did not--by improving the labeling in search results. The single Online Access Available link in search results was replaced by two links with more specific wording:

View item was used for resources that are viewable directly from within the search result.
Learn more about this item was used in results that would lead the user to another Web site or to descriptive information about the item.

Finally, we placed the simple and advanced search interfaces on a single page in an attempt to increase the chances that users would make use of the search-limiting features (Figure 4). We improved labeling to better match user expectations, and we combined several resource-type categories into a simpler set of options.

Revised interface with combined simple and advanced
search

Figure 4. Revised UIUC portal's combined simple and advanced search page

6.3 On the design of future user tests

This small, pilot study enabled us to identify some of the most egregious usability problems with the UIUC portal. However, future user tests could be designed to deliver more specific--and therefore, more actionable--feedback. The pilot testers conducted their searches privately, using search terms of their own choosing, so their experiences were quite unique. We were able to identify a few recurring issues (described in section 5.1); by controlling the test environment and the tasks, we could obtain more meaningful data.

For example, by assigning all users to search for material on the same set of topics, we could discover patterns in how users approach the search interface. Naturally, we would direct them to discover materials that we know exist in the repository. Since the search results for the assigned topics would be predictable, we could study the users' interaction with the results in a more systematic way. As it happened, some user searches had a zero success rate due to the lack of metadata on certain search topics. This hampered our ability to analyze how users deal with large result sets from heterogeneous metadata. In addition, we might select the test subjects using different criteria, for example, by controlling for Internet searching experience.

Clearly we would wish to conduct individual tests and observe them. This would allow us to track each user's progress through the interface, carefully recording how the users responded to search pages and results. Although usage logs can provide aggregate results, they do not tell us the motivation behind a user's actions. Using the "thinking aloud" protocol, we could also gather data about why users made the choices they did, paying particular attention to when and why users decide to abandon a given search or strategy. Finally, replacing focus groups with individual interviews would bypass the data-skewing that may have resulted from a natural tendency towards "group think".

7 Conclusion

The OAI Protocol for Metadata Harvesting has deservedly been hailed as an important tool in the development of digital libraries from multiple, dispersed digital collections. However, it is a tool designed only to easily move metadata from one place to another. To provide useful aggregation of metadata, it is clear that a non-trivial amount of mediation must take place. This pilot study helps to establish a baseline of metadata mediation that service providers may wish to consider.

A wide range of programmatic solutions could be undertaken to improve user interactions with the aggregated metadata, for instance, applying more sophisticated indexing tools; building more robust search features; and ranking of results based on frequency of keyword appearance in a resource or its metadata. The implementation, through automated means, of controlled vocabularies for certain fields, such as place names and personal names, would improve both recall and relevance of results. However, enforcing a controlled vocabulary for subject terms among such a diverse range of data providers is not likely to be feasible.

As mentioned earlier, the IMLS Digital Collections and Content project is exploring the combination of a searchable set of collection descriptions linked to an item-level metadata aggregation as a useful approach to setting user expectations. A similar approach is used in the Arc search service at Old Dominion University. (Liu 2002) Providing metadata schemas that are more complex than unqualified Dublin Core could be helpful in building more useful portals. OLAC has developed a specific metadata schema (based on Dublin Core) to convey the important aspects of the linguistics materials described.(Simons and Bird 2003) Other options involving editorial intervention might also be productive, e.g. portal developers could build thematic exhibits (based on analysis of metadata) to offer glimpses into the range and type of materials available through the portal. The Digital Library for Earth System Education (DLESE) project combined metadata searching (based on human-applied indexing) with automated text retrieval that relied on term weighting to rank resources by relevance. (Deniman et al. 2003) Their results show that such a "hybrid" approach can deliver a more predictable user experience. Narrowing the domain of interest may also be a fruitful tactic. Emory University's americanSouth.Org project, another Mellon-funded OAI project, focused specifically on materials relating to the history and culture of the southern region of the United States and included commentary from scholars in that domain to help contextualize the aggregated metadata. (Halbert 2003) It is to be hoped that a combination of these approaches will be helpful in mitigating the challenges found in the pilot study described here and in allowing service providers to fully exploit the potential of OAI for community-specific services.

Experiences of Educators Using a Portal of Aggregated Metadata