Author-generated Dublin Core Metadata for Web Resources: A Baseline Study in an Organization
Keywords:author-generated metadata, NIEHS, Dublin Core, metadata evaluation
Organizational Web sites are growing at a rapid pace, particularly as employees and departmental offices increasingly turn to the Web as a chief source for disseminating important information. Statistical and semantic-based search engines facilitate resource discovery on organizational Web sites, often providing satisfactory results. As organizational Web sites continue to grow, search engine scalability and retrieval effectiveness is likely to decline and organizations need to consider alternative or additional resource discovery options.
A metadata project is one way to improve resource discovery on an organizational Web site. This can be accomplished by embedding structured metadata in Web resource headers and installing a metadata search engine for searching on individual or a combination of metadata elements (e.g. HotMeta, developed by the Distributed Systems Technology Centre: http://www.dstc.edu.au/Research/Projects/hotmeta/search.html). Although this seems like a logical plan, organizations have been slow to explore this option due to uncertainties about who should create metadata. Professional metadata creators (e.g. catalogers and indexers) are ideal candidates (Milstead & Feldman 1999), although they are costly and limited in their availability. Resource authors might also be tapped for metadata creation, yet there is a perception that author-generated metadata will be of poor quality and may actually hamper rather than aid resource discovery (Thomas & Griffin 1999).
The research presented in this paper counters this notion and supports the hypothesis that resource authors are good candidates for metadata creation in an organizational setting. Resource creators are intimate with their work, they want their work to be discovered and consulted, and they know their audience and can thus describe their resources appropriately. These factors support the hypothesis that resource authors can create acceptable metadata when working with the Dublin Core, a schema initially designed for resource authors. The use of the Dublin Core ties in with this study's secondary hypothesis, which is that given basic guidance through a simple and intelligible Web form, resource authors can create professional quality metadata.
Web resource metadata can be generated via automatic or human processes. Within the context of the web, search engine spidersand HTML and XML editors and generators produce various types of metadata automatically. These tools produce fairly accurate metadata for certain elements, such as ï¿½date producedï¿½ or ï¿½MIME typeï¿½, but their results vary widely for more intellectually demanding metadata, such as ï¿½subject descriptorsï¿½ or ï¿½author". As a result, many environments prefer human intellectual processing for the production of schema-specific metadata.
Professional metadata creators and resource authors represent two main classes of metadata creators. Metadata professionals, such as catalogers and indexers, are people who have had formal training and are proficient in the use of descriptive and content-value standards. Although researchers have noted problems with inter-indexer consistency (e.g. Chan 1989), professionals generally produce high quality metadata (Weinheimer 2000).
Resource authors (hereafter referred to as authors) are individuals responsible for the creation of the intellectual content of a work (Yee 1995). Researchers regularly produce abstracts, keywords and other types of metadata for their scientific and scholarly publications. Visual artists, another class of authors, generally sign and date their works. In the Web environment authors can provide metadata via a template or editor, while turning to webmasters or other skilled people to make their work Web-accessible.
A number of digital library projects support author-generated metadata. This practice makes sense when weighing the rapid growth of the Internet against the economics of hiring metadata professionals. Two examples include the National Digital Library of Theses and Dissertations (NDLTD) (http://www.ndltd.org) and the Synthesis Coalition's National Engineering Education Delivery System (NEEDS) digital library for engineering education (http://www.needs.org/engineering/). These projects both have Web forms that help authors to create consistent and accurate metadata for theses and dissertations contributed to the NDLTD and courseware contributed to NEEDS.
Author-generated metadata projects appear to be less popular in the organizational setting. The rationale given for this predicament is that authors lack the metadata professional's expert skills, and they will therefore produce insufficient and poor quality metadata (Weinheimer 2000). Another possibility is the absence of an official program for providing sufficient metadata training to authors. Limited executive support and inadequate financial, human and space resources are factors that may have an impact here. A final point to consider is that authors may be reluctant to participate in a metadata project because they view an executive recommendation for author-generated metadata as a bureaucratic order or extra chore as opposed to an option that has rewarding benefits. Organizations need to investigate these factors if they are to implement successful metadata projects, involve authors in metadata production, and improve resource discovery on their Web sites. The research reported in this paper contributes to this process by presenting the results of a baseline study on the ability of authors to generate acceptable metadata in an organizational setting.
Cataloging and indexing studies have counted the number of subject terms (subject headings or descriptors), name headings and MARC fields per metadata record (e.g. Xu & Lancaster 1998). Counting studies can provide valuable information and are fairly simple to conduct. Studies have also assessed metadata record quality by examining subject term specificity and exhaustivity, metadata record completeness, and other substantive factors (e.g. Zeng 1993). Qualitative metadata examinations can be problematic because they demonstrate a subjective condition and because "to date, no consensus has been reached on conceptual and operational definitions of metadata quality" (Moen 1997).
Although there is an absence of an established set of metadata metrics, researchers have identified a number of key characteristics that can be used for metadata evaluation. Moen et al.'s (1997) comparative analysis of bibliographic control, metadata and digital library work, by six different authors, is one of the most thorough investigations in this area--resulting in 23 evaluation criteria. Moen and colleagues extracted accuracy, consistency, completeness, and currency from this initial set (the 23 criteria) for their analysis of Government Information Locator Services (GILS) metadata records. Their four criteria overlap, to some degree, with Tozer's (1999) data quality measures of accuracy, completeness, consistency, timeliness, and intelligibility. Rothenberg (1996) identifies correctness and appropriateness as two key aspects for data evaluation and emphasizes that the data's contextual use needs to be considered for improving data quality. Metrics, derived from the work cited here, facilitated this study's metadata evaluation.
This preliminary study investigated the ability of authors to produce acceptable Dublin Core metadata for resources placed on the National Institute of Environmental Health Sciencesï¿½ (NIEHS) Web site. With this study, we aimed to develop procedures for gathering baseline data and to obtain some preliminary results about authors as metadata creators. The overriding goal of this experiment was to assist with the implementation of the NIEHS metadata project. The study was guided by the following three research questions:
- Can authors create acceptable Dublin Core quality metadata?
- What perceptions do authors have about metadata in general and metadata generation activities?
- What Web form features can assist with author-generated metadata?
A multi-method approach was used to examine this study's research questions. The primary methods included an experiment to collect the author-generated metadata and a content analysis to examine the acceptability of this metadata. A participant profile questionnaire and a post-metadata generation questionnaire gathered contextual information for data analysis.
NIEHS employees and scientists, identified as authors, were recruited for the metadata generation experiment. A set condition was that they had authored the intellectual content of at least one or more Web resources; joint authorship was acceptable. A one-hour session was held in the NIEHS Computer Training Laboratory. During the first half-hour participants completed a profile questionnaire and partook in a metadata tutorial that introduced the concept of metadata and explained the features of the NIEHS-Dublin Core metadata form (hereafter referred to as the NIEHS form). [Figure 1]. The NIEHS form is loosely based on the DC-Dot Dublin Core metadata editor (http://www.ukoln.ac.uk/metadata/dcdot/) and presents the NIEHS-Dublin Core metadata schema. Dublin Core was selected for the NIEHS metadata project because it was developed for author-generated metadata and supports resource sharing and interoperability among information systems (see Robertson et al. 2001 for NIEHS-Dublin Core Schema details and a discussion of how it relates to the Dublin Core).
During the second half hour of the experimental session, participants produced metadata records via the NIEHS form. The form requires authors to manually input metadata, except for publisher and rights metadata, which have fixed-values in the NIEHS-Dublin Core schema. As a result, these two elements are absent from the form's interface, but they are automatically generated when the author selects the submit button. All metadata records input into the form and are stored in XML. After the metadata creation experiment, participants completed a post-metadata creation questionnaire.
The final step in this study was a content analysis conducted by two team members with professional cataloging experience in a joint session. Their goal was to determine the acceptability of the author-generated metadata. Metadata was acceptable if it was at the level that a professional cataloger would create or accept from another source, with or without modification, for inclusion in a resource header or database. The key factor was that acceptable metadata would support resource discovery.
Data analysis for this study focused on two components:
- metadata record content
- authors' perceptions about metadata generation.
The content analysis considered the participant profiles, the types of metadata produced, and the quality of the metadata.
Six participants, working in either NIEHS science or policy areas, with educational levels ranging form bachelor's to doctoral degrees, participated in the study. Four participants search the Web daily, one weekly, and one on a monthly basis. All but one participant had heard the word ï¿½metadataï¿½ in reference to the Web prior to the experiment, although it is not clear if this answer was influenced by the experimentï¿½s recruitment process. Half the participants had experience of HTML authoring.
The six participants produced a total of 11 metadata records during the experiment, averaging 1.8 metadata records per-participant. Two participants produced one metadata record each, and one participant produced three metadata records. The mode was two metadata records per participant. A near mean of 15 minutes per metadata record can be approximated for this study based on the half hour allotted for the metadata creation task. A correlation between Web skills and metadata production was not found.
The form ensured that all of the records produced contained metadata for the 11 mandatory elements: These include publisher and rights metadata, which have fixed values, and are automatically generated via the NIEHS form, and title, audience, author/contributor, subject, date created, date modified, URL, language, and format, which require author input. The creation of metadata for the optional elements varied per metadata record.
No. of records
with this metadata
Table 1 summarizes the use of optional metadata elements by participants. One participant created description metadata for only one of his two authored resources, although this metadata element could have been used by all of the participants for the total sample of eleven metadata records. Similarly, itï¿½s possible that alternative title, source, NIEHS number, and relation metadata could have been easily created (e.g. this metadata may have been part of the textual content or source code of the Web resources being described or known within NIEHS' organizational knowledge structure). The research reported in this paper emphasizes the quality of the metadata that was actually created; however, future studies will consider the metadata that could have been produced with a little extra effort on the part of the creator.
Two members of the research team, who have had extensive experience as professional catalogers, evaluated the metadata quality and determined the overall acceptability of the metadata records. Acceptance, to reiterate, meant that the author-generated metadata was equivalent to that a professional cataloger would create or accept from another source, with or without modification, for inclusion in a resource header or database; and the guiding principle was that the metadata would support resource discovery.
An online survey consisting of two parts guided this examination (Figure 2). Part one of the survey supported an element-based analysis. Binary measures of "accept" or "reject" were assigned to the data content values created for each metadata element for all 11 metadata records. These results are summarized in Table 2. Column two presents the total number of metadata records that included metadata for each element, and column three gives the total number of metadata records that had acceptable metadata for each element and the percentage based on the figure given in column two.
|Alternative title||3||2 (67%)|
|Date created||11||11 (100%)|
|Date modified||11||9 (82%)|
Table 2 shows that the participants produced acceptable metadata for the majority of the NIEHS-DC metadata elements. The metadata elements identifier (URL), author/contributor, date created, language, description, relation, and audience were 100% acceptable. The rest of the metadata elements (title, alternative title, subject, date modified, type, source, coverage, and format) had an acceptance rate ranging from 50% to 91%, with source and coverage at the 50% level. None of the 11 records produced contained other identifier metadata.
Part two of the evaluation survey included a series of questions for assessing the intelligibility and the general correctness of the author-generated metadata. The evaluators found that all of the metadata records produced were intelligible and that the authors placed data content values in the correct metadata field. This part of the evaluation also directed two questions to the subject metadata element because it is one of the key reasons that human processing is often preferred to automatic processing and because there are many questions about the author's ability to provide adequate subject access without being trained in the principles of subject analysis. Here, the metadata professionals evaluated both the specificity and exhaustivity of the keywords assigned by the authors. Specificity refers to the depth-level (e.g. granularity) of subject terms, and exhaustivity deals with breadth of topics represented by subject terms. Subject keywords for eight (73%) of the 11 records were assigned at the appropriate descriptive level and sufficiently covered the topics of these resources. Subject keywords in the other four records were found to be too general and did not sufficiently cover all of the topics represented in the resources being described. Only one participant produced unacceptable subject keywords for both of his authored metadata records. Participants creating unacceptable subject metadata in the other two records also demonstrated the ability to create acceptable subject keywords in their additional records, showing no evidence of a negative pattern.
A final evaluation question used a four-tier scale of "poor-reject", "fairï¿½major revision", "goodï¿½minor revision" and "excellentï¿½no revisionï¿½ to measure the overall acceptability of the metadata records. All 11 metadata records were rated highly. In other words, the evaluators concluded that the author-generated metadata was acceptable for inclusion in a Web resource header or a database and that it would facilitate resource discovery. Of the 11 metadata records four (36%) were fair, requiring major revision of selected metadata elements; six (55%) were good, requiring only minor revision; and one was excellent and of professional quality, thus requiring no revision. No correlation was found between participantsï¿½ Web skills and the quality of the metadata records produced.
Data gathered via the post-metadata generation questionnaire provided insight into authors' general perceptions about metadata and suggestions on how the NIEHS Web form might be improved to facilitate author-generated metadata.
A five-step semantic differential scale, on which 1 indicated ï¿½with difficultyï¿½ and 5 indicated ï¿½easily,ï¿½ gathered feedback on participantsï¿½ perceptions about the task of metadata creation. All 11 participants indicated that creating metadata was fairly easy. This question resulted in a mean score of 4.7, with four participants scoring 5.0 and two participants scoring 4.0. A similar semantic differential scale was used to gather data on participantsï¿½ thoughts about adding metadata to Web resources, where 1 indicated ï¿½neverï¿½ and 5 indicated ï¿½always.ï¿½ Overall, participants were positive about the value of adding metadata, with a mean of 4.0.
The participants were given a list of different categories of people and asked to specify who should create metadata. The options included authors, webmasters, departmental heads, secretaries/office assistants, librarians, and other, with room for suggestions. Author was most frequently selected, followed by webmasters, followed by librarians, and one person indicated that department heads should be involved in this task. Supporting the selection of author, one participant commented: ï¿½authors best know the [Web] page and target audienceï¿½. Two participants indicated that both author and librarian were preferred for this task because they best know the ï¿½subject matterï¿½ and ï¿½[audience/user] searching patternsï¿½.
The post-metadata creation questionnaire included six questions to gather feedback on the usability of the NIEHS Web form. Five participants indicated that the Web form was easy to use, while one participant indicated that it was average. All participants agreed that using the Web form required minimal learning time. With the exception of one participant who indicated that the help text was average (neither useless or overly helpful), all participants found the help text and terminology to be both helpful and understandable. Overall, these results show that the participants found the form usable. Several participants provided suggestions for improving the Web form. Among these recommendations were the need for better guidance about the level of detail for the subject metadata, additional examples for the selected metadata fields, and more categories for the type metadataï¿½a point related more to the NIEHS-Dublin Core metadata schema. One participant summed up the form evaluation by commenting: ï¿½the page is self explanatory and intuitive: however, for actual use a ï¿½philosophicï¿½ overview should be provided to major NIEHS usersï¿½.
This research shows that the authors participating in the experiment were able to create acceptable metadata according to the NIEHS-Dublin Core metadata schema and that their metadata generally requires only minor, if any, revision. The quality of the metadata produced clearly suggests authors in an organizational setting can create good quality metadata and that they have the ability to create professional level metadata. This research also identifies Dublin Core metadata elements that authors might have difficulty understanding, and it provides evidence about Web form design that helps authors to generate acceptable metadata.
As part of the experiment, the NIEHS Web form required authors to provide metadata for the ten mandatory elements in order to produce (submit) their metadata records. As a result, mandatory and some optional metadata were collected for examination for 11 metadata records. This is a small sample, but certainly sufficient to evaluate the studyï¿½s procedures and explore the potential of an author-generated metadata project at NIEHS. Furthermore, the small sample size is useful in setting expectations for the next study that aims to collect records from approximately 60 participants.
A secondary purpose of this study was to evaluate the basic usability of the Web form. For that purpose, the sample size seems adequate as suggested by Nielsen and Landauer's (1993) cost-benefit model, which recommends between three and five test users. The fact that all 11 metadata records were evaluated as acceptable, and that the majority of records analysed required only minor revisions suggests that authors are good candidates for generating metadata, and that they can indeed make a positive contribution to an organization's metadata project. Related to these results is the fact that a number of the participants conjectured that, as authors, they were good candidates for metadata creation. These participants expressed that they are obviously knowledgeable about their work, that they know their immediate and often potential audiences, and that they are aware of the way in which interested people will search for their work. In other words, these participants saw their authorship role, together with their command of a discipline's language, as important metadata production factors.
Participant feedback as expressed here, combined with the actual results of the evaluation, suggest that authors may even be able to produce better quality metadata than professionals for selected elements. The evaluators verified the accuracy of the participants' metadata by examining the Web page content and source code. In one case, date created metadata could not be confirmed and the evaluators asked the participant to clarify the origin of this information. The participant indicated that the date of creation was correct and based on his personal knowledge as the resource author. Date created for the intellectual content is frequently absent from Web resources, and although authors may not recall the specific date a resource was created, they are likely to have a better idea of the month and year of their intellectual activity compared to a cataloger, who did not produce the resource content. This example illustrates that author knowledge is extremely valuable for Web resource metadata creation. Here it should be emphasized that the date on which a resource's intellectual content is created canï¿½and often doesï¿½differ from the date it was made Web accessible or revised on the Web; this second interpretation of date metadata can be automatically generated via editing software.
Over half of the metadata records created had metadata assigned for the relation element. Like date created, personal author knowledge appears to be a contributing factor for this element. In two records that had assigned URLs for relation, the Web page and source code did not provide any indication of the related resource(s). A closer examination showed that, in these two cases, the authors' interpretation emphasized "subject" relation, which is not supported by the Dublin Core's official list of qualifiers for this element (Dublin Core Qualifiers 2000). Even so, the evaluators considered the use of relation for these and other records acceptable because Dublin Core's definition for relation is quite general and does not exclude subject, and because the authors were not required to work with the Dublin Core Qualifiers in this experiment. The interpretation of the Dublin Core relation element and the value of qualifiers will be examined in future studies.
A comparison between author and metadata professional's abilities to create metadata requires a larger sample and a more extensive analysis than this preliminary study. Even so, the results of this study, combined with participant feedback, suggest there are cases where authors may be able to provide equal if not better quality subject metadata compared with a metadata professional. This conclusion is based on the fact that almost three-quarters (7 of 11, 73%) of the metadata records had acceptable subject keywords and displayed an appropriate level of subject specificity and exhaustivity. While further research is required for subject metadata, particularly given the high use of this metadata element for resource discovery (subject searching is among the most popular means of searching on the Internet), it is likely that more guidance for authors could result in an even higher acceptance rate for this element.
While the results showed that authors are generally able to understand the Dublin Core, the low use and poor results found with all the optional metadata elements, excluding relation, suggests there may be interpretation difficulties here. This conclusion is loosely based on the limited use of these optional elements and the fact that one participant created unacceptable metadata for both source and coverage, although an acceptable metadata record was created.
A final area of comment is the design of the NIEHS Web form. The form includes selective use of pop-up windows, drop-down menus and scrolling lists, each containing essential but limited data, to assist authors in generating metadata. A key objective in designing the NIEHS form was to keep it simple, exemplifying the spirit of the Dublin Core. Participant feedback from this study indicates that the NIEHS form is simple, intelligible, and overall a good product. While this inquiry was limited to a few post-metadata creation questions, the positive feedback gathered together with the acceptability rate for the author-generated metadata, indicate that intelligible textual guidance, selective use of features, keeping the form to one page (computer screen), and the use of a simple schema are important considerations for author-generated metadata. Given these results, it is not unreasonable to suggest that the NIEHS form may actually serve as a model for facilitating author-generated metadata in an organizational setting or even in other environments. A stronger argument may be presented here if the deficiencies noted, such as the need for more examples for selected elements and additional guidance on the level of subject detail required, were linked to the form in an unobtrusive way. These suggestions will be incorporated into the next release of the NIEHS form.
In completing this discussion, it should be noted that during the data collection, logs recorded participantsï¿½ navigation of the NIEHS form. This data is still being analyzed, and will provide more insight into the use of Web form features as well as other aspects of author-generated metadata.
This study investigated the ability of authors to create acceptable metadata in an organization, following the Dublin Core. In examining this, data were also gathered about authors' perceptions on metadata and Web form features that may facilitate author-generated metadata.
The results show that authors can create acceptable metadata according to the Dublin Core, specifically the NIEHS-Dublin Core schema, and they can produce metadata equivalent to that of a metadata professional. These results prove that authors are indeed good candidates for metadata creation and that the Dublin Core is successful in supporting author-generated metadata. The results of this study established that authors think metadata is valuable for resource discovery, that they think it should be created for Web resources, and that almost unanimously they think they should be involved in the production of metadata for their works. Finally, the study shows that the design of a simple form, with selective use of features, may be the best means for author-generated metadata.
As with any baseline or exploratory study, conclusions drawn are limited by sample size and test conditions. The researchers note the limitations posed by the small sample size. However, the context of this examination needs be considered, in that it was primarily conducted to gather baseline data about the feasibility of implementing an author-generated metadata project at NIEHS. Given this is a baseline study, generalizabilty is limited. Furthermore, conclusions drawn about metadata quality, while based on professional analysis, cannot be confirmed without testing the actual value of this metadata in a resource discovery experiment that measures user satisfaction.
At the time of writing, the NIEHS metadata team is designing a follow-up study (with a much larger participant pool, aiming for a sample size of at least 60 records) in this area, incorporating results and refined procedures from this preliminary study. The next phase of this research will focus more on the subject metadata element and examine authors' perceptions about participating in an organizational metadata project as well as when might be the ideal time for authors to generate metadata. A long-term evaluation is also planned to measure retrieval effectiveness and user satisfaction in relation to author-generated metadata.
We would like to acknowledge Ellen M. Leadem, NIEHS Library; Jed Dube, NIEHS/OAO, Corp.; the NIEHS Computer Training Laboratory staff for their assistance in implementing this study; and Microsoft Inc. for research funding. We would also like to thank NIEHS employees for participating in this study.
Chan, L. M. (1989) "Inter-indexer consistency in subject cataloging". Information Technology & Libraries, 8(4), 349-358
Milstead, J. and
Feldman, S. (1999) "Metadata: Cataloging by any other name".
Online, January/February, 25-31
Moen, W. E., Stewart, E. L. and McClure, C. R. (1997) The Role of Content Analysis in Evaluating Metadata for the US Government Information Locator Service (GILS): results from an exploratory study http://www.unt.edu/wmoen/publications/GILSMDContentAnalysis.htm
Roberston, D., Leadem, E., Dube, J. and Greenberg, J. (2001) "Design and Implementation of the National Institute of Environmental Health Sciences Dublin Core Metadata Schema". Proceedings of the International Conference on Dublin Core and Metadata Applications 2001, Tokyo, Japan
(1996) "Metadata to Support Data Quality and Longevity".
1st IEEE Metadata Conference, Silver Spring,
Thomas, C. and Griffin, L.
(1999) "Who will create the Metadata for the Internet?" First
Monday, Vol. 3, No. 12, December
Zeng, L. (1993) "A study of a rule-based data validation system for online Chinese cataloging". Proceedings of the 14th National Online Meeting, edited by Martha E. Williams (Medford, NJ: Learned Information, Inc.), pp. 439-42
The following changes were made at the authors' request on 12th March 2002, subsequent to original publication. The main text above is correct following the changes. The changes made are detailed below:
2nd paragraph under Table 2, which begins with text..."Part
two of the evaluation survey..." the 6th sentence, was
FROM "Subject keywords for eight (73%) of the 11 records were assigned at the appropriate descriptive level and sufficiently covered the topics of these resources."
TO "Although 8 of 11 (73%) of the subject keywords assigned by authors were evaluated as acceptable, according to the criteria that they would support resource discovery, with or without modification, the results for this part of the data analysis differed slightly with 7 of 11 (64%) of the subject keywords being assigned at the appropriate descriptive level and sufficiently covering the resources' topics."
Section 7, Discussion of Results. 2nd paragraph, 1st
sentence was changed
FROM "11 mandatory elements"
TO "10 of the 12 mandatory elements"
- Section 7, Discussion of Results, 5th paragraph. The first sentence reading "Although optional, the type element was entered for all records created." was DELETED.
Section 7, Discussion of Results, 6th paragraph, 3rd
sentence was changed
FROM "This conclusion is based on the fact that almost three-quarters (7 of 11, 73%) of the metadata records had acceptable subject keywords and displayed an appropriate level of subject specificity and exhaustivity."
TO "This conclusion is based on the fact that almost three-quarters (8 of 11, 73%) of the metadata records had acceptable subject keywords according to the study's criteria of supporting resource discovery, with slightly less (7 of 11, 64%) of the records displaying the appropriate level of specificity and exhaustivity."
- Section 7, Discussion of Results, 7th paragraph, 1st sentence, the words "type and" were DELETED from: "...the low use and poor results found with all the optional metadata elements, excluding type and relation, suggests there may..."