Scholarly communication and the digital library: Harter: JoDI

Scholarly Communication and the Digital Library: Problems and Issues (1)

Stephen P. Harter
School of Library and Information Science, Indiana University
Bloomington, Indiana 47405, USA
Email: harter@indiana.edu

Abstract

This paper considers a range of definitions for a digital library from the perspective of scholarly communication and the properties of a traditional research library. It then explores some of the problems and issues involved in creating and maintaining a digital library, depending on the characteristics one wants it to have. The paper stresses the need to consider the requirements of scholarship and research as we build the digital libraries of the future.

1 Introduction

As evidenced by the creation of this journal, there is much interest today in digital libraries. We see many research and development projects, a plethora of international conferences, high activity in the computer science, human/computer interaction, library and information science and other research and development communities, and a great deal of development activity on the Internet. An advanced Alta Vista search conducted in early July, 1996 on "digital library" OR "digital libraries" retrieved about 20,000 entries. Six months later, the same search retrieved 30,000 hits, a significant proportion of which were relevant to the subject.

In spite of all this activity, it is not at all clear what one means by the term "digital library." The term is rarely defined, or even characterized. It has been applied to an extraordinary range of applications -- from digital collaboratories to collections of electronic journals, software agents that support inquiry-based education (3), collections of email and similar objects (4), electronic versions of a public library (5), personal information collections (6), and the entire Internet (7), among others. It is not easy to see what these have in common except for their digitization. This property (which Daniel Atkins calls digital coherence) allows all the objects in a digital library -- sounds, images, texts, and everything else -- to be treated in essentially the same way, for the first time in the history of libraries.

If we know what "digital" means, what is meant by the library half of the term? In what ways are digital libraries indeed libraries in some meaningful sense? How are they not? More to the point -- what values, properties, and characteristics of the traditional library do we want to retain as we build the digital libraries of the future? Our new digital libraries will clearly have much added functionality -- capabilities that have never been present in traditional libraries. At the same time, however, we are in danger of losing important properties of the traditional library. In our efforts to build new systems we rarely ask what aspects of traditional libraries are important to retain.

This paper will address these issues in the context of scholarship and research. It will consider the nature of a digital library in terms of a range of definitions that are based on properties of a traditional research library. It will also explore some of the problems and issues involved in creating and maintaining a digital library, depending on the properties one wants it to have. There is insufficient space here to consider these issues in detail. Here I will provide only an overview that ignores questions of cost, implementation, and technical detail.

Let me emphasize that I am not claiming that the traditional library is a panacea; this is not a Luddite cry to take sledgehammers to the machines or to construct digital libraries in the images of our venerable local institutions. But what I do assert is that the traditional library encompasses many values and properties that are considered important for scholarly communication. I am calling for a clear understanding and consideration of these characteristics as we design and develop the digital libraries of the future.

2 Origins

The term "digital library" is simply the most recent in a long series of names for a concept that was written about long before the development of the first computer. The idea of a "computerized library" that would supplement, add functionality, and even replace traditional libraries was invented first by H.G. Wells and other authors, who caught the imagination of millions with speculative writings about "world brains" and similar fanciful devices.

There is general agreement that much of the early actual application of computers to information retrieval was stimulated by the prominent scientist Vannevar Bush, who wrote about the "memex," a mechanical device based on microfilm technology that anticipated the ideas of both hypertext and personal information retrieval systems (8). The first real-world applications of computers to libraries began in the early 1950s with IBM and punched card applications to library technical services operations, and with the development of the MARC (machine-readable cataloging) standard for digitizing and communicating library catalog information. In 1965, J. C. R. Licklider coined the phrase "library of the future" to refer to his vision of a fully computer-based library (9), and ten years later, F.W. Lancaster (10) wrote of the soon-to-come "paperless library." About the same time Ted Nelson (11) invented and named hypertext and hyperspace. He also analyzed some of the problems to be identified later in this paper in some detail, but never built an operational system. Many other terms have been coined to refer to the concept of a digitized library, including "electronic library," "virtual library," "library without walls," "bionic library," and others (12).

The relatively recent use of the term "digital library" can be traced to the Digital Libraries Initiative funded by the National Science Foundation, the Advanced Research Projects Agency, and the National Aeronautics and Space Administration in the United States. In 1994 these agencies granted 24.4 million dollars to six U.S. universities for digital library research, impelled by the sudden explosive growth of the Internet and the development of graphical Web browsers (13). The term was quickly adopted by computer scientists, librarians, and others. Thus, while the term "digital library" is relatively new, work in bringing digitized information resources to libraries (or thinking of digitized information resources as libraries) has a history spanning several decades.

There is little discussion and less agreement in the literature about what constitutes a digital library. One may insist on a relatively narrow definition -- based explicitly on the properties of the traditional print library -- or consider a much broader continuum of possibilities. The most inclusive view takes a digital library to be, as its starting point, essentially what the Internet is today. But from this extreme perspective it can be seen that the metaphor of the traditional library fails in several respects.

3 Properties of a digital library

Table 1 describes essential properties of a digital library ranging from quite traditional to extremely broad views. A digital library contains digital representations of the objects found in it. Most understandings of "digital library" probably also assume that it will be accessible via the Internet, though not necessarily to everyone. But the idea of digitization is perhaps the only characteristic of a digital library on which there is universal agreement.

Table 1. Potential Properties of a Digital Library
NARROW VIEW (based on traditional library) BROADER VIEW (a middle position between the extremes) BROADEST VIEW (loosely based on current Internet)
objects are located in a physical place objects are located in a logical place (may be distributed) objects are not located in a physical or logical place
objects are information resources most of the objects are information resources objects can be anything at all
objects are selected on the basis of quality some of the objects are selected on the basis of quality no quality control; no entry barriers
objects are organized no organization
objects are subjected to authority control some aspects of authority control are present no authority control
surrogates of objects are created surrogates are created for some objects no surrogates of objects are created
surrogates are "finely searchable" surrogates and objects are finely searchable only objects are searchable
authorship is an important concept concept of author is weakened no concept of author
objects are fixed (do not change) objects change in a standardized way objects are fluid (can change and mutate at any time)
objects are permanent (do not disappear) disappearance of objects is controlled objects are transient (can disappear at any time)
access to objects is limited to specific classes of users access to some objects is limited to specific classes of users access to everything by everyone
services such as reference assistance are offered the only services are those performed by computer software (AI)
human specialists (called librarians, etc.) can be found there are no librarians
there exist well-defined user groups some classes of objects have associated user groups there are no defined user groups (or, alternatively, infinitely many of them)
use of library is free for specified user groups use of library requires payment for some services and/or user groups use of library requires payment

Beyond the idea of digitization, a digital library is a library. Or is it? What makes a library a library? In what senses do we really want the digital libraries we are building to be libraries? What are the essential features of a "library"?

The first column of Table 1 summarizes essential characteristics of a traditional research library. The second and third columns consider successively broader views of these properties from the point of view of what constitutes (or should constitute) a digital library. For example, a digital library may be organized and represented in the form of object surrogates created by human specialists (indexed, classified, cataloged) or it may be entirely unorganized, with no "added value" whatever, using free text searching of the objects themselves -- rather than object surrogates -- to gain access to the objects in the library.

Of course, the digital libraries we are building will have properties not present in the traditional research library, with many of these innovations yet to be invented. The digital coherence of the objects, the near elimination of distance or physical location as an important consideration, and the existing computer and communications infrastructure (and that yet to be built) will give rise to a myriad new possibilities for enriching and redefining what we think of as a library. But there may be a tradeoff. We may be asked to give up some important properties to gain new ones. Table 1 summarizes what I consider to be the essential features of a digital library, viewed from the perspective of scholarship and research.

The traditional research library has a physical location, embodied in its physical building. Most of the objects in the library are information resources of some kind. The works are also selected. Criteria for the selection process are defined, and these criteria typically include measures of quality. The objects (information sources) in it are organized -- classified, catalogued, and indexed by human beings, in what are called value-added processes (14). Authority control is a key feature, in which names of authors, variants of works (editions), and subject headings or descriptors are all controlled. The concept of authorship and ownership are extremely important in a traditional research library, in which various forms of an author's name are brought together in a name authority file. Surrogates of the objects in the library -- called index records, or in digital library terminology, metadata -- are created for purposes of representing the value added by catalogers and indexers. Data are recorded in dozens of specific fields and subfields of these records, and are "finely searchable." That is, highly specific searches can be conducted on particular combinations of fields or subfields of the index records. Retrieved records are linked to the objects themselves, which can then be obtained and used (15).

The treatment of authorship and ownership in the traditional research library reflects the importance of these ideas in traditional scholarly communication, in which scholars and scientists cite in reference lists the authors and works from whom they have borrowed ideas, words, or facts, thus paying intellectual debts and acknowledging original authorship. Ownership of intellectual property is central to publishing and scholarship. Plagiarism -- stealing the words of others without attribution -- is considered unethical in scholarly writing and science. Formal legal rights of ownership are also defined by national and international copyright law.

The objects in a traditional research library have certain properties as well. First, they are fixed -- they do not normally change, or if they do, various editions are identified and considered to be different from one another. Objects are also permanent -- they do not normally disappear from a collection. Finally, a variety of services to users are offered by librarians who work in the traditional library. These include assistance with searching for information resources, reference and research services, readers advisory services, and others. A traditional research library typically offers only limited access to materials and services; access to certain services may be restricted to certain classes of potential users.

Finally, use of basic services in many traditional research libraries is free for defined user populations. Some of these libraries are large, tax-supported research institutions. My own university library, for example, offers free access to basic services to all the citizens of the state of Indiana.

4 Problems and issues

One can take a narrow or a broad view of digital libraries according to these properties. It seems clear that among all of the properties listed, physical location is the least likely to survive in a digital library. Resources in future digital libraries will be more likely to be distributed than not. But all of the other properties listed in Table 1 are also in jeopardy in at least some of the digital libraries being built or conceptualized. Writers have taken a variety of positions as they contemplate what a digital library should be.

Miksa and Doty (16) take a traditional perspective, defining a digital library as a collection of information sources in a place (if not a physical place, then at least a logical one). They argue that a broader definition would lead to something different from what is normally understood to be a library. Graham (17) stresses the support of research as he describes the "digital research library," which looks much like the research library of today in many of its essential features (see Table 1). Atkinson (18) calls for a "control zone" in which the traditional research library can continue to function in a digital environment.

Further along the continuum, Wellman, et al see a digital library of the future in which software agents use principles of artificial intelligence (AI) to perform "monitoring, management, and allocation of services and resources" (19). Indeed, they define a digital library as a "community of information agents" that would retain most of the properties of the traditional library listed in Table 1, but would perform them using intelligent software rather than human beings. However, the extent to which techniques of AI can actually perform the functions envisioned by Wellman, et al, is not at all clear. Most of what the authors describe is presently no more than speculation.

Having evolved his position significantly in two years, Miksa views the traditional library as evolving into a "personal space library" that excludes many of the characteristics and values of the traditional library and which is configured for a single individual or small group (20).

At the far extreme is the Internet itself as it exists today, that has essentially none of the properties of the traditional library listed in Table 1. (See, for example, Wallace (21).) The Internet is anarchic and individualistic. It is not a collection of information resources selected on the basis of their quality, organized by subject, etc. The vast majority of objects on the Internet have no surrogates -- or metadata -- associated with them. Fine-grained searching -- searching limited to specific fields such as subject, editor, year of publication, version number, language, author, etc., is not possible. In general only the objects themselves are searchable, in a full-text, free-text mode that is presently extremely crude and inexact. However, some believe that the near future holds highly significant improvements in searching, through concept searching and vocabulary switching (22). If this prediction is accurate, perhaps many kinds of metadata -- but not all -- can be eliminated without great loss in future digital libraries.

There are real problems with the concept of "author" on the Internet. The concept of "control" is almost entirely absent. Many of the objects on the Internet will one day vanish without a trace. Those that remain are in a constant state of change. There are very few services and few of these are offered by human beings, as opposed to computer software (the Internet Public Library is a welcome exception). The metaphor of the traditional library simply does not apply to the Internet; most of the values and properties of the traditional research library are absent. Of course there are certain spots on the Internet that do have some of these properties. These are much more like traditional libraries, if one can manage to find and enjoys access privileges to them.

The metaphor of the traditional research library is powerful, useful, and compelling. Further, there are good reasons for the properties enumerated in Table 1. There is insufficient space here to go into these reasons in detail. However, it seems clear to me that science, scholarship, learning and teaching could not have evolved as we know them without the existence of the great and small "traditional" libraries of the world. Scholarship and learning imply the need to check and evaluate sources, to conduct careful, fine-grained comprehensive searches, to select, to be able to think about evidence critically, to more or less freely examine resources, to consider provenance. How well will the reader in future digital libraries be able to carry out these functions? Interestingly, as measured by references in published papers, electronic publications of all kinds have thus far made very little impact on scholarship and science (23), including electronic journals (24). This may be due in part to the difficulty of conducting scholarly work on today's Internet; for example, the problem of access to electronic journals is not trivial (25).

Readers in traditional research libraries are also able to consult with librarians as they attempt to accomplish their work. Who will scholars and researchers consult in future digital libraries? An extremely strong case can be made for including librarians in the digital library (26).

Finally, many public, tax-supported research libraries are open to the public and are free for basic services. Use is not limited to those wealthy enough to afford the equipment, telecommunications charges, and fees for services such as access to the collection and permission to use materials. Access to the objects in the traditional library is recognized as a public good and is supported with public tax monies. What kinds of access will the digital libraries of the future provide? What classes of users will be permitted free access to objects and services?

Table 2. Questions and Issues Related to Information Resources (IRs) in the Digital Library

  • How can we establish and control the currency, accuracy, and integrity of information sources (quality problem)
  • What can be done to provide intellectual access to IRs? (organizational problem)
  • How can we maintain the data and intellectual integrity of IRs? (authority control problem)
  • How can we recognize different versions of the same IR? (fluidity problem)
  • How can we establish object surrogates, metadata, and corresponding fine-grained search tools so that we can find those objects that we are seeking?
  • How can we address the issue of transient IRs? (preservation problem)
  • How can we preserve the concept of authorship?
  • How can copyright laws for IRs be observed? (legal problem)
  • Will access to some IRs be limited to some classes of users? (political problem)
  • What services, if any, should be offered by the digital library?
  • Should digital libraries be integrated into traditional libraries? If so, how can this be accomplished?
  • Does a digital library have librarians? If so, what do they do?
  • Does a digital library have well-defined classes of users?
  • Who will have access to which services, and at what price? Will our digital libraries of the future only be for the use of the "haves?"

Table 2 summarizes the problems and issues that I have identified. Ignored are the many managerial questions that might be raised, as well as how solutions can be paid for. The traditional library attempts to deal with these problems and issues in a number of ways. Those who are building digital libraries must ask themselves whether these issues should be considered. Perhaps the most thorough study of these questions has been conducted by Ross Atkinson, who calls for librarians to lay claim to the "control zone" -- demarcating a single, distributed digital library created by the academic library community and based on principles of the traditional research library (27).

An alternative to establishing a control zone is to take a broad view, and build digital libraries in which some or all of the properties of the traditional library have largely disappeared. Questions then immediately arise concerning science, scholarship, teaching, and learning. Will students take what they find on the Internet as "truth?" This is already happening today. What kind of scholars and researchers will such students become? How can they (or anyone, for that matter) evaluate what they find on the Internet? The problems of quality, integrity and authorship are legion. What is the source of the information that one finds? Who actually wrote it? How old is it? How accurate? Is it really what it claims to be? What "edition" is it? What is its authority? Its provenance? What will happen to the concept of authorship and the notion of fixed, permanent documents? To the concept of evaluation of sources? How will these changes affect scholarship and research? These are crucial social questions that are extremely important to contemplate.

Consider a personal example. I recently conducted an Internet search for information on South Korea, using Alta Vista's advanced search mode. Among the materials retrieved was an entry in the CIA Factbook (published by the Central Intelligence Agency, an agency of the U.S. government), the home pages of private individuals, pages that had no clear source, commercial firms, digitized newspaper articles, and several links whose referents had already disappeared. When I conducted this search, could it be said that I was searching the contents of a library? To what extent could I trust the accuracy of what I read? Were the documents purporting to be from the CIA Factbook or published by the Associated Press actually from these sources? If so, how current was the information in them? Or, were they forgeries or slightly modified originals with small, subtle but significant changes? There is simply no easy way to tell. To what extent can information from private individuals be considered "factual?" What are the highest quality (most accurate, complete, error-free, current, etc.) sources of information on the Internet about South Korea's history and culture? Of course, these same questions can be asked of print materials in traditional libraries, but the problems are greatly exacerbated on the Internet. Only by stretching the metaphor of the library far beyond its traditional sense can the Internet be construed as a library.

Nearly two decades ago former U.S. Librarian of Congress Daniel Boorstin observed that Gresham's Law was at work in the information field; that information was driving knowledge out of circulation (28). In a recent study published by Reuters Business Information, empirical evidence was found to support this thesis (29). One in four managers in the UK, US, Australia, Hong Kong, and Singapore admitted to suffering ill effects -- including tension, stress, illness, and the breakdown of personal relationships, among others -- as a result of trying to deal with the amount of information they now handle, and fully half expect the problem to get worse with the continued growth and development of the Internet.

In a recent piece that is reminiscent of some of Daniel Boorstin's ideas, Mary Biggs lamented the disappearance of books and serious reading from our discussions of the virtual library (30). Why are we building digital libraries, anyway? What is our broad social purpose? What properties of our digital libraries are implied by these purposes? Will our digital libraries be part of the problem or part of the solution?

5 Conclusion

Perhaps the best of all possible worlds would be a broad, inclusive digital "library" filled with a multitude of interesting and informative objects and software agents of all kinds -- as well as a large amount of material that is worthless to almost everyone. Such a place would be built from the bottom up, and would consist of whatever materials and objects and libraries anyone wanted to build (and could afford to maintain). It would be an evolved version of what the Internet is like today.

But I would argue that one important aspect of such a place must be special spaces, digital libraries that have the properties of a traditional research library, a control zone, or perhaps more realistically, a collection of control zones, Here would be found high quality material, selected by specialists. True intellectual access would be provided in the form of fine-grained search tools and object surrogates constructed using the value-added processes of indexing, cataloging and classification. Such digital libraries would concern themselves with the currency, accuracy, and integrity of the information sources found within them, and would address the other concerns identified here as well. They would offer actual services to their user populations. Where these cannot be accomplished by computer software, they would be performed by human beings -- the librarians of the digital library. Finally, I hope that we will have digital libraries that are supported by tax monies and that will offer free basic services to defined constituent groups, not just to those who can afford to pay for them.

Notes and References

1. An earlier version of this paper was delivered at at KOLISS DL '96: International Conference on Digital Libraries and Information Services for the 21st Century, September 10-13, 1996, Seoul, Korea.

2. Email address: harter@indiana.edu

3. Atkins, Daniel E., William P. Birmingham, Edmund H. Durfee, Eric J. Glover, Tracy Mullen, Elke A. Rundensteiner, Elliot Soloway, José M. Vidal, Raven Wallace, and Michael P. Wellman. 1996. Toward inquiry-based education through interacting software agents.

4. Winograd, Terry. 1995. Digital vs. libraries: Bridging the two cultures. SIGIR '95: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 18:2.

5. 1997. Internet public library: Same metaphors, new service. American Libraries: 56-59.

6. Miksa, Francis. 1996. The Cultural Legacy of the "Modern Library" for the Future. Journal of Education for Library and Information Science 37:100-119.

7. Wallace, Jonathan. "The Internet is a library." Sex, Laws, and Cyberspace Bulletin 1.

8. Bush, Vannevar. As we may think. Atlantic Monthly 176 (1945): 101-108.

9. Licklider, J. C. R. Libraries of the Future. Cambridge, Mass.: M.I.T. Press, 1965.

10. Lancaster, F. Wilfrid. Toward paperless information systems. New York: Academic Press, 1978.

11. Nelson, Theodor H. Computer Lib. Chicago: Nelson, 1974.

12. Drabenstott, Karen. Analytical Review of the Library of the Future. Council on Library Resources; Washington, D.C.

13. Pool, Robert. "Turning an info-glut into a library." Science 266 (1994): 20-22.

14. Taylor, Robert S. Value-added processes in information systems. Norwood, NJ: Ablex, 1986.

15. Although Alta Vista and a few other search engines permit field searching, most objects on the Internet have only a few identifiable fields. There is nothing remotely approaching the MARC communications format in common use.

16. Miksa, Francis L. and Philip Doty. 1994. Intellectual Realities and the Digital Library Proceedings of the First Annual Conference on the Theory and Practice of Digital Libraries. June 19-21, 1994, College Station, Texas.

17. Graham, Peter S. 1995. The digital research library: Tasks and Commitments. Digital Libraries '95: The Second Annual Conference on the Theory and Practice of Digital Libraries, June 11-13, 1995, Austin, Texas, USA.

18. Atkinson, Ross. 1996. Library functions, scholarly communication, and the foundation of the digital library: Laying claim to the control zone. Library Quarterly 66:239-65.

19. Wellman, Michael P., Edmund H. Durfee and William P. Birmingham. The digital library as community of information agents. A position statement, to appear in IEEE Expert, June, 1996.

20. Miksa, 1996. "The cultural legacy of the 'modern library' for the future."

21. Wallace, 1996. "The internet is a library."

22. Schatz, Bruce R. 1997. Information retrieval in digital libraries: Bringing search to the net. Science 275:327-33.

23. Harter, Stephen P. and Hak Joon Kim. 1996. Electronic journals and scholarly communication: A citation and reference study. Proceedings of the ASIS Midyear Meeting (San Diego, CA: May, 1996). pp. 299-315.

24. Harter, Stephen P. 1996. The Impact of Electronic Journals on Scholarly Communication: A Citation Analysis. Public-Access Computer Systems Review 7(5).

25. Harter, Stephen P. and Hak Joon Kim. 1996. Accessing electronic journals and other e-publications: An empirical study. College & Research Libraries 57:440-56.

26. Arnold, Kenneth. 1995. The electronic librarian is a verb/ The electronic library is not a sentence. Miksa, 1996. "The cultural legacy of the 'modern library' for the future."

27. Atkinson, Ross. 1996. "Library functions, scholarly communication, and the foundations of the digital library: Laying claim to the control zone."

28. Boorstin, Daniel. Gresham's Law: Knowledge or Information. Remarks at the White House Conference on Library and Information Services. Washington, D.C., November 19, 1979.

29. Reuters Business Information. 1996. New independent research reveals cost of the information revolution.

30. Biggs, Mary. 1995. "Virtual libraries & actual readers." The Seventh Nasser Sharify Lecture (Sunday, May 14, 1995, Pratt Manhattan Center). Pratt School of Information and Library Science.