Institutional Repositories, Long Term Preservation and the changing nature of Scholarly Publications

Institutional Repositories, Long Term Preservation and the changing nature of Scholarly Publications

Paul Doorenbosch
Koninklijke Bibliotheek, national library of the Netherlands
paul.doorenbosch@kb.nl

Barbara Sierman
Koninklijke Bibliotheek, national library of the Netherlands
barbara.sierman@kb.nl

Abstract

The web offers new opportunities for scholars to publish the outcome of their research. One of these new forms is called Enhanced Publications. In an Enhanced Publication different objects and files that has a meaningful and close relation to each other are aggregated on the level of a resource map in witch not only the separate files are described, but also the relation between those files are. An example of an Enhanced Publication is a digital text publication and a dataset on which the publication is based. Preserving these compound entities in the existing infrastructures raises new issues. This article discusses these issues against the background of the Dutch long term preservation infrastructure and organisation.

1. Introduction

The nature of publications in scholarly communication is changing. Enhanced Publications and Collaborative Research Environments are new phenomena in scholarly communication using the wide range of possibilities of the digital environment in which researchers and their audience act. This rapidly changing digital environment also affects long term preservation archives. Raising awareness of long term preservation in the research community is important because researchers are responsible for public dissemination of their research output and need to understand their role in the life cycle of the digital object. At the moment of the creation of the digital object choices are made that will influence the long term preservation changes of the objects. Researchers should be aware that constant curation and preservation actions must be undertaken to keep the research results fit for verification, reuse, learning and history over time. This awareness raising is a topic in several European projects, like ParseInsight and Aparsen.

Over the last years an infrastructure was created able to deal with long term preservation issues of the more traditional way of publishing in a digital form. In this paper we'll show the findings of a research project [1] aiming to find out how far the existing infrastructure is fit for ingesting Enhanced Publications or whether the infrastructure should be adjusted to accommodate them. It will be explained how the Dutch infrastructure is structured, and some specific preservation issues related to Enhanced Publications will be discussed.

1.1 DARE, NARCIS and DRIVER

The DARE project was a joint initiative of the Koninklijke Bibliotheek (KB), which is the national library of the Netherlands, the Dutch universities, the Royal Netherlands Academy of Arts and Sciences (KNAW), the Netherlands Organisation for Scientific Research (NWO) and SURFfoundation. DARE's aim was to store the results of all Dutch research in a network of repositories, in which the KB fulfils the role of safe place in charge of digital preservation with her repository called the e-Depot. In this way all participants retain responsibility for their own data, and keep control over it, while making them accessible at the same time. Moreover, as the KB takes responsibility for storage and long-term preservation, the universities can concentrate on their research work. The programme started in 2003, and was successfully completed in 2006.

Part of the DARE projects was the building of one single open access entry point for the research output from Dutch universities (firstly only research universities but later on expanded to universities for applied sciences). This entry point was realised by the KNAW, and is nowadays part of the Dutch Research Information System under the name NARCIS, and is hosted by Data Archiving and Networked Services (DANS) of the KNAW.

The European FP6 project DRIVER (2006-2007) has established a network of relevant experts and Open Access repositories and the Dutch DARE/NARCIS infrastructure has been made part of this network and its search portal. Phase II of DRIVER (FP7), that lasted from 2007 until 2009, not only expanded this network in Europe, and established a robust, scalable repository infrastructure accompanied by an open source software package (D-Net), but also performed technology watch studies in Enhanced Publications the outcome of which will be discussed later.

1.2 Infrastructure for Institutional Repositories in the Netherlands

Research universities, universities for applied studies, research institutions etc. in the Netherlands, coordinated by SURFfoundation, the innovation platform for scholarly information and network, have developed open access repositories to make the output of their research community available. In most cases the institutional or university library is in charge of coordination and maintenance. Every organisation provides access to its own repository and besides that, NARCIS offers the integrated single access point to this open access material. NARCIS harvests the metadata from the repositories, and builds services on it. On a European level the Dutch repositories are harvested by DRIVER. The DRIVER website provides integrated access to the metadata of open access research material in European repositories.

Figure 1: Academic repository infrastructure
and long term preservation

By agreement - the Netherlands has no deposit legislation - between the repositories and KB, the KB harvests the publications from all Dutch repositories together with the accompanying metadata, and stores them in the e-Depot (the long term preservation environment for publications and other digital material in the KB), where they are safeguarded for long term preservation and access.

1.3 Dutch Organisational Approach to Long Term Preservation

The KB is not the only party involved in long term preservation in the Netherlands. In winter 2009-10 five Dutch organisations, collaborating in the National Coalition for Digital Preservation (NCDD), offered a proposal to the Dutch Government on how long term preservation of digital material in the Netherlands could be organised. This proposal intends to make formal what is currently more or less reality (both in the analogue and in the digital world). A division of responsibilities over five organizations is proposed: the KB will take care of textual materials, DANS of the scientific data, the National Archive of national governmental information, the Netherlands Institute for Sound and Vision of the audiovisual material, and a recently established Cultural Coalition Digital Preservation (CCDP) of born digital and digitised cultural content. Making these responsibilities formal is a big step forwards in the organisation of preserving the Dutch digital heritage.

2. Enhanced Publications

For 'traditional' ways of publishing this organisation of divided responsibilities works well, while research output is most of the time a document of a single nature: text or film or dataset, but that might no longer always be the case. It is becoming increasingly common to accompany an article with other material, for example a data set on which the research was based. This set is called an Enhanced Publication (EP) or Compound Publication. In DRIVER II the definition of an EP is: "Enhanced publications are envisioned as compound digital objects which can combine various heterogeneous but related web resources. The basis of this compound object is the traditional academic publication. This latter term refers to a textual resource with original work which is intended to be read by human beings, and which puts forward certain academic claims. [Verhaar 2008] Enhancing a publication involves adding one or more resources to this ePrint. These can be the resources that have been produced or consulted during the creation of the text.

To aggregate the parts, the structure and kind of relationship between the parts of an EP OAI-ORE is used. The choice for ORE is based on the study Enhanced Publications: Linking Publications and Research Data in Digital Repositories, part 1-2 [Woutersen-Windhouwer 2009].

Responsible for the EP as an intellectual entity is the author (person, group of persons, or organisation directly involved with the content). Sometimes libraries or bodies that are not directly involved with the content are creating aggregations, and also call them EPs. These are considered here as quasi EPs, but not a priori excluded. There is also some confusion whether the notion EP covers only the aggregation, or the aggregation and all the parts it is aggregating.

Even during the DRIVER project duration the definition of an EP turned out to be too traditional. An EP can be any combination of files (video and annotation, datasets and documentation, etc.). The traditional publication in digital form is not always the main file of an enhanced object. For practical reasons we kept to our original definition in DRIVER II, but we are aware of its limitations. Nevertheless it served the research goal: how could we archive such an EP in the existing Dutch infrastructure, and what issues to be resolved came out of this research?

3. Research Issue

For more than a decade, long term preservation archives like the KB e-Depot have experience with all kinds of singular digital objects or data collections. Talking about EPs is not about a collection of singular objects or files, but is about relations between those files, and about a collection of files that could be distributed stored. The essence for studying the long term preservation aspects of an EP is not in the individual files but in the aggregation of the files, their mutual relation, and the fact that the storage of files could be scattered all over the world.

4. Approach

To get better insight into the long term preservation issues of EPs the definition of an EP as referred to in paragraph 2 was taken as granted. It is just one of the many forms in witch an EP could appear, but is adequate to start studying the more generic problems. The long term preservation infrastructure in the Netherlands as described in paragraph 1.2 was taken as an example of the infrastructure in which an EP must be ingested. Although this Dutch infrastructure does not necessarily reflect the infrastructure in other countries or between countries, it could be regarded for this study as a laboratory environment leaving aside all kind of local and national differences.

The study was primarily undertaken theoretically but to visualise the matter, and be sure that we have considered all the subsequent steps an experiment was conducted, and a demonstrator built.

5. Results

The main outcome of this study into the long term preservation of EPs is a more detailed insight into the issues that will come up when EPs (or other networked publications) has to be preserved for long term access. The process for ingesting an EP was described, and the new software parts that were considered to be necessary were realised as experimental software, and are published as open source code. The process that we proposed for transferring an EP (aggregation and components) was visualised in a demonstrator [2].

5. 1 Process

Firstly, we designed a process for the transfer of an EP, consisting of a text and extra data, to a long term preservation archive environment in the Dutch context. This would mean that the text should be stored in one archive, and the data in another archive, both of different organisations and in different locations. Here follows the description of this process. The steps in it are numbered, and these numbers refer to figure 2.

Figure 2: Process for Enhanced Publications in the long term preservation infrastructure.
The numbers in this figure refer to the article text (paragraph 5.1)

(1) The EP consists at minimum of a text file, a data file and an aggregation file (ORE). The consisting parts of the EP can be stored in one or more geographically scattered repositories (2). In this design the ORE-file is harvested via OAI-PMH, and handed over to a service determines which objects should be stored in which archive. In this case: research data in archive X, publications in archive KB (e-Depot). Because each archive must have the complete information about the whole entity, a separate MPEG21-DIDL file (the used metadata format for these archives) is prepared for every single archive, containing the metadata, the references to the appropriate files and the whole ORE-file. (3) These packages are stored in an EP repository where the individual archives can harvest them via OAI-PMH. (4) KB, being the text archive, harvests the set that was prepared for it in the EP repository, retrieved the files that are referred to in the DIDL from the different repositories, and processes and ingests all the files in the usual way into an AIP. The ORE file is included in the AIP to make it possible to restore the EP for access purposes. (5) The same will be done by the other archive(s), e.g. a data archive, detected by the EP transformer service according to the nature (type, country, etc.) of the constituent files. Some issues we detected during this research are rather generic for long term archiving, and are not discussed here, but others have some special features because of the nature of an EP. They are discussed below.

5.2 Issues with the Preservation of Enhanced Publications

The very nature of EPs, in relation to the Dutch situation of several repositories under different organisations, raised some issues that might influence the activity of preserving these publications for the long term. Take for example rights management. Although the simplest form might be a publication of a Dutch author with a data set of which the rights are also with the Dutch author, it might be more realistic to look at EPs which for some parts intellectual ownership is held by non-Dutch authors, and are subject to a different right management regime. Some of the issues that have to do with the nature of EPs will be described here briefly, but the list can be extended without any doubt.

  • Ownership. As research is often done with partners from various (international) organisations, an EP can have several owners, sometimes geographically distributed. So first of all creators have to be identified because they own the intellectual property rights. In the Netherlands we assign a unique Digital Author Identification (DAI) to every creator. These numbers are planned to become connected to the VIAF to have a global unique identification for creators.
  • Rights. It has to be clear what an archive is permitted to do with the EP on the short term and on the long term, and with its various components, which could all be subject to different legislation. This relates not only to access rights but also to the rights related to preservation actions such as: under what agreement is the copy archived; what actions are agreed to preserve the content and/or the form, etc. Copyright could be held by the author, the organisation the author is working for, or any institution or person the rights holder has transferred his rights to. Currently the KB's solution is to have the institutional repositories declare in the archival agreement that it will only store open access material in its repository, so there is no need for an advanced access rights system.
    This approach will only work for a fairly simple and transparent EP. In complicated situations where the different types of data are more intertwined, it is harder to record exactly which part of the publication is under a free access licence, and which part is not. Take for example the situation where an EP has a textual part that is already available as open access, but where the related (commercially interesting) data underneath are still not publicly available. In that case it is necessary to record for every part of the publication the exact copyright holder, the national law under which the licence is given, the form of licensing, the neighbouring rights, the clearance by the owner of the publication or by the deposit holder, the exceptions owner and deposit have agreed on, the period for which agreed propositions are valid, etc.
  • Nature of the consisting files. The question is will it be possible to divide files according to their "textual" nature or "data" nature, and whether features like location and subject could be detected automatically. In case there are separate archives for different types of material, subject or country, a distributing mechanism should understand automatically where to deliver the parts of an EP. Although "type" of object might seem the easiest part to detect automatically, current identification tools still have lots of problems with determining the nature of files. Although identification tools are becoming more mature, the manually added metadata is still important to detect the correct properties of files.
  • Versioning/update policies. Data tend to be part of sets that could be subject to change. A "traditional" text is a unity in itself, and it is relatively easy to determine a new version. Modern texts become more and more a patchwork of information units (like a hypertext), that could be subject to updates on its own. A dataset could be changed as a whole, but also the items within a dataset could be changed. An update policy can be based on a schedule, but this will not always be the right method, depending on the nature of the publication. More advanced and differentiated solutions have to be developed
  • Authenticity. How can the future user trust that the EP as a whole still is the same as intended when created? A variety of measures need to be taken, like integrity checks, adding metadata about the origin of the parts, etc.
  • Persistency. Not only the files themselves need to be kept integer, but also the linking between the files (OAI-ORE) and the identification of the files stored in different places should be persistent). A basic requirement of a long term preservation archive is that it can guarantee the persistency of the ingested files and the persistency of the identifier of these files. In the Dutch academic world we use the URN:NBN as the system for persistency of identifiers. A persistent identifier is essential for identification, retrieval, referring and linking. For datasets HANDLE is an upcoming standard in the Netherlands. There is no problem in mixing several systems for persistent identifiers, as long as it is clearly stated which system is used.
  • Future use. A clear vision of the expected use of the archived EP will support shaping the preservation policies, but it is not an easy task to describe the future users and expected (re)use. Preservation is only useful if the goal of preservation is access on the long term. But how will systems and formats develop; what do users in the future expects from information created in our time? What aspects do we need to record to provide the right context for a future user to understand and reuse information from long time ago? Questions still to be answered.

6. Discussion

6.1 The Consequences of these Issues

Some of the above mentioned issues became very clear during the building of the prototype. For this prototype software, the starting point was a rather simple situation: a text and a dataset that could easily be separated, and distributed over two separate archives as long as the references between the parts were guaranteed. Although the test case was rather simple and theoretical, we could conclude that the distribution of the consisting parts of the EP over more then one archive should be considered realistic. Maybe not in the way it was done in this research situation, but in the real world the constructing parts will be in different places, even countries, and will be archived from there in various archives. So the issues mentioned above will be valid in reality. A question that is unsolved is where the best place will be to archive the aggregation: close to the most important file - if one could decided on this - in every place where there is a file belonging to the EP, or in a separate archive dedicated to the long term preservation of aggregations. Most of the issues mentioned are not particularly challenging from a technical perspective. The overall conclusion is that the issues are rather of an organisational nature than of a technical one.

The aim for the earlier mentioned SURFshare programme is to have at least 1.500 EPs at the end of 2011. In a recently started review project called Sustainable Enhanced Publications we scrutinise a selection of these real EPs to verify our findings from the theoretical exercise we conducted, to come to a proposal for adjusting the Dutch infrastructure.

6.2. New Developments: Collaborative Workspaces

Publications are the concrete output of a research process. The researcher or the research group in its role as author determines when the publication will be delivered to a repository for access and archiving. Collaborative virtual research environments are considered to be the new workspaces for researchers. Future scholarly communication will take place in this environment, and the environment itself could be part of intermediate and end results of scientific research. As a consequence these environments could become the new way of publishing scientific output. Repositories should make connections to these environments to harvest and distribute the scientific outcomes. Long term preservation archives should archive and preserve relevant aspects of collaborative virtual research environments for future use.

But also in that environment it is the researcher who has to decide in the end what is ready for publication and thus for archiving. It is too early to describe how this could be done, but archives have to make it easy for a researcher or research group to transfer the scientific data and publications to the long term archive. The easiest way is to archive everything, but with it then comes the obligation to preserve (not only store) and give access to this all. However, it will be unavoidable that the costs of digital preservation will force organisations to select what to preserve. Every selection is a decision about what is now considered valuable for the future without knowledge of those future users. Nevertheless, even if we come to a selection of what is really valuable and relevant for future users, history has taught us that history of science and history of man could only be studied well if also the artefacts that were judged non-valuable in those days could be taken into account. Currently the Dutch SURFfoundation is performing pilots with virtual research environments, and is discussing to make the issue of preservation in this environment and the selection of data for future use part of their investigation.

Recently the European Project ParseInsight published a report showing that researchers became a little more aware of the long term preservations problems of the data they produced during their research [ParseInsight 2010]. For them libraries and archives are the parties that are in place to solve these problems. Funding bodies can also play an important role by making deposing data and publication part of the obligations they impose on researchers. Also on a European level we see a higher awareness of the value of research data for reuse, and an urge for open access to these data [DataCite and PersId). The growing attention for long term preservation and the urge to provide high quality in delivering long term preservation services is also showed by the attempt to come to an official certification standard for long term preservation repositories (standard under development ISO 16363 - Space data and information transfer systems - Audit and certification of trustworthy digital repositories). We finish this article by pointing to the European project SCAPE, which started in January 2011, and has included in its working program further research into the concept of EP and its long term preservation solutions.

(15-05-2011)

Acknowledgment

Partners in WP 4 (D4.3/M4.2) of the DRIVER II project were:
Paul Doorenbosch (WP-leader), Barbara Sierman (Koninklijke Bibliotheek, The Netherlands)
Eugène Durr (3.TU, The Netherlands)
Jens Ludwig, Birgit Schmidt (Göttingen State and University Library, Germany)
Maarten Hoogerwerf (DANS, The Netherlands)

Notes

References

The authors

Paul Doorenbosch MA, is head of Research at the Koninklijke Bibliotheek, national library of the Netherlands. He studied Dutch literature of the 19th and 20th century at the University of Amsterdam. In 2001 he joined the KB as programme manager f the development and realisation of the Dutch national digitisation programme Memory of the Netherlands. Before 2001 he was employed by the Royal Netherlands Academy of Arts and Sciences (KNAW) as manager and scientific editor. He is vice chairman of the multidisciplinary programme Continuous Access to Cultural Heritage (CATCH), a joint research programme of computer science, cultural heritage and humanities, and member of the advisory boards of CLARIN (The Netherlands) and D-SPIN (Germany), both large scale infrastructural programmes for the Humanities.

Barbara Sierman MA, is Digital Preservation Manager at the Koninklijke Bibliotheek. She studied Dutch literature of the Enlightenment at the University of Amsterdam, then joined Pica (now OCLC) as a library consultant and had various jobs at IT companies as a consultant, last at Cap Gemini. In 2005 she started at the KB at the Research and Development Department. She was engaged in the EU projects PLANETS and DRIVER and is now in SCAPE and APARSEN. She participates in international working groups on digital preservation, like TRAC, UDFR, IIPC and JHOVE2. She gave presentations on digital preservation, preservation metadata and organising digital preservation, and published several articles on these topics.


Koninklijke Bibliotheek: www.kb.nl