Document Viewers for Non-Born-Digital Files in DSpace

Document Viewers for Non-Born-Digital Files in DSpace

Elías Tzoc
Miami University Libraries
tzoce@muohio.edu

Abstract

As more institutions continue to work with large and diverse type of content for their digital repositories, there is an inherent need to evaluate, prototype, and implement user-friendly websites -regardless of the digital file size, format, location or the content management system in use. This article aims to provide an overview of the need and current development of Document Viewers for digitized objects in DSpace repositories -including a local viewer developed for a newspaper collection and four other viewers currently implemented in DSpace repositories. According to the DSpace Registry, 22% of institutions are currently storing "Images" in their repositories and 21% are using DSpace for non-traditional IR content such as: Image Repository, Subject Repository, Museum Cultural, or Learning Resources. The combination of current technologies such as Djatoka Image Server, IIP Image Server, DjVu Libre, and the Internet Archive BookReader, as well as the growing number of digital repositories hosting digitized content, suggests that the DSpace community will probably benefit with an "out-of-the-box" Document Viewer, especially one for large, high-resolution, and multi-page objects.

1. Introduction

As academic and research institutions continue to work with large and diverse type of content for their digital repositories, there is an inherent need to evaluate, prototype, and implement user-friendly websites -regardless of the digital files' size, format, location or the content/digital management system in use. This article aims to provide an overview of the need and current development of Document Viewers for digitized objects in DSpace repositories. Although some may argue that the need of a viewer for non-born-digital files is a result of using a system for something that was originally designed for born-digital and traditional IR content such as PDF files; others may also agree that the type of content and file formats in digital repositories is already diverse. In fact, according to the DSpace Registry [1], 249 of 1,117 (22%) institutions are currently storing "Images" in their repositories; 231 (21%) are using DSpace for non-traditional IR content such as: Image Repository, Subject Repository, Museum Cultural, or Learning Resources. This data seems to confirm the early finding in a 2005 survey conducted by the Coalition for Networked Information (Lynch and Lippincott, 2005) where the authors concluded that a growing number of institutions were using institutional repositories for not only e-prints or born-digital materials, but also for digitized materials such as books, maps, and other primary source materials that are traditionally housed in libraries’ special collections or archives.

Digitized images are often saved as either TIFF or JPG2000 files, especially for master copies, JPEG is the preferred format for access copies and it remains as the most common image format used on the web. A more extensive analysis of file formats for images and how to make scanned documents web accessible is presented in the article Are Your Digital Documents Web Friendly?: Making Scanned Documents Web Accessible (Zhou, 2010). The presentation of large and high-quality images through a web browser has led to some significant changes and improvements in the last few years. For example, a single image can be viewed using a standard inline visualization method that simply loads an entire image into memory; however, this approach can be very inefficient for a set of high-quality image files such as those of books or newspaper. The digitized images of a book can result in large files -often in hundreds of Megabytes-, which make them unsuitable for traditional in-browser visualization. The benefits of presenting friendly and customizable viewers for high-resolution images is well presented in the article Capturing and Viewing Gigapixel Images (Kopf et all., 2007). The ability to zoom and pan high-resolution images in digital repositories can allow researchers to take a closer look at interesting parts of images -e.g. captions under images or handwritten text -far beyond the original image resolution-, which can subsequently lead into interesting discoveries.

The topic of Document Viewers for DSpace generated a discussion in January 2011 in the DSpace-tech listserv [2], several developers talked about the possibilities of such an implementation, others provided links to examples or related technologies, and some even talked abut the need of an “out-of-the-box” viewer. The following two sections of this article include a brief discussion of the development of a local OpenZoom viewer deployed in a student newspaper collection and the overview of four viewers currently implemented in other DSpace repositories.

2. A Local Example Using OpenZoom

The Miami Student Newspaper, established in 1826 as irregular student publications and from 1867 as a regularly published newspaper, is one of the oldest college newspapers in the United States. In 2007, the Miami University Libraries began a digitization project for this collection, using the archival microfilm and original bound copies when necessary. The collection -launched in January 2009 and stored initially in CONTENTdm 4.3- soon became one of the libraries’ most popular digital collections. However, the proprietary image viewer presented challenges to viewing, reading or printing the newspaper pages. As an alternative, in the summer of 2010 the team of the Digital Initiatives Department completed a migration process of this large (more than 4,600 issues) student newspaper collection to DSpace 1.6; the project was possible thanks to a grant from the federal Institute of Museum and Library Services (IMLS), awarded by the State Library of Ohio, the Office for the Advancement of Research and Scholarship at Miami University and the Miami University Libraries. The grant helped us to prove and test the ability of DSpace to provide access to large non-born-digital collections with enhanced web usability and functionality features for online reading. The workflow included tow main processes: a) data conversion from TIFF to JPG2000 and b) development of a new public interface.

DSpace provides two options for presenting data on the web: JSPUI, a user interface based on Java Server Pages (JSP) and XMLUI, another user interface developed by Texas A&M University and based on the Apache Cocoon framework. Because of the need to create a custom look-and-feel for the collection, we decided to use XMLUI. This project allowed us to test and implement three major theme customizations:

  • OpenZoom image viewer, a front end of the IIP Image server, which supports the use of high-resolution jp2 images and feeds a dynamic flash-based viewer with a basic toolbar and options for zooming and full-screen view.
  • Customized metadata labels, digital collections often require specific metadata fields to better describe them, for instance the “Volume No.” is only relevant for newspaper collections; creating a template for custom metadata labels was useful for this collection.
  • Calendar view for browsing, collections such as yearbooks, magazines or newspapers often need a custom browse page based on dates, we exported a metadata file from DSpace via OAI and wrote a PHP script to create a twelve-month calendar view for browsing.

The major challenge in integrating the viewer into DSpace was accessing a set of jp2 files (bitstreams) associated with a single record; this was mainly because of how DSpace stores beatstreams in its assetstore directory. A workaround was to create a separate copy of the jp2 files and saved them under the OpenZoom directory; we created a folder and named it with a publication date (e.g. YYYY-MM-DD) and the trick was to send a date in ISO format from DSpace to OpenZoom as a variable in a URL (e.g. ?pub_date=1991-04-30); then a PHP script takes the date variable and uses it to match the corresponding folder, reads folder content, and generates a drop down menu which allows users to navigate/view the pages of that particular issue. When DSpace loads a page for a given record, an XSLT template sends a URL -with the date variable- to call the PHP script and embeds the OpenZoom viewer into DSpace. Figure 1 illustrates the basic interaction between XSLT and PHP for embedding the viewer in DSpace. The live site is available at: http://digital.lib.muohio.edu/msnda/.

XSLT and PHP interaction for OpenZoom viewer

Figure 1. Representation of the XSLT and PHP interaction for embedding an OpenZoom viewer in DSpace.
http://arthur.lib.muohio.edu/handle/10617/2492

Although this approach required a separate copy of the jp2 files on a second location, we believe it was also a good experiment for understanding how practical it is to embed external and related files into a DSpace record. In fact, as the content in repositories and devices for access continue to diversify and grow, this approach of displaying different pieces of related data from different sources and formats in a single interface may become more common and expected. In this regard, we have recently created two other test themes, one that embeds videos hosted on Vimeo and another for JPEG images that are created on the fly using a multi-page DjVu file.

One limitation we found with the OpenZoom viewer is that as a flash-based option, it does not display well in small screen and keyboard-less devices such as the Apple iPad or other tablet devices. As part of the student newspaper migration, we also prototyped a mobile interface for this collection, we modified an existing PHP script that takes a DjVu file and creates and displays individual JPG files for each page in a standard HTML page; we added a navigation bar with thumbnails to all corresponding pages so that users can navigate from page to page or jump to any page. In the future, we plan to add this viewer as an alternative for devices that do not support flash. The implementation of a theme with two viewers will require a JavaScript file that detects the user’s device and browser capabilities and use that information to choose the viewer that works best for a particular device. Figure 2 is a screenshot of the newspaper collection viewed in an iPad.

DjVu viewer for iPad

Figure 2. Screenshot of a newspaper issue viewed in an iPad using the DjVu-based viewer.

The newspaper theme required some major edits in four files:

File Changes
mustudent.xsl Added two XSLT templates: a) to get the filename of the primary bitstream and send it as a variable in a URL for the OpenZoom viewer; and b) to customize local metadata labels for web display. Removed templates such ds-options* and recent-submission.
style.css Modified the DIV ds-body properties in order to provide a cleaner/wider interface for the OpenZoom iframe element. Removed certain elements like the "Show full item record" link.
mustudent.php Wrote a PHP file that takes a date variable to match a folder with jp2 files, when a match is found, the OpenZoom viewer is called and a drop-down menu is created for navigation.
calendar.php Wrote a PHP file to reads a text file with two metadata fields: handle and date; then it creates a twelve-month calendar view for browsing with a decade option as well.

* Removing the DIV ds-options element -which includes links to all the administrative functions- can be confusing, especially if administrators need to make changes using the web-interface. In some cases when we needed to perform an administrative task, we used the JSPUI interface.

The OpenZoom viewer example discussed in this article seem to solve the initial need of previewing large high-resolution files in DSpace, this approach allows users to read and examine minor details in multi-page files without having to download a large file. This is definitely an essential and major step towards making digital repositories more accessible and easy to use. However, another key feature that will make this collection more accessible will be an integrated full-text searching with keywords highlighted. At present, the web interface provides users the ability to search at the collection level, using a hidden field with the OCRed text, but there is no option for searching within individual issues unless a PDF file is downloaded. As users’ expectations and technologies continue to change, it may be valid to question whether a hybrid repository can be a better alternative; for instance, a DSpace system for the back-end supporting digital preservation and another open source system for the front-end where developers can implement enhanced user interfaces. The End Users Solutions currently available for the Fedora Commons Repository Software [3] is a good example of this hybrid approach. After all, users may not care much about how and where the objects are stored, but they definitely care if the content is not available in a user-friendly interface and within a reasonable span of time.

3. Other Viewers in DSpace Repositories

As part of the documentation of current examples of Document Viewers in DSpace, the following four organizations can serve as successful examples that have implemented document viewers in their repositories.

Brasiliana Digital Library
The Brasiliana Digital Library is perhaps one with the most advanced examples found on the topic of document viewers. They use a combination of technologies such as the Adore Djatoka image server, ImageMagick, and the Internet Archive BookReader. Their repository contains mostly multi-page PDF files of scanned books and newspaper issues. Fabio Kepler said "When we started digitizing, many files had more than 100 MB ... the issue worrying us was that the user was forced to download the file in order to view it and decide if it was of one's interest." Their implementation included some major modifications to the XMLUI module and to a small part of the DSpace core in order to make it possible to get the files' paths from the assetstore.

Document Viewer - Brasiliana Digital Library

Figure 3. An 1885 newspaper available at the Brasiliana Digital Library with a viewer-based on the Internet Archive BookReader.
http://www.brasiliana.usp.br/bbd/handle/1918/060044-038

DRC-OhioLINK
The Digital Resource Commons (DRC) is a statewide platform that enables Ohio institutions to preserve, publish, and access all types and formats of digital materials. The DRC recently launched a JP2 Viewer based on the IIPImage Server. The viewer features a toolbar with options for zooming, panning, rotating and viewing images in full screen mode; the viewer also includes a navigation box with thumbnails and the total number of pages in a document, which helps in browsing multi-page files. The viewer supports JP2 and TIF files and is currently being deployed in more DSpace instances either as a new theme or as an additional XSLT template in existing themes.

JP2 Viewer - DRC-OhioLINK

Figure 4. An 1874 Atlas Map available at DRC-OhioLINK with a viewer-based on the IIP Image Server.
http://drc.ohiolink.edu/handle/2374.OX/89412

Texas A&M Libraries
As the developers of XMLUI/Manakin for DSpace, Texas A&M Libraries has done extensive customization work for their digital repository. A good example of document viewer for multi-page is implemented in the Geologic Atlas of the United States collection, which is a set of folios containing three types of content: text, maps, and photographs. The preview method is done using an image gallery-style viewing interface based on the Lightbox JavaScript application used to display large images. Users also have the options to download either a full archival-quality TIFF file or a reduced-quality JPEG file.

Document Viewer - Texas A&M Libraries

Figure 5. A 1924 folio available at Texas A&M Libraries with a viewer-based on the Lightbox JavaScript.
http://repository.tamu.edu/handle/1969.1/3688

@mire
@mire is a commercial organization that develops modules for DSpace. Their Document Streaming module provides access to document files using the Scribd's iPaper viewer. This flash-based viewer features options for zooming, printing, and full-screen viewing. In contrast to other examples presented in this article, this alternative does not support JP2 or TIF file formats. It also requires to automatically upload a copy of the file to the scribd.com servers, although users can make the files fully private.

Document Viewer - West Texas Digital Archives

Figure 6. A 1916 Newspaper available at the West Texas Digital Archives with a viewer provided by @mire.
http://wtda.alc.org/handle/123456789/33381

The above viewer implementations are good examples of how others are dealing with non-born-digital objects in DSpace. The Djatoka Image Server [4] seems to be a strong candidate not only because of its existing features of zooming, panning, and selecting the URI of the current view, but also because its three main system requirements -Sun Java, Tomcat, and Apache Ant- are already in use by DSpace. Another related development using the Djatoka viewer is the Image Serving feature currently offered by DuraCloud [5]. The IIP Image Server [6] also works great for streaming extremely high-resolution images, there are several options for viewers, including the IIPMooViewer -an HTML5 Javascript solution that allows cross platform plugin-free browsing. According to their documentation, the only external dependencies are the libtiff TIFF library and the IJG JPEG library. Additionally, the Internet Archive BookReader can also become a stronger candidate for a Document Viewer in DSpace, not only because of its ubiquitous book interface but most importantly because its full-text search functionality, which runs on Solr -an existing component of the DSpace software as well.

4. Conclusion

On the assumption that the number of digital repositories hosting digitized content continues to grow, the DSpace community will probably benefit with an "out-of-the-box" Document Viewer, especially one that supports large, high-resolution, and multi-page objects. A further study of the data available in the DSpace Registry can help identify a list of institutions with common requirements and implementations on image viewers, which can also help create a group of developers to formalize this work in a DSpace release.

The growing popularity of JPG2000 as a preferred format -for both access and long-term preservation- in many repositories seems to facilitate the implementation of new viewers as well. The current "state-of-the-art" on Web Document Viewers for complex and large image files suggests that there are several technologies that can allow users to preview individual pages of digitized multi-page objects. Those viewers are also providing users the ability to perform key interactions such as zooming, panning or turning pages.

Examples of current technologies presented in this article include: Djatoka Image Server, IIP Image Server, DjVu Libre, and the Internet Archive BookReader. A most recent development is the Diva multi-page document viewer (Hankinson, et all., 2011), which works as a "document viewer designed to present high-resolution digitized images as a continuous and scrollable items." In the mid-to-long term, an ideal DSpace viewer should allow users to preview, zoom, and perform full-text searches with keyword highlighting. For web usability and accessibility, such a viewer should probably be device independent and avoid the need of either plug-ins or stand-alone applications.

5. Acknowledgements

The author would like to thank John Millard, Head of the Center for Digital Scholarship at Miami University, for helping in the deployment of OpenZoom and DjVu Viewers; and Fábio Kepler, Developer and Researcher at the Brasiliana Digital Library, for proving feedback on the viewers they have implemented.

6. References

7. Notes

  1. DSpace Registry: <http://www.dspace.org/whos-using-dspace>
  2. DSpace Tech Listserv - Online Document Viewer Functionality: <http://tinyurl.com/viewers-for-DSpace>
  3. DuraSpace Wiki - End User Apps: Simple Interfaces to Complete Solutions: <https://wiki.duraspace.org/display/FEDORACREATE/Complete+Solutions>
  4. Djatoka Jpeg 2000 Image Server: <http://djatoka.sourceforge.net/>
  5. DuraCloud - Image Serving: <http://www.duracloud.org/image_serving>
  6. IIPImage server: <http://iipimage.sourceforge.net/>