Document Viewers for Non-Born-Digital Files in DSpace
As more institutions continue to work with large and diverse type of content for their digital repositories, there is an inherent need to evaluate, prototype, and implement user-friendly websites -regardless of the digital file size, format, location or the content management system in use. This article aims to provide an overview of the need and current development of Document Viewers for digitized objects in DSpace repositories -including a local viewer developed for a newspaper collection and four other viewers currently implemented in DSpace repositories. According to the DSpace Registry, 22% of institutions are currently storing "Images" in their repositories and 21% are using DSpace for non-traditional IR content such as: Image Repository, Subject Repository, Museum Cultural, or Learning Resources. The combination of current technologies such as Djatoka Image Server, IIP Image Server, DjVu Libre, and the Internet Archive BookReader, as well as the growing number of digital repositories hosting digitized content, suggests that the DSpace community will probably benefit with an "out-of-the-box" Document Viewer, especially one for large, high-resolution, and multi-page objects.
As academic and research institutions continue to work with large and diverse type of content for their digital repositories, there is an inherent need to evaluate, prototype, and implement user-friendly websites -regardless of the digital files' size, format, location or the content/digital management system in use. This article aims to provide an overview of the need and current development of Document Viewers for digitized objects in DSpace repositories. Although some may argue that the need of a viewer for non-born-digital files is a result of using a system for something that was originally designed for born-digital and traditional IR content such as PDF files; others may also agree that the type of content and file formats in digital repositories is already diverse. In fact, according to the DSpace Registry , 249 of 1,117 (22%) institutions are currently storing "Images" in their repositories; 231 (21%) are using DSpace for non-traditional IR content such as: Image Repository, Subject Repository, Museum Cultural, or Learning Resources. This data seems to confirm the early finding in a 2005 survey conducted by the Coalition for Networked Information (Lynch and Lippincott, 2005) where the authors concluded that a growing number of institutions were using institutional repositories for not only e-prints or born-digital materials, but also for digitized materials such as books, maps, and other primary source materials that are traditionally housed in libraries’ special collections or archives.
Digitized images are often saved as either TIFF or JPG2000 files, especially for master copies, JPEG is the preferred format for access copies and it remains as the most common image format used on the web. A more extensive analysis of file formats for images and how to make scanned documents web accessible is presented in the article Are Your Digital Documents Web Friendly?: Making Scanned Documents Web Accessible (Zhou, 2010). The presentation of large and high-quality images through a web browser has led to some significant changes and improvements in the last few years. For example, a single image can be viewed using a standard inline visualization method that simply loads an entire image into memory; however, this approach can be very inefficient for a set of high-quality image files such as those of books or newspaper. The digitized images of a book can result in large files -often in hundreds of Megabytes-, which make them unsuitable for traditional in-browser visualization. The benefits of presenting friendly and customizable viewers for high-resolution images is well presented in the article Capturing and Viewing Gigapixel Images (Kopf et all., 2007). The ability to zoom and pan high-resolution images in digital repositories can allow researchers to take a closer look at interesting parts of images -e.g. captions under images or handwritten text -far beyond the original image resolution-, which can subsequently lead into interesting discoveries.
The topic of Document Viewers for DSpace generated a discussion in January 2011 in the DSpace-tech listserv , several developers talked about the possibilities of such an implementation, others provided links to examples or related technologies, and some even talked abut the need of an “out-of-the-box” viewer. The following two sections of this article include a brief discussion of the development of a local OpenZoom viewer deployed in a student newspaper collection and the overview of four viewers currently implemented in other DSpace repositories.
2. A Local Example Using OpenZoom
The Miami Student Newspaper, established in 1826 as irregular student publications and from 1867 as a regularly published newspaper, is one of the oldest college newspapers in the United States. In 2007, the Miami University Libraries began a digitization project for this collection, using the archival microfilm and original bound copies when necessary. The collection -launched in January 2009 and stored initially in CONTENTdm 4.3- soon became one of the libraries’ most popular digital collections. However, the proprietary image viewer presented challenges to viewing, reading or printing the newspaper pages. As an alternative, in the summer of 2010 the team of the Digital Initiatives Department completed a migration process of this large (more than 4,600 issues) student newspaper collection to DSpace 1.6; the project was possible thanks to a grant from the federal Institute of Museum and Library Services (IMLS), awarded by the State Library of Ohio, the Office for the Advancement of Research and Scholarship at Miami University and the Miami University Libraries. The grant helped us to prove and test the ability of DSpace to provide access to large non-born-digital collections with enhanced web usability and functionality features for online reading. The workflow included tow main processes: a) data conversion from TIFF to JPG2000 and b) development of a new public interface.
DSpace provides two options for presenting data on the web: JSPUI, a user interface based on Java Server Pages (JSP) and XMLUI, another user interface developed by Texas A&M University and based on the Apache Cocoon framework. Because of the need to create a custom look-and-feel for the collection, we decided to use XMLUI. This project allowed us to test and implement three major theme customizations:
- OpenZoom image viewer, a front end of the IIP Image server, which supports the use of high-resolution jp2 images and feeds a dynamic flash-based viewer with a basic toolbar and options for zooming and full-screen view.
- Customized metadata labels, digital collections often require specific metadata fields to better describe them, for instance the “Volume No.” is only relevant for newspaper collections; creating a template for custom metadata labels was useful for this collection.
- Calendar view for browsing, collections such as yearbooks, magazines or newspapers often need a custom browse page based on dates, we exported a metadata file from DSpace via OAI and wrote a PHP script to create a twelve-month calendar view for browsing.
The major challenge in integrating the viewer into DSpace was accessing a set of jp2 files (bitstreams) associated with a single record; this was mainly because of how DSpace stores beatstreams in its assetstore directory. A workaround was to create a separate copy of the jp2 files and saved them under the OpenZoom directory; we created a folder and named it with a publication date (e.g. YYYY-MM-DD) and the trick was to send a date in ISO format from DSpace to OpenZoom as a variable in a URL (e.g. ?pub_date=1991-04-30); then a PHP script takes the date variable and uses it to match the corresponding folder, reads folder content, and generates a drop down menu which allows users to navigate/view the pages of that particular issue. When DSpace loads a page for a given record, an XSLT template sends a URL -with the date variable- to call the PHP script and embeds the OpenZoom viewer into DSpace. Figure 1 illustrates the basic interaction between XSLT and PHP for embedding the viewer in DSpace. The live site is available at: http://digital.lib.muohio.edu/msnda/.
Figure 1. Representation of the XSLT and PHP interaction for embedding an OpenZoom viewer in DSpace.
Although this approach required a separate copy of the jp2 files on a second location, we believe it was also a good experiment for understanding how practical it is to embed external and related files into a DSpace record. In fact, as the content in repositories and devices for access continue to diversify and grow, this approach of displaying different pieces of related data from different sources and formats in a single interface may become more common and expected. In this regard, we have recently created two other test themes, one that embeds videos hosted on Vimeo and another for JPEG images that are created on the fly using a multi-page DjVu file.
Figure 2. Screenshot of a newspaper issue viewed in an iPad using the DjVu-based viewer.
The newspaper theme required some major edits in four files:
|mustudent.xsl||Added two XSLT templates: a) to get the filename of the primary bitstream and send it as a variable in a URL for the OpenZoom viewer; and b) to customize local metadata labels for web display. Removed templates such ds-options* and recent-submission.|
|style.css||Modified the DIV ds-body properties in order to provide a cleaner/wider interface for the OpenZoom iframe element. Removed certain elements like the "Show full item record" link.|
|mustudent.php||Wrote a PHP file that takes a date variable to match a folder with jp2 files, when a match is found, the OpenZoom viewer is called and a drop-down menu is created for navigation.|
|calendar.php||Wrote a PHP file to reads a text file with two metadata fields: handle and date; then it creates a twelve-month calendar view for browsing with a decade option as well.|
* Removing the DIV ds-options element -which includes links to all the administrative functions- can be confusing, especially if administrators need to make changes using the web-interface. In some cases when we needed to perform an administrative task, we used the JSPUI interface.
The OpenZoom viewer example discussed in this article seem to solve the initial need of previewing large high-resolution files in DSpace, this approach allows users to read and examine minor details in multi-page files without having to download a large file. This is definitely an essential and major step towards making digital repositories more accessible and easy to use. However, another key feature that will make this collection more accessible will be an integrated full-text searching with keywords highlighted. At present, the web interface provides users the ability to search at the collection level, using a hidden field with the OCRed text, but there is no option for searching within individual issues unless a PDF file is downloaded. As users’ expectations and technologies continue to change, it may be valid to question whether a hybrid repository can be a better alternative; for instance, a DSpace system for the back-end supporting digital preservation and another open source system for the front-end where developers can implement enhanced user interfaces. The End Users Solutions currently available for the Fedora Commons Repository Software  is a good example of this hybrid approach. After all, users may not care much about how and where the objects are stored, but they definitely care if the content is not available in a user-friendly interface and within a reasonable span of time.
3. Other Viewers in DSpace Repositories
As part of the documentation of current examples of Document Viewers in DSpace, the following four organizations can serve as successful examples that have implemented document viewers in their repositories.
Brasiliana Digital Library
The Brasiliana Digital Library is perhaps one with the most advanced examples found on the topic of document viewers. They use a combination of technologies such as the Adore Djatoka image server, ImageMagick, and the Internet Archive BookReader. Their repository contains mostly multi-page PDF files of scanned books and newspaper issues. Fabio Kepler said "When we started digitizing, many files had more than 100 MB ... the issue worrying us was that the user was forced to download the file in order to view it and decide if it was of one's interest." Their implementation included some major modifications to the XMLUI module and to a small part of the DSpace core in order to make it possible to get the files' paths from the assetstore.
Figure 3. An 1885 newspaper available at the Brasiliana Digital Library with a viewer-based on the Internet Archive BookReader.
The Digital Resource Commons (DRC) is a statewide platform that enables Ohio institutions to preserve, publish, and access all types and formats of digital materials. The DRC recently launched a JP2 Viewer based on the IIPImage Server. The viewer features a toolbar with options for zooming, panning, rotating and viewing images in full screen mode; the viewer also includes a navigation box with thumbnails and the total number of pages in a document, which helps in browsing multi-page files. The viewer supports JP2 and TIF files and is currently being deployed in more DSpace instances either as a new theme or as an additional XSLT template in existing themes.
Figure 4. An 1874 Atlas Map available at DRC-OhioLINK with a viewer-based on the IIP Image Server.
Texas A&M Libraries
@mire is a commercial organization that develops modules for DSpace. Their Document Streaming module provides access to document files using the Scribd's iPaper viewer. This flash-based viewer features options for zooming, printing, and full-screen viewing. In contrast to other examples presented in this article, this alternative does not support JP2 or TIF file formats. It also requires to automatically upload a copy of the file to the scribd.com servers, although users can make the files fully private.
Figure 6. A 1916 Newspaper available at the West Texas Digital Archives with a viewer provided by @mire.
On the assumption that the number of digital repositories hosting digitized content continues to grow, the DSpace community will probably benefit with an "out-of-the-box" Document Viewer, especially one that supports large, high-resolution, and multi-page objects. A further study of the data available in the DSpace Registry can help identify a list of institutions with common requirements and implementations on image viewers, which can also help create a group of developers to formalize this work in a DSpace release.
The growing popularity of JPG2000 as a preferred format -for both access and long-term preservation- in many repositories seems to facilitate the implementation of new viewers as well. The current "state-of-the-art" on Web Document Viewers for complex and large image files suggests that there are several technologies that can allow users to preview individual pages of digitized multi-page objects. Those viewers are also providing users the ability to perform key interactions such as zooming, panning or turning pages.
Examples of current technologies presented in this article include: Djatoka Image Server, IIP Image Server, DjVu Libre, and the Internet Archive BookReader. A most recent development is the Diva multi-page document viewer (Hankinson, et all., 2011), which works as a "document viewer designed to present high-resolution digitized images as a continuous and scrollable items." In the mid-to-long term, an ideal DSpace viewer should allow users to preview, zoom, and perform full-text searches with keyword highlighting. For web usability and accessibility, such a viewer should probably be device independent and avoid the need of either plug-ins or stand-alone applications.
The author would like to thank John Millard, Head of the Center for Digital Scholarship at Miami University, for helping in the deployment of OpenZoom and DjVu Viewers; and Fábio Kepler, Developer and Researcher at the Brasiliana Digital Library, for proving feedback on the viewers they have implemented.
- Hankinson, A., Liu, W., Pugin, L., and Fujinaga, I. (2011) "Diva.js: A Continuous Document Viewing Interface". Code4Lib Journal, Issue 14. Available at: http://journal.code4lib.org/articles/5418 [Accessed August 17, 2011]
- Kopf, J., Uyttendaele, M., Deussen, O., and Cohen, M. F. (2007) "Capturing and viewing gigapixel images". ACM Transactions on Graphics, Vol. 26 No. 3. Available at: http://doi.acm.org/10.1145/1275808.1276494 [Accessed July 2, 2011]
- Lynch, C.A. and Lippincott, J.K. (2005) "Institutional Repository Deployment in the United States as of Early 2005". D-Lib Magazine, Vol. 11 No. 9. Available at: http://www.dlib.org/dlib/september05/lynch/09lynch.html [Accessed July 15, 2011]
- Zhou, Y. (2010) "Are Your Digital Documents Web Friendly?: Making Scanned Documents Web Accessible". Information Technology & Libraries, Vol. 29 No. 3. Available at: http://www.ala.org/ala/mgrps/divs/lita/publications/ital/29/3/zhou.pdf [Accessed June 26, 2011]
- DSpace Registry: <http://www.dspace.org/whos-using-dspace>
- DSpace Tech Listserv - Online Document Viewer Functionality: <http://tinyurl.com/viewers-for-DSpace>
- DuraSpace Wiki - End User Apps: Simple Interfaces to Complete Solutions: <https://wiki.duraspace.org/display/FEDORACREATE/Complete+Solutions>
- Djatoka Jpeg 2000 Image Server: <http://djatoka.sourceforge.net/>
- DuraCloud - Image Serving: <http://www.duracloud.org/image_serving>
- IIPImage server: <http://iipimage.sourceforge.net/>