The TheOREM project exercised the OAI-ORE protocol in the context of chemical theses by modeling the thesis as an aggregate of chapters and supporting information, and by proposing mechanisms to leverage ORE in a hypothetical scenario describing a thesis submission and consequent deposit and publication in an Open Access Institutional Repository. The ICE (Integrated Content Environment) (Sefton 2006) extension to that project showed how chemical theses could be authored in a word processing environment, following from proof of concept work presented at the Electronic Theses and Dissertations conference in 2007 (Murray-Rust 2007). We have been able to demonstrate theses that are both 'supported' by data in Neylon's terms (Neylon 2008) and are datuments (Murray-Rust & Rzepa 2004) in that they are hypertext aggregations of document and data, which are both human and machine-readable.
ICE-TheOREM was a joint project between the University of Cambridge (UC) and the University of Southern Queensland (USQ) funded by the JISC. At USQ, there was a team involved in this work: Oliver Lucido, Ron Ward, Linda Octalina, Bronwyn Chandler and Duncan Dickinson all assisted in programming and project management. At Cambridge, Nick Day was the implementer, with support from Joe Townsend.
The technical motivation for the project was to test the OAI-ORE standard, asking:
There are various approaches for using ORE with the SWORD-APP protocol to move theses and linked-data from one system (the client in the context of this transfer) and a repository system (the server), and the most practical option turned out to be one which used AtomPub to transport ORE Resource Maps (ReM), and consequently did not to use SWORD's support for content packaging.
The project designed a thesis process based around a purely web architecture, looking for interoperability between systems and, looking towards a repository modeled as a set of services; an idea promoted by repository theorists, but which is lacking in most IR installations, which tend to be monolithic.
From the point of view of open access, we tested whether disaggregation of theses into chapters could promote open access, by making it easier to disseminate open parts via existing OA systems and hold-back embargoed content in an upstream system. In this model we posited that promoting embargo of sensitive chapters could accelerate publication of the remainder of the thesis. To explore this we had to look at issues such as:
Answers to these questions emerge in the project narrative below, which is organized to follow the life-cycle of a thesis.
In this paper we follow the life-cycle of a thesis starting with writing and supervision, then examination and deposit of a thesis showing where the ICE-TheOREM project (Jacobs 2008) has produced proof of concept innovations that promise to improve on current repository practice. While the project was exploratory in nature there have been some concrete outcomes.
The thesis is authored in the ICE system, which also handles the interactions between candidate and supervisor through annotations. When the thesis is submitted, a copy is sent to the Thesis Management System (TMS) and also to a hypothetical system representing the Graduate Studies Office (GSO), where it is sent or made available to the examiners. We did not prototype mechanisms for distributing copies to examiners or to model the examination and correction process, but chose to focus on the interactions between the systems shown. When all necessary corrections have been made and an updated version has been transferred to the TMS, GSO sends a message to the TMS to make the thesis available to the Institutional Repository (IR), and another to the IR that it should start to collect and republish the thesis.
This design was motivated by a desire to prototype a system that could work without an institutional mandate for thesis deposit or Open Access, and that decentralized control over embargo. The latter is important as it is supervisors or candidates, rather central university administration, that most often make decisions about embargo.
Illustration 1: Overall thesis workflow with thesis repository and Board of Graduate Studies (BoGS)
The centerpiece of the ICE-TheOREM project has been a Master's thesis by Malcolm Tait. This thesis was collected as part of the JISC sponsored SPECTRa-T project (Murray-Rust 2007) in its source format (Microsoft Word .doc) with permission to process and republish. It is a typical thesis in the area of chemical synthesis, with a review of the properties of the molecules in question and previous work, followed by a discussion of the work conducted. The narrative sections of the document are interspersed with tabular data and diagrammatic representations of molecular structures, and the the appendices contain large amounts of detailed procedure description and characterization data. The thesis is shown here in the ICE system, running in the virtual machine we created for the project (http://hdl.handle.net/102.100.100/32). The thesis is broken up into multiple source documents, one for each chapter or section of the work, in this case in Microsoft Word (.doc) format, but OpenOffice.org (.odt) files are also supported. The view shown here is a web-rendered view of the thesis, ICE converts each part into HTML, and also creates a PDF version.
The key features of an ICE document are highlighted in Illustration 3. It uses styles to convey structural information about a document, the author applies styles using a toolbar, and the document can be converted to HTML format or sent to a website (usually a weblog) via the Atom Publishing protocol. In this case, though, the author does not have to click any buttons in the word processor to see the thesis in HTML, they look at it through the ICE web application, which runs on their desktop – changes to the document are automatically reflected in the web-view when the author refreshes the page. This is an important feedback mechanism which helps to improve the quality of documents created in ICE – any inconsistencies between the print and web view can be spotted by the author immediately. This contrasts with workflows where authors send documents away for processing and may not see the results for hours, days or months.
One of the key features developed for this project is the ability to link to data in a meaningful way. In this case, the image of a molecule,
The ICE system allows for stand-off annotation of documents in a way that is similar to the digress.it tool, formerly commentpress. Supervisor(s) and peers are able to comment on a document without changing it.
When the document is ready for submission for examination, the ICE-TheOREM model proposes a repository which belongs to the graduate studies office, so the thesis needs to be deposited in that repository. This could be accomplished using a 'pull' process where the repository watches the ICE system and fetches theses with a certain flag set, such as ready_ for_examination, as described and prototyped in the competition entry for Open Repositories 2008 Zero Click Ingest (Monus et al. 2008). In ICE-TheOREM we have used a push system, where the candidate uses the SWORD function to send the thesis to a thesis repository. This SWORD button is now in use at USQ with the ePrints institutional repository as well, allowing authors to post completed works as soon as they have been accepted into a journal or delivered at a conference, such as an earlier version this paper which is available in HTML as well as PDF (Sefton et al. 2009).
The use of SWORD here is special – we are using SWORD as a transport but OAI-ORE as well, to describe the structure of the thesis as an aggregate object.
But before we look at the submission process we need to consider embargo. One of the major contributions of ICE-TheOREM is a model for granular thesis embargo, allowing individual chapters or sections to be placed under embargo. While it is likely that this will be used for reasons of commercial exploitation or privacy of research subjects, there are few safe assumptions here; we have heard of a case where a PhD graduate was happy for an entire thesis to be made open access apart from the acknowledgements section. To model this situation, in our demonstration the acknowledgements section is placed under embargo.
Embargo metadata is encoded in a style:
And ICE can extract the metadata:
Illustration 8: Metadata extracted from the doucment by ICE
The initial demonstration encodes embargo information using a style, using a technique developed in the ICE and ICE-TheOREM projects.
When the thesis is sent to the thesis repository via SWORD, then the metadata is sent with it. We propose that the graduate studies office get the student to submit and validate an OpenId (Recordon & Reed 2006)– allowing the student to authenticate to administer embargoes after their institutional login expires by authenticating with OpenID. While ICE and the thesis repository based on The Fascinator can both accept OpenId login, the details of managing student identity have not been worked out. Whilst the idea of using OpenId is an interesting possibility, it should be noted that embargoes that lift automatically after a fixed period of time avoid such technical and / or administrative overheads, and are likely to be preferable where thesis deposit is mandated.
The SWORD deposit contains an OAI-ORE payload.
This XML is expressing the structure of the thesis.
ORE is important to this work because it:
(Currently repositories such as ePrints and DSpace do not do this at all well, content models for repository items are usually implicit.)
The use of ORE to describe a thesis as an aggregate of objects makes it easier to implement fine-grained embargoing, as in illustrations 12 and 13, which show the ICE-TheOREM mock-up repository, implemented using The Fascinator (Sefton & Lucido 2009) to serve theses from a Fedora 3 repository. Further discussion of the mechanics of using ORE to implement embargoed transfer can be found in Section 2.5 , below.
Whereas if an administrator is logged in then the acknowledgements are visible.
The thesis repository is underdeveloped, with more work to do, but in a production version of the model presented here, the thesis repository would feed the institutional repository, using some approach to incremental embargo release (see Section 2.5 )
There are various approaches for using ORE with the SWORD-APP protocol to move theses and linked-data from one system (the client in the context of this transfer) and a repository system (the server), distinguished primarily by the way they use the ORE Resource Map (ReM):
In this approach the entire contents of the thesis, including a ReM in a file that acts as a manifest, are bundled into a single package archive file, such as a zip file or tape archive (tar) file, which is then transferred using SWORD-APP. The ReM includes relative URLs to refer to, and describe, the contents. This is the specific situation for which SWORD was designed. The main advantage is that the client does not need to act as a server as well. Because all of the aggregate object's parts are transferred in a single HTTP request, this approach provides transactional guarantees that are more difficult to implement in the other approaches. The disadvantages of this approach are that it precludes the benefits of pass-by-reference of the other approaches, and replicates content in a way that makes it difficult to track and resolve copies.
In terms of the initial transaction, these approaches are identical; the ReM is transferred using AtomPub. The SWORD extensions to support content packages are consequently not used, and the SWORD profile extensions will only be useful if there is some other some other requirement for them (e.g. mediated submission using OnBehalfOf). The difference is that in the Shopping List approach the server expects to dereference all of the aggregated resources immediately in order to republish them at new URLs, whereas the Road Signs approach uses the original URLs as they are, dereferencing only when necessary to access content. An important discriminatory feature between the approaches is the effect on access control. Since the server in the Shopping List approach republishes resources, access control is performed independently by the two systems, probably without co-ordination. We used the Shopping List approach in ICE-TheOREM, as it was more appropriate to separate concerns by having the repository system deal with access control and embargo management (discussed later), leaving the Thesis Management System free of these concerns. In other situations it would be more appropriate for the originating system to have the sole duty for access control, indicating the Road Sign approach as a better fit.
As indicated earlier, we were unable to fully implement a demonstration of incrementally transferring partially embargoed theses to an IR software, but here we propose three potential mechanisms that use standard web mechanism and ORE to implement the transfer of an aggregation as embargoes on parts of the aggregation lift over time. All three use SWORD + ORE to transfer the thesis ReM using the Shopping List recipe (described in Section 2.4.2 ), but have different approaches to incremental embargo release; pull (polling resource), pull (polling ReM), and (re-)push.
In the first scheme (Illustration 15: Scheme 1) the thesis recipient is told of all the resources in the aggregation and polls each of them. It is up to the sending system to protect embargoed resources using HTTP authentication. This scheme requires the sending system to be able to act as a server as well as a client, and requires the recipient system to be able to authenticate with the sender. Continued requests for a resource that might stay embargoed for several years is also inefficient, although probably not problematically so since long (e.g. monthly) polling intervals would probably be acceptable.
The second scheme (Illustration 16: Scheme 2) has the recipient system polling the ReM representing the thesis, and takes advantage of HTTP caching mechanisms (e.g. Entity Tags) to inform the recipient when an update has occurred. The inefficiencies of polling are slightly less problematic than scheme 1 (there are likely to be fewer ReMs than embargoed resources), and this scheme hides the URLs of embargoed resources, which might be desirable. It would also be possible to modify the ReM in other ways than simply omitting resources – for example including a blank or redacted version of a chapter rather than omitting it completely (important for pagination if the parts are to be automatically reassembled). The primary downside to this scheme is additional complexity; the recipient needs a way of discovering that a ReM is partial, and when there are likely to be no more updates, probably by including additional data in the ReM itself.
In the third scheme (Illustration 17: Scheme 3) the ReM is simply sent to the recipient whenever the embargo lifts on part of the thesis. This requires the sender and recipient to agree on an identifier in the ReM to identify the thesis, it makes sense to use the URI for the aggregation (the URI-a). This scheme is particularly suited for situations in which the recipient is to be the primary point of publication for the thesis, and has the additional advantages that (like scheme 2) it allows redacted versions of chapters to be substituted for completed ones, and that the sender is not required to act as a server, simply as a client.
This section is a brief description of the technology used in this work. The main applications, ICE and The Fascinator are both evolving open systems, with websites that track ongoing development. Code and documentation is available for both.
ICE is an open source application written in the Python programming language, available for Windows, Mac and Linux platforms. I simplifies management of a set of files managed using the Subversion revision control system by providing the user with a web-view of content either on their local storage system or on a centralized server. ICE orchestrates content conversion from various file formats to web-ready formats using an extensible plugin system. It's original and main focus is on providing good quality HTML output for word processing documents using the OpenOffice.org application as part of the conversion system.
The Fascinator in a flexible repository component which provides a rich faceted index of content via the Apache Solr text indexer, with support for OAI-PMH ingest and dissemination. It is written in Java, with some interface and indexing plugins in Jython, to enable easy customization. The version used in the work reported here was tied to the FedoraCommons repository back-end but current versions offer a choice of back-end storage systems via an API, with a simple file-based storage layer currently available as an alternative to Fedora.
Evolved and tested versions of the SWORD + ORE work described here will be released as part of The Fascinator in 2010.
To summarize, innovations in the workflow/lifecycle of a thesis include:
To summarize our work on workflow, ICE-TheOREM has followed existing academic workflows for authoring, examination, repository deposit. This work provides a proof-of-concept for true born digital web-eTheses. Embargo is handled by making sure that the requirements of the various stakeholders and parties are taken into account. Metadata about embargo is to be entered by the person best placed to know the requirement, the candidate, while we have recommended using an OpenId to identify the candidate so that the embargo can be managed even if they no longer have and institutional account.
The outcomes of the ICE-TheOREM project are summarized here:
The work reported here is a proof of principle for the ORE technology and a first step towards larger scale trials of repository-integrated thesis authoring workflows. A PhD thesis takes years to complete, so a true test of this infrastructure will involve a long term commitment. This commitment is being made at the Australian Digital Futures Institute – beginning in 2009 all the theses begin completed by institute staff and affiliates are housed in a system derived from the TheOREM work.
Further work starting now includes small scale trials with PhD candidates happening and conversion of recent theses into ICE at USQ. But much more work is required:
Downing, J., Allinson, J. & et al, 2009. SWORD AtomPub Profile. Available at: http://swordapp.org/sword/specifications [Accessed September 15, 2009].
Jacobs, N., 2008. Departmental Thesis Management System development using the Integrated Content Environment (TheOREM-ICE). Available at: http://www.jisc.ac.uk/whatwedo/programmes/digitalrepositories2007/theorem-ice.aspx [Accessed July 14, 2008].
Monus, L. et al., 2008. Zero Click Ingest. Available at: http://pubs.or08.ecs.soton.ac.uk/119/ [Accessed May 20, 2008].
Murray-Rust, P., 2007. The Power of the Electronic Scientific Thesis. 10th International Symposium on Electronic Theses and Dissertations. Available at: http://epc.ub.uu.se/ETD2007/sessions/keynote-2.html?keepThis=true&TB_iframe=true&height=480&width=640 [Accessed September 8, 2009].
Murray-Rust, P. & Rzepa, H.S., 2004. The Next Big Thing: From Hypermedia to Datuments. Journal of Digital Information, 5(1), 248. Available at: http://journals.tdl.org/jodi/article/viewArticle/130/128
Neylon, C., 2008. Science in the open » A personal view of Open Science - Part IV - Policies and standards. Available at: http://blog.openwetware.org/scienceintheopen/2008/10/26/a-personal-view-of-open-science-part-iv-policies-and-standards/ [Accessed February 5, 2009].
Recordon, D. & Reed, D., 2006. OpenID 2.0: a platform for user-centric identity management. Proceedings of the second ACM workshop on Digital identity management, 11-16. Available at: http://portal.acm.org/citation.cfm?id=1179532
Sefton, P., 2006. The integrated content environment. In AUSWEB 2006. Noosa: Southern Cross University. Available at: http://eprints.usq.edu.au/archive/00000697/01/Sefton_ICE-ausweb06-paper-revised-3.pdf .
Sefton, P., Downing, J. & Day, N., 2009. ICE-theorem - end to end semantically aware eResearch infrastructure for theses. University of Southern Queensland. Available at: http://eprints.usq.edu.au/5248/1/ice-theorem-paper-OR09.htm [Accessed August 24, 2009].
Sefton, P. & Lucido, O., 2009. The Fascinator: a lightweight, modular contribution to the Fedora-commons world. In Atlanta, Georgia. Available at: http://eprints.usq.edu.au/5259/ .