Intermediary schemas for complex XML applications: an example from research information management

Richard Gartner
Centre for e-Research, King's College, London
richard.gartner@kcl.ac.uk

Abstract

The complexity and flexibility of some XML schemas can make their implementation difficult in working environments. This is particularly true of CERIF, a standard for the interchange of research management information, which consists of 192 interlinked XML schemas. This article examines a possible approach of using 'intermediary' XML schemas, and associated XSLT stylesheets, to make such applications easier to employ. It specifically examines the use of an intermediary schema, CERIF4REF, which was designed to allow UK Higher Education institutions to submit data for a national periodic research assessment exercise in CERIF. The wider applicability of this methodology, particularly in relation to the METS standard, is also discussed.

1. Introduction

The complexity of important XML schemas may often present a major hurdle to their adoption, particularly in cases where they are required in environments which do not already have significant experience in XML authoring or editing. These problems may be exacerbated in cases where these schemas are highly flexible, the lack of constraint in the way in which they can be used often requiring extensive work on initial information architectural design before they are implemented in practice: the Text Encoding Initiative (TEI), for instance, usually requires a process of tag selection and semantic specification (as described, for instance, in Welty and Ide 1999, 62) unless a pre-constrained variant (such as TEI-Lite) is employed.

Such problems of complexity and over-flexibility become more acute in applications which employ multiple, linked XML files to capture the complexities of their required architectures. In such cases, not only must the elements and other components be determined, but also the form of the linkages between files (including the definition of any ontologies used to define any links with semantic meaning). The possibility of offering an 'off-the-shelf' scheme with a minimal learning curve becomes less likely the greater the complexity of the overall information environment to be encoded, and the potential take-up of any such schemas will inevitably become more limited.

This article examines one potential approach to obviating these problems in the form of 'intermediary' XML schemas and XSLT stylesheets. The context in which it is examined is that of the CERIF format (European Organisation for International Research Information 2010b), a complex application designed to facilitate the interoperability of research management information. CERIF is maintained by euroCRIS, the European Organisation for International Research Information, and offers a comprehensive model for all metadata necessary to maintain current research information systems (CRISs) in a readily interchangeable form. The CERIF model, originally instantiated as a set of relational SQL tables but since 2006 available in XML, is based on a small number of core components and an extensive set of linkages between these which can mirror their often complicated inter-relationships. Such an approach has the benefit of being able to fulfil the metadata requirements of almost any operating CRIS, but also the downside of potentially great complexity.

The methodology advocated here to alleviate some of this complexity is an intermediary XML schema and associated XSLT stylesheets which are used to select a relevant subset of the CERIF components and constrain the manner in which they are employed. This technique is employed in the context of the Readiness for REF (R4R) project (Centre for e-Research 2011) from the United Kingdom's higher education community which sought to examine the feasibility of employing CERIF in the context of the periodic research assessment exercises which are used to determine the allocation of research funding to universities and other research institutions in that country. Its aim was to render the CERIF standard, which the higher education funding body had specified as a format for submissions to the next exercise in 2014, a feasible option for the first time for the majority of institutions who had not previously found it viable.

2. CERIF and its implementation challenges

The need to share information required for research management has long been recognised, particularly where this research is publicly funded and there is a consequent onus to ensure transparency in determining the allocation of funding and ensuring that it is well spent. In addition, the international nature of much research collaboration also requires this information to be readily shared across national (and often linguistic) boundaries. Early work on rationalising the metadata necessary for sharing information of this type began in Europe in the 1980s and eventually produced the CERIF standard, which has undergone several major revisions (in 2000, 2004 and 2006) since its first appearance in 1991 (European Organisation for International Research Information 2010c).

CERIF was conceived as a data model which is independent of any given syntax, although it was initially made available as SQL tables and later as a series of XML schemas. The model defines three 'base' entities project, person and organisation unit, all of which include only very basic metadata (including crucially unique IDs) as shown in the diagram below:-

CERIF base entities

Figure 1. CERIF Base entities.

In addition, the CERIF core provides basic metadata for research outputs ('results') including publications, patents and products:-

CERIF result entities

Figure 2. CERIF Result entities.

and a secondary tier of metadata components, which encode concepts potential relevance to any of these base entities such as addresses (postal or electronic), countries or events .

CERIF Base, Result and 2nd Level entities

Figure 3. CERIF Base, Result and 2nd Level entities.

A final group of entities, and an additional layer of complication for the implementer, allows CERIF to function in a multi-lingual environment. Any textual data capable of translation into multiple languages, such as publication titles, abstracts, descriptions of funding programs or descriptions of research environments, must be encoded using these entities: the following fragment, for instance, contains multiple language versions (English, Italian and German) of the title of this paper:-

                <cfResPublTitle>
                    <cfResPublId>1abcdefghijklmnop</cfResPublId>
                    <cfTitle cfLangCode="en-UK">Intermediary schemas for complex XML applications: an example from research information management</cfTitle>
                    <cfTitle cfLangCode="it">Schemi intermediari per complesse applicazioni XML: un esempio della gestione e ricerca delle informazioni</cfTitle>
                    <cfTitle cfLangCode="de">Intermediäre Schemata für komplexe XML-Anwendungen: ein Beispiel aus der Forschung im Bereich Informations-Management</cfTitle>
                </cfResPublTitle>
                
            

The core of any CERIF application, and the bulk of the standard itself, consists of an extensive series of linking tables or XML files which allow these base and second level entities to be joined together. These linking entities (of which there are currently 95, far more than the 3 Base and Results entities and 16 Second-level entities (Jörg et al. 2010, 36-38)) allow for a potentially complex set of relationships to be expressed which can model almost any research environment: one linkage, for example, may be used to join a member of staff to a research group, a research output, an event (such as a conference presentation), or to other researchers (for instance project collaborators). Similarly a research publication may be joined to its funding stream, to a conference at which it is presented or to a prize awarded for it.

These entities rely upon either user-defined, or preferably pre-published, semantic schemes to assign meaning to the linkages within a given application. The following XML fragment, for instance, is an example of part of the linking entity which joins research staff to their publications or other research outputs:-

                <cfPers_ResPubl>
                    <cfPersId>cerch-2008-001</cfPersId>
                    <cfResPublId>publ-9999-a-0001</cfResPublId>
                    <cfClassId>is-author-of</cfClassId>
                    <cfClassSchemeId>cerif-person-pub-roles</cfClassSchemeId>
                    <cfStartDate>2001-01-07T00:00:00-00:00</cfStartDate>
                    <cfEndDate>2099-12-31T00:00:00-00:00</cfEndDate>
                </cfPers_ResPubl>
            

Here the person identified by the cPersID element is linked to the publication (designated by the ResPubId element) of which he/she is the author: to designate this semantic relationship, it is necessary to declare both the semantic scheme used (in this case one designated by the identifier in cfClassSchemeId) and the semantic term itself (identified in cfClassId).

To be able to use CERIF in a real-world application, therefore, requires the identification or definition of an extensive set of semantic schemes and their consistent application. This may be particularly problematic as CERIF records lose much of their interoperability without the application of a coherent semantic scheme. Although euroCRIS have themselves published a core set of semantic terms (European Organisation for International Research Information 2010a) which would, if widely adopted, move towards resolving this problem, it at present covers only a proportion of the relationships likely to be required in a real-world application.

The complexities involved in implementing CERIF should be apparent from even this short introduction to it. This complexity is not alleviated in the XML instantiation of the CERIF data model which translates directly the structure and content of the original relational database tables that formed CERIF's first version as a SQL application. Adopting this approach was done with good reason, particularly to retain the degree of flexibility present in the original data model which would be very difficult to replicate in a single XML schema. Using XML does result in the loss of some potentially valuable integrity rules (such as uniqueness and referential integrity) which are present in the SQL model of CERIF (Jeffery, Lopatenko & Asserson 2002, 80) , but the essential ability to model complex research environments remains intact in this use of XML. The disadvantage is the verbosity and complexity involved in employing the 192 XML schemas which form the model in this format.

3. Constraining the XML metadata universe

Much of the preceding discussion leads to the conclusion that for CERIF to achieve its potential as a medium for the interoperability of research information it requires some degree of constraint in its application. In addition, the terms under which it is constrained (for instance, the choice of schemas and the semantics to be employed) need to be adopted in an environment that extends beyond a single institution (preferably the whole research management community).

XML as a language offers fewer opportunities for constraining and validating content than are available in, for instance, a conventional relational database: the XSD schema language only allows constraint by domain range (constraining values, usually numerical, to a given range), mixed content (constraining the number and orders of child elements for complex element types) and cardinality (minimum and maximum occurrences of an element) (Jacinto et al. 2002, 29). As is well known, XML validation procedures can test the conformance of a document syntactically but not semantically (Jacinto et al. 2002, 2); this requires any desired semantic constraint to be hard-coded into a schema (for instance, in the form of an enumerated list) so that it be validated as if it were a syntactic rather than a semantic requirement. The possibilities offered by such simple validation procedures as lists or constrained attribute values are very limited, however, and the requirement for them to be incorporated into the schema when it is written makes their potential irrelevant when seeking to constrain standards such as CERIF which have already been published.

Two well-established methods for constraining the content of XML files already exist in the form of XCSL (XML Constraint Specification Language) (Jacinto et al. 2004) and the more widely-used Schematron (International Standards Organisation 2006), both of which offer the possibility of validating the content of an XML file in addition to its syntactical conformance. Both work by allowing conditions for element contents to be tested against specified contexts: for instance, the content of an ISBN element can be checked to ensure that it conforms to the required format and that it validates correctly against its check digit. Both also offer the possibility of conditional validation, so allowing, for instance, the value of a given element or attribute to dictate the structure or content of other components of a file: a poem, for instance, with a type attribute which can be set to such values as 'sonnet' or 'quatrain' could have its overall structure validated according to the value of this attribute (Jacinto et al. 2002, 14-20).

Neither solution is ideal for the challenges presented by a complex CERIF application. A comprehensive validation system would require 192 XCSL or Schematron files, one for each potential CERIF XML instance (although it is unlikely that any given application would in reality use anything near this potential number). The validation of CERIF's complex linkages is possible using either approach (for instance, by employing the document() function in XPath within a Schematron file), but rapidly becomes complicated and hard to maintain accurately once the number of linkages exceeds a relatively small number. In addition, for relatively non-technical users, the further validation steps required to use XCSL or Schematron add to the gradient of the learning curve for CERIF implementation.

4. CERIF in the context of the Research Excellence Framework (REF)

These problems became all too evident in the context of a research project undertaken at King's College London which sought to make CERIF a viable option for institutional submissions to research assessment exercises. The Readiness for REF (R4R) project (Centre for e-Research 2011) takes its name from the UK Government's Research Excellence Framework (REF) programme (Higher Education Funding Council for England 2010), the latest in a series of regular assessments of the research outputs of higher education institutions on the basis of which research funding is allocated to these bodies. It has been announced that the next exercise, due in 2014, will accept submissions in CERIF as its preferred format, although no detailed specification of the CERIF implementation envisaged has yet been published.

To test the feasibility of using CERIF as the medium for submissions, the R4R project undertook a detailed mapping to it of the data requirements from the previous exercise, undertaken in 2008 (Higher Education Funding Council for England 2008). This exercise, at that time called the Research Assessment Exercise (RAE), allowed submissions in XML which conformed to a schema devised specifically for this purpose. In the absence of any specification of the data requirements for the REF itself, this schema formed the basis of the mapping exercise. The results revealed the complexity of the task ahead, as most concepts in the RAE schema proved to mappable only by employing three or four inter-linked CERIF files: for instance, linking a researcher to the title and bibliographic details of a research publication involves a minimum of four files linked as follows:-

 Researcher/research publication linkages in CERIF

Figure 4. Researcher/research publication linkages in CERIF.

The mapping exercise concluded that in total 19 of the 192 possible CERIF files were required to encode all of the metadata specified in the RAE exercise (Gartner & Grace 2010, 100). While the number of files required was relatively small, the complexity of the linkages required and the complex semantic vocabularies required to enable them to be formed appeared daunting, and would probably have rendered the standard an inappropriate solution for institutions without an advanced technical knowledge of CERIF and XML itself.

5. Constraining CERIF with XSD and XSLT

This problem of complexity required some method of constraining CERIF when it is used for this specific application: this constraint is both syntactic (limiting the XML files used and the manner in which they are linked) and semantic (limiting the range of vocabularies used to enable the linkages). Before the advent of XML SGML provided a mechanism to allow constraints of at least the former kind to be imposed: architectural processing could be used to derive small project-specific DTDs which could then be automatically processed to larger, more complex applications constrained to the requirements of the project. Simons, for instance, uses this technique to map highly-constrained DTDs to the complex and highly flexible TEI (Text Encoding Initiative) (Simons 1998 & 1999). Such architectural processing is not available in XML but much of its functionality can be duplicated by creating highly constrained 'intermediary' XML schemas which are then processed by XSLT to created the complex application required.

The XML schema devised for this project, called CERIF4REF, is properly termed intermediary as it mediates between the requirements of the RAE schema and CERIF. It was not possible simply to devise a stylesheet to translate RAE directly into CERIF as so much of RAE consists of aggregations from a variety of data sources. In the CERIF model, these sources are encoded explicitly, and it is impossible to disaggregate them from the summary form in which they are given in RAE: for instance, the RAE requires a simple count of full-time equivalent research assistants, whereas the CERIF model requires them to be listed individually. In addition, the RAE schema makes no use of XML IDs and IDREFs for linkages, making it impossible to validate these accurately and rendering the construction of the complex network of links in CERIF much more difficult.

The CERIF4REF schema is structured in a similar way to RAE, dividing into five sections:-

  1. research groups
  2. research personnel
  3. research outputs
  4. funding
  5. overall descriptions of the research environment

A comparison of the encoding of research groups and personnel in RAE and CERIF4REF provides an immediate example of the different approaches of the two schemas:-

Research Groups and Research Personnel in RAEand CERIF4REF

Figure 5. Research Groups and Research Personnel in RAE and CERIF4REF.

Although both encode similar metadata, CERIF4REF relies heavily on XML IDs and IDREFs to establishes linkages between components (for instance, between a research assistant and a supervisor (by use of the c4rResearchAssistantOf attribute in this example)); RAE by contrast establishes linkages by element content (for instance the ResearchGroup1 and ResearchGroup2 elements). Using IDs in this way allows linkages to be validated with standard XML parsing rather than requiring the use of XCSL or Schematron which the latter approach necessitates. In addition, the contents of many elements in CERIF4REF, for instance c4rPersonRole are constrained by closed enumerated lists, and some numerical elements, such as ResearchAssistantFTE are absent altogether as this data is calculated directly by aggregating information from elsewhere in the CERIF4REF file (in this case, the number of c4rResearchAssistantOf attributes which point to the ID of any given researcher).

The encoding of research outputs also differs between the two schemas:-

Research Outputs in RAE and CERIF4REF

Figure 6. Research Outputs in RAE and CERIF4REF.

As is the case for the research group and personnel metadata, the main difference is the increased precision allowed by the more constrained CERIF4REF schema. As before, extensive use is made of XML IDs and IDREFs (here, for instance, linking co-authors and research outputs to their authors and research groups). In addition, concepts which are merged confusingly in the RAE schema can be disentangled so that their components can be encoded more clearly. The RAE schema, for instance, uses the broad element ShortTitle to encode journal names when research outputs take the form of journal articles but the same element is also used to encompass such diverse concepts as volume numbers, patent registration numbers, places of performance or venues for art installations depending on the form of the research output: this conflation of concepts, and their semantic dependence on the value of other elements in the schema (specifically the research output type) introduce a degree of complexity into the encoding process which is likely to be error-prone. The CERIF4REF schema allows for separate elements to be used for each output type, named in a comprehensible manner (for instance MonographTitle in the example above) and so greatly reduces the possibility of such errors.

The CERIF4REF schema therefore offers the potential for a simpler and less error-prone medium for encoding the information required by the research assessment exercise for which the earlier RAE schema was devised. To test its practicability, a series of six cases studies were undertaken in which attempts were made to map data held in higher education institutions' digital repositories and human resources, financial and student record systems to CERIF4REF. These studies all produced positive results, indicating that any problems of mapping and exporting to CERIF4REF result from discongruities between the original data requirements of RAE and the ways in which institutions hold and structure data on their systems: no problems were noted that arose through the strategic approach of a constrained, closely internally-linked schema (Gartner 2010, 14).

The second key component to this strategy is the XSLT file which is used to effect the necessary transformations between CERIF4REF and CERIF (and also, should it be required, to RAE). This file is designed to duplicate the mapping function of architectural processing in SGML and also to add a layer of semantics to the resulting XML outputs which was not possible using this earlier methodology. Using certain new features of XSLT 2.0, particularly the xsl:result-document element which allows output to be redirected to multiple named files in a single XSLT process, it proved possible to write a single stylesheet to generate all 19 CERIF files which the RAE to CERIF mapping exercise had identified as necessary to encode the range of metadata needed for the research assessment exercise.

In a few cases conversion to CERIF is a simple matter of the extraction and relocation of a given component from the CERIF4REF file to its CERIF equivalent: in this diagram, for instance, the personal identifier cfPersID in the CERIF file is a simple translation of the CERIF4REF c4rHESAStaffIdentifier attribute of c4rResearchActiveStaffMember:-

Deriving cfPersId in cfPERS-CORE from CERIF4REF using XSLT

Figure 7. Deriving cfPersId in cfPERS-CORE from CERIF4REF using XSLT.

Producing the linking files which make up the bulk of any CERIF application requires the application of much more complicated XSLT transformations. Figure 8 below shows some of these required to link an individual to a research group:-

cfPERS-OrgUnit-LINK from CERIF4REF using XSLT

Figure 8. Deriving cfPERS-OrgUnit-LINK from CERIF4REF using XSLT.

The components of the CERIF file here are spread throughout the CERIF4REF file, and some (such as cfOrgUnitID) are constructed during processing: the XSLT is therefore of necessity more complex. It is, however, relatively straightforward to write as the CERIF4REF schema was conceived with the mapping to CERIF as its overriding architectural rationale, so that the stylesheet in effect merely translates this mapping to XSLT. It should also be noted that the XSLT file hard-codes the semantic components used by CERIF (in this example cfClassId and cfClassSchemeId), although these could equally well be supplied by a formal ontology (expressed for instance, in OWL) which is read and processed during the transformations.

Using this approach, CERIF becomes a more viable option for institutions preparing for the forthcoming REF exercise: information held on multiple systems has been shown to map readily to the intermediary schema, and from there, using relatively straightforward stylesheets, it is possible to produce the complex web of CERIF files necessary to express this data and its interrelationships. Reversing the process also proved possible, allowing a stylesheet to be written to extract metadata from multiple CERIF files into the intermediary schema for further editing and re-export to CERIF.

6. Further potential applications

In many metadata environments, particularly in that of the digital library, the problems of complex and highly flexible generic schemas are as acute as they are in that of CERIF. A tension arises particularly between flexibility and interoperability: the more potential approaches to encoding are offered by a standard, the more problematic is the transfer of metadata to other systems and its interpretation and processing by them.

In the digital library arena, this problem has particularly been noted by implementers of the METS (Metadata Encoding and Transmission) (Library of Congress 2011) standard. A widely-read report by the UK's Joint Information Standards Committee (JISC) by the current author (Gartner 2008) proposes METS as the basis of an integrated metadata strategy for digital libraries, but recognises a number of difficulties which arise particularly because of its great flexibility (Gartner 2008, 13-14). Using PREMIS for administrative metadata within METS, for instance, requires the consistent application of best-practice guidelines to resolve such issues as redundancies between the two standards or clashes between the METS's structural metadata functions and those expressed in PREMIS relationships (Gartner 2008, 13).

The usual practice to counteract these problems when implementing METS is to publish a METS Profile in which the usage of METS for a given application is documented in a standardised way (Gartner 2008, 14). Such Profiles, however, are merely human-readable documentation of a METS application, and are not machine-actionable as a mechanism for allowing the ready exchange of METS metadata. The use of intermediary schemas and associated XSLT transformations as documented here offers a possible way of obviating some of these problems: a METS application, including the extension schemas employed and any controlled vocabularies, could potentially be incorporated into such a schema and used to provide a constrained environment within which metadata can be encoded and from which METS files could be generated.

Such a schema would incorporate any required best-practice guidelines (such as those already published for using PREMIS with METS (Library of Congress 2008)), and ensure full conformance with the application's published METS Profile. Writing XSLT files to translate from a METS instance to the intermediary schema, and then from one intermediary schema to one conforming to another METS Profile, may make it possible to translate between METS profiles in a relatively automated manner. Such a scenario would in itself require some work on standardising approaches to the design of such schemas, but the methodology as it is is certainly capable of sustaining such a function.

7. Conclusions

Despite its great power as an encoding mechanism for the complex metadata needs of research environments, CERIF remains relatively underused in the area of research information management. Its flexibility and fragmented architecture in particular can produce significant problems for implementors and reduce its interoperability unless such key components as its semantic infrastructure are standardised between institutions. These problems were experienced more than a decade ago by implementors of such standards as the TEI and were solved by some by using the architectural mapping features of SGML. Without this facility in XML, the solution advocated here can replicate its best features but also add more powerful, non-syntactic features, such as semantic control.

The strategy has been tested thoroughly in several live research information management environments and found to be generally workable: the only problems experienced have proved to be those inherent in the metadata scheme on which the mapping to CERIF was based. The results have proved it to form a good compromise which allows the use of a key standard (with the consequent benefits of wider interoperability) in conjunction with a constrained, project-specific and more easily implemented element set. The successful application of this methodology suggests that it may be beneficial in the wider area of digital library metadata in general, where several key schemas are more easily implemented when constrained it this way.

8. Acknowledgements

The author acknowledges with thanks the assistance of Brigitte Jörg (Deutsche Forschungszentrum für Künstliche Intelligenz) who provided figures 1, 2 and 3, and the Joint Information Systems Committee (JISC) who financed the R4R project.

9. References