Personalization of Shared Data: The ShaRef Approach

Abstract

Personalization of services often has to cope with the conflicting goals of allowing cooperation and sharing, which require common data formats and services, and supporting individual use cases, which require as much personalization as possible. In this paper we present the ShaRef approach to personalization and sharing, which on the one hand allows users to cooperatively work with bibliographic references, and on the other hand supports the usage of this information in personalized and diverse ways. The goal of this approach is to foster as much cooperation as possible, while simultaneously supporting users with individualized ways of reusing the cooperatively managed data. This way of building applications combines the beneficial aspects of information sharing and personalization. Using this approach, applications are better suited to become building blocks in information infrastructures that are built by users in unpredictable ways.

1 Personalization and Sharing

At first sight, personalization and sharing seem to be conflicting goals. While personalization is about using data and optimizing it for the usage of a single individual and/or application, sharing is about reusing data among a number of cooperating individuals and/or applications, which often have conflicting or at least very different goals. Because of that, sharing often concentrates on making data available in a general way, while personalization often concentrates on meeting the goals of a single user.

In the ShaRef project [ECDL2005], both personalization and sharing are important goals, and this paper describes how ShaRef approaches these two issues. ShaRef is a application and service for managing bibliographic metadata. It can be used to manage any kind of referential data (i.e., metadata about published resources), but it is mainly designed to support the management of bibliographic data. The most common tools for managing bibliographic metadata are BibTeX (in conjunction with the LaTeX document preparation system) and EndNote (in conjunction with Office products), but both lack collaboration and sharing features. In many settings, however, researchers cooperate and could benefit from sharing their bibliographic metadata, and ShaRef is a project that has the goal to provide a tool and a service which support collaborating researchers better than the current tools which are being used.

ShaRef not only supports the collaborative management of bibliographic metadata, it also supports a data model (shown in Figure 1) which is more advanced that that of BibTeX or EndNote. In particular, it supports some hypertext features, which enable researchers to create links between their references [TIK224]. The idea behind this data model is to better support the knowledge management facet of bibliographic metadata. This means that the data models supports keywords (for categorizing references) and relationships between references (which are called associations. Associations can be used to link references which have a relationship, for example by stating that some referenced resource is an updated version of another, or that one referenced resource contains a supportive statement about a theory in another reference.

ShaRef Data Model
Figure 1: ShaRef Data Model

The details of ShaRef's data model are not important for the personalization issues discussed in this paper (detailed information about the data model has been published in a technical report about the hypermedia aspects of the ShaRef project [TIK242]). The important aspects are that each ShaRef installation has a number of registered users (which may use groups to associate collaborating users), and that bibliographic data is contained in bibliographies. Each bibliography has administrators, writers, and readers, and may contain any number of references as well as other entry types. The most important entry type for the purpose of this article is the shadow. A shadow can be thought of as a pointer to a reference (or a symbolic link, for those using Unix file systems). Shadows can be used across bibliographies, which means that it is possible to have a bibliography only containing shadows (in this case, the bibliography contains only pointers to references in other bibliographies).

To summarize, every ShaRef users has access to a number of bibliographies, which may contain references and/or shadows (i.e., pointers to references). Shadows thus provide an ideal way to produce views of bibliographies, by creating a bibliography which contains a subset of another bibliography in the form of shadows. All updates of the original references will be visible in the bibliography containing the shadows.

2 Publishing of Personalized Views

Using the view aspect described in the previous section, users can use or create bibliographies containing their own and/or shadows of other references. It is also possible to use group bibliographies, where a group of collaborating researchers is sharing one bibliography. There are a number possible use cases, but the important aspect is that ShaRef supports a wide variety of scenarios to create data that is maintained in bibliographies.

This data is stored in the ShaRef system and can be managed and viewed through the ShaRef clients. ShaRef has two clients, one is a Java-based rich client which runs on any machine equipped with JRE 1.5 or higher, and the other is a Web-based client which can be used with a Web browser. These clients are useful for working with bibliographic metadata, but often this data should also be used in existing applications, such as document preparation systems, or for publication on a Web server. For these scenarios, ShaRef supports a number of export formats. These formats can be used to transform ShaRef data to other formats which can be used with other applications.

The following export formats are supported, with each of the formats having a number of options which can be used to control some of the features of the export format:

  • BibTeX: This is the data format of the BibTeX program, which is used together with the LaTeX document preparation system. The BibTeX format is also supported by a number of available tools for working with bibliographic data.
  • EndNote: This is a commercial product which is highly integrated with the Office product suite for document processing. While the internal format of EndNote is unavailable, EndNote also supports XML-based import and export. The XML format is undocumented and changes between different releases, but with some effort it is possible to create filters for this format.
  • HTML: HTML is a very popular format for publishing information and can be used to create bibliography information that should be published on the Web. This may be interested for publication lists of individuals or projects, or for reading lists for lectures.
  • Silva: Silva is an open source content management system which is used for managing Web content. It uses an XML-based format for storing the information. Using the Silva export format, publication lists can be imported into the content management system.
  • XML: This is the internal XML format of ShaRef. While it may not be directly useful for other applications, it is documented and easy to understand, so that converters to other formats can be written fairly easily, for example by using [XSLT2].

While these formats are the ones supported by ShaRef directly, it is easily possible to support new formats, as described in Section 3.1. One principal limitation of exporting data is that it is a snapshot of the exported data at a certain point in time. This may be acceptable in certain scenarios, but in other scenarios it is necessary to have a live view of bibliographic data. ShaRef supports this concept of a live view through the publishing concept, which is an extension of the export process. Publishing works as follows:

  1. Selecting the data to be published: As a first step, the data to be published must be selected. This is done by selecting a bibliography and a set of search criteria for this bibliography. Consequently, the data being selected for publishing is either a complete bibliography, or a search-based subset of a bibliography.
  2. Selecting the publishing format: This step is identical to the export process described above. When selecting the publishing format, a data format has to be selected and the available options for this data format have to be selected as well.
  3. Selecting the publication channel: Finally, it must be decided how the selected references in the selected publishing format should be published. ShaRef publishes published data through URIs on the ShaRef server, and while the front part of the URI are given by the protocol (HTTP), the host name of the ShaRef server, and the user name (of the user configuring the publication channel), the rest can be either configured manually, or ShaRef will select a random name. The resulting URI will look like http://sharef.ethz.ch/publish/dret/mybib, where http://sharef.ethz.ch/publish/ is the system-defined prefix, dret is the user name of the channel publisher, and mybib is the name chosen for the publishing channel. This URI can be used to retrieve the live view of the content selected for publishing.

This concept of publishing together with the shared bibliographies can be used to support different ways of personalization. On the one hand, data in other bibliographies can be reused in personal bibliographies (described in Section 2.1), and on the other hand, data can be published in personalized ways (described in Section 2.2). By combining these two concepts, users have the opportunity to create a variety of ways in which they reuse the data inside a ShaRef system in a personalized way.

2.1 Creation of Personal Collections

ShaRef's flexible model of users, groups, and bibliographies allows a variety of use cases. In a typical ShaRef installation, several bibliographies will be used by a workgroup. One set of bibliographies are standard bibliographies which may be provided by the ShaRef administrator or through import from existing sources (ShaRef supports BibTeX and EndNote as import formats and uses a declarative mapping approach for import and export which can be easily updated to reflect user needs). For example, the complete set of Internet standard documents (known as Request for Comments (RFC) documents) is available as an XML document on the RFC editor's Web page. From this index, a ShaRef XML file can be produced (by writing a custom import filter as described in Section 3.2), which can then be imported into ShaRef. Using this kind of strategy, it is possible to provide a set of standard bibliographies to users which they can use without any need to input or manage data themselves.

As a second set of bibliographies, workgroups usually create bibliographies which contain bibliographic references specific for they domain of interest. These references may use entries from some of the more general bibliographies (for example by associating a workgroup bibliography entry with an entry from a system-wide available bibliography), but in general will reflect the domain of interest of the workgroup. In most cases, the workgroup bibliographies describe own publications as well as publication from other research groups. By using the shadow mechanism, it is then possible to create a bibliography which contains only the own publications by creating shadows which point to the entries in the general workgroup bibliography.

As a final set of bibliographies, workgroup members may create their own bibliographies. These may contain references that are of private interest only, or they may contain selected subsets of workgroup bibliographies. By using the second strategy, it is possible to create views of workgroup bibliographies as described in Section 1. A common use case for this is a reading list for a lecture or a seminar. Rather than copying the references for this reading list into a new bibliography, the reading list bibliography is populated with shadows of the required entries. This way, changes in the original entries are reflected in the reading list entries as well, because the shadows are live pointers to the original references. If necessary, the shadows may later be instantiated if this live connection is no longer required, but in most cases the live connection is preferable, because corrections and updates of the original entries will be available in the reading list as well without any manual intervention being necessary.

By combining the three approaches to bibliography management described above (system-wide, workgroup-wide, and personal), ShaRef enables users in different application scenarios to balance the need between centrally maintained and managed bibliographic data and personal data. In particular, the shadow mechanism can be used to create reuse references which should be available in different contexts.

2.2 Personalization of Publishing

The creation of bibliographies for workgroups and/or individuals as described in the previous section provides a convenient way for managing bibliographic data in different contexts. However, so far there is no support for reusing this data in any other application than ShaRef itself. For this, the publishing feature as described in Section 2 can be used.

For any bibliography that has been created according to the strategy described in the previous section, the publishing feature can be used to create a personalized view of this bibliography. As described in Section 2, publishing involves (1) the selection of a bibliography and an associated search filter, (2) the definition of a publishing format and its options, and (3) the selection of a URI for the published results.

HTML Export Options
Figure 2: HTML Export Options

For example, Figure 2 shows the options of the HTML export format. It can be seen that the export options provide a variety of different HTML formatting possibilities, which make it possible to personalize the exported HTML in different ways. Through the definition of a personalized publication channel of a bibliography, a user can easily create a personalized HTML view of any of the bibliographies he is interested in. This HTML view is always accessible from the ShaRef server and reflects the latest version of the underlying bibliography data.

While the HTML export is intended to be used by humans (by rendering it in a Web browser), other export formats are intended to be used by applications. The Silva export format is the input format of the Silva Content Management System (CMS), and by creating a personalized Silva publishing channel, a user can create a personalized of bibliographic data which can be integrated into Silva-managed Web sites. In this case, ShaRef makes the Silva formatted view available at the publishing URI, and the CMS retrieves the contents from this URI and uses them for displaying the live contents of the ShaRef data. This way, the integration of publication lists into CMS-managed Web sites can be fully automated without the need for any manual updates.

3 Implementation

ShaRef is mainly built on top of XML technologies [TIK213], which means that the data model is defined as an XML Schema [XSD1,XSD2] (with additional constraints which cannot be expressed in XML Schema), even though the current implementation stored the data in a relational database. In addition, data manipulations (for example, import and export) have been implemented using XSL Transformations (XSLT) 2.0 [XSLT2], which is ideally suited for transforming XML (the only exception is BibTeX import, which first uses a BibTeX parser to produce an XML version of the BibTeX syntax).

Even though XML Schema proved to be less than ideal as a schema language for the project (because it cannot encode all constraints which are important for our data model), we decided to use XML Schema because it is widely accepted, supported by a large set of tools, and likely to be understood by other reusing ShaRef's data model. A different schema language might have been technically better, but would have reduced the ability to define the data model in a way which is accessible to a large set of users.

There are two main reasons for the XML-centric design of ShaRef: Firstly, XML provides an ideal foundation for working with structured data, and there are many tools and languages available which help developers when working with XML-based data. Thus, software development becomes easier because there is a large number of established tools, and software developers can reuse their existing knowledge when working with these tools. The advantages of this aspect are described in Section 3.1. Secondly, because of the XML-based data model, users of the system can easily extend it by adding external components for processing XML data. This can be done much more easily because the data format is described by an XML Schema.

3.1 New Publishing Formats

In addition to the formats described in Section 2, it may be necessary to support additional import and/or export formats in ShaRef installations. Following the guideline of XML-centric software design, this can be done in an easy way:

  1. The options of import and export formats are also described in an XML-based format, which make it easy for developers of new import or export formats to describe the available options in a way that is supported by the existing software environment. The XML-based option file is used by the Java- as well as the Web-client to produce the option dialogs for users.
  2. The import or export itself in most cases can be implemented using XSLT 2.0, which can be plugged into the software platform. There are some minimal guidelines to follow (for example, progress and other messages have to be output using a given messaging module), but apart from these, developers are free to use the full potential of XSLT 2.0. This is particularly important because XSLT 2.0 (in contrast to XSLT 1.0 [XSLT1], which can process XML files only) has built-in support for opening and parsing text files. (If the import or export requires more than just XSLT 2.0 processing, integration into the software platform becomes more complicated.)

This extensibility of the software platform in terms of additional import and export formats has been of great importance during the design and implementation of ShaRef, because we assume that the predefined formats are a good set to start with, but probably not sufficient in many application scenarios. One popular example of an import or export format that we currently not support is the Dublin Core (DC) [ISO15836] metadata set. While it would be rather easy to create a mapping between DC metadata and ShaRef's data model, there was no immediate need in our application scenarios to support this format, so it remained on the to-do list.

3.2 Custom Personalization

The easy addition of new import or export formats as described in the previous section has been the one of the main goals of ShaRef's software design, but it still requires some changes in the software and maybe more than individuals can or want to do. Instead, if individual users are managing their data with ShaRef, they should have easier ways of creating custom personalizations, enabling them to import or export data formats which are not supported by their ShaRef installation.

ShaRef's XML Schema has been designed to be as easily understandable as possible. This not only means that the semantics have been designed to be easily understandable (which means that the supported semantics are geared towards end user and not librarians), but also that the markup design has been made as simple as possible. This is in stark contrast to other bibliographic metadata schemas such as the Metadata Object Description Schema (MODS) [MODS], which are comparable to ShaRef in terms of semantics, but use a much more complex markup structure to encode the data. The ease of use of the semantics and the markup design are a simple but effective way to encourage users to connect ShaRef with other applications by writing custom import and export filters.

Basically, the approach for writing custom filters is the same as for the addition of new import or export formats as described in the previous section, but it does not touch the ShaRef software itself. Instead, the ShaRef XML format is used for import or export, and the custom filter is written as a standalone program (preferably in XSLT 2.0) which processes the data before import or after export. If a user has some experience with XSLT, in most cases this program can be written fairly easily. Other options are the usage of a mapping tool for mapping XML Schemas (which are increasingly available in commercial XML tool suites), or the specification of the mappings, so that the local ShaRef system administrator can implement the corresponding XSLT.

As an example for such a custom personalization, the XML and Web Service Glossary is imported into ShaRef. The glossary is also available as an XML file (in fact, the published Web version is generated from this XML source). By writing a simple custom import filter, it is possible to create a ShaRef XML version of the glossary. Since the glossary does not directly contain any bibliographic information, it is mapped onto keyword definitions, which could then be used to classify bibliographic metadata according to the glossary's terms.

4 Application Scenarios

Section 2 describes the publishing of personalized views and Section 3 describes how this can be implemented for the support of new import and export filters within the software, as well as for customized personalization with standalone filters. The following sections describe application scenarios which are the result of a user survey that had been conducted before the project started [TIK194]. As a general result from this survey, users thought that sharing bibliographic information would be useful for them, but they also considered their references as something they wanted to keep control over.

ShaRef Use Cases
Figure 3: ShaRef Use Cases

Figure 3 shows the possible uses cases of the ShaRef software and service. While some of the management and sharing tasks can be performed within the systems, for others the export or publishing features must be used, which enable users to make ShaRef data available to other systems.

4.1 Personal Applications

ShaRef is about sharing data, but due to the answers in the user survey, it does not require users to share their data. Users can use the system in a completely private way (ignoring all sharing features), or they can set up the systems which allow different levels of data sharing (as described in Section 2.1). Whatever setup users choose, they can always reuse their data in other applications. This is important because virtually all users need their data for document preparation, and ShaRef does not integrate into document preparation environments. Instead, the required data must be exported or retrieved from a publishing channel and can then be used for document preparation. Because different users are using different document preparation environments, they may use different formats for the same underlying data.

Another example is the glossary described in Section 3.2. In this case, a user may decide that some information that exists outside of ShaRef would be useful to have within the system. By writing a custom import filter, the data can be transformed and reused within the ShaRef system.

4.2 Workgroup Applications

While individual users may create custom import filters for their personal purposes, it may also make sense to reuse ShaRef data on the workgroup level. An interesting use case here is the university-wide publication database that already exists at our university. This system only contains publications authored by university members and has been designed for creating publication reports for departments and the university. Currently, university employees must manually enter their publications once a year. It would be much easier if the existing entries in their bibliographies could be transferred into the publication database.

Currently, the problem is that the research database does not have a well-defined import format. The only way of inputting data is the Web-based form which requires manual input. Once the system has been equipped with an import format and interface, it will be easy to write a filter for transforming ShaRef data to the research database format. This will save the thousands of university employees a lot of typing, and is another example how the XML-based data model can be used to integrate ShaRef with other applications.

The system presented here provides a unique combination of features for the sharing of data supporting a flexible user and group concept, the personalized reuse of data, the integration with other applications, and the description of bibliographic metadata by using a hypermedia-inspired data model. For all these areas, other projects have been working on similar approaches in these particular areas, but no project has the exact same scope and combination of features.

Data sharing is supported by the BibShare system [BibShare], but this is focused on Office products and does not support information integration with other applications. Bibster [Bibster] is a peer-to-peer based system which is mainly focused on the aspects of semantics. Bibster does not support information integration very well, because the RDF-based data model is rather complex to understand. However, for integration into RDF-oriented environments, the Bibster approach may be the best way to go.

Other examples of sharing-oriented tools for managing bibliographic metadata is available free software or services such as CiteULike or JabRef. These tools do not support ShaRef's flexible user and group model, and are thus less suited to be used as tools in organizations with a large number of users, many cooperating users, and complex requirements how these users can manage the access rights to their bibliographic data.

6 Summary and Conclusions

The ShaRef approach to personalization presented in this paper is based on a model of data management which allows users to choose their preferred combination of privacy and sharing, and on an XML-based design of import and export features for easy information integration. We believe that this combination is the right mix for providing users with a service that is powerful and flexible enough so that most users will find it more useful than more traditional methods and tools for bibliography management.

Generally, choosing an XML-centric approach for the project has proven to be a useful decision. The number of tools for working with XML data has become so large that there must be very good reasons for not using XML when designing a data-centric application. Ideally, we would have liked to use a native XML database, which would have made life easier for us, but due to stability and availability issues (we required a native Java database for working with the Java-client in offline mode), we chose to implement our XML data model on top of a relational database. However, keeping the XML data model and the relational data model in sync required additional efforts and made it harder to follow the guidelines of agile software development, so ideally software should not have these internal model discrepancies.

The ShaRef project ended 2/2007 and parts of it are now integrated into a solution which is developed by the ETH library in an effort to create the ETH bibliography, the authoritative list of all publications of all ETH members. The original plan of deploying ShaRef as a university-wide service was not executed because it turned out to be difficult to find a stakeholder for running this service for a longer period of time. Most institutions which were potential users either implemented isolated solutions and did not want to rely on a service without some guarantee of service quality. The individual users in most cases would have liked such a service, but have no voice when it comes to deciding which services should be provided as a central service of a university. Newer projects such as Zotero (which has substantial financial backing from a variety of sources) might be able to overcome that hurdle and establish a stable service for personal and institutional bibliography management.

7 Acknowledgements

The ShaRef approach is the joint work of all members of the ShaRef project team (the ShaRef project ran between 7/2004 and 2/2007). The ShaRef project team members are the two full-time members Sai Anand and Petra Zimmermann (working in the first project phase until 12/2005), the full-time member Max Jörg (working in the second project phase until 2/2007), and the two student contributors Nick Nabholz (responsible for the user/group management concept) and Thierry Bücheler (who developed the Web client), and Till Quack, who developed the Web design. All the members of the project team were invaluable to the project's success and the work presented in this paper contains ideas from all of them.

References

[BibShare] José H. Canós and Eduardo Mena. BibShare: An Interoperable System to Access and Maintain Bibliographic References. In: III Jornadas de Trabajo DOLMEN, Madrid, Spain, November 2002.
[Bibster] Peter Haase, Björn Schnizler, Jeen Broekstra, Marc Ehrig, Frank van Harmelen, Maarten Menken, Peter Mika, Michal Plechawski, Pawel Pyszlakand Ronny Siebes, Steffen Staab, and Christoph Tempich. Bibster — A Semantics-Based Bibliographic Peer-to-Peer System. Journal of Web Semantics, 2(1), 2005.
[ECDL2005] Erik Wilde, Sai Anand, and Petra Zimmermann. Management and Sharing of Bibliographies. In: Andreas Rauber, Stavros Christodoulakis, and A. Min Tjoa (eds.), Proceedings of the 9th European Conference on Digital Libraries, Lecture Notes in Computer Science, pages 479-480, Vienna, Austria, September 2005. Springer-Verlag.
[ISO15836] International Organization for Standardization. Information and Documentation — The Dublin Core Metadata Element Set. ISO 15836, November 2003.
[TIK194] Erik Wilde. Usage and Management of Collections of References. Technical Report TIK Report No. 194, Computer Engineering and Networks Laboratory, ETH Zürich, Switzerland, June 2004.
[TIK213] Erik Wilde, Sai Anand, and Petra Zimmermann, ShaRef: XML-Centric Software Design, Technical Report TIK Report No. 213, Computer Engineering and Networks Laboratory (TIK), ETH Zürich, February 2005.
[TIK224] Erik Wilde. XML-Centric Application Development. Technical Report TIK Report No. 242, Computer Engineering and Networks Laboratory, ETH Zürich, Switzerland, February 2006.
[TIK242] Erik Wilde. Shared Bibliographies as Hypertext. Technical Report TIK Report No. 224, Computer Engineering and Networks Laboratory, ETH Zürich, Switzerland, May 2005.
[MODS] Library of Congress, Network Development and MARC Standards Office. Metadata Object Description Schema (MODS) Version 3.1. July 2005.
[XSD1] Henry S. Thompson, David Beech, Murray Maloney, and Noah Mendelsohn. XML Schema Part 1: Structures Second Edition. World Wide Web Consortium, Recommendation REC-xmlschema-1-20041028, October 2004.
[XSD2] Paul V. Biron and Ashok Malhotra. XML Schema Part 2: Datatypes Second Edition. World Wide Web Consortium, Recommendation REC-xmlschema-2-20041028, October 2004.
[XSLT1] James Clark. XSL Transformations (XSLT) Version 1.0. World Wide Web Consortium, Recommendation REC-xslt-19991116, November 1999.
[XSLT2] Michael Kay. XSL Transformations (XSLT) Version 2.0. World Wide Web Consortium, Recommendation REC-xslt20-20070123, January 2007.

Author details

Erik Wilde was project leader of the ShaRef project at ETH Zürich (the ShaRef project ran between 7/2004 and 2/2007), since then he has moved to UC Berkeley's iSchool. For more details about the ShaRef project, please visit the ShaRef project Web page. Erik's general interests are Open Information Systems, or more specifically, Web technologies, ranging from communications mechanisms such as HTTP and SSL to content management and application server programming. His current focus is on the Extensible Markup Language (XML) and associated technologies. He is interested in XML itself and the core specifications such as XInclude, XML Base, XML Namespaces and the XML Information Set. On top of this foundation, he is interested in XML schema languages (languages, their design, and design patterns for using and combining them) and XSL Transformations. On the application level, his main interests are XML-based information integration, XML-centric software design, XML and databases, and Web Services. For more details about Erik, please visit his Web page, his publication list, or his CV.