1 Background
This article is an expansion of a five minute, slightly tongue-in-cheek, invited presentation (by Peter Murray-Rust) at ACM Hypertext 2003. In subsequent discussion the underlying serious message was felt to be important, and this is the emphasis here.
We start by defining what we mean by the term "data" in an electronic environment. This is used to cover any material which is not usefully human-readable in raw form. Examples include graphs, digitised maps, database tables, computer code, program output, chemical structures, graphics visualisations, audio and video streams, genomic and microarray data, and many more (really a superset of "hypermedia"). Our background may emphasise physical and biological sciences but the concept can be transferred to many domains. We believe that many concepts here are widely applicable to hypertext in all disciplines. However, practice and technology will differ and, for example, the classic concept of transclusion will vary considerably between fields. We often emphasise the "call-by-value" (i.e. direct copy) strategy rather than "call-by-reference" as we feel this is the manner in which open scientific disciplines wish to work.
We emphasise the term "open". This is well understood in software licenses where the term Open Source insists on the preservation of authors' moral rights and the integrity of information and metadata. For "data" it is much less clear and we highlight this concern, without providing complete solutions. "Open", therefore, refers to the desire to make information universally available without hindrance on conditions which preserve authors' rights but make it unnecessary to contact the author for responsible re-use.
Our ideas are implemented as working examples within the chemistry domain that act as proofs of concept and have been peer-reviewed (Murray-Rust et al. 2000, 2001, 2003, 2004; Gkoutos et al. 2001). We illustrate these concepts with working examples that are part of the present article. What is needed is the political will of the scientific community to give the impetus to scaling up.
Some aspects of this discourse may appear as an uncritical diatribe against all scientific publishers. This was indeed one of the themes in the humorous presentation, and it elicited resonance from the audience. We recognise that there are forward-looking publishers and we are currently pleased to be working with them. However, we feel that the scientific publishing community is in many ways holding back the vision of increased scientific communication in the digital age. We will be pleased to hear from publishers who want to explore the concept of datuments further.
2 Problem and opportunity
Most publicly funded scientific information is never fully published and decays rapidly. As an example, the crystallography services in typical chemistry departments such as the University of Cambridge or Imperial College London carry out hundreds of analyses per year. Each is publishable in its own right, but the majority remain as "dusty files" where the effort required to "write them up" for a full peer-reviewed paper cannot be found. Yet these data are among the highest quality scientific experiments performed in any discipline. All information is produced electronically and only about 1% are found to be incorrect in some way. They contain very rich information. Nearly 1000 peer-reviewed papers have been published on information extraction ("data mining") from such crystallographic data alone. The International Union of Crystallography has produced an impressive electronic-only publication process where the complete "manuscript" is submitted electronically, and reviewed not only by humans but by extensive computer programs ("robots"). Such manuscripts are "almost always" accepted if there are no technical errors. Yet well over 80% of such material lies unpublished and unavailable to science.
Why? In some cases the scientists wish to have first use of their data and do not want competitors to get it. This was common in the protein crystallography community, which has developed acceptable practices such as putting data "on hold" for, say, six months. In each discipline the practice will vary. Often the result is that the scientific public gets a summary of the work in (e)paper form but has not enough information to repeat the experiment. This is particularly true for in silico experiments (such as quantum mechanical calculations on molecules) where unless the reader has complete knowledge of the input information and installation details of the program, they may get different results or behaviour. Frequently a reader will carry out the experiment again from scratch, as the published information is insufficient.
A serious consequence is that data- and text-mining is non-existent in many communities - they lack a sufficiently large corpus to make it useful. Crystallography had J. D. Bernal (Goldsmith 1980), a visionary far ahead of his time like Bush (1945) and, in a more general scientific sense, Garfield (1962). Both the latter foresaw the globalisation of information and laid the infrastructure for the archiving of scientific and crystallographic information.
A feature of many sciences is that information is "micropublished" in many different journals. There are c. 3,000,000 new chemical compounds reported each year but few journals carry more than about 50 in any one article. Thus information about chemical compounds becomes spread over perhaps half a million articles each year. There are three main approaches to integrating and coordinating such micropublished information:
-
Do nothing.
-
Create a business for abstraction and curation. A well-known example (in molecular sciences) is the Chemical Abstracts Service which extracts (by human effort) "every" new reported chemical compound from the public literature. Obviously this is costly and costs must be recovered in per-use charges. The Cambridge Crystallographic Data Centre has a similar model. These organisations fill a useful role but their design is fundamentally incompatible with the datuments we propose.
-
Deposit information at time of publication. Crystallography led the way in the 1970s by urging journals to insist that authors deposited "data" at time of publication. Originally this was on paper (early versions were typed), but computer output became the norm and electronic data is now the norm. Ultimately all major journals adopted this policy for crystallography. The best known modern example is biosciences where genomes, sequences and macromolecular structures must be deposited as a condition of publication. In most cases national or international data banks (e.g. the European Bioinformatics Institute or the Protein Databank) act as depositors and curators.
The latter approach is essentially an (incomplete) datument and we strongly promote this idea below. However, the number of publishers actively adopting this idea is small. Supporting "supplemental information" (nowadays prefixed with the term electronic and hence often referred to as "ESI") is a cost without obvious return and is unlikely to be actively refereed or curated. There are few standards and (anecdotally) little re-use. The publishing process itself militates against datuments. The author is required to recast their information into models that conform to the publisher's technology and business model, often being a Microsoft Office document with defined template and the conversion of all data into tables or semantically void images. In any case, most manuscripts are re-keyed at some stage in the publication process, so electronic submissions by authors have little value. Authors are not surprisingly discouraged from datument-like publishing.
Until recently this was inevitable, but now we have the technology to address this. Many information components in a hyperdocument can be recast as context-free XML and integrated with XML text and XML graphics. Here we show the overall information architecture with reference to the latest proofs-of-concept in the chemical field.
3 Robot readers and the digital age
The current transition to e-journals seems to be welcomed by many - but not us. E-journals published in portable document format (PDF) have missed a great opportunity for change and brought little value to the scientific community (in this sense, portable really means print anywhere rather than re-use anywhere). Many readers still print their reading on paper, so the effect is merely to transfer the cost of producing paper journals (including mail) to the readers' printing bills. Even where readers use the screen there are few or no tools to manage this information - each scientific article is a distinct entity whose linear concept dates from the 19th century. Electronic TOCs and bibliographic hyperlinks may provide some value but the idea of a dynamic knowledge base for the benefit of the community is wholly lacking. We accept that business goals and methods cannot change overnight, but novel forms of communication have usually been ignored. For example, the authors have pioneered e-conferences (Rzepa et al. 1995), e-courses (Murray-Rust et al. 1995) and sit on the board of an innovative e-journal where datuments can be published (Gkoutos et al. 2001). These and similar efforts in other disciplines have been largely ignored. The brave new world articulated in many of the talks at the first World-Wide Web conference in 1994 (Rzepa 1995), which foresaw radically new ways in the digital age, has been largely stifled by conventional business interests and methods.
A common feature of all mainstream science publication is the universal destruction of high-quality information. Spectra, graphs, etc., are semantically rich but are either never published or must be reduced to an emasculated chunk of linear text to fit the paper model. The reader has to carry out "information archeology" using the few bricks that remain from the building.
The true vision of the digital age is to use information beyond the limitations of paper. We use the test of the "robotic scientific reader". This robot can read and understand scientific discourse such as papers and emails. The understanding is very limited and has carefully controlled semantics but it has several major advantages over human reading:
-
robots never get tired or bored
-
robots seldom make mistakes and they can be mended when they do
-
robots can carry out operations that humans cannot (e.g. in extreme or unsafe environments)
-
robots scale (humans are unlikely to process the 500,000 articles we refer to above)
The last feature is critical. Science is being overwhelmed with information and it is
essential that we develop robotic readers. This is keenly understood in the
biosciences where "text mining" is an active area. We accept that human natural
language is a major current barrier, but this can be dramatically lowered if we have
the will. By contrast, most publishers are continuing to make their products
inaccessible to robots, mainly through PDF. Even Microsoft Word is better. GIF or
JPEG images carry content which machines can understand only with great difficulty
(Gkoutos et al. 2003). SVG (graphics in XML) is
the natural choice for digital image information. We shall use SVG here so if
readers want to see our diagrams they can take their first steps in reading a
datument (Figure 1).
Figure 1. Information loss in the current publication process. The author (a human/machine symbiote) has a rich (if legacy) information environment. This is downgraded to PDF during publication. The two images have "identical" content but use different technologies: a, SVG, can be scaled indefinitely without corruption, can be used for information extraction (e.g. "PDF" can be retrieved) and can be re-used in whole or part (human readers new to SVG should visit the W3C site http://www.w3.org/Graphics/SVG/ to get a plugin or other viewing technology); b, corresponding JPEG (ten times as large a file), shows the loss and the near impossibility of any information extraction
This is not science fiction. A program undertaken at Cambridge (Murray-Rust et al. 2003) has resulted in robots that can read and understand most of the data in a typical paper on the synthesis of new chemical compounds. The robots can read a paper in c. 5 seconds and create a complete datument of all analytical information. Using XSLT stylesheets the robots can answer trivial (chemical) questions like:
-
How many nuclear magnetic resonance spectra were published this year? What is the distribution of magnetic field strengths used? What are the solvents used to prepare the samples?
-
How many compounds contain an even number of carbon atoms? (a classic data mining exercise)
-
What is the average range of melting temperature for a compound?
Unfortunately the real spectra have been destroyed in the (conventional) publication process. Even so, these tools can carry our information archeology to make reasonable estimates of what the original spectra might have looked like.
It easily conceivable that robots could take action on reading papers, such as "find all inhibitors of HIV protease in J. Med. Chem., order them from suppliers or where unavailable repeat the syntheses". In practice this will still require human oversight for some years, but it illustrates the power of the semantics.
This discourse, therefore, is a call for "accessibility for robots as well as humans".
4 Datuments, transclusion and integrity
A datument is a hyperdocument for transmitting "complete" information including content and behaviour. We differentiate between "machine-readability", merely that a document such as a JPEG image can be read into a system, and "understandability", where the machine is supplied with tools which are semantically aware of the document content. Examples of the latter are domain-specific XML components such as maps (GML), graphics (SVG) and molecules (Chemical Markup Language, CML). Understandability may require ontological (meaning) or semantic (behaviour) support for components. Neither are yet fully formalised but within domains it is often possible to find that certain concepts are sufficiently agreed that programs from different authors will behave in acceptable manners on the same documents. We shall assume that most scientific disciplines can, given the will, support machine-understandability for large parts of their information.
In principle datuments can be infinite in size, both in terms of the semantic and ontological recursion and the need to provide complete information for every component. For example, a scientific paper has citations that are also datuments and which may be required to create the complete knowledge environment. In principle, also, a datument can be dynamic with components changing in time. Nonetheless we believe that in many sciences bounded static datuments are of great value and that many primary publications are valuable as such.
Classical transclusion normalises information by providing a single copy of each component and providing links to, rather than copies of, such sources. This works well on the Web as long as integrity is regarded as relatively unimportant (or at least poor integrity can be "lived with"). It also works where a single (monopolistic) supplier has control over all the transcluded information. In a heterogeneous environment it does not yet work. A supplier of transcludable content may have little business or moral motivation to provide continued integrity. A primary publisher may have no contractual information to continue to support authors' supplemental data or even full text indefinitely. While transclusion may work where microcontent is of very high value (e.g. arts and literature) it is difficult to see a business model in science.
An alternative model is the datument "snapshot" where all the components are copied and aggregated at "time of publication" (Figure 2).
Figure 2. a, hyperdocument (linked documents); b, bounded datument (text (XHTML) and data (MathML, SVG) aggregated. The figure emphasises the links between separate documents in (a) and the complete aggregation in (b), although there is a continuity of architectures
While this forgoes the power of dynamic linking, it enriches of the original material enormously. An example could be a scientific thesis with multiple components, including generic components such as:
-
human-understandable text
-
numeric data of agreed types constrained by ontologies, units and errors
-
graphs
-
tables and other common data structures (lists, trees, matrices)
-
graphics
-
bibliography and other document components (TOC, abstracts ...)
and domain-specific ones such as:
-
molecules
-
biosequences
-
spectra
-
program output
-
statistical analyses
-
organisms
The graduand creates a thesis by aggregating the information as a single datument with integral XML copies of all the information collected to support the scientific work. After the examiners have torn it to pieces (critical examination!), the revised datument can then be published in its entirety. Whereas most paper theses are never re-read, PhDatuments can be universally accessible to humans and robots alike.
Remarkably, models for such aggregation are already arising within the so-called "blogging" communities, which are united by their published "Web logs" and some degree of semantic and ontological unity achieved using RSS metadata feeds (Murray-Rust and Rzepa 2003a, 2003b).
5 Open information
This article is addressed to those communities who genuinely wish to share scientific information. We believe that "most" scientists wish their data to be re-used, even if it occasionally leads to embarrassing retractions and revisions. Many authors do not recognise the value of aggregating their micropublished work, although this tradition has been common for 200+ years. We hope the datument will show that mutual contribution leads to a vastly richer resource for scientific discovery.
We accept that certain data cannot be made freely available though patient confidentiality, patentability, etc. We are, however, urging that all data published in the primary literature be openly available for re-use. "Free" does not necessarily mean open, as re-use may be prohibited. By "open" we mean that the information can be aggregated, filtered and redistributed, and derivative works can be made, subject to appropriate license conditions. In open source software these licenses are well explored and (to paraphrase) include the preservation of original authorship, details of any changes in derivative works (if allowed) and full access to source code (not merely executable functionality).
A datument is generally composed of components from many sources. If these sources have any barriers to re-use the distributability and re-use of the datument is severely limited. Among the barriers are:
-
Fees and non-open licenses of any sort. In general the effort to understand such licenses and honour them takes effort for humans and is beyond machines at present. It is probably a major barrier to the concept of the unbounded semantic Web. Robots generally do not have credit cards.
-
Login procedures, even if free. The need to provide personal details and spammable emails deters many from accessing certain information sources.
-
Unstated copyright. Since documents are presumed by default to carry the copyright of the author until many years after their death, most Web-based documents are de facto copyrighted and non-re-usable. A simple statement permitting re-use would be of great value.
-
Lack of provenance. The data are unattributed and therefore of unknown value. Their semantics may be unknown and therefore misguessed.
-
Lack of discovery metadata (including provenance). The resource is simply never found.
The protection of intellectual property (IP) on datuments is potentially extremely complex. Creative works are copyrightable but "facts" are not. However, collections of facts may be held to be creative works. The status of a datument, where many components including text are assembled, is unlikely to be clear and this could jeopardise the process of making data open in the community.
This could be simplified if authors made it clear they were making the complete scientific datument openly available. In most cases it has been created before submission to the publishers and we see little reason why copyright should be reassigned. If compromise seems inevitable we have heard of a recent case where authors keep copyright of the original manuscript and the publishers have copyright of the form that appears "in print" with pagination.
The international scientific unions have emphasised the importance of data being publicly available to the scientific community. In our view authors must not hand over copyright of the "data" to publishers. The datument (perhaps eviscerated of some of its "text") should be regarded as "data" and published in open view. We show how this is technically straightforward and manageable with marginal costs
6 The practice of publishing datuments
Although datuments are expressed in XML (Figure 3), this is not (yet) the format in which most scientists work. Data and text are collected in a variety of (often proprietary) non-extensible legacy formats, many in binary form. The two strategies are:
-
to provide new XML-based technologies, including tools, and convert the community to their use
-
to convert the legacy formats to XML, ideally as painlessly as possible
We have discussed this elsewhere but remark that the second approach, though unaesthetic and lossy, is likely to be the most tractable. Moreover, when it succeeds the community may be sufficiently impressed to invest in the infrastructure of XML. But 5000 years of linearisation will not disappear immediately.
Author-provided SVG
Figure 3. Datument-based publishing. The author(s) create a datument which is re-used by the publisher to create their preferred content and format. The datument is also published directly to the global community ("Web") where readers (human and machine) can re-use it with whatever tools they like
Each domain will have to create a significant amount of infrastructure and technology. In some cases this is well understood and under construction. We illustrate it from our own subject of molecular science (with the CML family of languages) and expect that the structure will map to other disciplines. With the help of the open source community we have created:
-
an extensible set of schemas to describe and define the language and its semantics
-
additional tools for validation
-
editing tools for humans to create native CML components (e.g. molecules, reactions)
-
tools to convert from legacy (Word) to CML
-
tools to parse the output of computational chemistry codes to CML
-
DOMs (Document Object Models) to support the in-memory representation of datuments
-
transformation tools (mainly XSLT stylesheets), including output to legacy
-
tools to render the datuments graphically (viewers, sometimes combined with editors)
-
interfaces to the mainstream computational programs and processes in the domain
-
semantic agreements for binding datuments to behaviour
-
ontologies to define the meaning of datument components
The social dynamics of this daunting enterprise will vary considerably between domains. In some areas (e.g. crystallography) it is overseen by the appropriate scientific union or learned body. In others (e.g. new drug applications, NDA) it will be part of the regulatory process. In biosciences the (inter)national data curators have a major role. In chemistry the established nature of the chemical information industry has left a vacuum in communal development which is filled by a smallish group of open source enthusiasts such as ourselves. In all cases it requires considerable investment of some sort, but this is considerably lessened through the availability of open generic tools.
7 Datuments and Hypertext
The datument is therefore a hypermedia document accessible to robots and humans. At the ACM Hypertext conference we were impressed by the developments in human-understandable hypermedia but felt that robots were neglected in comparison. Web hypermedia systems are largely aimed at human readers and have few concessions for robots. Much of the analysis is post facto - analysing how humans and metadata-deprived robots navigate rather than building global hyperstructures ab initio. Developments such as ZigZag (Nelson 2004) with a non-traditional information structure are exciting but it will require much evangelism before they become tools in mainstream publishing.
8 Datument technology: a novel approach
This article contains two small examples of datuments (Figures 4 and 5) of published scientific information and both incorporate a mixture of "text" and "data". Their subject matter is chemistry but readers need no detailed domain knowledge. They are interactive, but are not just another example of scientific multimedia or hypermedia. We stress that the content is independent of the presentation and the graphical displays are created by tools operating on the display-neutral datuments. For example, a graphical display is irrelevant to a robot reader.
We argue that a cultural change in our approach to information is needed and that money on its own will not solve it. Indeed, greater investment in mainstream publishing may worsen the situation. The publishers' primary selling point is their impact factor, not necessarily the functionality of the product. Funders and academic bodies compound this, and novel initiatives are often not welcomed if they have low impact. The model of publication must therefore change. Realistically this will take time but we have to create something where the benefit is to the scientific community, and where the practitioners can be visionary. We propose students and their theses or reports as fertile ground.
Students have less fear of the impossible and less legacy to unlearn. We have involved both undergraduate and postgraduate students in authoring XML in many of the ways shown above and they have not only picked it up quickly but added their innovations. We therefore suggest that positive incentives should be given to students to create their theses as XML datuments.
We illustrate this approach with an example derived from a small part of a typical student chemistry thesis (Figure 4). The original component of the thesis is written in XML, with the chemistry carried directly using CML, itself an XML language. This datument can then be transformed into different representations for human assimilation. Figure 4a illustrates its conversion to an Acrobat file, destined largely for those humans who wish to print or archive the content, whereas the same datument can be transformed into e.g. Figure 4b, where the chemical content can now be viewed using either SVG (for 2D perception) or directly using a Java applet (where 3D perception might be needed).
Figure 4. a, Acrobat file derived from a
chemistry datument; b, the same content
presented using SVG/JMol viewers (both presentation styles are derived from
the same XML datument, documents will display in a separate browser window)
The molecular structures emphasize the re-use of XML in three ways:
-
Interactive 3D display.
-
Transformation to high-quality 2D graphics (SVG). SVG allows for interactivity and animation, though we have not included this here.
-
Offline transformation to "dumb" PDF. This shows that datuments can be "printed" in the normal way if required, but that the original information is still available as XML.
Note that the molecules can also be stored in searchable databases.
What are the immediate benefits of this approach? Some examples, which we contend may immediately save the student work:
-
It becomes much easier for students to check that their work fits scientific and procedural validity criteria. Examples are that all numbers should have scientific units.
-
The data can be viewed synoptically. Have I got an IR spectrum for every compound?
-
Data outliers can be detected. "The melting point seems very low compared with the other compounds - is this a typo or an interesting fact?"
-
My supervisor has asked me to change the structure of the thesis (e.g. sort the experiments in alphabetic rather than chronological order, or put all the IR spectra before the UV ones).
The longer-term benefits are even more dramatic. Assuming the research group has five years worth of student PhDatuments, they could:
-
Correlate success of synthesis with solvent used.
-
Find all compounds with unusually low NMR frequencies.
-
Compute the predicted properties and spectra for all compounds. How well do they agree with experiment? Is in silico prediction a useful tool?
When the thesis is accepted, corrections will be easier to make. By using XSLT, the components of the thesis can be prepared as datuments for publication in the wider community. A working illustration of this process is given in Figure 5, where the action of XSLT stylesheets upon XML-based datuments can provide a variety of (user-driven) representations, including functional ones such as transformation of scientific units and manipulations of mathematical terms.
Figure 5.
The use of XML and XSLT to provide a variety of
rendering and transformation styles for scientific documents (this should be
viewed using an XML/XSLT compliant browser such as Internet Explorer 6, document
will display in a separate browser window)
Here the two datuments (organic chemical synthesis and computational chemistry) are cast in XML and retransformed on-the-fly by XSLT stylesheets. These transformations involve re-use of the information (filtering, sorting, tabulation, transformation of values). The stylesheets are independent of the precise content of each article and therefore applicable to a wide range of datuments.
-
The "organic" stylesheets would act on any paper (in XML) reporting chemical synthesis, of which there are hundreds of thousands a year. An reader can re-view an aricle using (here) three such stylesheets. If all publishers adopted the idea of marking up scientific units in their papers, it would go some way towards preventing units-driven catastrophes.
-
The computational stylesheets show how program manuals can come to life. The limited demonstration here shows retrieval of equations, which could then be input to symbolic algebra packages. It would be meaningful to search papers for functional forms containing exponentials and then robotically compute the second derivatives.
How could funding make this happen? We need:
-
Serious investment in tools. We are starting to be approached by generic tool manufacturers who wish to create tools that effectively incorporate CML into datuments, so this does not require exceptional vision. We must overcome the conservatism of the academic community in how these are authored. A good place to start may be with undergraduate projects (Harrison et al. 2003) where restrictions are often fewer.
-
Incentives, including cash prizes for students and a hall-of-fame supported by major sponsors (Harrison et al. 2003).
-
Research sponsorship for creating demonstrators of the scientific benefits to evangelise this among our domain colleagues.