Towards Modular Access to Electronic Handbooks

Caterina Caracciolo
Language & Inference Technology Group, ILLC
University of Amsterdam, Nieuwe Achtergracht 166, 1018 WV Amsterdam, The Netherlands
Email: caterina@science.uva.nl

Abstract

The paper reports an ongoing project aimed at providing an exemplary architecture for an electronic dissemination environment for scientific handbooks. It focuses on a way of facilitating navigation through and access to electronic handbooks by using a WordNet-like concept hierarchy consisting of synsets (sets of synonyms) that are connected to each other and to external sources by semantic relations for navigational purposes.

1 Introduction

This paper reports on the project Logic and Language Links (LoLaLi), (http://remote.science.uva.nl/~caterina/LoLaLi/) which is ongoing at the University of Amsterdam and is supported by Elsevier Science. The project aims to define the appropriate environment for accessing scientific handbooks published in electronic form. We base our works on the Handbook of Logic and Language (van Benthem and ter Meulen 1997).

Scientific handbooks are typically read in a non-linear manner, which suggests that we could also adopt a modular approach for electronic publication. Harmsze (2000) proposes a modular structure for articles in experimental sciences, but it is not clear whether this approach can be adapted to handbooks with a more abstract content. In the LoLaLi project we aim to define what electronic publications should look like: we are especially interested in developing a good hyperlink system to provide access to the content of the handbook. This system should be rich enough to account for the complexity of the domain (the interface between Logic and Linguistics), while avoiding disorientation of the reader. Our prototype focuses on a specific domain, but we believe that we will be able to draw general conclusions on dealing with electronic handbooks in a wider range of domains.

The approach uses a WordNet-like concept hierarchy to annotate and access the handbook. It consists of synsets (sets of synonyms) that are connected to each other and to external sources by semantic relations. Concept hierarchies are often used for the purpose of navigating through large collections of documents. They are very useful for the organization, display and exploration of a large amount of information.1

Moreover, it has been shown that users in a hypertext search task who have hierarchical browsing patterns perform better than users who have sequential browsing paths (McEneaney 1999). Therefore, it is important that architectures for electronic handbooks allow, or even enforce, such hierarchical patterns: a concept hierarchy is a good way of doing this.

Section 2 describes in some detail the concept hierarchy developed within the Logic and Language Links project (in the following it is referred to it as the LoLa hierarchy): section 2.1 presents the internal structure; section 2.2 the XML encoding of the concept hierarchy. Section 3 is dedicated to ongoing work.

2 Organization of the Hierarchy

The LoLa hierarchy (http://lit.science.uva.nl/LoLaLi/alpha/) consists of concepts, connected by several semantic relationships. By concept we mean every relevant notion or topic in the domain, worth individual discussion. In line with WordNet (Fellbaum 1998), we make a distinction between words or terms on the one hand, and concepts on the other hand: a concept is denoted by a synset, a set of synonymous words (we only use the English nomenclature for our domain). Words are synonymous if they have (more or less) the same meaning in some settings. For example first-order logic is also known as predicate logic, FOL or predicate calculus.

The semantic relationships linking synsets come in two kinds: ones that are internal to the concept hierarchy (Section 2.1), and ones that link the concepts to external resources (Section 3.2).

2.1 Internal Architecture

Concepts in the hierarchy are annotated with a gloss; for instance, the study of language meaning is a gloss for semantics. Moreover, they come with a longer description, provided by the authors of the concept especially for the LoLa hierarchy.

The hierarchy consists of a TOP concept, under which there are four main branches: computer science, mathematics, linguistics and philosophy; from each of these concepts stems a branch of the hierarchy, organized by relations of subtopic - supertopic. A concept is a subtopic of another concept if one (and only one) of the following relations holds:

  1. is a: epistemic logic is a related subtopic of modal logic;
  2. part of: metaphysics is part of philosophy;
  3. technical notion: operator is a notion in mathematical logic;
  4. mathematical result: Goedel's incompleteness theorem is a mathematical result (theorem) of logic and mathematical logic;
  5. computational tool: SPASS is a computational tool for first-order logic (it is a first-order resolution-based theorem prover);
  6. historical view: the concept Frege on quantifiers gives an historical view of the concept quantifiers.

Abbreviations placed beside the name of a concept indicate to the reader the type of relation it holds with its parent concept. The semantics of these relations is made transparent to the reader by means of examples accessed by clicking on the abbreviation. More refined and compact visualizations (e.g. colors or icons) will be tested by means of usability tests with an appropriate sample of users.

The above set of relations is currently undergoing a detailed analysis to make sure they provide a reasonable coverage of important semantic and cognitive connections between concepts. In particular, we are currently studying the semantic distinctions between the notion of is a, interpreted as the set-theoretical notion of subclass (for example modal logic is a kind of logic, where modal logic stands for a family of logics using modal operators), and the notion of instance, interpreted as the set-theoretical notion of membership (using the above example, K2 is an instance of modal logics, i.e. a particular axiomatization of a modal system).

figure 1

Figure 1. Graphical representation of a fragment of the LoLa hierarchy

The LoLa hierarchy is not a strict tree, because multiple parenthood is allowed: for example the concept logic has the concepts mathematics and computer science as parents. In fact, this is properly a graph structure.

Beside the subtopic - supertopic relations, non-hierarchical relations are also allowed, and are used for navigational purposes. They include the following:

  1. Sibling: all concepts having the same parent(s). Informal experiments indicate that readers find it useful to know what the siblings of a given concept are. Provided that the siblings are listed in some meaningful order, they prevent the "lost in space" problem. Siblings are automatically computed and presented to the reader with a flag indicating what kind of relation they have with the parent.
  2. Other meanings: all concepts having the same title, but with a different gloss. For example, computer science and mathematics have logic as subconcept, with the following gloss: "A system or calculus of reasoning"; while logic under philosophy has the gloss: "The branch of philosophy that analyses inference". This relation is automatically computed, too.
  3. Associated concepts: concepts sharing some properties or somehow analogous to each other. For instance, finite state machine is similar in this sense to regular language. This relation is provided with a short explanation of the reason of similarity.
  4. Antonymous concepts: as in the case of completeness and incompleteness. Learning the antonym of a concept not only teaches us more about the meaning of the antonymous concept, but also about the concept itself (Muehleisen 1997). Like the similarity relation, this relation comes with a short explanation.

2.2 Encoding the Concept Hierarchy

Each concept is given a unique identifier and is represented as an XML document in which the following pieces of information are stored:

All these pieces of information constitute elements in the XML tree; some of them (e.g. title and gloss) are given an identifier to be individually addressable. Moreover, the XML documents incorporate a set of metadata (Dublin Core compliant) about the document, such as author and date of creation and modification. An extension of the DTD to accommodate bibliographic references is under development, in collaboration with Elsevier Science.

The graph structure is coded in a relational table, while descriptions are stored separately because they are typically written in LaTeX and can also contain non-textual objects, such as images. Users do not access the XML base but a static set of HTML documents, searchable and browsable, generated from the XML base at regular intervals.

Despite the existence of more sophisticated languages to represent hierarchical structures, such as RDFschema, we decided to stick to XML, perhaps a less expressive but certainly a more consolidated language.

The current version of the hierarchy is populated with close to 500 concepts, provided by the LoLaLi group at ILLC. We plan to expand the group of authors and double the size of the concept hierarchy in the next two years.

3 Ongoing work

3.1 Searching the Hierarchy

The ideal reader of the envisaged electronic version of Handbook of Logic and Linguistics is not a new learner user, because our hyperlink structure is not meant to provide a learning environment. Moreover, given the size of the concept hierarchy, it is best accessed by using a mixture of browsing and searching, the latter implying a certain ability in phrasing the information need.

A module for sophisticated searching of the hierarchy at the user end is under development. The search facility is a crucial feature for users, since access through browsing is not suitable for significantly large graphs. The user will be able to search the hierarchy using a structured search that allows for queries like "has title...'', "has gloss...'' or "has sibling called...''. Besides that, generic, i.e. unstructured search (string matching), will also be available.

3.2 Linking the Hierarchy to the Handbook

In addition to the internal links, our concept hierarchy will also accommodate external links in the sense that they are between concepts and targets outside the hierarchy. We distinguish between handbook links (to information in the handbook but outside the concept hierarchy), and web links (to information sources on the Web). Here we focus on the former.

The target of a handbook link can be of different levels of granularity (a part, a chapter, a subsection, a definition, etc.). Ideally, concepts higher in the hierarchy refer to larger fragments in the handbook, while lower concepts refer to smaller parts. However, as the handbook chapters are written by different authors, resulting in a different structuring and writing style for every chapter, this is hard to achieve.

Handbook links come with metadata describing crucial information about the publication linked (e.g. author, editor, publisher), enriched with an indication of the link type (e.g. definition, theorem, discussed-in, example, counter example, ...).

At an earlier stage of the project we experimented with automatically generating hypertext links from concepts in the hierarchy to (electronic versions of) chapters in the original handbook. As the documents to be retrieved, we took pages of the original handbook; while arbitrary, this choice was forced on us by the diversity of the writing styles of the contributing authors. For the queries we explored several possibilities (term, term plus description, term and description plus additional weights on the term). We plan to use the current (richer) hierarchy to run more refined experiments and concentrate on the segmentation of the text with respect to the topic treated, and classification of the topic itself.

3.3 Testing

Two groups of tests will be run in collaboration with the User-Centered Design of Elsevier Science. The first set of experiments will assess how the structure and interface of the LoLa concept hierarchy are exploited by the users. We are interested in how well and how fast a user can get acquainted with the system and navigate through it. Once this first task has been performed, and after having linked the concept hierarchy to the handbook, we will assess the effectiveness of the hierarchy for the disclosure of the text. This represents a global evaluation of the project, to find out how good a bridge a concept hierarchy is for accessing an electronic handbook.

4 Conclusion

We have reported ongoing work aimed at providing an exemplary architecture for an electronic dissemination environment for scientific handbooks. We focused on facilitating navigation through and access to electronic handbooks by means of a WordNet-like concept hierarchy of synsets connected to each other and to external sources by various semantic relations. We also reported on the state of the project, and outlined current and future developments.

Acknowledgements

This research was supported by Elsevier Science Publishers. Thanks to Anita de Waard, Guus Schreiber and Dagobert Soergel for helpful comments and suggestions. Thanks to Maarten de Rijke and Joost Kircz for their supervision.

References

Fellbaum, C.

(ed.) (1998) WordNet, an Electronic Lexical Database (MIT Press)

Harmsze, F. (2000) "A Modular Structure for Scientifc Articles in an Electronic Environment". PhD thesis, Universiteit van Amsterdam

McEneaney, J. E. (1999) "Visualizing and assessing navigation in hypertext". In Proceedings of the 10th ACM Conference on Hypertext and Hypermedia, pp. 61-70

Muehleisen, V. L. (1997) "Antonymy and Semantic Range in English". PhD thesis, Northwestern University, USA

van Benthem, J. and ter Meulen, A. (editors) (1997) Handbook of Logic and Language (Elsevier)

Note

1 Well-known examples include Yahoo!'s topic hierarchy for exploring the Web (http://www.yahoo.com/), and Google's directories ( http://www.google.com/dirhp?hl=en&tab=wd&ie=UTF-8) based on the DMOZ Open Directory Project (http://dmoz.org/).