1 Introduction
Thesauri and related knowledge organisation systems (KOS) (Hodge 2000) have formed the basis of a large number of commercial and research retrieval systems. They are being applied increasingly to Web site architecture and in interfaces to Web databases (Rosenfeld and Morville 2002). The international research context has seen significant interest in KOS-based metadata due to the rapid growth of the Semantic Web, Semantic Grid and digital library (DL) research communities. Recent NKOS workshops have reported initiatives to update international thesaurus standards to take account of these online developments (e.g. Tudhope 2003).
It can be argued, however, that traditional focus has tended to be on tools for KOS construction rather than use. To access their full potential within distributed DL environments, we need a separation of the intellectual content of domain thesauri (and other KOS) from the various services that search, index and provide end-user interfaces. This separation would facilitate a division of labour, where different players can concentrate on the services they are best equipped to provide. We need to identify the different ways that KOS services can (optionally) assist the various stages of the search process. This requires a means for interoperability between different KOS and DL services. The lack of standardised programmatic access and interchange formats are a barrier to wider use of thesaurus and KOS intellectual resources in automated Web and retrieval applications. Programmatic access to thesaurus resources, with a useful degree of interoperability, requires a commonly agreed protocol building on lower-level standards, such as Web services.
Recently, Hill et al. (2002) proposed a research agenda to consider the implications of treating KOS as integrated DL components. To this end, they proposed the development of "a general KOS service protocol from which protocols for specific types of KOS can be derived". This would serve to open up KOS content to programmatic access from a variety of (Web) services across the controlled terminology information lifecycle, from description (indexing and annotation) services to cross-mapping to different kinds of retrieval services. Various styles of interface should be possible for each service, with different design decisions on thesaurus visualisation and the appropriate balance between interactive and automatic processing of thesaurus content possible for different audiences. This is particularly important for applications that seek to open up some form of access to thesauri in end-user interfaces, as opposed to traditional use by more specialist intermediary searchers.
A thesaurus is a type of KOS which provides a controlled vocabulary the concepts of which are structured by a core set of semantic relationships specified by international standards (Aitchison et al. 2000). Hierarchical relationships structure broader and narrower concepts in relation to a given concept, and allow a thesaurus to be visualised in hierarchies. Associative relationships structure more loosely related concepts. Equivalence relationships specify terms that can be considered as effective synonyms for a concept, according to the objectives of a particular thesaurus. Thus a major thesaurus will have a large 'entry vocabulary' of equivalent terms and 'scope notes' will define how each concept can be used.
A thesaurus protocol should facilitate rapid access to the various kinds of thesaurus data required by different applications, with regard to the bandwidth restrictions imposed by anticipated operating environments. For example, easy and flexible methods (assisted by linguistic techniques) for mapping from a searcher's expression of information need to controlled terminology are important. The protocol must also provide the information needed for various types of browsing tools that navigate and inspect thesaurus content and which may be applied in conjunction with different types of search method. Search tools should be able to use thesaurus services to generate queries that can be output in various formats, such as Z39.50, HTTP and also terminology-enhanced queries to commercial and custom search engines, etc. Thus, in future a combination of thesaurus and query protocols could permit any thesaurus (in a standard representation) to be used with a choice of search tools on various kinds of database. This includes not only controlled vocabulary search applications where the thesaurus is used both for indexing and as the basis (possibly in conjunction with free text) for queries on the database. We also need to consider collections without controlled metadata, where a thesaurus may assist query (re)formulation either by interactively suggesting concepts and terms or by automatic query expansion. Again, query expansion services can be used with both free text and controlled vocabulary indexed collections.
To progress with commonly agreed protocols, however, an understanding is required of the practical requirements of different forms of interfaces and retrieval provision. This paper reflects on our experiences in building a Web demonstrator of some novel thesaurus browsing and search services as a case study in determining requirements for thesaurus access protocols. The Web-based system provides dynamically generated interface components for finding terms and browsing the thesaurus, building a query and returning ranked results (using term expansion) from a collections database. We designed a custom application programming interface (API) of lower-level thesaurus functions to support the various user interface requirements of the application demonstrator. Based on our experience with developing the system, we review the literature and make recommendations for further development of thesaurus service protocols.
Section 2 gives an overview of the FACET research project and the Web demonstrator. This includes a detailed description of (and links to) key elements of the Web demonstrator and the rationale, together with a discussion of the data elements required by the different interface components. Section 3 reviews the literature on current KOS service protocols and related projects. The CERES, Zthes and ADL thesaurus protocols are discussed. Section 4 reflects on lessons that can be drawn from constructing the Web demonstrator and implications for separating the service protocol from the interface. This leads on to suggestions for evolving the thesaurus protocols and possible new features, including a novel, unified expansion service.
2 FACET project
FACET is a recently completed EPSRC funded research project that investigated the integration of a (faceted) thesaurus into the search interface together with semantic query expansion techniques (FACET). It explored the potential of a thesaurus as a search tool for controlled vocabulary applications where the thesaurus is used for both indexing and search. The project was a collaboration with the UK National Museum of Science and Industry (NMSI), which includes the National Railway Museum, and the J. Paul Getty Trust which provided the main vocabulary used in the research - the faceted [1] Art & Architecture Thesaurus (AAT; Soergel 1995). The mda (Museum Documentation Association) and CHIN (Canadian Heritage Information Network) also acted as advisors to the project. In addition to the AAT, we have explored the techniques with a number of smaller specialist thesauri, including a draft version of the mda Railway Object Names Thesaurus and the Alexandria Digital Library's Feature Type Thesaurus. An extract of the NMSI Collections Database that had been indexed with the AAT acted as a test bed for the project and the Web demonstrator.
The project investigated query expansion techniques, so that a searcher is not required to match exactly the terms used to index an item. This permits ranked 'best match' results for multi-concept, controlled vocabulary queries - a novel feature of the FACET system. Combining single concept vocabulary elements permits highly specific metadata (or annotations) and queries. However, matching such multi-concept descriptors poses significant challenges when searching. Indexer and searcher may choose related but not identical terms, or indeed the same person may choose varying terms on different occasions (Chen et al. 1997). Thesaurus search systems are seen in these situations as potential precision enhancing tools but usually at a cost in recall performance with few matches resulting. The FACET matching function addresses this challenge by generalising queries; semantic expansion of concepts allows partially matching results, with index terms semantically close but not identical to query terms. Providing a ranked list of results allows a user to see the closest matches at the top of the list and decide according to context how far to scan down. In fact, the matching function incorporates common search tactics (Bates 1979) such as dropping a facet from a multi-concept query, or trying a broader term.
The query term expansion involves a measure of semantic closeness between thesaurus concepts. Semantic closeness is based on the minimum number of (weighted) relationships connecting two thesaurus concepts. Thus, a narrower term might be considered very close to its immediate parent term. However, the traversal algorithm can take into account more complex chains of relationships between concepts, with a cost factor applied at each step. The semantic expansion engine can be configured to traverse particular thesaurus relationships and to allocate different cost parameters to relationship types. For example, with typical settings a high cost would be allocated to traversing an associative relationship but a very low (if any) cost to a narrower term traversal. A threshold limits expansion to a neighbourhood considered semantically close to the original concept. For a discussion of semantic closeness measures, see Alani et al. (2000). Note that semantic closeness can also be used in browsing (section 2.3.2). The multi-concept matching function applies a weighted average match to the query, with various controlling factors (for details see Tudhope et al. 2002a).
Various standalone systems were developed as part of an iterative design and evaluation cycle (Blocks et al. 2002). This paper is concerned with lessons learned from the development of the Web demonstrator. Its overall functionality is briefly outlined next, while key features are discussed in more detail in the following section.
2.1 FACET Web demonstrator
The demonstrator illustrates how thesaurus content (and particularly term expansion across thesaurus relationships) may be used to query a collection in a realistic Web prototype application. It was intended more as an exploration of FACET research outcomes as dynamically generated Web components than as a general interface, although it might, for example, form the basis of 'advanced search' components [2]. The interface does not rely on pre-built static HTML pages; thesaurus content is generated dynamically, most via an in-memory representation of concepts and their relationships, the remainder (e.g. Scope Notes) from a relational database. Design of the interface took account of various findings from evaluation studies of the standalone interface.
A modified extract from an old version of the NMSI collections database is employed to illustrate an example query service. In this case, both the KOS terminology service (the AAT) and the collections data reside on the same server. However, in general this need not be the case. For example, a provider might offer a KOS service on their portal while independently a DL might offer various collections to be searched, offering a choice of vocabulary search tools or allowing the user to choose if a standard protocol existed.
The various interface components are described in the following sections of the paper. Together they form a coherent application for building structured queries from faceted thesaurus terms, and submitting queries. Figure 1 illustrates the Web demonstrator interface. Note, it incorporates a facility that attempts to map from a user's initial search expression to controlled thesaurus terminology ('Find in Thesaurus'). Figure 1 also shows the hierarchical thesaurus browser (clicking on a term in any of the displays or results brings this up), a pane for any associative relationships, a multi-concept query pane with individual term controls for dynamically setting the extent of semantic expansion, and a results pane where matching items are ranked according to semantic closeness. Terms are currently colour coded as a visual indication of their originating facet, the colour coding being controlled via an external stylesheet. (Note that alternative visual indicators and user controlled colour schemes may be appropriate in the interests of accessibility in operational systems). Clicking on any controlled term in the index field of the results will bring it up in the current browser view and the user can go on to reformulate the query by adding / substituting terms as required. See http://www.comp.glam.ac.uk/~FACET/webdemo/ for an introduction to the Web demonstrator.
Figure 1. FACET Web demonstrator interface (see http://www.comp.glam.ac.uk/~FACET/webdemo/demo_QueryBuilder.htm for demonstrator [3] - links are given to individual components in the following subsections)
In fact, the interface employs a slightly different set of top-level categories than the AAT. An external XML file maps a category to a hierarchy or concept in a thesaurus. This approach was influenced by work on fundamental categories by the Classification Research Group (CRG) and is reported in Tudhope et al. (2002b). In this case, we essentially reproduce the AAT facets (the AAT itself was influenced by the CRG approach).
2.1.1 Technical issues
The current browser-based interface is an Active Server Pages (ASP) application, and much of the content is dynamically generated during each page request using a combination of server-side scripting and compiled components. Consistency in page layout and element style is achieved using cascading style sheets, conforming to the CSS2 standard. The final system comprises a tiered component-based architecture as shown in Figure 2, accessing a SQL Server relational database. Early experiments suggested that acceptable real-time performance of semantic expansion over thesaurus relationships was not viable over relational database representations, due to the need to perform multiple iterative SQL queries in order to assemble the final data set. Therefore we decided early on to use an in-memory representation of the network of relationships between concepts for semantic expansion purposes, in combination with a relational database for additional thesaurus information. This proved successful and yielded acceptable real-time performance over the 28,646 preferred terms in our current AAT version. The architecture permitted both the standalone and Web-based versions of the system to make use of the semantic expansion engine operating over the in-memory directed graph structure of thesaurus concepts and their relationships.
The persistence of state information between page requests is a problematic issue in the design of Web-based applications. The Web demonstrator operates over the HTTP protocol - which is (by design) stateless. This means that page requests occur independently of each other, and the scope of contextual information such as the current status of the user's query is limited to the currently loaded page. Various methods exist to overcome this basic problem, each with its own particular advantages and disadvantages. Examples include the use of URL query strings, browser based 'cookies', ASP session objects, JSP session tracking etc. URL query strings expose elements of the API unnecessarily, and multiple parameters become inelegant. We wanted to avoid any dependence on browser configuration, so cookies were not regarded as a viable option. The solution adopted for the current demonstrator involved the creation of small functional 'scriptlet' interface components, which had the ability to communicate with the server without causing a browser to refresh the entire page. The rationale behind this decision was to eliminate the persistence problem and simultaneously reduce bandwidth usage. However, it had the side effect of introducing an undesirable element of platform dependence [3] on the client browser side. We also attempted to minimise the amount and type of information transferred for each request. Data caching by the browser made a noticeable difference in the apparent speed of operation.
2.2 Searching the thesaurus (mapping from user terminology)
Linguistic search facilities are currently fairly basic; time constraints meant that more advanced functional search options such as Boolean operators, fuzzy matching and stemming were excluded from this version. The system utilises the entry vocabulary structure to map from the user's own terminology to preferred concept descriptors in the thesaurus. Alternate and non-preferred term fields are automatically included in the search scope, and the preferred term is displayed where any match is located. Searching within scope notes is an option, and this proved useful in identifying other potential terms of interest. The example search on the term aniline shown in Figure 3 has the "include term definitions" box checked, and as a result has uncovered two processes that utilise aniline dye, and an object for which aniline dye is used during the fabrication process. The user may also search on multiple terms or on compound terms by entering a sentence or a phrase (a search on wall clocks would yield the terms walls, clocks and wall clocks).
Figure 3. Search on aniline (see http://www.comp.glam.ac.uk/~FACET/webdemo/demo_TermFinder.htm for demonstrator [3])
Terms matching the search criteria are currently presented in an alphabetically ordered list. More recent discussions have favoured incorporating an option to group the matching terms by their originating facet (at the moment terms are colour coded by facet). The rationale for this type of display would be to make the context of the terms clearer, and to reinforce the faceted nature of the underlying system to the user. Other display and grouping options may also be considered in the future, for example displaying broader terms to provide further contextual clarification.
2.3 Displaying and browsing thesaurus structure
Exposing the thesaurus as a browsable hyperlink structure allows the user to navigate and explore the scope of the thesaurus in three ways. The first method is browsing a traditional tree structure. Each displayed term takes the form of a hyperlink, allowing the user to navigate concepts by clicking the link.
a2.3.1 Hierarchical display
The dynamically generated hierarchical display shown in Figure 4 shows all ancestry terms back to the root (facet) term. We then show immediate narrower (child) terms and siblings (narrower terms of parent term). Although we followed Getty AAT guidelines on displaying all broader concepts up to the top of the hierarchy, the decision to limit the display to one level below the currently selected term conflicted somewhat with the Getty Institute's recommendations [4] for the display of AAT data (two levels down). Our rationale was that in dense and extensive hierarchical structures such as the AAT, extending the hierarchy two narrower term levels below the currently selected term can result in the display of hundreds of terms. Immediate narrower terms already provide adequate local contextual information without further elaboration, providing they exhibit some visual indication where further levels exist. The user can make the decision to navigate further down the tree, simply by clicking on a term. In practice this display proved responsive and robust.
Figure 4. Hierarchical display for aniline dye (see http://www.comp.glam.ac.uk/~FACET/webdemo/demo_TermViewer.htm for demonstrator [3])
Note, although not implemented here, for display purposes a "term type" indicator could visually distinguish between the top three levels of the hierarchy to reduce confusion: Materials (facet), Materials (hierarchy) and materials (indexing term). This approach could also serve to indicate where no further narrower term relationships exist (term type = leaf), or to further distinguish guide terms (sub-facet indicators).
2.3.2 Semantic browsing display
Semantic browsing is a novel FACET feature that presents semantic expansion as a second browsing option. It offers an alternative to sequential hierarchical navigation of a complex thesaurus structure. With semantic browsing, the hierarchical display is replaced by a linear list, ranked by semantic closeness of terms to the base concept. Thus in Figure 5 the complexity of the underlying conceptual structure is hidden, but semantically relevant terms are still navigable. Again, each displayed term takes the form of a hyperlink, any selected term then becomes the top of the list and a new list of semantically relevant terms is generated below it. Based on evaluation feedback, the bar graph display of semantic closeness was adopted in preference to the percentage indicator employed in the standalone FACET application.
This is not only useful for 'local maps' that include Related Terms [5] which might be overlooked in hierarchical browsing. In some situations, semantic expansion may be an easier browsing option than investigating which sub-hierarchies (or sub-facets) are fruitful to explore in large thesauri. A user can continue to browse via semantic expansion by clicking the terms presented.
Figure 5. Semantic expansion of aniline dye (see http://www.comp.glam.ac.uk/~FACET/webdemo/demo_TermViewer.htm for demonstrator [3])
2.3.3 Related terms - display format
The third method of browsing is via associative relationships. If present, the related terms for a concept are displayed as a series of comma separated navigable links in the form of preferred term labels (see Figure 6). The links lead to areas of the thesaurus that may be outside the immediate hierarchical structure, and sometimes may lead to terms within a completely different facet.
Figure 6. Related terms for alizarin (see http://www.comp.glam.ac.uk/~FACET/webdemo/demo_TermViewer.htm for demonstrator [3])
2.4 Scope notes - display format
A composite concept description is generated from the scope note and other fields and displayed on selection of any term. This serves to clarify the contextual scope of the displayed concept, allowing the user to disambiguate homograph terms and to establish whether they have selected the most appropriate concept. The inclusion of non-preferred variant terms in the display assists in this clarification.
Figure 7. Description for aniline dye (see http://www.comp.glam.ac.uk/~FACET/webdemo/demo_TermViewer.htm for demonstrator [3])
2.5 Interactive query building
In the Web demonstrator the results of interactive semantic expansion of query terms are fed back to the user dynamically, allowing them to vary the degree of expansion to apply to each individual query term through the use of a simple coarse grained control, as shown in Figure 8. The radio button controls were adopted in preference to the slider control employed in the standalone interface. Evaluation studies of the standalone system showed that in practice users experienced some difficulty manipulating the slider control effectively.
Figure 8. Expanded query on aniline dye (see http://www.comp.glam.ac.uk/~FACET/webdemo/demo_QueryTerms.htm for demonstrator [3])
2.6 Querying collections
To demonstrate FACET's partial matching and relevance ranking, queries are executed against a set of descriptive records representing museum artefacts. The results are displayed as a list of records in order of descending relevance. In Figure 9 an expanded query on a single term aniline dye has been executed, and three matching items have been retrieved. Note how semantic expansion has resulted in partial matches on the index term magenta. In the context of the indexed data and the indexing vocabulary (AAT), this term represents a material - a kind of aniline dye (refer to the hierarchical display in Figure 4). For more information on FACET's matching function, which provides ranked results for both single and multi-concept queries via semantic expansion, see Tudhope et al. (2002a).
Figure 9. Results for an expanded query on aniline dye (see http://www.comp.glam.ac.uk/~FACET/webdemo/demo_QueryResults.htm for demonstrator [3])
2.7 Data requirements for FACET interface components
The data requirements for building the various display components discussed in previous sections are summarised in Table 1. These requirements motivate the discussion in section 4 on employing external, distributed terminology services to build equivalent displays.
Component | Data required |
Search (Figure 3) |
For the overall display:
|
Hierarchical display (Figure 4) |
For the conceptual structure:
|
Semantic expansion (Figure 5) |
For the overall display:
|
Related terms (Figure 6) |
For the overall display:
|
Description (Figure 7) |
For the overall display:
|
Expanded query terms (Figure 8) |
For the overall display:
|
3 Programmatic access and thesaurus protocols
Many Web sites have offered a static thesaurus lookup facility via pre-built HTML pages. Notable thesaurus-based Web projects with some dynamic element, such as mapping from a given string to controlled terminology or database query capability include AAT, AGROVOC, APAIS, CABI, CHIN, FATKS, MeSH,and several others. For comprehensive indexes, see (Koch, SWAD-EUROPE, Will). Experiments with novel online thesaurus interfaces include Beaulieu (1997), Johnson and Cochrane (1995), Koch et al. (2003), Pollitt (1997), Tudhope et al. (2002a), Yee et al. (2003). In light of the cognitive demands on non-specialist users posed by the incorporation of thesauri into the search process, the extent to which programmatic interface APIs can accommodate a range of interface styles is a challenge which underlies the following discussion.
There has also been interest recently in applying web service frameworks (Gardner 2001) to the provision of Web-based, programmatic access to KOS resources. Matthews (2002) argues that it is necessary for Web service architectures to be augmented by Semantic Web techniques if both initiatives are to achieve their ultimate goals [6]. Zisman et al. (2002) discuss experiences from applying Web service wrappers in an 'information bus' approach to the development of a prototype system that integrated various FAO data sources with disparate organisation and structure. The HILT project has explored the possibilities of a high-level thesaurus to provide terminology services at the collection level for UK higher educational communities. Two presentations at the ECDL 2003 NKOS Workshop discussed initial results on the provision of KOS-based distributed services. Miles et al. (2003) describe initial work on an RDF Thesaurus Interchange Format and online Java thesaurus tools. The latter package thesaurus data as RDF and employ an HTTP thesaurus protocol, the intention being to develop them as Web services in the SWAD-EUROPE project. Vizine-Goetz (2003) describes the OCLC Terminology Services Project that employs Z39.50 (SRU/W) with a modified Zthes profile (see section 3.2) to make available and extend the value of terminology resources via various Web services, including the automatic mapping between different resources (such as DDC and LCSH) and standards crosswalks.
While these projects have offered dynamic Web-based presentation of KOS content, the concern with attempting to define standard protocols for distributed access to KOS goes back several years. In 1998, the second NKOS Workshop >[7] had as one of its themes a 'functional model of the process of using a KOS over a network'. The notes from that discussion categorised types of KOS use as Consultation (user attempts to find an unknown concept); Description (indexing); Search (query with a known concept and subsequent presentation of results); Other uses (e.g. translation). Discussion focused mainly on the consultation phase (initial entry to a KOS) and different types of information display for a single term, but included the ability to specify relationship types and the degree of expansion. The diagram illustrating that discussion is still pertinent and shows the concern with capturing both KOS lookup terminology services and KOS services for queries (Figure 10).
Figure 10. NKOS 1998 Workshop functional model (http://nkos.slis.kent.edu/SESS2.html)
Three projects in particular have explicitly been concerned with attempting to formulate general protocols that might form the basis of client-server communication and programmatic access to KOS content and services. These are briefly reviewed below. Coincidentally, both CERES and Zthes appear to have announced some form of public version around February 1999. The ADL Protocol is a later development.
3.1 CERES thesaurus protocol
The Californian CERES/NBII Thesaurus Partnership Project (CERES) developed a general protocol standard for distributed thesaurus communication [8]. This project was a collaboration between the California Environmental Resources Evaluation System (CERES), and the US Geological Survey Biological Resources Division (USGS/BRD) to facilitate access to environmental information. The aim was to construct an integrated controlled environmental vocabulary together with the tools that would enable it to be used for metadata creation and query construction, in both stand-alone and Web systems. This involved the development of a 'general-purpose thesaurus applications programming interface' to broker communication between the thesaurus and client applications. A working demonstration was provided on the project Web site. CERES developed an HTTP protocol using an RDF (XML) thesaurus representation format that followed the NISO Z39.19 standard.
The services provided by the CERES protocol include:
- return thesaurus properties (KOS metadata)
- given a term ID or term, return a description of that term
- Match: this takes a string parameter(s) and yields thesaurus terms 'that may share similar concepts to those of the parameters. How the server determines this is unspecified.' The type of term can also be specified. An option restricts terms to 'those that would help a user begin browsing the thesaurus'. Options include:
-
- return all terms in the thesaurus
- return terms matched by any type parameters
- return terms matched by SQL style syntax
- return terms matched by unix glob or regexp style matching
Problematic issues flagged by the project include:
- partial return and continuation mechanism when a Server sends only part of a potentially long set of information at one time (e.g. if a whole thesaurus is requested)
- USE+ relationship
- a mechanism for the client to query a server's capabilities
3.2 Zthes Z39.50 profile for thesaurus navigation
The Zthes Z39.50 profile for thesaurus navigation (ZTHES), 'an abstract model for representing and searching thesauri', was based on the Z39.50 protocol following ISO 2788 [9]. Thus part of the specification concerns the representation of thesaurus database records for Z39.50 implementation. It was intended, however, that the model could be general enough for use in other base communication protocols and an XML thesaurus DTD is given for the model. The Zthes profile has been used to make some thesauri available (via Z39.50) on the Internet by means of a Zthes-compliant Z39.50 server. Subsequently Zthes has been used as part of the ZING, 'Z39.50-International: Next Generation' effort, in the SRW Search/Retrieve Web Service protocol [10]. While looking to build on and facilitate access to Z39.50 systems, SRW includes both SOAP and URL-based access mechanisms.
In particular, the Zthes Qualifier-Set for CQL (Common Query Language used in SRW) provides the following set of Zthes services (note that other services are provided via SRW generally):
- search thesaurus representation for concept ID or term string
- find words occurring anywhere in the term record
- search for all narrower terms of the given concept
- search for the broader term of the given concept
- search for the preferred term of the given concept
- search for non-preferred terms of the given concept
- search for the related terms of the given concept
- search for 'linguistic equivalents' of given terms
ZThes term records are either full (element set 'f') or brief (element set 'b') records. The element sets are defined using a number of pre-defined 'tagSets' described within the Z39.50 standard. There are both mandatory and optional elements, with sub-records representing term relationships and postings.
Future possibilities noted for the profile (based on Zthes 0.5 specification) include (among others):
- versioning support at thesaurus or term level
- post-coordination support (e.g. coal mining USE COAL + MINING)
- terms 'considered suitable' as browsing starting points
- terms whose note contains specified words
- terms in a specified language
3.3 ADL thesaurus protocol
The ADL (Alexandria Digital Library) thesaurus protocol (ADL) [11] is intended as a lightweight, stateless programmatic interface to thesaurus servers, based on XML and HTTP. The protocol's model of a thesaurus closely follows Z39.19 and the definition is specified in an XML DTD and corresponding XML schema. Unlike the wider Z39.50 context of Zthes, the ADL protocol is focused on 'downloading, querying, and navigating thesauri'. A sister gazetteer protocol has also been developed. A generic, open source Java thesaurus server is supplied and demonstration forms illustrate the five independent services:
- get-properties() - return thesaurus properties (KOS-level metadata)
- download(include-nonpreferred, format) - return list of all terms with option on including non-preferred terms
- query(operator, text, fuzzy, format) - search thesaurus for matching terms to given supplied candidate string (user terminology) with parameters equals; contains -all/any; reg-exp; fuzzy (stemming, spelling correction - the exact semantics of these operators is not defined by the protocol but there is a suggestion that the call to get-properties() might document its interpretation of thesaurus service provider).
- get-broader(starting-term, max-levels, format) - returns a hierarchy of terms above the starting term to a maximum level (or all terms above), where format allows different amounts of detail on describing a term, ranging from just the term name and whether preferred to all immediate relationships (e.g. RTs), etc.
- get-narrower([starting-term,] max-levels, format) - similarly returns a narrower hierarchy
Current issues of debate for the protocol include [11]:
- a concept unique identifier; the protocol currently relies on a unique (qualified) term name
- an alternative to the "preferred" term model for dealing with synonyms and synonym rings
- a means of identifying top-level facets other than simply by level of the hierarchy
- support for sub-types of standard thesaurus relationships
- support for a language attribute
While it is possible to make use of the full term label with qualifiers to uniquely identify a term within a given thesaurus, separate unique identifiers occur in many existing KOS, such as the AAT (Harpring 2000). Unique identifiers would facilitate KOS updates where a label might change and also situations where multiple KOS were being referenced. Other points raised overlap with standards for representation of KOS generally and are discussed further in section 5.
4 Discussion of requirements and implications for protocols
The description of the three thesaurus protocols in section 3 show that there is a fair degree of agreement on basic services, with varying ideas on string matching functionality and the use of linguistic equivalents. CERES, as published, appears to have focused particularly on term lookup functionality. The Zthes profile closely follows the standard thesaurus representation of data elements and relationships with a set of atomic calls. The ADL protocol includes provision for returning groupings of primitive elements (chunking), via the get-broader and get-narrower commands. This composite capability is highly relevant to the FACET Web demonstrator, as shown in section 4.2.
Reviewing the description of the Web demonstrator interface components in section 2 and the data requirements summarised in Table 1, we can see that the requirements include single items such as the currently selected concept or its scope note, lists of items such as the possible controlled vocabulary matches for a given user term, and also more complex groupings of items. These composite groupings range from the provision of appropriate context for a current concept (e.g. facet, immediate narrower terms, immediate broader term(s)) to the potentially wider set of information such as ancestry term relationships back to the root of the hierarchy, or a semantic expansion list of potentially relevant concepts.
4.1 Web demonstrator API functions
To understand the data requirements for possible protocol calls in future developments, it is instructive to examine a subset of the underlying custom API functions used in the implementation of the current Web demonstrator. These took the form of low-level COM API interface functions, called via ASP pages to assemble the required data. Referring back to Figure 2 and the underlying system architecture, this interaction takes place between the Server-side ASP pages and scriptlets component, and the Expansion engine and semantic network component.
- Term_Ancestry(term_identifier): a composite function similar to ADL's get_broader(starting-term, max-levels, format). The function is used to return the hierarchical ancestry of terms above (broader than) the identified start term, terminating at the root (facet) term.
- Term_Descendants(term_identifier, levels): a composite function similar to ADL's get_narrower(starting-term, max-levels, format). The function is used to return a hierarchy of terms below (narrower than) the identified start term.
- Term_Description(term_identifier): returns the scope note for a term. To perform the equivalent using the ADL protocol would require the use of the query(operator, text, fuzzy, format) function, using term-description as the format, then parsing the <note type="scope note">...</note> tag from the returned data. This would not be onerous and indeed the ADL approach might be considered more flexible, returning some useful associated composite information which we have currently achieved using a sequence of separate function calls.
- Expand(term_identifier(s), expansion_threshold, diminishment_factor): performs the semantic expansion of terms. Additional parameters for assigning costs to relationships for traversal are set separately.
4.2 The expression of composite protocol operations
The Web demonstrator interface component is achieved by a combination of the above COM API calls with appropriate parameters. Achieving the same result by protocol calls similar to the ADL composite calls would be fairly straightforward. However, following an atomic protocol set such as Zthes would pose problems for a dynamically constructed interface component of the type illustrated by the browser. Note that while section 3 describes the rationale for our interface design, other interface styles also require composite protocol functionality. For example, some interfaces might display all narrower terms to the leaves of the hierarchical tree, while polyhierarchical displays require more complex broader groupings. The Getty guidelines (Harpring 2000) recommend displaying broader terms to the root and two levels of narrower terms, while Johnson and Cochrane (1995) and Koch et al. (2003) employ slightly different groupings in light of the types of KOS employed and the particular purposes of their applications.
Although it would be possible to reproduce this browsing hierarchy via combinations of primitive calls, we argue that the overheads resulting from the round-trip network latency of repeated calls to the service provider would hamper the performance of interactive interfaces over common Web bandwidth restrictions. Therefore an appropriate composite provision in the protocol is desirable. Something similar to the ADL's get-broader and get-narrower with parameters to control the level would seem to match the requirements of such hierarchical displays.
The thesaurus protocols we have examined have no direct facility for semantic expansion operations. However, the arguments regarding composite services for display purposes also apply to semantic expansion, both in semantic browsing (section 2.3.2) and in query term expansion (Section 2.5). For different applications of semantic query expansion in the Information Retrieval field, see Beaulieu (1997), Kekäläinen (1998) and Kristensen (1993). In the FACET system, this may include automatic traversal of associative relationships. This would involve some overhead with the ADL protocol which (unlike Zthes) subsumes Related Terms within the term details [12]. More fundamentally, with the current protocols, implementing expansion would involve multiple calls to the server, expanding over all permitted relationships and then consolidating the retrieved data. The two composite hierarchical calls discussed above would not achieve this general expansion. Treating expansion more generally would permit an expansion service which would unify various sets of functionalities, as discussed in section 4.3.
4.3 A unified expansion protocol operation
This section proposes a method to express the required extent of expansion succinctly within a concept/relationship KOS structure. This semantic expansion service could be employed within a useful composite operation for two related but different purposes:
- To describe the information to be retrieved for display purposes, within a single protocol call. This approach elegantly negates the need for a sequence of separate protocol calls (i.e. get-broader, get-narrower, etc.), thus potentially reducing the bandwidth and latency overheads. It could also allow the client application itself to express and retrieve the group of data fields that constitute a single term record, rather than this format being predefined by the protocol.
- To express the allowable paths (i.e. the intended extent) for a query term semantic expansion operation. Term expansion facilities are currently absent from thesaurus retrieval protocols.
The required extent and shape of expansion for a term in a conceptual KOS structure can be expressed as a tree of allowable paths emanating from that term (i.e. the allowable relationship types that may be traversed). To illustrate this, revisit the example shown in Figure 4, the hierarchical display for aniline dye. To recap, the information required to be retrieved to display that particular structure is:
- The ancestry back to the root of the hierarchy. This involves an iterative or recursive traversal of BT >[13] relationships until a term with no BT relationships is reached (i.e. the root or facet term).
- The immediate narrower term (NT) relationships.
- Narrower term relationships for immediate broader terms (BT-NT) to display the starting term within its native sibling term group.
The allowable paths:
BT* BT-NT NT paths = BT* | BT-NT | NT |
As a hierarchical tree:
paths . BT* . BT . . NT . NT |
As XML:
<paths> <bt steps='-1' /> <bt> <nt /> </bt> <nt /> </paths> |
Using this simple, extensible syntax any legitimate allowable path structure can be modelled (including the traversal of other relationship types such as UF and ALT), and the selective traversal of specific relationship sub-types may also be accommodated, for example if the standard thesaurus relationships are extended. The optional "steps" attribute indicated in the XML data in Table 2 has been used to define the allowable number of repeated traversals for a relationship of the specified type. A positive numeric value (default '1') allows for repetition e.g. <nt steps='3' /> would allow a path of NT-NT-NT, while a value of '-1' indicates an unbounded traversal. Any number of predefined expansion strategies can be represented using this simple format as platform neutral XML strings, and a single protocol call of get_expansion(term, XML) would be passed the term identifier and the applicable XML (in the form of a string or a URL) as input parameters. The format of the returned data would be defined by the protocol or by some standard interchange format.
The example in Table 3 illustrates reuse of the same allowable path strategy in a slightly more complex example. This case is intended to illustrate a possible query term expansion operation that might have been performed by the FACET semantic expansion engine. Note that the longer paths subsume the shorter paths in the hierarchical tree.
The allowable paths:
BT BT-NT BT-NT-NT NT* NT*-RT RT RT-NT* paths = BT-NT-NT | NT*-RT | RT-NT* |
As a hierarchical tree:
paths . BT . . NT . . . NT . NT* . . RT . RT . . NT* |
As XML:
<paths> <bt> <nt steps='2' /> </bt> <nt steps='-1'> <rt /> </nt> <rt> <nt steps='-1' /> </rt> </paths> |
In this case the expanded set of terms would be used, not in a thesaurus browser display, but as part of a semantic query expansion service, for example applying various costs according to relationship types, traversal steps and other factors to produce ranked results, as described in section 2.6. The method allows the behaviour of the expansion algorithm to be dynamically controlled by the client, possibly via an appropriate utility or configured for a particular collection or user group. This particular configuration is not intended as a universal recommendation but as an illustration of the principle. However, this expansion space is roughly consistent with Brooks' (1997) relevance aura findings from his empirical tests of semantic closeness perception. Support for allowing RT-NT* expansion can be found in the RT Guidelines, discussed in Tudhope et al. (2001) based on the Getty Editorial Manual for RTs. The AAT followed a principle of inheritance - an RT should be made to the broadest possible related term and the relationship should also hold for narrower terms of the one related. (See appendix in Tudhope et al. (2001))
4.4 Query services
The second example of a KOS semantic query expansion service leads on to consideration of query services generally. They are outside the scope of the KOS service protocols discussed in section 3. The issues involved here form part of wider DL service specification. In the long term, KOS services need to be integrated with DL collection query and discovery services (Hill et al. 2002). The SRW query language, CQL, offers a useful set of functionality relating to Boolean queries with proximity and linguistic (stemming, fuzzy) qualifiers, but does not exhaust the range of query services that could be offered. Probabilistic ranked result queries and also semantic expansion services, such as FACET's matching function (see section 2.6) are among the other possibilities. As can be seen from the Web demonstrator, there are various possible interactions between KOS and query services and this is a fruitful arena for further work in detailed specification. Examples include provision for building a query (interacting with KOS services), query term expansion, control of the matching function, display and inspection of results.
5 Conclusions
The Web demonstrator explores issues concerning the development of a client-server thesaurus retrieval application, where interface elements are dynamically constructed via a custom API to various thesaurus functional components. These include a mapping to controlled terminology, an interactive browser over thesaurus relationships, query building and searching a collection. Novel features include browsing via expansion and 'best match', ranked results to multi-concept queries, both based on semantic expansion of concepts. The demonstrator shows that integration of a thesaurus into the search interface is possible with a service-oriented architecture. The thesaurus data may originate from a distributed remote source but can be assembled within the interface to form an integral part of the application.
Our experience with the Web demonstrator and the data requirements of the interface components, together with the discussion in section 4 on general protocols for distributed access to KOS functionality, suggests the general conclusions, outlined below.
5.1 A service oriented approach
Any protocol will involve judgements on common groupings of functionality that should be carried as primitive elements by the protocol and therefore services that are appropriate to provide at the base level. This may involve a trade off between the complexity of the protocol (and ensuing problems for maintenance) and the practical requirements of services needed by client applications - and any novel services that may be envisaged. Thus it involves a policy on common forms of display structure and navigation patterns to support. It also involves practical judgements of current technological capability - bandwidth and platform capability to support perceived user expectations of interactive response. Note that similar issues have surfaced in previous efforts at standardising APIs for functional domains where different levels of implementation of a standard have been identified [14].
The current trend towards service oriented architectures (SOA) brings an opportunity of moving towards a clearer separation of interface components from the underlying data sources, via the use of appropriate Web services. There are many advantages to this approach: platform neutral dissemination of thesaurus content, leveraging existing intellectual effort in the compilation of thesauri - exploiting common representations, etc. However, in an SOA, basing distributed protocol services on the atomic elements of thesaurus data structures and standard relationships is not necessarily the best approach; client operations that require multiple client-server calls would carry an overhead, as each function call introduces an element of round-trip network latency. This would limit the interfaces that could be offered by applications adhering to the protocol. We argue that Web interfaces offering advanced thesaurus services require protocols which group primitive thesaurus data elements (via their relationships) into composites, to achieve reasonable response rate. The ADL Protocol's composite services of get-broader and get-narrower with their parameterisation are a step in this direction.
5.2 A unified expansion service
We propose that a thesaurus service protocol should include a semantic expansion service, as described in section 4.3. The functionality to perform controlled expansion of thesaurus terms would form a useful supplement to the composite provision. Protocol functions can be characterised as forming two core classes - the lexical query (pattern matching facilities for term finding) and the structural query (retrieval of specific elements of the structure). Recognising that functions such as get-broader and get-narrower are structural queries that implement a unidirectional form of expansion, it follows that they are subsumed in a get-expansion function [15]. A unified term expansion function could support several areas of KOS functionality, ranging from KOS visualisation to query support:
- Provision of conventional hierarchical displays (of different configurations) within a browser user interface - a single function call to retrieve all necessary data.
- Provision of novel alternative navigation interfaces, such as navigation via semantic expansion.
- Automatic expansion of query terms, which could be used in various ranked result (best match) query services (including but not confined to FACET matching function).
- Term suggestion facilities to assist in document indexing applications.
- Reduces the need for future protocol extensions to cater for further possible KOS relationship type and sub-type extensions [16] (since relationships are specified explicitly).
5.3 Standard KOS representations and explicit reference to relationships
The issue of sub-typing and extensions to standard thesaurus relationships illustrates that thesaurus (KOS) representations (interchange formats) and service protocols for retrieval are closely linked. Progress needs to be made on both dimensions if common standards are to be achieved. Standard KOS representations would encourage adoption of protocols. However, there are issues of extensibility and flexibility to consider. The current protocols tend to be limited to a small core set of standard semantic relationships, the rationale being that this restriction aids interoperability. A service protocol should be expressed in terms of a well defined but extensible set of KOS data elements and relationships (e.g. via a namespace) and relationship type should be a parameter to the appropriate protocol commands. This would allow the specialisation of the current thesaurus relationships, for example. Standards for representation and interchange of thesauri (and KOS more generally) are generally beyond the scope of this paper. However consideration of the type of interface illustrated by the Web demonstrator suggests the need for explicit representation of facet structure, for example facet membership as an attribute of a concept. This might be needed for displaying thesaurus structure in browsing interfaces, as a means of restricting the scope of term matching, as an element of a query matching function, etc.
5.4 Thesaurus protocol recommendations
Building on the above discussion, our general recommendations for the further evolution of the current thesaurus (and more generally KOS) protocols are for:
- composite service provision (as discussed in section 5.1)
- expansion service provision (as discussed in section 5.2)
- explicit reference to relationship type in services, rather than hardwiring relationships, assuming progress can be made in standard thesaurus representations (section 5.3).
We suggest that future protocol standards identify levels of implementation (Web services implementations could publish the explicit services provided). This would permit some application servers to offer a base level provision, while others might offer more advanced services. The base level could include both the atomic services based on specified relationship types and common composite services, such as get-broader and get-narrower (with parameters including relationships involved). The advanced level could include expansion, such as the proposed, unified expansion service, among other options. Different kinds of expansion options - and different degrees of complexity - could be offered by different service providers. Linguistic services have been touched on in section 3 while reviewing current protocols but have not been a focus of this paper. Clearly, various linguistic services which map to controlled terminology from initial user query formulations are other possibilities for advanced level provision.
5.5 Future work
Our intention is to move towards an open (Web service) platform in future work and build on a general programmatic KOS interface, along the lines discussed in this paper, rather than the custom API employed in the Web demonstrator. We are also investigating computational linguistic techniques with a view to applying FACET techniques to full-text collections that have not been indexed with a controlled vocabulary, and to search interfaces where the KOS is employed behind the scenes.