Optimal User-Centered Knowledge Organization and Classification Systems: Losee: JoDI

Optimal User-Centered Knowledge Organization and Classification Systems: Using Non-reflected Gray Codes

Robert M. Losee
School of Information and Library Science, University of North Carolina,
Chapel Hill, NC 27599-3360, USA
Email: losee@ils.unc.edu Web: http://ils.unc.edu/~losee

Abstract

Existing library classification systems, such as the Dewey Decimal Classification (DDC) system and the Library of Congress Classification system, are generally considered effective at bringing similar documents together. For example, the DDC system is widely used to classify books in libraries and increasingly is being used for online applications, such as browsing Online Public Access Catalogues (OPACs) and Cooperative Online Resource Catalogs (CORCs), and will probably see increasing use in digital libraries. Based on theoretical considerations, a classification system consistent with the Gray code provides an optimal organizing principle for documents; it can be used to classify knowledge, as can the classification systems commonly used in both paper and digital libraries. What is the relationship between a theoretically optimal Gray code-based classification system and the existing classification systems? We suggest that the non-reflected Gray code provides a basis for ordering documents that is more consistent with existing classification systems than is the more frequently discussed reflected Gray code. This provides both a theoretical basis for existing techniques and a standard by which document organizing systems in digital libraries may be compared, evaluated and improved.

1 Introduction: Classification

An optimal user-centered classification system assigns classification numbers to documents, placing those documents near others that are of use to an individual or group, given the system's ordering procedures. More explicitly, an optimal classification system places the most similar documents adjacent to each other. The developers of most existing classification systems placed documents on the same topic adjacent to each other and placed related documents near each other. Adding formal definitions and measures of such concepts as "topic", "similar" and "near" to a model of document ordering can serve as the basis for a mathematically formal model of an optimal classification system. Making the parameters and assumptions of this model explicit allows the user of such a system to understand both its inherent strengths and its inherent weaknesses, with the former ideally outweighing the latter.

Ranganathan (1965) claimed that "it is the duty of documentalists to spread the multi-dimensional universe of knowledge along one line. We must make a linear spectrum of it" (quoted in Tinker et al. 1999). Use of a subject-based linear arrangement of documents allows documents to be shelved by topic in a physical library. With the introduction of computers to paper-based libraries and the development of digital libraries, classification systems have been used for the display of ordered subject information or of ordered bibliographic records on computer screens (Tinker et al. 1999). Documents can be displayed in a two-dimensional space or displayed in a three-dimensional model, projected on to a two-dimensional surface such as a computer screen (Newby 2001), and our proposed model can support multi-dimensional arrangements, but a linear arrangement, such as that suggested by Ranganathan, may be the most cognitively efficient way to display groups of tens to hundreds of characters, representing the bibliographic information about each single entity in a digital library. Most search engines and online public access catalogs (OPACs) for libraries display lists of document surrogates, usually displaying a few to 10 or 20 documents on a screen; it is hard to imagine displaying this information as efficiently in a two- or three-dimensional model, although the multi-dimensional models may preserve information that is sometimes lost when compressing information into a lower-dimensional space.

A classification system may provide a value representing the subject or topic of a document, based on the document's characteristics. A theoretically developed classification system based upon the Gray code can be used to generate and order such classification numbers. The Gray code-based system is easily shown to be optimal when certain assumptions are met (e.g. there are documents on every subject and there are accurate subject indicators). While Ranganathan addressed the necessity for mapping subjects from a multi-dimensional knowledge space on to a line, he did not take the next step, which is to place precise criteria on this mapping. This paper discusses such criteria and shows that a system based on the Gray code is optimal and that existing classification systems, such as the Dewey Decimal System (DDS), can be optimal or near-optimal, even though they have what may seem to be irregularities or inconsistencies as one moves from one part of the system to another and the pattern of citation (feature) ordering changes. This leads us to try to understand the relationship between a theoretically optimal system, based on the Gray code, and the existing classification systems. In earlier work, the performance of an implemented Gray code-based system was described (Losee 1992, Losee 1993), as well as the means by which the system could adapt to individual users (Losee 1997) to provide a truly user-centered classification system. The purpose here is to move beyond a simple, theoretically optimal system and its implementation, to consider the relationship between a more flexible but still optimal variant and existing classification systems. Understanding theoretically optimal document organization systems and their relationships to some popular existing classification systems, such as the DDS, provides a desirable link between theory and practice and, more practically, a basis for understanding and improving the organization of materials in a digital library.

Anecdotal evidence suggests that existing classification systems such as the various Dewey-based systems and the Library of Congress Classification system perform reasonably well at approximating optimality; most libraries find their performance acceptable and continue to use one of these "standard" organizing systems or one of their cousins. Systems such as the DDS are being used in a wide variety of applications, ranging from the traditional ordering of books on shelves (it is the most widely used library classification system worldwide) to the organization of material on the Web with Cooperative Online Resource Catalogs (CORCs) or improved online catalog interfaces (Hickey and Vizine-Goetz 2000, Tinker et al. 1999).

Existing hierarchical classification systems group documents together based largely on taxonomic and naturalness considerations. This paper is limited to hierarchical classification systems; for a discussion of other systems, see Foskett (1996). In the Dewey-based classification systems, for example, sciences, broadly construed, are in the 500s, and technologies in the 600s. A search engine such as Yahoo similarly uses a taxonomy based on human, naturalness considerations. These systems are derived from the views of a particular culture and represent the values, interests and knowledge of a group of individuals at a given time in history. The Gray code system described below is optimal for a given individual or group, given only a fixed set of features and the expression of user preferences about the relative worth of different documents. If it is assumed that the features used in classifying are those features expressed in a document in a language, this system is independent of the extra-linguistic aspects of social and cultural systems, except for those aspects introduced directly or indirectly through language by a user or user group. When developed based on totally objective criteria derived from users, e.g. Latent Semantic Indexing (Deerwester et al. 1990, Manning and Schütze 1999), the features are likely to be a better reflection of the values and interests of those using the language, which may differ from the values and interests of those who manually develop classification systems.

Classification systems may be descriptively evaluated to determine the degree to which they provide a linear ordering of documents, such as the ordering provided for documents on library shelves. Evaluation of the quality of an existing ordering system might be based on how the system addresses the aggregated interests of a group of users, or relevance judgments, or commonly accepted topical criteria. One can then determine the degree to which the system brings together documents the user wants, or those documents perceived to be on a similar topic that are felt to belong together naturally.

In a more prescriptive manner, systems may be developed that meet the explicit criteria understood by the developer as being desirable in a classification system. For example, we might wish to design a system that ensures that each document is adjacent to another document from which it varies by at most one significant characteristic. An approach such as that taken below explicitly attempts to minimize this feature distance between adjacent documents. For a user-oriented system, one can use a classification system design that explicitly attempts to benefit the user's actions, unlike most other classification systems that attempt to organize documents based on considerations of the naturalness of a taxonomy. It is likely that naturalness and taxonomic considerations will be a part of an optimal system, but there are other likely factors, and these other factors will vary from individual to individual and from group to group.

A user-centered classification system organizes information in a manner that is consistent with the preferences of one or more users. For example, a digital library might appear to each user to have the collection organized consistent with the interests of that specific user, which we refer to as dynamic collection organization. A library supporting the needs of an organization or of any predetermined population might organize the collection based on the statistical average of the users' interests. A system organizing knowledge for an individual might take advantage of knowledge obtained through a variety of means, including systems analysis before the implementation of the system, or the incorporation of relevance feedback provided while the user examines documents, or through circulation or access records. Given this feedback, a classification system may learn the stated needs of a group and what would be optimal for the members of the group, given the data from which this is learned. A simple approach to user-centered classification for a very large group is to optimize the classification for all possible subjects and searches.

Document features are those aspects of knowledge that can represent a document's characteristics to someone accessing the collection. Features might represent the topic of a document, the date published, the publisher, the reading level, formally described metadata, or any of a number of characteristics (Schamber 1994, Foskett 1996, Theodoridis & Koutroumbas 1999, Greenberg 2001). A classification system that can be used to organize the collection so that the documents are ordered and thus optimized differently for each individual's interests might include non-topical features, for example, whether the individual has read the document, is the material in a language the user can read, or has the searcher enjoyed something else written by the author?

The classification systems used in most document management systems are static. However, dynamic user-centered systems that can adapt to the needs of a particular user or group of users will provide performance equal to or superior to that obtained with a static classification system. While continually reclassifying a traditional library's collections would be seen as impractical in virtually all circumstances, digital libraries can continually learn and adapt as times change and the system learns individual preferences. The adaptive electronic library may provide an ordering that differs from individual to individual, providing the best ordering available for each individual.

2 The Reflected Gray Code

A classification system provides a procedure for assigning a number to each possible topic within a topic space and enables one to count or enumerate through this set of all possible representations. The filing order used with a given classification system provides the rules for ordering and, in a sense, the rules for counting through the ordered set of possible classification numbers. Library classification systems most commonly use letters and decimal digits to represent various aspects of the document, including topicality. Subject-related characteristics are commonly represented in the body of a classification number, and authorship, date of publication, and other information are commonly represented with other techniques, such as the use of Cutter numbers.

Ordering classified documents can be treated as a counting problem if the value representing the document is completely numeric. Counting with the decimal system is second nature to virtually all those humans with a few years of formal education. In addition, most of us can order numbers containing decimal points. This form of counting serves as the basis for the ordering of classification numbers in systems based on the work of Melville Dewey. The decimal number system is based on numbers composed of members of a set of tokens, "0" through "9." However, other number systems may be used, such as octal, hexadecimal or binary.

The binary number system has only the characters "0" and "1" used in representing document characteristics as a classification number. Each binary digit is referred to as a bit. Counting in true binary results in the enumeration found in the center column in Table 1. In decimal systems such as the Dewey Decimal system, individual positions represent areas of coverage, e.g. the "5" in the 500s represents the sciences. In a binary number, each position can similarly represent the presence ("1") or absence ("0") of a feature or characteristic. For our purposes, we need not consider the source of these features, such as whether they are the product of human indexing, whether the feature set constitutes the set of terms occurring in a group of documents, or whether the features are derived from statistical methods such as Latent Semantic Indexing.

A Gray code provides a binary representation for numbers such that each successive representation differs from its predecessor by one feature or bit, a single changed zero or one. The rightmost column in Table 1 shows a set of Gray code codewords for a set of decimal numbers, along with the corresponding true binary values. Each cell in Table 1 constitutes a codeword. The reader is strongly encouraged to count slowly through these Gray code numbers at least once to develop a sense of the number system and how each number differs from its predecessor by a single character. One can derive the Gray code directly from the true binary using existing procedures (Savage 1997). Below, the paper considers how a modification in the most common counting procedure can produce Gray codes that may be similar to the structures used in existing classification systems, providing a basis for judging the existing classification systems to be nearly optimal.

Table 1. Comparing decimal, binary and Gray codes
Decimal True Binary Gray Code
0
0 0 0
0 0 0
1
0 0 1
0 0 1
2
0 1 0
0 1 1
3
0 1 1
0 1 0
4
1 0 0
1 1 0
5
1 0 1
1 1 1
6
1 1 0
1 0 1
7
1 1 1
1 0 0

Note that the optimality of a Gray code-based system assumes that all possible representations are used. At present, we have no guarantee that if there are subject gaps and particular representations do not occur, then the neighboring documents will necessarily be arranged so that the average similarity between adjacent documents is maximized. It is likely that the average similarity is close to its maximum, but we cannot guarantee that it is at its maximum.

The most commonly used Gray code is the reflected Gray code. In the third column of Table 1, the Gray code representation for the decimal values 2 and 3 is a mirror image representation for the decimal values 0 and 1, with a 1 placed in the second column (what the reader may call "the ten's column") for the Gray codes representing 2 and 3. Imagine a mirror placed below the line representing the Gray code for 1. The value for 2 is the reflection of 1, with a 1 added to the second (left) column for the Gray code. The value for decimal zero, further away from the mirror than the decimal 1, has as its reflection the value for 3, again with a 1 placed in the second column.

We may think of the Gray code geometrically and view the ordering imposed by it as providing a path connecting all the vertices (corners) of a unit-cube in three-dimensional space (or more generally, in an n-dimensional space, where the object is referred to as an n-cube, and n represents the number of dimensions or features). Note that the time it takes to compute such a path is similarly difficult to that of many other graph traversal algorithms. Each vertex is labeled with the number representing its location in the geometric space. For example, the labels on a square would include the values (moving from the lower-left clockwise around the square) (0,0), (0,1), (1,1), and (1,0). Any path moving through all the vertices exactly once, by traveling on the edges of the cube (and not traveling other parts of the cube than the edges), represents a Gray code. Thus, our labels on the four corners of the square represent a Gray code. If we consider a three-dimensional cube, a reflected Gray code would consist of moving, for example, clockwise around the four corners of one face of the cube and then moving counter-clockwise around the four corners of the opposite face (with directions being viewed from a single point in space).

One can imagine the reflective Gray code using this geometric approach by considering different paths around the edge of an n-cube. Consider the top surface of a cube on which you are asked to trace a path with your fingertip along the edges of the cube, such that each corner is touched exactly once. We assume that points A, B, C and D are consecutive corners (upper-left, upper-right, lower-right, lower-left) on the top face and points E, F, G and H are on the bottom face, with A above E, B above F, C above G, and D above H. If we were to begin with point A and finish with point B, we might follow the path: A, D, H, E, F, G, C, and then B, which moves over the left face overhand and then over the right face taking the reverse or reflected path. There are several paths that could be interpreted as Gray codes around the edges of such a cube.

The application of the Gray code to documents is relatively simple. Assume that we have a simple universe of only two features and four documents, with the four different sets of characteristics: {no features}, {f1}, {f2}, and {f1, f2}. These would be encoded as {0,0}, {0,1}, {1,0} and {1,1}. When arranged in common or normal counting order, they would be arranged:

{0,0}, {0, 1}, {1, 0}, {1,1}

with the documents being in a ring and the last document preceding the first document in the ring. Using the Gray code order, the same documents would be arranged:

{0,0}, {0,1}, {1,1}, {1,0}.

Clearly, the Gray code ordering places similar documents closer to each other. In a very dense information space, documents will differ from an adjacent document by at most one feature. A similar phenomena occurs when we move to documents at greater distance, with the general trend of inter-document distance and inter-document dissimilarity increasing together.

3 The Gray Code as the Gold Standard for Organizing Knowledge

The Gray code may serve as a gold standard for linearly organizing documents in collections. We know that if we wish to place documents adjacent to other documents, with the two documents differing by at most one feature, an organizing system that is consistent with the Gray code is required. Clearly, such a system can serve as a gold standard by which we may evaluate existing systems or design new systems. While systems may be designed to be optimal in the sense of bringing similar documents together, the systems may not be optimal in other senses, such as the naturalness of the grouping of documents.

This and other precise models make assumptions. It was mentioned above that when the document space is dense enough so that there exist documents for all possible representations, the Gray code-based classification is optimal. If the space is less dense, adjacent documents might differ by more than one feature. It is likely that this Gray code-based arrangement is optimal or very nearly optimal in this less-dense situation, but this is not a fundamental characteristic described in the Gray code literature. We also assume that the features present cover all those aspects of a document that might help discriminate between documents of greater usefulness and those of lesser utility. Ideally the features should all be statistically independent; some preliminary tests by the author suggest that significant deviation from independence may result in poor document organization.

Empirical evidence suggests that using the Gray code is generally effective in those environments tested. For example, in a study of books cataloged under four different subject headings, organizing the books by using the Gray code resulted in the documents with the same subject headings being placed closer to each other than when the documents were ordered by their Library of Congress Classification number (Losee 1992).

4 Ordering of Features

Features have been treated above as though they were all equal in worth in a Gray code-based classification system. However, we may order features based on their relative importance and use their ordering within a classification system to place similar documents closer together, on the average, than if the features were randomly ordered. This ordering may reflect the interests of an individual knowledge user, or a group, or support the information needs of the entire population.

Foskett (1996) suggests that there are reasons for ordering differently, consistent with different user preferences. He points out, for example, that the feature language may take precedence over literary style for documents addressing literature, while there are other reasons that one might wish to have literary style take precedence over language. Those who seek French language literature, for example, would find it easier to have the feature language to the left (or having precedence over) particular literary genres. Someone who wishes to focus on a genre, however, regardless of the language, would clearly prefer to have the genre or literary style to the left of and dominate the language used. Video material (Geisler et al. 2001) is normally organized so that the individual frames are sorted in time order with time as the dominant feature, but when indexing a video it may be desirable to organize it by topic instead of by time. Having a system that is flexible and can adjust to a particular user's preferences is desirable and, in fact, mandatory if a system is to be user-centered.

Feature reordering, reflecting relative levels of interest, provides an adaptive characteristic to a knowledge ordering system that allows it to be accessed in different ways, just as information retrieval systems allow us to retrieve different sets of documents by entering different queries. Someone interested in Ronald Reagan using a search engine might search for either Ronald Reagan AND Screen Actors Guild or, possibly, for Ronald Reagan AND White House. This flexibility, which may be seen as providing different entry points to the database, can be emulated by a browsing system that enables the features Screen Actors Guild and White House to be moved so as to represent the relative importance of each to the other.

The reordering of features often occurs in existing classification and subject-heading systems. This is a natural phenomenon that occurs when the focus or dominant subject changes as we move from one topical area to another. Viewed differently, change occurs in the economic benefit associated with assigning one feature as dominant instead of another feature. In fact, we may precisely define the subject of a document as the order of features, i.e. the relative economic importance of features. These feature-changing phenomena in organizing systems can be seen in lists of subject headings. For example, one finds that books on ambassadors and foreign relations might be classified to be consistent with the heading Ambassador - France or assigned the heading France - Foreign Relations. Similarly, one can find Food-United States and Chinese-Americans-Food. Quilts appear under Quilts-United States and under specific groups, e.g. Afro-American Quilts and Indian Quilts - North Carolina.

5 Feature Weighting

When using the Gray code, and consistent with information theoretic considerations, it can be shown that the expected information dissimilarity will be lower between adjacent documents when the features are ordered in a classification number so that features with higher probabilities are placed to the left of those features with lower probabilities (Losee 1992). Experimental evidence provided support for this argument by showing that ordering the features so those with lower probabilities are on the left resulted in the inferior placement of documents when compared to ordering features with the more common, higher probability features on the left.

One can move beyond this simple statistical ordering of features to include user-supplied preferences. Because economic costs may reflect the preferences of individuals, groups or populations, this model may easily incorporate expressed user interests. While user-oriented systems are often presumed to be qualitative in nature, we believe that any system such as this that formally incorporates information the user provides about preferences will be equally or more user-oriented than other, less rigorously defined methods. Following Losee (1997), one can compute the expected cost of the dissimilarity of features across documents. By placing the features with the highest expected cost of dissimilarity on the left, one can maximize the benefit of the organization of the data for the individual user. The result of this is to have infrequent transitions between documents with high transition costs and more frequent transitions between documents differing by features on the right that have a lower expected cost associated with the transitions.

User preferences can be supplied as relevance feedback, providing information about the relative merits of specific documents and of specific features within documents. By modifying the costs or benefits associated with a specific feature, by either explicitly modifying a given cost or by a system changing its estimate of costs, the ordering of features may be changed. Systems using such relevance feedback can adapt to individual preferences. Such user-centered classification can be used to organize an electronic library to address the interests and preferences of each individual, effectively personalizing the order of a previously impersonally ordered collection. While relevance feedback cannot economically be used to provide user-centered ordering for a paper-based document collection with expensive-to-replace paper classification labels, it can easily provide a user-centered ordering for knowledge represented in electronic form with computer-generated representations of dynamically produced classification numbers.

6 Non-reflected Gray Codes

The Gray coding system described above is the reflected Gray code. Each half of the code is the reflection of the other half of the code, with a single bit on the left being changed. All earlier library or document organization related empirical research on the Gray code being used for document classification addressed this reflected subset of all Gray codes (Losee 1992, Losee 1993, Losee 1997). However, there are non-reflective Gray codes that do not share this reflective property but that place documents adjacent to other similar documents, due to the definition and characteristics of all Gray codes.

Consider Table 2. In binary, we note that on the left we have the reflected Gray code described earlier. On the right, we have a set of binary Gray codes that differ from those on the left because the rightmost two bit positions are exchanged. We can see informally that a new Gray code can be made from the simple reflective Gray code by transposing bit columns. This is still a reflected Gray code.

Table 2. Making a new Gray code from simple reflective Gray code by transposing bit columns
Standard Reflected Code Right and Middle Bit Positions Exchanged
0 0 0
0 0 0
0 0 1
0 1 0
0 1 1
0 1 1
0 1 0
0 0 1
1 1 0
1 0 1
1 1 1
1 1 1
1 0 1
1 1 0
1 0 0
1 0 0

A non-reflective code differs from either of these codes by moving around certain representations in the code, while retaining the rule that each successive representation differs from its predecessor by exactly one position or one bit. Table 3 shows a reflective code and the insertion and deletion pairs (X & Y, A & B) that can be made to produce a non-reflective Gray code. Note that taking the entire left column as presented gives us a traditional Gray code, while inserting and deleting (as indicated) both the X & Y values (one pair of numbers) or the A & B values (a second pair of numbers) gives us another Gray code, albeit a non-reflected code. Note that we cannot move both the X & Y pair and the A & B pair and have a valid Gray code.

Table 3. Making a non-reflective Gray code
Traditional Gray code Non-reflected Gray code
0 0 0 0  
0 0 0 1  
0 0 1 1  
0 0 1 0  
  1 0 1 0 ß Insert X
  1 1 1 0 ß Insert Y
0 1 1 0  
0 1 1 1  
  1 1 1 1 ß Insert A
  1 1 0 1 ß Insert B
0 1 0 1  
0 1 0 0  
1 1 0 0  
1 1 0 1 B: Deleted à
1 1 1 1 A: Deleted à
1 1 1 0 Y: Deleted à
1 0 1 0 X: Deleted à
1 0 1 1  
1 0 0 1  
1 0 0 0  

7 Feature Ordering and Non-Reflective Gray Codes

The reordering of features can take place globally or locally. Often referred to as "citation ordering," we instead refer to the ordering or reordering of "features" because all possible features may be used in this system, although not all will produce the same degree of classification performance. The global reordering of features takes place throughout the coding system, changing the dominance of one feature over another through the collection. Local reordering can represent changes in particular domains that one may not want spread through the universe of all subjects. For example, one might want language to dominate literary style in many areas but not in poetry.

Changes in feature ordering such as these frequently occur in existing classification systems. While such asymmetric changes violate the assumptions of a symmetrical, optimal reflected Gray code, the changes may be consistent with the assumptions of an optimal but non-reflected Gray code. Many existing classification systems appear to be very good, at least to the point that they are widely accepted and supported, so we might ask whether they could approach optimality and to what extent and in what way some of the classificatory power found in these systems is due to their consistency with or approximation of the assumptions of an optimal, non-reflected Gray code. By definition, any system consistent with the sort of optimality described above is a Gray code. Thus, any existing classification system that meets these assumptions overall, or in parts of the coding and classification system, is a Gray code overall or in those parts.

These changes can occur where the economic costs for features vary in different subject areas. A feature that is very important in one area might be less important in others. The different economic values for different features and the resulting shifts in feature ordering can occur in segments of a Gray code classification that can be shifted around, as in the non-reflected code. In the example of the non-reflected code in section 6, pairs of codes were moved. Consider that in a much larger system, with hundreds or thousands of features, large blocks of representations may be moved. Within these blocks, bits can be exchanged, as was shown above, to reflect the different costs in different subject areas. Note that within a block of code, if there are two pairs of adjacent codewords (the two pairs need not be adjacent) that have the same values for two features, the features may be swapped for those codewords on (what is arbitrarily referred to as) the inside of the adjacent codewords and all codewords in between these codewords. This inside section represents those areas where we wish to flip bit positions. For example, if we had features represented from left to right as X, Y and Z, and the Gray code codewords 110 followed by 111 and then at some other point in the code the codeword 000 followed by 001, we can flip the leftmost and center bit positions from 111 through all the codewords up to and including 000. Thus, on the "inside" from values 111 to 000, the order of features can be changed to Y, X then Z. We can do this without violating the rule that Gray code codewords must differ from adjacent codewords by exactly one characteristic.

8 Further Discussion and Conclusions

Why do human-developed classification systems such as the Library of Congress Classification or Dewey Decimal Classification systems work as well as they do? We have described above a non-reflective form of the Gray code that allows feature ordering to be reversed in different parts of a classification system. Some hierarchical classification systems, such as the Dewey Decimal system, exhibit feature reordering, and the fact that an optimal coding system might be consistent with existing library classification systems, with their feature reordering, suggests that the existing systems may be optimal or near-optimal.

A classification system consistent with the Gray code ensures that adjacent documents differ at most by a single feature, given the assumption that there are documents on every subject. If this assumption is not met, the classification system is expected to place related documents very closely together. Using a Gray code-based classification system in a digital library could lead to near-optimal browsing and organizing performance.

Earlier work suggested that a document collection using the Gray code could organize documents so that similar documents are placed close to each other. In fact, using the Gray code where the feature set consisted of controlled subject headings resulted in classification performance equal to or superior to that obtained with human supplied classification numbers, for a small set of data (Losee 1992, Losee 1993). While this level of performance probably requires the careful choice of a feature set, it is indicative of the performance that may be obtained when using the Gray code. For these systems, the placement of features from left to right in the classification number remain constant through a particular arrangement of the documents.

Existing library classification systems often place the same features in different orders in the classification numbers of the system as a whole. While this makes the classification system asymmetrical, it can lead to a system that seems more consistent with human ideas about taxonomizing the known universe, and at the same time provides an efficient classification system. The weighting and difference in importance of features may be consistent with the transposition of the feature ordering found locally in non-reflected Gray codes. As one changes from one domain to another, the relative economic value of different features or facets of this domain may shift. Feature weighting can thus be used to explain why the feature positions may shift, producing relative valuations that may be incorporated into a Gray code. The structure of non-reflected Gray codes suggests that classification systems may be developed with portions of the code being asymmetric and yet still be optimal. More specifically, features that are to the left of others at some points in the enumeration of a set of classification values might be to the right of them at other points in the enumeration (and vice versa), and one could still have an optimal classification system.

References

Arazi, B. (1984) "An Approach for Generating Different Types of Gray Codes". Information and Control, 63, 1-10

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990) "Indexing by Latent Semantic Analysis". Journal of the American Society for Information Science, 41, 391-407

Flores, I. (1956) "Reflected Number Systems". IRE Transactions on Electronic Computers, EC-5(2), 79-82

Foskett, A. C. (1996) The Subject Approach to Information, 5th edition (London: Library Association Publishing)

Geisler, G., Marchionini, G., Nelson, M., Spinks, R., and Yang, M. (2001) "Interface Concepts for the Open Video Project". Proceedings of the 64th Annual Meeting of the American Society for Information Science and Technology, pp. 58-75

Gilbert, E. (1958) "Gray Codes and Paths on the N-Cube". Bell System Technical Journal, 37, 815-826

Greenberg, J. (2001) "Metadata and the World Wide Web". Encyclopedia of Library and Information Science (New York: Marcel Dekker)

Hickey, T. and Vizine-Goetz, D. (2000) "The Role of Classification in CORC". 23rd International Online Information Meeting, edited by B. McKenna (Oxford: Learned Information Europe), pp. 247-250

Losee, R. (1992) "A Gray Code Based Ordering for Documents on Shelves: Classification for Browsing and Retrieval". Journal of the American Society for Information Science, 43(4), 312-322

Losee, R. (1993) "The Relative Shelf Location of Circulated Books: A Study of Classification, Users, and Browsing". Library Resources & Technical Services, 37(2), 197-209

Losee, R. (1997) "Browsing Document Collections: Automatically Organizing Digital Libraries and Hypermedia using the Gray Code". Information Processing & Management, 33(2), 175-192

Manning, C. D. and Schütze, H. (1999) Foundations of Statistical Natural Language Processing (Cambridge, MA: MIT Press)

Newby, G. (2001) "Cognitive Space and Information Space". Journal of the American Society for Information Science and Technology, 52, 1026-1048

Ranganathan, S. R. (1965) A Descriptive Account of the Colon Classification (Bombay: Asian Publishing House)

Savage, C. (1997) "A Survey of Combinatorial Gray Codes". SIAM Review, 39(4), 605-629

Schamber, L. (1994) "Relevance and Information Behavior". In Annual Review of Information Science and Technology, edited by Martha E. Williams (Washington, DC: ASIS), pp. 3-48

Tinker, A., Pollitt, A. S., O'Brien, A., and Braekevelt, P. (1999) "The Dewey Decimal Classification and the Transition from Physical to Electronic Knowledge Organisation". Knowledge Organisation, 26(2), 80-96

Theodoridis, S. and Koutroumbas, K. (1999) Pattern Recognition (San Diego, CA: Academic Press)