A Text Categorization Technique based on a Numerical Conversion of a Symbolic Expression and an Onion Layers Algorithm


Marios Poulos, Sozon Papavlasopoulos and Vasilios Chrissikopoulos
Department of Archives and Libraries Science, University of Ionian,
Palaia Anaktora, 49100, Corfu, Greece
Email: {mpoulos, sozon, vchris}@ionio.gr

Abstract

The dramatic increase in the amount of content available in digital forms gives rise to large-scale digital libraries, targeted at millions of users. As a result, it has become a necessary to categorize large texts (documents). The paper develops a novel method where text categorization is achieved via a reduction in the original data information using numerical conversion of a symbolic expression and an onion layers algorithm. Three different semantic categories were considered and five texts selected from each category for submission to a text categorization procedure using the proposed method. The results and the statistical evaluation of this procedure showed that the proposed method may be characterized as highly accurate for text categorization purposes.

Keywords: text categorization, onion algorithm, computational geometry

1 Introduction

Text categorization is the process of grouping texts into one or more predefined categories based on their content. Automatic text categorization can play an important role in a wide variety of more flexible, dynamic and personalized information management tasks as well as: real-time sorting of email or files into folder hierarchies; topic identification to support topic-specific processing operations; structured search and/or browsing; or finding texts that match long-standing interests or more dynamic task-based interests (Anick and Vaithyanathan 1997). A number of statistical classification and machine learning techniques have been applied to text categorization, including regression models, Bayesian classifiers, decision trees, nearest neighbour classifiers, neural networks, and support vector machines (Weston et al. 1997).

Classification technologies should be able to support category structures that are very general, consistent across individuals, and relatively static (e.g. Dewey Decimal or Library of Congress classification systems, Medical Subject Headings (MeSH), or Yahoo!'s topic hierarchy), as well as those that are more dynamic and customized to individual interests or tasks. Until now the first step in text categorization has been to transform texts, which are typically strings of characters, into a representation suitable for the learning algorithm and the classification task. Each document is represented as a vector of words, as is typically done in information retrieval (Salton and McGill 1983). For most text retrieval applications, entries are weighted to reflect the frequency of terms in texts and the distribution of terms across the collection as a whole. For text classification, simpler binary feature values (i.e. a word either occurs or does not occur in a document) are often used instead. For example, in the most commonly used text representation, the vector space model, a vector of words represents each text. A word-by-text matrix A is used for a collection of texts, where each entry represents the occurrence of a word in a text, i.e. figure , where figure is the weight of word i in text j. Determining the weight and the frequency of words in a text is common to all existing methods.

Classification of the aforementioned vectors has been based on several learning techniques, including multivariate regression, nearest neighbour classifiers, probabilistic Bayesian models, decision trees, and neural networks. Overviews of this text classification work can be found in Lewis and Hayes (1994), Yang (1999), Joachims (1998) and Dumais et al. (1998). Furthermore, in the semantic text categorization reduction technique, latent semantic analysis (LSA) (Deerwester et al. 1990), there is an approach to automatic indexing and information retrieval that attempts to overcome common problems in information retrieval. The first problem is termed synonym and refers to the fact that there are many names for the same object. The second problem is called polysemy, and refers to the fact that most words have more than one meaning.

Recently, machine learning algorithms have been applied successfully to automatic text categorization problems, with a large number of unique words (over 15000). However, it has not yet been clarified, particularly for large-scale problems with more than a million feature terms, to what degree the following two factors work to the detriment of each other: the reduction of the classification error using computationally-intensive machine learning algorithms, and the loss of information due to the selection of feature terms before applying the expensive learning algorithms.

Specifically, in the learning procedure the algorithm of Slonim and Tishby (2001), which is based on the Bayesian classifier, is O(n3k) in complexity where n is the total number of words and k is the number of classes. Furthermore, the philosophy of the aforementioned methodology is justified because a Bayesian classifier presents a weakness in the point sets for large dimensions (Domingos and Pazzani 1997). It is known that high dimensionality, which presents a classified vector S of words, creates classification problems. These arise from the asymptotic error rate of the nearest neighbour rule (Toussaint et al. 1984). This problem is solved partially by the Voronoi diagram of S, which partitions the plane convex polygonal Voronoi cells.

This paper attempts to improve the performance of the naive text categorization methods by using the concept of probability weighted amount of information, and also by incorporating morphological information and probabilistic language modelling techniques in extracting and computing the weights of feature terms. Thus, our analysis is based on the advantages of the Voronoi diagram and we reduce the complexity of this transformation by using the onion algorithm of Bremner et al. (2003). The idea behind this implementation is to use the convex hull subroutine recursively to extract the outmost convex hull (S1 smallest convex polygon) of the given points (characters). In particular, we have developed a method based on an onion polygon each point of which represents a character of the document. For the numerical conversion of each character we used a specific numerical representation of the conversion strings which equates to double precision of ASCII characters.

In this novel way we avoid the problems of vector representation by converting the entire text into a unique set of numbers (more details of this method are given in section 2.1). This conversion took place in order for all the features of the characters of a document to be represented in the Cartesian plane, and used in the computational geometry. Furthermore, a characteristic of the onion layers method (Borgefors et al. 1992) is that all the features are sorted in an ideal way and all specific features of documents, such as spaces and punctuation marks, have a domain role in the final reduction. This method may be characterized as a continuation of our latest research, such as specific feature extraction for fingerprint verification via onion layers (Poulos et al. 2003), and the study of the encephalogram where the feature of a convex polygon gave positive results in diagnostic biometrical tests (Poulos et al. 1999).

The aim of this paper is to describe the proposed algorithm in depth and with examples, and, furthermore, to compare it statistically with the similar LSA reduction method in order to extract significant conclusions regarding the statistical assurance of this method.

2 Method

In brief, our proposed method is divided into the following stages:

2.1 Preprocessing stage

In this stage we suppose that a selected text is an input vector figure , where figure represent the characters of the selected text. Then, using a conversion procedure where a symbolic expression (in our case an array of characters of a text) is converted to ASCII characters in string arithmetic values, we obtained a numerical value vector figure where these values ranged between 1-128.

In our case this conversion was achieved by using the double.m function of the Matlab language. This function converts strings to double precision and equates with converting an ASCII character to its numerical representation.

For better comprehension we give the following example via Matlab:

>> S = 'This is a message to test the double "command".'

>> double(S)

ans =

Columns 1 through 12

84 104 105 115 32 105 115 32 97 32 109 101

Columns 13 through 24

115 97 103 101 32 116 111 32 116 101 115 116

Columns 25 through 36

32 116 104 101 32 100 111 117 98 108 101 32

Columns 37 through 46

34 99 111 109 109 97 110 100 34 46

2.2 Processing stage

Our proposed method is based on the following proposition:

The set of elements of vector figure for each selected text contains a convex subset, which has a specific position in relation to the original set (Poulos et al. 1999, Poulos et al. 2003). This position may be determined by using a combination of computational geometry algorithms, which is known as Onion Peeling Algorithms (Bose and Toussaint 1995) with overall complexity O(d*n log n) times, where d is the depth of the smallest convex layer and n is the number of characters in the numerical representation (in accordance with section 2.1)

Thus, the smallest convex layer figure of the original set of vector figure carries specific information. In particular, vector figure may be characterized as a common geometrical area of all the elements of vector figure . In our case, this consideration is valuable because this subset may be characterized as representing the significant semantics of the selected text.

Implementation

We consider the set of characters of a selected text to be vector figure . The algorithm starts with a finite set of points figure in the plane. The following iterative process is considered. Let figure be the set figure minus all the points on the boundary of the hull of figure . Similarly, define figure . The process continues until the set is figure (see Figure 1). The hulls figure are called the layers of the set, and the process of peeling away the layers is called onion peeling for obvious reasons (Figure 1). More details of the algorithmic construction of the proposed convex polygon and onion algorithm are presented in the Appendix.

figure 1

Figure 1. Onion layers of a set of points

2.3 Categorization stage

In our case we considered that the smallest convex layer that has depth 3 (Figure 1) carries specific information (Poulos et al. 2003), because this position gives a geometrical interpretation of the semantics of the selected text. In other words, the smallest convex polygon (layer) depicts a particular geometrical area in which the values of these semantics range. This structure has also been used for example in statistics to define a generalization of the concept of the median into two-dimensional data (O'Rourke et al. 1982). This feature may be characterized as unique to each selected text because the two following conditions (Poulos et al. 1999) are ensured:

Thus, two variables were extracted from the proposed text categorization method: the area of the smallest onion layer figure and the depth (i) of this layer, which is a subset of the original text set S values. In more detail, the smallest onion layer figure is a matrix that contains the set of convex coordinate values of the smallest onion layer, in other words, figure

where x represents the frequency of each converted character and y the numerical value of each character.

Taking into account the specific features of the aforementioned variables it is easy to ascertain that these may be used in the next stage.

2.4 Decision stage of computational geometry method

In this stage subset figure intersects a new subset figure , which came from the processing of another set figure , on the Cartesian plane (Figure 2).

figure 2

Figure 2. Decision stage between two different onion algorithm procedures

The correlation of investigated subsets figure , figure is decided based on the following experimental conditions:

If the above subsets are intersected, the degree of correlation is extracted by using the fraction of the common intersection and the area of the original convex set. In more detail, if we consider that figure and the corresponding areas of these convex polygons are figure , figure and figure , then the degree figure of the correlation of subset figure with regard to subset figure is extracted as follows:

figure (1)

In our case if figure , then the system decides that the two compared subsets figure and figure are not correlated (Figures 3 and 4).

figure 3

Figure 3. Procedure of correlating the two smallest convex polygons, extracted using the onion layer method (see Figure 2)

In this way degree figure is the defining factor in categorizing two investigated texts.

figure 4

Figure 4. Case of categorization in the same category. The green area is the common area of the compared text categorization documents

3 Experimental part

3.1 Data acquisition

In this study we tested the proposed method in three different semantic categories:

  1. Five documents (A1, A2, A3, A4 and A5) concerning Money and Marketing with the Reuters-21578 collection as a source.
  2. Five documents (B1, B2, B3, B4 and B5) concerning computational geometry with common keywords: Computational, Geometry, Algorithm, Onion Algorithm and Fingerprint Verification
  3. Five documents (C1, C2, C3, C4 and C5) concerning the AIDS virus from the Reuters-21578 collection

Each text had about 500 characters (with spaces), or about 100 words.

3.2 Implementation of the algorithm

The implementation of the algorithm takes place using the following steps.

In the preprocessing stage we selected a text from each category and submitted it for numerical conversion using the following algorithm implemented via Matlab function double.m.

For example, we took the following text from the third category (AIDS and Virus), which we called C1:

West German chemical group Bayer AG <BAYG.F> said it had received claims for damages from haemophiliacs who allege they were infected with AIDS through a blood plasma produced by a Bayer subsidiary. Bayer's shares fell 13 marks to close at 292 in Frankfurt today. Dealers said a newspaper report about possible huge claims for damages triggered off the share selling in Bayer, other pharmaceuticals and insurance companies. Bayer said in a statement it was insured against claims and had received two.

This text was submitted to numerical conversion, and a vector of a set number figure was constructed.

The same procedure was followed for other selected texts from the three categories.

In the processing stage vector figure was put on the Cartesian plane and the onion algorithm was applied, thus isolating the characteristic smallest convex polygon (vector figure ). This algorithm was implemented via Matlab algorithms convex2.m.

This procedure was repeated for the other vectors until we finally isolated the 15 smallest convex polygons or vectors figure where n=1 ... 15

An example of this procedure is presented in Figures 5 and 6.

figure 5

Figure 5. Implementation of the onion polygon of text C1 using vector figure



figure 6

Figure 6. Isolation of the smallest convex polygon of text C1 or vectors figure , figure

In the categorization stage vectors figure and figure of the smallest polygon were intersected with the other vectors. The algorithmic procedure was as follows.

We considered that the intersected coordinate vectors were figure and figure from the first text and figure and figure from the second text. Then the convex polygon, which was produced by chance intersection, had figure and figure coordinate vectors. This algorithm was implemented via the following Matlab function:

figure

For example, we used the following second text from the third category (Aids and Virus), which we called C2:

Bristol-Myers Co <BMY> said it would seek Food and Drug Administration permission to test its AIDS vaccine in humans by the end of March. A company spokesman said the company will file an investigational new drug application by the end of the month, requesting the FDA to allow testing of the vaccine in humans. Scientists at the company's Genetic Systems unit created an AIDS vaccine, which Bristol-Myers said produced antibodies to the AIDS virus in mice and monkeys. The vaccine consists of smallpox virus.

We also used the following text, which we called B1, from the second category, Computational Geometry:

In this paper, fingerprint segmentation for secure Internet verification purposes is investigated. The novel application of computational geometry algorithms in the fingerprint segmentation stage showed that the extracted feature (characteristic polygon) may be used as a secure and accurate method for fingerprint-based verification over the Internet. On the other hand, the proposed method promisingly allows very small false acceptance and false rejection rates, as it is based on specific segmentation,

Figures 7 and 8 show the implementation of the intersection of semantic texts from the same categories, C1 and C2, where we obtained a large convex polygon C 12 (green) (figure). In contrast, in Figures 9 and 10 show that the intersection of semantic texts C1 and B1 are zero (B1∩C1=0).

figure 7

Figure 7. Intersection of two semantic texts, C1 and C2, using the onion algorithm


figure 8

Figure 8. Isolation of the intersection (green) of the two smallest convex polygons of semantic texts C1 (blue) and C2 (red)


figure 9

Figure 9. Intersection of two different semantic texts, C1 and B1, using the onion algorithm


figure 10

Figure 10. Zero intersection between the two smallest convex polygons of semantic texts C1 (blue) and B1 (red)

In the decision stage degree figure of correlation between subsets figure and figure was calculated as follows.

First, we calculated the area of each convex subset figure and figure followed by the area of their common intersection figure . Then degree figure was calculated as follows:

figure and figure

In Matlab this procedure was calculated as follows:

figure and figure

and was repeated for all text cases. The figure degrees are presented in Table 1.

4 Results

In this part the results of the implementation of the proposed algorithm for text categorization purposes are presented in Table 1. More specifically, we extracted the figure degrees of correlation using Equation 1. In total we obtained 15 x 15 = 225 figure degrees of correlation for the three semantic categories.

Table 1: figure degree of correlation for three semantic categories (A, B, C) using an onion algorithm

TEXTS
A1
A2
A3
A4
A5
B1
B2
B3
B4
B5
C1
C2
C3
C4
C5
A1
1
0.71
0.42
0.35
0.33
0
0
0
0
0
0
0
0
0
0
A2
0.53
1
0.67
0.22
0.18
0
0
0
0
0
0
0
0
0
0
A3
0.44
0.48
1
0.39
0.87
0
0
0
0
0
0
0
0
0
0
A4
0.67
0.56
0.33
1
0.26
0
0
0
0
0
0
0
0
0
0
A5
0.44
0.18
0.35
0.56
1
0
0
0
0
0
0
0
0
0
0
B1
0
0
0
0
0
1
0.67
0.43
0.38
0.35
0
0
0
0
0
B2
0
0
0
0
0
0.19
1
0.22
0.34
0.44
0
0
0
0
0
B3
0
0
0
0
0
0.54
0.87
1
0.56
0.57
0
0
0
0
0
B4
0
0
0
0
0
0.66
0.56
0.11
1
0.33
0
0
0
0
0
B5
0
0
0
0
0
0.23
0.43
0.31
0.27
1
0
0
0
0
0
C1
0
0
0
0
0
0
0
0
0
0
1
0.44
0.23
0.25
0.33
C2
0
0
0
0
0
0
0
0
0
0
0.81
1
0.12
0.17
0.19
C3
0
0
0
0
0
0
0
0
0
0
0.54
0.41
1
0.33
0.33
C4
0
0
0
0
0
0
0
0
0
0
0.30
0.29
0.46
1
0.27
C5
0
0
0
0
0
0
0
0
0
0
0.23
0.25
0.42
0.25
1

5 Statistical evaluation

In this section a statistical evaluation between the most used information retrieval method, latent semantic indexing (LSI), and the proposed method is attempted. In particular, we tested the five texts of the same semantic category C (AIDS-virus). To compare the two techniques we used rank correlation among several variables of the chi-square control. In the following paragraphs we develop the procedure followed in order to yield the normalized matrices for each method and, thereafter, apply them to the selected statistical method.

5.1 Concordance: rank correlation among several variables

The concept of correlation between two variables can be expanded to consider association among more than two, as shown for multiple correlations (Kendall 1962). Such association is readily measured non-parametrically by a statistic known as Coefficients of Concordance. In our case this correlation, W, gave the perfect criterion of agreement among five selected sets of data. In other words, this test investigates if the data of columns are in agreement. The correlation W was employed by the following equation:

figure (2)

where M is the number of variables being correlated and n is the number of data per variable, Ri is the sum of the values of each column for i=1 ... n.

We can ask whether a calculated sample, W, is significant, that is, whether it represents an association different from zero in the population that was sampled. A simple way to find out the relationship between the Kendall Coefficients of Concordance, W, and the Friedman chi-square figure involves using the following formula:

figure (3)

Therefore, we can convert the calculated W to its equivalent figure and then employ the critical values of figure .

5.2 Latent semantic indexing

In the first stage we created Table 2 to sort out all the conceptual contents of the documents. In particular, the local weight (Lij) of each term is represented by each row in Table 2, and every document is represented by each column. Thus, an individual entry in Table 2, aij, represents the weights of the terms i in document j, where, in our case, i=1...88 and j=1...5. More specifically, the term document matrix can be expressed as a vector consisting of the weights of each term and mapped in a vector space (Tokunaga 1999):

figure (4)

where: Lij is the local weighting. This approach defines the weights wij in each document. Let Pij denote the occurrence probability of ti (terms frequency) and dj (documents). We ascribe significance to a term's occurrence, on the grounds that it represents a document's topics more than other factors do. Therefore, we base w ij on the term occurrence probability Pij in each document, and we define a local weighting Lij as follows:

figure (5)

The LSI method uses an algebraic model of document retrieval, using a matrix technique known as singular value decomposition. Thus, using the SVD function of Matlab, we obtained three separate matrices from the original table A, as represented in Table 2, of Lij elements. The first matrix is a term by concept matrix B. The second matrix is a concept by concept matrix (see Table 3). The third matrix is a concept by document matrix D. One important note is that multiplying B x C x D may not necessarily yield the original Table 2, since many of the concepts may be very small and not worth keeping. In our case, we used the normalized matrix B (Table 3) of dimensionality 5 x 5 in order to conceptually compare the LSI method with our proposed method.

Table 2: Mapping of the most frequent common terms (words) from the same five semantic (AIDS) texts, where: T/O is the term's occurrence and Lij is the local weighting

Terms Document C1 Document C2 Document C3 Document C4 Document C5
T/O Lij T/O Lij T/O Lij T/O Lij T/O Lij
West 1 0.0001 0 0.0000 0 0.0000 0 0.0000 1 0.0001
German 1 0.0001 0 0.0000 0 0.0000 1 0.0001 0 0.0000
chemical 1 0.0001 0 0.0000 0 0.0000 1 0.0001 3 0.0011
group 1 0.0001 0 0.0000 0 0.0000 0 0.0000 0 0.0000
Bayer 4 0.0020 0 0.0000 0 0.0000 1 0.0001 1 0.0001
said 3 0.0011 3 0.0011 1 0.0001 2 0.0005 1 0.0001
received 2 0.0005 0 0.0000 0 0.0000 2 0.0005 0 0.0000
claims 3 0.0011 0 0.0000 1 0.0001 1 0.0001 0 0.0000
damages 2 0.0005 0 0.0000 0 0.0000 1 0.0001 1 0.0001
hemophiliacs 1 0.0001 0 0.0000 1 0.0001 2 0.0005 0 0.0000
allege 1 0.0001 0 0.0000 0 0.0000 0 0.0000 1 0.0001
infected 1 0.0001 0 0.0000 1 0.0001 0 0.0000 0 0.0000
AIDS 1 0.0001 3 0.0011 2 0.0005 2 0.0005 2 0.0005
through 1 0.0001 0 0.0000 1 0.0001 0 0.0000 0 0.0000
blood 1 0.0001 0 0.0000 0 0.0000 1 0.0001 1 0.0001
plasma 1 0.0001 0 0.0000 1 0.0001 0 0.0000 1 0.0001
produced 1 0.0001 1 0.0001 0 0.0000 1 0.0001 1 0.0001
Bayer 3 0.0011 0 0.0000 0 0.0000 2 0.0005 0 0.0000
subsidiary 1 0.0001 0 0.0000 1 0.0001 1 0.0001 0 0.0000
shares 2 0.0005 0 0.0000 0 0.0000 0 0.0000 1 0.0001
fell 1 0.0001 0 0.0000 0 0.0000 0 0.0000 0 0.0000
marks 1 0.0001 0 0.0000 1 0.0001 1 0.0001 1 0.0001
close 1 0.0001 0 0.0000 0 0.0000 1 0.0001 0 0.0000
Frankfurt 1 0.0001 0 0.0000 0 0.0000 0 0.0000 1 0.0001
today 1 0.0001 0 0.0000 0 0.0000 1 0.0001 0 0.0000
Dealers 1 0.0001 0 0.0000 0 0.0000 0 0.0000 0 0.0000
newspaper 1 0.0001 0 0.0000 1 0.0001 0 0.0000 1 0.0001
report 1 0.0001 0 0.0000 1 0.0001 1 0.0001 0 0.0000
about 1 0.0001 0 0.0000 0 0.0000 0 0.0000 1 0.0001
possible 1 0.0001 0 0.0000 1 0.0001 1 0.0001 1 0.0001
huge 1 0.0001 0 0.0000 0 0.0000 1 0.0001 1 0.0001
triggered 1 0.0001 0 0.0000 0 0.0000 0 0.0000 0 0.0000
selling 1 0.0001 0 0.0000 1 0.0001 0 0.0000 0 0.0000
other 1 0.0001 0 0.0000 2 0.0005 3 0.0011 1 0.0001
harmaceutical 1 0.0001 0 0.0000 0 0.0000 0 0.0000 0 0.0000
insurance 1 0.0001 0 0.0000 1 0.0001 2 0.0005 1 0.0001
company 1 0.0001 3 0.0011 1 0.0001 2 0.0005 1 0.0001
statement 1 0.0001 0 0.0000 0 0.0000 0 0.0000 0 0.0000
insured 1 0.0001 0 0.0000 2 0.0005 1 0.0001 1 0.0001
against 1 0.0001 0 0.0000 0 0.0000 0 0.0000 0 0.0000
and 1 0.0001 2 0.0005 2 0.0005 2 0.0005 3 0.0011
two 1 0.0001 0 0.0000 0 0.0000 0 0.0000 1 0.0001
Bristol-Myers 0 0.0000 2 0.0000 1 0.0001 1 0.0001 1 0.0001
would 0 0.0000 1 0.0001 0 0.0000 1 0.0001 1 0.0001
seek 0 0.0000 1 0.0001 0 0.0000 0 0.0000 2 0.0005
Food 0 0.0000 1 0.0001 1 0.0001 1 0.0001 0 0.0000
Drug 0 0.0000 2 0.0005 0 0.0000 0 0.0000 1 0.0001
Administration 0 0.0000 1 0.0001 0 0.0000 0 0.0000 0 0.0000
permission 0 0.0000 1 0.0001 1 0.0001 0 0.0000 0 0.0000
test 0 0.0000 2 0.0005 0 0.0000 1 0.0001 2 0.0005
vaccine 0 0.0000 4 0.0020 1 0.0001 1 0.0001 0 0.0000
humans 0 0.0000 2 0.0005 1 0.0001 1 0.0001 1 0.0001
end 0 0.0000 2 0.0005 1 0.0001 1 0.0001 1 0.0001
March. 0 0.0000 1 0.0001 0 0.0000 0 0.0000 1 0.0001
spokesman 0 0.0000 1 0.0001 1 0.0001 2 0.0005 3 0.0011
file 0 0.0000 1 0.0001 0 0.0000 0 0.0000 0 0.0000
investigational 0 0.0000 1 0.0001 1 0.0001 1 0.0001 1 0.0001
new 0 0.0000 1 0.0001 0 0.0000 0 0.0000 1 0.0001
application 0 0.0000 1 0.0001 1 0.0001 0 0.0000 1 0.0001
month 0 0.0000 1 0.0001 0 0.0000 1 0.0001 0 0.0000
requesting 0 0.0000 1 0.0001 2 0.0005 0 0.0000 0 0.0000
FDA 0 0.0000 1 0.0001 0 0.0000 1 0.0001 1 0.0001
allow 0 0.0000 1 0.0001 1 0.0001 0 0.0000 1 0.0001
Scientists 0 0.0000 1 0.0001 0 0.0000 0 0.0000 0 0.0000
Genetic 0 0.0000 1 0.0001 2 0.0005 1 0.0001 1 0.0001
Systems 0 0.0000 1 0.0001 0 0.0000 0 0.0000 0 0.0000
unit 0 0.0000 1 0.0001 0 0.0000 0 0.0000 0 0.0000
created 0 0.0000 1 0.0001 1 0.0001 0 0.0000 1 0.0001
which 0 0.0000 1 0.0001 0 0.0000 0 0.0000 0 0.0000
antibodies 0 0.0000 1 0.0001 0 0.0000 0 0.0000 0 0.0000
virus 0 0.0000 2 0.0005 1 0.0001 1 0.0001 0 0.0000
monkey 0 0.0000 1 0.0001 0 0.0000 0 0.0000 0 0.0000
mice 0 0.0000 1 0.0001 1 0.0001 1 0.0000 0 0.0000
smallpox 0 0.0000 1 0.0001 1 0.0001 0 0.0000 0 0.0000
consists 0 0.0000 1 0.0001 1 0.0001 0 0.0000 1 0.0001
hospital 0 0.0000 0 0.0000 1 0.0001 1 0.0001 1 0.0001
general 0 0.0000 0 0.0000 1 0.0001 1 0.0001 0 0.0000
thus 0 0.0000 0 0.0000 0 0.0000 0 0.0000 0 0.0000
letter 0 0.0000 0 0.0000 1 0.0001 0 0.0000 1 0.0001
Growth 0 0.0000 0 0.0000 2 0.0005 1 0.0001 0 0.0000
aggreed 0 0.0000 0 0.0000 0 0.0000 1 0.0001 0 0.0000
lecture 0 0.0000 0 0.0000 1 0.0001 0 0.0000 1 0.0001
report 0 0.0000 0 0.0000 1 0.0001 0 0.0000 1 0.0001
patient 0 0.0000 0 0.0000 2 0.0005 1 0.0001 1 0.0001
library 0 0.0000 0 0.0000 1 0.0001 0 0.0000 0 0.0000
food 0 0.0000 0 0.0000 1 0.0001 0 0.0000 0 0.0000
care 0 0.0000 0 0.0000 1 0.0001 0 0.0000 0 0.0000
staff 0 0.0000 0 0.0000 1 0.0001 0 0.0000 0 0.0000

Table 3: Concept by concept, decompose matrix using SVD method

0.4489 0.8667 0.8840 0.8430 0.9810
0.5386 0.7050 0.6724 0.6230 0.7500
0.4733 0.7560 0.8038 0.4137 0.7860
0.5693 0.4980 0.5210 0.6750 0.9238
0.5220 0.8500 0.6610 1.0000 0.7870
Rank Score 2.5521 3.6757 3.5422 3.5547 4.2278

In the next stage the data of Table 3 can be tested using the top-down correlation technique (see Equation 2). The null hypothesis (H0) of this test is that there is an agreement regarding the investigated data of the columns of Table 3. In our case, according to Equation (2), M=5 was the number of investigated documents, n=5 were the conceptual values after reduction, and R was the sum of the rank score (see Table 3). Then, W=0.0059 and thus figure (see Equation 3). Comparing it to the chi-square distribution with a degrees of freedom í=5-1= 4 and figure , we had non-rejection of the null hypothesis (H0) of the agreement regarding the conceptuality of the compared documents (0.99 < P < 0.95).

5.3 Onion layers method

In this stage we obtained the five sets of data for each document (Table 4) using the method described in section 3.2. It should be noted that each set contains the points of the smallest convex polygon plus a number of points ranging between 0-2 (see Figure 1).

Table 4: Pair points of the last (smallest polygon) of the five onion polygons of documents C1, C2, C3, C4, C5

Document C1 Document C2 Document C3 Document C4 Document C5
Axis X Axis Y Axis X Axis Y Axis X Axis Y Axis X Axis Y Axis X Axis Y
248 97 255 101 259 99 251 99 232 98
246 70 242 101 249 100 242 99 230 97
265 97 235 100 242 97 240 97 220 83
248 97 249 44 265 32 237 83 243 97
259 97 267 68 266 98 262 77 232 98
255 101 259 99 269 97 240 97
252 101 254 97 251 99
266 70 260 97

The next stage was to transform the data of Table 4 in order for each pair of data to be represented by a value. Thus, for the whole set we calculated the magnitude of each pair point (xij,yij) from the origin 0 of the coordinate system using the following formula:

figure (6)

where: i=1, 5 is the number of elements of each smallest convex layer and j=1, 5 is the number of documents.

We then obtained the first five values (Table 5) for each document in order to compare these statistically with those of the of LSI method (Table 3) and normalized the data in Table 6.

Table 5: Magnitude of all the pair points of documents C1, C2, C3, C4, C5

Document C1 Document C2 Document C3 Document C4 Document C5
266.2949
255.7655
282.1950
266.2949
276.5683
274.2736
262.2308
255.3919
252.8577
275.5231
277.2760
268.3300
260.7163
266.9251
283.4784
269.8185
261.4670
258.8610
251.1135
273.0806
251.8492
249.6177
235.1361
261.6448
251.8492

Table 6: Normalization on the maximum value (283.48) of Table 5

Document C1
Document C2
Document C3
Document C4
Document C5
0.9394
0.9022
0.9955
0.9394
0.9756
0.9675
0.9250
0.9009
0.8920
0.9719
0.9781
0.9466
0.9197
0.9416
1.0000
0.9518
0.9224
0.9132
0.8858
0.9633
0.8884
0.8806
0.8295
0.9230
0.8884
Rank Score 4.7252 4.5768 4.5588 4.5818 4.7992

In this stage the data of Table 6 can be tested using the top-down correlation technique (Equation 2). The null hypothesis (H 0 ) of this test is that there is an agreement regarding the investigated data of the columns of Table 6. In our case and according to Equation (2), M=5 was the number of investigated documents, n=5 were the conceptual values after reduction, and R was the sum of the rank score (see Table 6). Then W=0.0003 and thus figure (see Equation 3). On comparing it to the chi-square distribution with degree of freedom í=5-1=4 and figure , we had non-rejection of the null hypothesis of the agreement regarding the conceptuality of the compared documents (0.99 < P < 0.95).

6 Analysis and discussion

As can be seen from the results in Table 1, the proposed method correctly categorized all tested texts.

The statistical evaluation method based on rank correlation (section 5.1) of our proposed method and the LSI method (section 5.2) showed that they have the same level significance (0.99 < P < 0.95), although our proposed method may be considered to be superior to the LSI method because the reduction technique is independent of the number of documents and the number of terms involved. In more detail, the method promises a great reduction in the original information. For example, a selected text of 500 characters is now represented by a convex polygon (categorization stage) of two vectors figure and figure that have 3-9 elements each: a total reduction in information of about figure or 96.4%

Furthermore, for smaller texts (100 words in length) we avoid the training procedure and, in this way, the complexity of the proposed method is reduced dramatically in comparison with traditional methods.

For training on longer texts this method may be applied as follows:

7 Conclusions and future work

7.1 Conclusion

This paper has explained a text categorization technique that combines numeric conversion of a symbolic expression and a computational geometric algorithm. The proposed technique is suitable for the specific partitioning of digital library collections, and semantic categorization of emails and abstracts. This method may be characterized as a supervised method because it is based on the claim that an unknown text categorization vector (smallest convex polygon) is compared with a particular text categorization vector the class of which is already known. Thus, a practical way to see if this method is able to classify an unknown document vector correctly is to test it with a particular number of different predetermined semantic document vectors and let the system decide which of these vectors is geometrically nearest the unknown vector according to the method described in section 3.2.

An example of this is the experiment in which five test categories (of 5 x 100 word documents per category) were used to test this technique. Statistical information was collected from the experiment and used to corroborate our findings:

  1. Statistical evaluation using rank correlation gave a significant level of accuracy (0.99 < P < 0.95) and showed that the results of the five tested documents are in agreement.
  2. The proposed method is based on a different philosophy to those methods that depend on the removal of features based on overall frequency counts (Salton and McGill 1983, Jones and Paynter 2002). In contrast to those methods, the proposed method is based on total word participation in the feature extraction. The results showed that this feature is efficient and accurate for semantic text categorization.
  3. The onion algorithm may be characterized as a more efficient algorithm for the lowest complexity, O(d*n log n), than the text categorization methods that are used currently, for example the Bayesian classifier with O(n3k) complexity.

In conclusion, use of the onion algorithm for text categorization purposes may be considered to be a solution to the problems of managing information storage and retrieval in large digital libraries.

7.2 Future work

A way to improve our proposed method in future would be to develop numerical conversion of a symbolic expression based on a different encoding method. In this way, we propose to extend the encoding to a larger 8-bit conversion. The aim of this proposition is for each character of each word to be described in a more specific way. Phonetic conversion in place of ASCII conversion may provide an even more accurate solution in our proposed method.

The next aim of this study is to increase the text length in each processed document to more than 100 words.

Furthermore, this system has the ability to investigate the relation of the position of one convex polygon to that of another in the Cartesian plane. Thus, in this study we considered the degree a of the intersection of convex polygons as a measure of comparison. However, in future the centre of gravity of a convex polygon may be used as a measure of comparison and treated as a vector (O'Rourke 1995).

In our latest work modified the proposed method to construct an anti-spam filter.

Finally, this method may be considered to be a preliminary study and the next aim is to apply it to a real test collection. However, the statistical conclusion and the results of the 15 tested documents show that we are moving in the right direction.

References

Anick, P. and Vaithyanathan, S. (1997) "Exploiting Clustering and Phrases for Context-Based Information Retrieval". In Proceedings of SIGIR'97: the 20th International Conference on Research and Development in Information Retrieval (ACM Press), pp. 314-322

Borgefors, G., Ragnemalm, I. and Sanniti di Baja, G. (1992) "Feature extraction on the Euclidean distance transform". In Progress in Image Analysis and Processing II, edited by V. Cantoni, M. Ferretti, S. Levialdi, R. Negrini and R. Stefanelli, (World Scientific: Singapore), pp. 115-122

Bose, P. and Toussaint, G. (1995) "No Quadrangulation is Extremely Odd". In 6th International Symposium on Algorithms and Computation (formerly SIGAL International Symposium on Algorithms), pp. 340-358

Bremner, D., et al. (2003) "Output-Sensitive Algorithms for Computing Nearest-Neighbour Decision Boundaries". In Proceedings of the 8th Workshop on Algorithms and Data Structures (WADS 2003), Ottawa, Ontario, Canada, July 30-August 1 http://citeseer.ist.psu.edu/593554.html

Deerwester, S., Dumais, S. T., et al. (1990) "Indexing by latent semantic analysis". Journal of the American Society for Information Science, 41(6), 391-407 http://citeseer.ist.psu.edu/deerwester90indexing.html

Domingos, P. and Pazzani, M. J. (1997) "On the optimality of the simple Bayesian classifier under zero-one loss". Machine Learning, 29(2-3), 103-130 http://citeseer.ist.psu.edu/domingos97optimality.html

Dumais, S. T., Platt, J., Heckerman, D. and Sahami, M. (1998) "Inductive Learning Algorithms and Representations for Text Categorization". In Proceedings of the 7th International Conference on Information and Knowledge Management (ACM Press), pp. 148-155

Graham, R. L. (1972) "An efficient algorithm for determining the convex hull of a finite planar set". Information Processing Letters, 1, 132-133

Joachims, T. (1998) "Text categorization with support vector machines: Learning with many relevant features". European Conference on Machine Learning (ECML), pp. 123-132 http://citeseer.ist.psu.edu/joachims97text.html

Jones, S. and Paynter, G. W. (2002) "Automatic Extraction of Document Keyphrases for Use in Digital Libraries: Evaluation and Applications". Journal of the American Society for Information Science and Technology, 53(8), 653-677

Kendall, M. G. (1962) Rank Correlation Methods, 3rd edition (Charles Griffin: London), p. 199

Lewis, D. D. and Hayes, P. J. (1994) ACM Transactions on Information Systems, special issue on text categorization, 12(3), July

O'Rourke, J., Chien, J., Olson, C. and Naddor, T. (1982) "A new linear algorithm for intersecting convex polygons".Computer Graphics and Image Processing, 19 (4), 384-391

O'Rourke, J. (1995) Computational Geometry in C (Cambridge University Press: New York), p. 47

Poulos, M., Rangoussi, M., et al. (1999) "Parametric person identification from the EEG using computational geometry". Proceedings of the Sixth International Conference on Electronics, Circuits and Systems (ICECS'99), Pafos, Cyprus, September, pp. 1005-1012

Poulos, M., Magkos, E., Chrissikopoulos, V. and Alexandris, N. (2003) "Secure Fingerprint Verification Based on Image Segmentation using Computational Geometry Algorithms". In Proceedings of the IASTED International Conference Signal Processing Pattern Recognition & Application, pp. 308-312 http://www.ionio.gr/~mpoulos/papers/Fingerprint%20verification.pdf

Salton, G. and McGill, M. (1983) Introduction to Modern Information Retrieval (McGraw-Hill)

Slonim, N. and Tishby, N. (2001) "The power of word clusters for text classification". In 23rd European Colloquium on Information Retrieval Research (ECIR) http://www.cis.upenn.edu/group/datamining/ReadingGroup/papers/textclass-infobottleneck.pdf

Tokunaga, T. (1999) Computation and Language Volume 5: Information Retrieval and Natural Language Processing (University of Tokyo Press)

Toussaint, G., Bhattacharya, B. and Poulsen, R. (1984) "The application of Voronoi diagrams to non?≠parametric decision rules". In Proceedings of Computer Science and Statistics: 16th Symposium of the Interface

Weston, J., et al. (1997) "Density Estimation Using Support Vector Machines". Technical Report CSD-TR-97-23, Royal Holloway College, University of London http://citeseer.ist.psu.edu/weston98density.html

Yang, Y. (1999) "An evaluation of statistical approaches to text categorization". Journal of Information Retrieval, Vol 1, No. 1/2, 67-88 http://citeseer.ist.psu.edu/yang97evaluation.html

Links

Reuters-21578 collection http://www.daviddlewis.com/resources/testcollections/reuters21578/

Appendix

This Appendix describes the algorithms used in the text categorization method. Section A1 describes the algorithm for constructing a two-dimensional convex polygon, and section A2 outlines an onion algorithm. In our case we considered that the point P=Pxy is a converted character, where Py on the y-axis represents the corresponding value of a text character and Pxon the x-axis represents the frequency of this character.

A1 Construction of a two-dimensional convex polygon

Graham (1972) proposed an algorithm of complexity O(n logn) which computed the convex hull of a given set of n points on the 2D plane. This algorithm proceeds in three phases.

Phase 1. The highest y-coordinate, which is clearly on the hull, is used as the pivot (Figure A1).

figure A1

Figure A1. The original points

Phase 2. The points are sorted in order of increasing angle about the pivot. A horizontal ray P0, emanating from the pivot, is rotated counter-clockwise. The first point that the ray contacts is sorted as the first point (P1). In the same way, the remaining points are sorted, until the ray returns to its original horizontal position (Figure A2).

figure A2

Figure A2. Sorting procedure

Phase 3. The horizontal ray, emanating from the pivot (P0), begins to rotate counter-clockwise and stops at the first point sorted in phase 2. Then these points, the first and the second (P0, P1), form the new axis of rotation. Using point P1 as the pivot, the third point of the convex polygon is determined as the first point of contact on rotation of the new axis (Figure A3).

figure A3

Figure A3. Final phase of the construction of the convex polygon

This procedure is repeated until a phase where the first point of contact of the rotating ray is again P0. Then the hull is closed.

Algebraic explanation of the area of a convex polygon

Suppose the x and y coordinates of a point piare denoted by (xi, yi) and there are n total points in the plane. If we considered that the algebraic area (or the signed area) of the oriented triangle p1, p2, p3 is figure , then figure determinant figure .

The sign of the algebraic area is positive if and only if the orientation of the triangle is counter-clockwise. Consider any n-vertex simple polygon P whose vertices around its boundary in some orientation (clockwise or counter-clockwise) are figure . Let O denote the origin in the plane of P. Then the algebraic area of the convex polygon P is figure , where where we assume figure . Here again, the sign is positive if and only if the given orientation is counter-clockwise around the boundary of P. Thus, any simple polygon P with more than three vertices (i.e. n > 3) has at least one diagonal. A diagonal is a line segment that connects two non-adjacent vertices of P and has no intersection with the exterior of P.

A.2 Onion layers of a set of points

Let P be a set of n points in R2, and let (H0, H1, ... ,Hk) be the onion of P, that is, H0 is the convex hull of P. Now remove the points on the boundary of H0from P and let H1be the convex hull of what remains in P, and so on until no more points are left in P (see the example in Figure A4).

figure A4

Figure A4. Onion layers of a set of points

Thus, the construction of the onion is produced by iterative application of the gift-wrapping algorithm and shows that it takes a total time of O(d*n log n).