A Text Categorization Technique based on a Numerical Conversion of a Symbolic Expression and an Onion Layers Algorithm
Abstract
The dramatic increase in the amount of content available in digital forms gives rise to largescale digital libraries, targeted at millions of users. As a result, it has become a necessary to categorize large texts (documents). The paper develops a novel method where text categorization is achieved via a reduction in the original data information using numerical conversion of a symbolic expression and an onion layers algorithm. Three different semantic categories were considered and five texts selected from each category for submission to a text categorization procedure using the proposed method. The results and the statistical evaluation of this procedure showed that the proposed method may be characterized as highly accurate for text categorization purposes.
Keywords: text categorization, onion algorithm, computational geometry
1 Introduction
Text categorization is the process of grouping texts into one or more predefined categories based on their content. Automatic text categorization can play an important role in a wide variety of more flexible, dynamic and personalized information management tasks as well as: realtime sorting of email or files into folder hierarchies; topic identification to support topicspecific processing operations; structured search and/or browsing; or finding texts that match longstanding interests or more dynamic taskbased interests (Anick and Vaithyanathan 1997). A number of statistical classification and machine learning techniques have been applied to text categorization, including regression models, Bayesian classifiers, decision trees, nearest neighbour classifiers, neural networks, and support vector machines (Weston et al. 1997).
Classification technologies should be able to support category structures that are very general, consistent across individuals, and relatively static (e.g. Dewey Decimal or Library of Congress classification systems, Medical Subject Headings (MeSH), or Yahoo!'s topic hierarchy), as well as those that are more dynamic and customized to individual interests or tasks. Until now the first step in text categorization has been to transform texts, which are typically strings of characters, into a representation suitable for the learning algorithm and the classification task. Each document is represented as a vector of words, as is typically done in information retrieval (Salton and McGill 1983). For most text retrieval applications, entries are weighted to reflect the frequency of terms in texts and the distribution of terms across the collection as a whole. For text classification, simpler binary feature values (i.e. a word either occurs or does not occur in a document) are often used instead. For example, in the most commonly used text representation, the vector space model, a vector of words represents each text. A wordbytext matrix A is used for a collection of texts, where each entry represents the occurrence of a word in a text, i.e. , where is the weight of word i in text j. Determining the weight and the frequency of words in a text is common to all existing methods.
Classification of the aforementioned vectors has been based on several learning techniques, including multivariate regression, nearest neighbour classifiers, probabilistic Bayesian models, decision trees, and neural networks. Overviews of this text classification work can be found in Lewis and Hayes (1994), Yang (1999), Joachims (1998) and Dumais et al. (1998). Furthermore, in the semantic text categorization reduction technique, latent semantic analysis (LSA) (Deerwester et al. 1990), there is an approach to automatic indexing and information retrieval that attempts to overcome common problems in information retrieval. The first problem is termed synonym and refers to the fact that there are many names for the same object. The second problem is called polysemy, and refers to the fact that most words have more than one meaning.
Recently, machine learning algorithms have been applied successfully to automatic text categorization problems, with a large number of unique words (over 15000). However, it has not yet been clarified, particularly for largescale problems with more than a million feature terms, to what degree the following two factors work to the detriment of each other: the reduction of the classification error using computationallyintensive machine learning algorithms, and the loss of information due to the selection of feature terms before applying the expensive learning algorithms.
Specifically, in the learning procedure the algorithm of Slonim and Tishby (2001), which is based on the Bayesian classifier, is O(n^{3}k) in complexity where n is the total number of words and k is the number of classes. Furthermore, the philosophy of the aforementioned methodology is justified because a Bayesian classifier presents a weakness in the point sets for large dimensions (Domingos and Pazzani 1997). It is known that high dimensionality, which presents a classified vector S of words, creates classification problems. These arise from the asymptotic error rate of the nearest neighbour rule (Toussaint et al. 1984). This problem is solved partially by the Voronoi diagram of S, which partitions the plane convex polygonal Voronoi cells.
This paper attempts to improve the performance of the naive text categorization methods by using the concept of probability weighted amount of information, and also by incorporating morphological information and probabilistic language modelling techniques in extracting and computing the weights of feature terms. Thus, our analysis is based on the advantages of the Voronoi diagram and we reduce the complexity of this transformation by using the onion algorithm of Bremner et al. (2003). The idea behind this implementation is to use the convex hull subroutine recursively to extract the outmost convex hull (S1 smallest convex polygon) of the given points (characters). In particular, we have developed a method based on an onion polygon each point of which represents a character of the document. For the numerical conversion of each character we used a specific numerical representation of the conversion strings which equates to double precision of ASCII characters.
In this novel way we avoid the problems of vector representation by converting the entire text into a unique set of numbers (more details of this method are given in section 2.1). This conversion took place in order for all the features of the characters of a document to be represented in the Cartesian plane, and used in the computational geometry. Furthermore, a characteristic of the onion layers method (Borgefors et al. 1992) is that all the features are sorted in an ideal way and all specific features of documents, such as spaces and punctuation marks, have a domain role in the final reduction. This method may be characterized as a continuation of our latest research, such as specific feature extraction for fingerprint verification via onion layers (Poulos et al. 2003), and the study of the encephalogram where the feature of a convex polygon gave positive results in diagnostic biometrical tests (Poulos et al. 1999).
The aim of this paper is to describe the proposed algorithm in depth and with examples, and, furthermore, to compare it statistically with the similar LSA reduction method in order to extract significant conclusions regarding the statistical assurance of this method.
2 Method
In brief, our proposed method is divided into the following stages:
 Preprocessing stage: conversion of the symbolic expression (in our case an array of characters of a text) to numeric values.
 Processing stage: analysis of the proposed dimensionality reduction technique using an onion algorithm based on computational geometry for text categorization purposes.
 Categorization stage: the result of the above reduction is to obtain the smallest characteristic subset. This is compared geometrically with another characteristic subset from a different text in order to determine their semantic correlation.
 Decision stage: determination of the rules of categorization.
2.1 Preprocessing stage
In this stage we suppose that a selected text is an input vector , where represent the characters of the selected text. Then, using a conversion procedure where a symbolic expression (in our case an array of characters of a text) is converted to ASCII characters in string arithmetic values, we obtained a numerical value vector where these values ranged between 1128.
In our case this conversion was achieved by using the double.m function of the Matlab language. This function converts strings to double precision and equates with converting an ASCII character to its numerical representation.
For better comprehension we give the following example via Matlab:
>> S = 'This is a message to test the double "command".'
>> double(S)
ans =
Columns 1 through 12
84 104 105 115 32 105 115 32 97 32 109 101
Columns 13 through 24
115 97 103 101 32 116 111 32 116 101 115 116
Columns 25 through 36
32 116 104 101 32 100 111 117 98 108 101 32
Columns 37 through 46
34 99 111 109 109 97 110 100 34 46
2.2 Processing stage
Our proposed method is based on the following proposition:
The set of elements of vector for each selected text contains a convex subset, which has a specific position in relation to the original set (Poulos et al. 1999, Poulos et al. 2003). This position may be determined by using a combination of computational geometry algorithms, which is known as Onion Peeling Algorithms (Bose and Toussaint 1995) with overall complexity O(d*n log n) times, where d is the depth of the smallest convex layer and n is the number of characters in the numerical representation (in accordance with section 2.1)
Thus, the smallest convex layer of the original set of vector carries specific information. In particular, vector may be characterized as a common geometrical area of all the elements of vector . In our case, this consideration is valuable because this subset may be characterized as representing the significant semantics of the selected text.
Implementation
We consider the set of characters of a selected text to be vector . The algorithm starts
with a finite set of points in the plane. The following iterative process is considered. Let
be the set minus all the points
on the boundary of the hull of . Similarly, define . The process continues until the set is (see Figure 1). The
hulls are called the
layers of the set, and the process of peeling away the layers is called onion
peeling for obvious reasons (Figure 1). More details of the algorithmic construction
of the proposed convex polygon and onion algorithm are presented in the Appendix.
Figure 1. Onion layers of a set of points
2.3 Categorization stage
In our case we considered that the smallest convex layer that has depth 3 (Figure 1) carries specific information (Poulos et al. 2003), because this position gives a geometrical interpretation of the semantics of the selected text. In other words, the smallest convex polygon (layer) depicts a particular geometrical area in which the values of these semantics range. This structure has also been used for example in statistics to define a generalization of the concept of the median into twodimensional data (O'Rourke et al. 1982). This feature may be characterized as unique to each selected text because the two following conditions (Poulos et al. 1999) are ensured:
 the selected area layer is nonintersected with another layer
 the particular depth of the smallest layer is variable in each case.
Thus, two variables were extracted from the proposed text categorization method: the area of the smallest onion layer and the depth (i) of this layer, which is a subset of the original text set S values. In more detail, the smallest onion layer is a matrix that contains the set of convex coordinate values of the smallest onion layer, in other words,
where x represents the frequency of each converted character and y the numerical value of each character.
Taking into account the specific features of the aforementioned variables it is easy to ascertain that these may be used in the next stage.
2.4 Decision stage of computational geometry method
In this stage subset
intersects a new subset , which came from the processing of another set , on the Cartesian plane (Figure 2).
Figure 2. Decision stage between two different onion algorithm procedures
The correlation of investigated subsets , is decided based on the following experimental conditions:
If the above subsets are intersected, the degree of correlation is extracted by using the fraction of the common intersection and the area of the original convex set. In more detail, if we consider that and the corresponding areas of these convex polygons are , and , then the degree of the correlation of subset with regard to subset is extracted as follows:
(1)
In our case if , then
the system decides that the two compared subsets and are not correlated (Figures 3 and 4).
Figure 3. Procedure of correlating the two smallest convex polygons, extracted using the onion layer method (see Figure 2)
In this way degree is the defining factor in categorizing two investigated texts.
Figure 4. Case of categorization in the same category. The green area is the common area of the compared text categorization documents
3 Experimental part
3.1 Data acquisition
In this study we tested the proposed method in three different semantic categories:
 Five documents (A_{1, } A_{2, } A_{3, } A_{4 } and A_{5}) concerning Money and Marketing with the Reuters21578 collection as a source.
 Five documents (B_{1}, B_{2}, B_{3}, B_{4 } and B_{5}) concerning computational geometry with common keywords: Computational, Geometry, Algorithm, Onion Algorithm and Fingerprint Verification
 Five documents (C_{1, } C_{2, } C_{3, } C_{4 } and C_{5}) concerning the AIDS virus from the Reuters21578 collection
Each text had about 500 characters (with spaces), or about 100 words.
3.2 Implementation of the algorithm
The implementation of the algorithm takes place using the following steps.
In the preprocessing stage we selected a text from each category and submitted it for numerical conversion using the following algorithm implemented via Matlab function double.m.
For example, we took the following text from the third category (AIDS and Virus), which we called C_{1}:
West German chemical group Bayer AG <BAYG.F> said it had received claims for damages from haemophiliacs who allege they were infected with AIDS through a blood plasma produced by a Bayer subsidiary. Bayer's shares fell 13 marks to close at 292 in Frankfurt today. Dealers said a newspaper report about possible huge claims for damages triggered off the share selling in Bayer, other pharmaceuticals and insurance companies. Bayer said in a statement it was insured against claims and had received two.
This text was submitted to numerical conversion, and a vector of a set number was constructed.
The same procedure was followed for other selected texts from the three categories.
In the processing stage vector was put on the Cartesian plane and the onion algorithm was applied, thus isolating the characteristic smallest convex polygon (vector ). This algorithm was implemented via Matlab algorithms convex2.m.
This procedure was repeated for the other vectors until we finally isolated the 15 smallest convex polygons or vectors where n=1 ... 15
An example of this procedure is presented in Figures 5 and 6.
Figure 5. Implementation of the onion polygon of text C_{1} using vector
Figure 6. Isolation of the smallest convex polygon of text C_{1} or vectors ,
In the categorization stage vectors and of the smallest polygon were intersected with the other vectors. The algorithmic procedure was as follows.
We considered that the intersected coordinate vectors were and from the first text and and from the second text. Then the convex polygon, which was produced by chance intersection, had and coordinate vectors. This algorithm was implemented via the following Matlab function:
For example, we used the following second text from the third category (Aids and Virus), which we called C_{2}:
BristolMyers Co <BMY> said it would seek Food and Drug Administration permission to test its AIDS vaccine in humans by the end of March. A company spokesman said the company will file an investigational new drug application by the end of the month, requesting the FDA to allow testing of the vaccine in humans. Scientists at the company's Genetic Systems unit created an AIDS vaccine, which BristolMyers said produced antibodies to the AIDS virus in mice and monkeys. The vaccine consists of smallpox virus.
We also used the following text, which we called B_{1}, from the second category, Computational Geometry:
In this paper, fingerprint segmentation for secure Internet verification purposes is investigated. The novel application of computational geometry algorithms in the fingerprint segmentation stage showed that the extracted feature (characteristic polygon) may be used as a secure and accurate method for fingerprintbased verification over the Internet. On the other hand, the proposed method promisingly allows very small false acceptance and false rejection rates, as it is based on specific segmentation,
Figures 7 and 8 show the implementation of the intersection of semantic texts from
the same categories, C_{1}
and C_{2}, where we obtained a large convex polygon C
_{12}
(green) (). In
contrast, in Figures 9 and 10 show that the intersection of semantic texts
C_{1}
and B_{1}
are zero (B_{1}∩C_{1}=0).
Figure 7. Intersection of two semantic texts, C_{1} and C_{2}_{, } using the onion algorithm
Figure 8. Isolation of the intersection (green) of the two smallest convex polygons of semantic texts C_{1} (blue) and C_{2} (red)
Figure 9. Intersection of two different semantic texts, C_{1} and B_{1}, using the onion algorithm
Figure 10. Zero intersection between the two smallest convex polygons of semantic texts C_{1} (blue) and B_{1} (red)
In the decision stage degree of correlation between subsets and was calculated as follows.
First, we calculated the area of each convex subset and followed by the area of their common intersection . Then degree was calculated as follows:
and
In Matlab this procedure was calculated as follows:
and
and was repeated for all text cases. The degrees are presented in Table 1.
4 Results
In this part the results of the implementation of the proposed algorithm for text categorization purposes are presented in Table 1. More specifically, we extracted the degrees of correlation using Equation 1. In total we obtained 15 x 15 = 225 degrees of correlation for the three semantic categories.
Table 1:
degree of correlation for three semantic categories (A, B, C) using an onion
algorithm
TEXTS 
A1 
A2 
A3 
A4 
A5 
B1 
B2 
B3 
B4 
B5 
C1 
C2 
C3 
C4 
C5 
A1 
1 
0.71 
0.42 
0.35 
0.33 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
A2 
0.53 
1 
0.67 
0.22 
0.18 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
A3 
0.44 
0.48 
1 
0.39 
0.87 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
A4 
0.67 
0.56 
0.33 
1 
0.26 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
A5 
0.44 
0.18 
0.35 
0.56 
1 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
B1 
0 
0 
0 
0 
0 
1 
0.67 
0.43 
0.38 
0.35 
0 
0 
0 
0 
0 
B2 
0 
0 
0 
0 
0 
0.19 
1 
0.22 
0.34 
0.44 
0 
0 
0 
0 
0 
B3 
0 
0 
0 
0 
0 
0.54 
0.87 
1 
0.56 
0.57 
0 
0 
0 
0 
0 
B4 
0 
0 
0 
0 
0 
0.66 
0.56 
0.11 
1 
0.33 
0 
0 
0 
0 
0 
B5 
0 
0 
0 
0 
0 
0.23 
0.43 
0.31 
0.27 
1 
0 
0 
0 
0 
0 
C1 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
1 
0.44 
0.23 
0.25 
0.33 
C2 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0.81 
1 
0.12 
0.17 
0.19 
C3 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0.54 
0.41 
1 
0.33 
0.33 
C4 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0.30 
0.29 
0.46 
1 
0.27 
C5 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0.23 
0.25 
0.42 
0.25 
1 
5 Statistical evaluation
In this section a statistical evaluation between the most used information retrieval method, latent semantic indexing (LSI), and the proposed method is attempted. In particular, we tested the five texts of the same semantic category C (AIDSvirus). To compare the two techniques we used rank correlation among several variables of the chisquare control. In the following paragraphs we develop the procedure followed in order to yield the normalized matrices for each method and, thereafter, apply them to the selected statistical method.
5.1 Concordance: rank correlation among several variables
The concept of correlation between two variables can be expanded to consider association among more than two, as shown for multiple correlations (Kendall 1962). Such association is readily measured nonparametrically by a statistic known as Coefficients of Concordance. In our case this correlation, W, gave the perfect criterion of agreement among five selected sets of data. In other words, this test investigates if the data of columns are in agreement. The correlation W was employed by the following equation:
(2)
where M is the number of variables being correlated and n is the number of data per variable, R_{i} is the sum of the values of each column for i=1 ... n.
We can ask whether a calculated sample, W, is significant, that is, whether it represents an association different from zero in the population that was sampled. A simple way to find out the relationship between the Kendall Coefficients of Concordance, W, and the Friedman chisquare involves using the following formula:
(3)
Therefore, we can convert the calculated W to its equivalent and then employ the critical values of .
5.2 Latent semantic indexing
In the first stage we created Table 2 to sort out all the conceptual contents of the documents. In particular, the local weight (L_{ij}) of each term is represented by each row in Table 2, and every document is represented by each column. Thus, an individual entry in Table 2, a_{ij}, represents the weights of the terms i in document j, where, in our case, i=1...88 and j=1...5. More specifically, the term document matrix can be expressed as a vector consisting of the weights of each term and mapped in a vector space (Tokunaga 1999):
(4)
where: L_{ij} is the local weighting. This approach defines the weights w_{ij} in each document. Let P_{ij} denote the occurrence probability of t_{i} (terms frequency) and d_{j} (documents). We ascribe significance to a term's occurrence, on the grounds that it represents a document's topics more than other factors do. Therefore, we base w _{ij} on the term occurrence probability P_{ij} in each document, and we define a local weighting L_{ij} as follows:
(5)
The LSI method uses an algebraic model of document retrieval, using a matrix technique known as singular value decomposition. Thus, using the SVD function of Matlab, we obtained three separate matrices from the original table A, as represented in Table 2, of L_{ij} elements. The first matrix is a term by concept matrix B. The second matrix is a concept by concept matrix (see Table 3). The third matrix is a concept by document matrix D. One important note is that multiplying B x C x D may not necessarily yield the original Table 2, since many of the concepts may be very small and not worth keeping. In our case, we used the normalized matrix B (Table 3) of dimensionality 5 x 5 in order to conceptually compare the LSI method with our proposed method.
Table 2: Mapping of the most frequent common terms (words) from the same
five semantic (AIDS) texts, where: T/O is the term's occurrence and
L_{ij}
is the local weighting
Terms  Document C1  Document C2  Document C3  Document C4  Document C5  
T/O  L_{ij}  T/O  L_{ij}  T/O  L_{ij}  T/O  L_{ij}  T/O  L_{ij}  
West  1  0.0001  0  0.0000  0  0.0000  0  0.0000  1  0.0001 
German  1  0.0001  0  0.0000  0  0.0000  1  0.0001  0  0.0000 
chemical  1  0.0001  0  0.0000  0  0.0000  1  0.0001  3  0.0011 
group  1  0.0001  0  0.0000  0  0.0000  0  0.0000  0  0.0000 
Bayer  4  0.0020  0  0.0000  0  0.0000  1  0.0001  1  0.0001 
said  3  0.0011  3  0.0011  1  0.0001  2  0.0005  1  0.0001 
received  2  0.0005  0  0.0000  0  0.0000  2  0.0005  0  0.0000 
claims  3  0.0011  0  0.0000  1  0.0001  1  0.0001  0  0.0000 
damages  2  0.0005  0  0.0000  0  0.0000  1  0.0001  1  0.0001 
hemophiliacs  1  0.0001  0  0.0000  1  0.0001  2  0.0005  0  0.0000 
allege  1  0.0001  0  0.0000  0  0.0000  0  0.0000  1  0.0001 
infected  1  0.0001  0  0.0000  1  0.0001  0  0.0000  0  0.0000 
AIDS  1  0.0001  3  0.0011  2  0.0005  2  0.0005  2  0.0005 
through  1  0.0001  0  0.0000  1  0.0001  0  0.0000  0  0.0000 
blood  1  0.0001  0  0.0000  0  0.0000  1  0.0001  1  0.0001 
plasma  1  0.0001  0  0.0000  1  0.0001  0  0.0000  1  0.0001 
produced  1  0.0001  1  0.0001  0  0.0000  1  0.0001  1  0.0001 
Bayer  3  0.0011  0  0.0000  0  0.0000  2  0.0005  0  0.0000 
subsidiary  1  0.0001  0  0.0000  1  0.0001  1  0.0001  0  0.0000 
shares  2  0.0005  0  0.0000  0  0.0000  0  0.0000  1  0.0001 
fell  1  0.0001  0  0.0000  0  0.0000  0  0.0000  0  0.0000 
marks  1  0.0001  0  0.0000  1  0.0001  1  0.0001  1  0.0001 
close  1  0.0001  0  0.0000  0  0.0000  1  0.0001  0  0.0000 
Frankfurt  1  0.0001  0  0.0000  0  0.0000  0  0.0000  1  0.0001 
today  1  0.0001  0  0.0000  0  0.0000  1  0.0001  0  0.0000 
Dealers  1  0.0001  0  0.0000  0  0.0000  0  0.0000  0  0.0000 
newspaper  1  0.0001  0  0.0000  1  0.0001  0  0.0000  1  0.0001 
report  1  0.0001  0  0.0000  1  0.0001  1  0.0001  0  0.0000 
about  1  0.0001  0  0.0000  0  0.0000  0  0.0000  1  0.0001 
possible  1  0.0001  0  0.0000  1  0.0001  1  0.0001  1  0.0001 
huge  1  0.0001  0  0.0000  0  0.0000  1  0.0001  1  0.0001 
triggered  1  0.0001  0  0.0000  0  0.0000  0  0.0000  0  0.0000 
selling  1  0.0001  0  0.0000  1  0.0001  0  0.0000  0  0.0000 
other  1  0.0001  0  0.0000  2  0.0005  3  0.0011  1  0.0001 
harmaceutical  1  0.0001  0  0.0000  0  0.0000  0  0.0000  0  0.0000 
insurance  1  0.0001  0  0.0000  1  0.0001  2  0.0005  1  0.0001 
company  1  0.0001  3  0.0011  1  0.0001  2  0.0005  1  0.0001 
statement  1  0.0001  0  0.0000  0  0.0000  0  0.0000  0  0.0000 
insured  1  0.0001  0  0.0000  2  0.0005  1  0.0001  1  0.0001 
against  1  0.0001  0  0.0000  0  0.0000  0  0.0000  0  0.0000 
and  1  0.0001  2  0.0005  2  0.0005  2  0.0005  3  0.0011 
two  1  0.0001  0  0.0000  0  0.0000  0  0.0000  1  0.0001 
BristolMyers  0  0.0000  2  0.0000  1  0.0001  1  0.0001  1  0.0001 
would  0  0.0000  1  0.0001  0  0.0000  1  0.0001  1  0.0001 
seek  0  0.0000  1  0.0001  0  0.0000  0  0.0000  2  0.0005 
Food  0  0.0000  1  0.0001  1  0.0001  1  0.0001  0  0.0000 
Drug  0  0.0000  2  0.0005  0  0.0000  0  0.0000  1  0.0001 
Administration  0  0.0000  1  0.0001  0  0.0000  0  0.0000  0  0.0000 
permission  0  0.0000  1  0.0001  1  0.0001  0  0.0000  0  0.0000 
test  0  0.0000  2  0.0005  0  0.0000  1  0.0001  2  0.0005 
vaccine  0  0.0000  4  0.0020  1  0.0001  1  0.0001  0  0.0000 
humans  0  0.0000  2  0.0005  1  0.0001  1  0.0001  1  0.0001 
end  0  0.0000  2  0.0005  1  0.0001  1  0.0001  1  0.0001 
March.  0  0.0000  1  0.0001  0  0.0000  0  0.0000  1  0.0001 
spokesman  0  0.0000  1  0.0001  1  0.0001  2  0.0005  3  0.0011 
file  0  0.0000  1  0.0001  0  0.0000  0  0.0000  0  0.0000 
investigational  0  0.0000  1  0.0001  1  0.0001  1  0.0001  1  0.0001 
new  0  0.0000  1  0.0001  0  0.0000  0  0.0000  1  0.0001 
application  0  0.0000  1  0.0001  1  0.0001  0  0.0000  1  0.0001 
month  0  0.0000  1  0.0001  0  0.0000  1  0.0001  0  0.0000 
requesting  0  0.0000  1  0.0001  2  0.0005  0  0.0000  0  0.0000 
FDA  0  0.0000  1  0.0001  0  0.0000  1  0.0001  1  0.0001 
allow  0  0.0000  1  0.0001  1  0.0001  0  0.0000  1  0.0001 
Scientists  0  0.0000  1  0.0001  0  0.0000  0  0.0000  0  0.0000 
Genetic  0  0.0000  1  0.0001  2  0.0005  1  0.0001  1  0.0001 
Systems  0  0.0000  1  0.0001  0  0.0000  0  0.0000  0  0.0000 
unit  0  0.0000  1  0.0001  0  0.0000  0  0.0000  0  0.0000 
created  0  0.0000  1  0.0001  1  0.0001  0  0.0000  1  0.0001 
which  0  0.0000  1  0.0001  0  0.0000  0  0.0000  0  0.0000 
antibodies  0  0.0000  1  0.0001  0  0.0000  0  0.0000  0  0.0000 
virus  0  0.0000  2  0.0005  1  0.0001  1  0.0001  0  0.0000 
monkey  0  0.0000  1  0.0001  0  0.0000  0  0.0000  0  0.0000 
mice  0  0.0000  1  0.0001  1  0.0001  1  0.0000  0  0.0000 
smallpox  0  0.0000  1  0.0001  1  0.0001  0  0.0000  0  0.0000 
consists  0  0.0000  1  0.0001  1  0.0001  0  0.0000  1  0.0001 
hospital  0  0.0000  0  0.0000  1  0.0001  1  0.0001  1  0.0001 
general  0  0.0000  0  0.0000  1  0.0001  1  0.0001  0  0.0000 
thus  0  0.0000  0  0.0000  0  0.0000  0  0.0000  0  0.0000 
letter  0  0.0000  0  0.0000  1  0.0001  0  0.0000  1  0.0001 
Growth  0  0.0000  0  0.0000  2  0.0005  1  0.0001  0  0.0000 
aggreed  0  0.0000  0  0.0000  0  0.0000  1  0.0001  0  0.0000 
lecture  0  0.0000  0  0.0000  1  0.0001  0  0.0000  1  0.0001 
report  0  0.0000  0  0.0000  1  0.0001  0  0.0000  1  0.0001 
patient  0  0.0000  0  0.0000  2  0.0005  1  0.0001  1  0.0001 
library  0  0.0000  0  0.0000  1  0.0001  0  0.0000  0  0.0000 
food  0  0.0000  0  0.0000  1  0.0001  0  0.0000  0  0.0000 
care  0  0.0000  0  0.0000  1  0.0001  0  0.0000  0  0.0000 
staff  0  0.0000  0  0.0000  1  0.0001  0  0.0000  0  0.0000 
Table 3: Concept by concept, decompose matrix using SVD method
0.4489  0.8667  0.8840  0.8430  0.9810  
0.5386  0.7050  0.6724  0.6230  0.7500  
0.4733  0.7560  0.8038  0.4137  0.7860  
0.5693  0.4980  0.5210  0.6750  0.9238  
0.5220  0.8500  0.6610  1.0000  0.7870  
Rank Score  2.5521  3.6757  3.5422  3.5547  4.2278 
In the next stage the data of Table 3 can be tested using the topdown correlation technique (see Equation 2). The null hypothesis (H_{0}) of this test is that there is an agreement regarding the investigated data of the columns of Table 3. In our case, according to Equation (2), M=5 was the number of investigated documents, n=5 were the conceptual values after reduction, and R was the sum of the rank score (see Table 3). Then, W=0.0059 and thus (see Equation 3). Comparing it to the chisquare distribution with a degrees of freedom √≠=51= 4 and , we had nonrejection of the null hypothesis (H_{0}) of the agreement regarding the conceptuality of the compared documents (0.99 < P < 0.95).
5.3 Onion layers method
In this stage we obtained the five sets of data for each document (Table 4) using the method described in section 3.2. It should be noted that each set contains the points of the smallest convex polygon plus a number of points ranging between 02 (see Figure 1).
Table 4: Pair points of the last (smallest polygon) of the five onion
polygons of documents C1, C2, C3, C4, C5
Document C1  Document C2  Document C3  Document C4  Document C5  
Axis X  Axis Y  Axis X  Axis Y  Axis X  Axis Y  Axis X  Axis Y  Axis X  Axis Y 
248  97  255  101  259  99  251  99  232  98 
246  70  242  101  249  100  242  99  230  97 
265  97  235  100  242  97  240  97  220  83 
248  97  249  44  265  32  237  83  243  97 
259  97  267  68  266  98  262  77  232  98 
255  101  259  99  269  97  240  97  
252  101  254  97  251  99  
266  70  260  97 
The next stage was to transform the data of Table 4 in order for each pair of data to be represented by a value. Thus, for the whole set we calculated the magnitude of each pair point (x_{ij},y_{ij}) from the origin 0 of the coordinate system using the following formula:
(6)
where: i=1, 5 is the number of elements of each smallest convex layer and j=1, 5 is the number of documents.
We then obtained the first five values (Table 5) for each document in order to compare these statistically with those of the of LSI method (Table 3) and normalized the data in Table 6.
Table 5: Magnitude of all the pair points of documents C1, C2, C3, C4, C5
Document C1  Document C2  Document C3  Document C4  Document C5 
266.2949 
255.7655 
282.1950 
266.2949 
276.5683 
274.2736 
262.2308 
255.3919 
252.8577 
275.5231 
277.2760 
268.3300 
260.7163 
266.9251 
283.4784 
269.8185 
261.4670 
258.8610 
251.1135 
273.0806 
251.8492 
249.6177 
235.1361 
261.6448 
251.8492 
Table 6: Normalization on the maximum value (283.48) of Table 5
Document C1 
Document C2 
Document C3 
Document C4 
Document C5 

0.9394 
0.9022 
0.9955 
0.9394 
0.9756 

0.9675 
0.9250 
0.9009 
0.8920 
0.9719 

0.9781 
0.9466 
0.9197 
0.9416 
1.0000 

0.9518 
0.9224 
0.9132 
0.8858 
0.9633 

0.8884 
0.8806 
0.8295 
0.9230 
0.8884 

Rank Score  4.7252  4.5768  4.5588  4.5818  4.7992 
In this stage the data of Table 6 can be tested using the topdown correlation technique (Equation 2). The null hypothesis (H _{0} ) of this test is that there is an agreement regarding the investigated data of the columns of Table 6. In our case and according to Equation (2), M=5 was the number of investigated documents, n=5 were the conceptual values after reduction, and R was the sum of the rank score (see Table 6). Then W=0.0003 and thus (see Equation 3). On comparing it to the chisquare distribution with degree of freedom √≠=51=4 and , we had nonrejection of the null hypothesis of the agreement regarding the conceptuality of the compared documents (0.99 < P < 0.95).
6 Analysis and discussion
As can be seen from the results in Table 1, the proposed method correctly categorized all tested texts.
The statistical evaluation method based on rank correlation (section 5.1) of our proposed method and the LSI method (section 5.2) showed that they have the same level significance (0.99 < P < 0.95), although our proposed method may be considered to be superior to the LSI method because the reduction technique is independent of the number of documents and the number of terms involved. In more detail, the method promises a great reduction in the original information. For example, a selected text of 500 characters is now represented by a convex polygon (categorization stage) of two vectors and that have 39 elements each: a total reduction in information of about or 96.4%
Furthermore, for smaller texts (100 words in length) we avoid the training procedure and, in this way, the complexity of the proposed method is reduced dramatically in comparison with traditional methods.
For training on longer texts this method may be applied as follows:
 The text is divided into smaller subsets of about 100 words in length.
 The procedure of the preprocessing stage (section 3.2) is then applied to each subset in turn.
7 Conclusions and future work
7.1 Conclusion
This paper has explained a text categorization technique that combines numeric conversion of a symbolic expression and a computational geometric algorithm. The proposed technique is suitable for the specific partitioning of digital library collections, and semantic categorization of emails and abstracts. This method may be characterized as a supervised method because it is based on the claim that an unknown text categorization vector (smallest convex polygon) is compared with a particular text categorization vector the class of which is already known. Thus, a practical way to see if this method is able to classify an unknown document vector correctly is to test it with a particular number of different predetermined semantic document vectors and let the system decide which of these vectors is geometrically nearest the unknown vector according to the method described in section 3.2.
An example of this is the experiment in which five test categories (of 5 x 100 word documents per category) were used to test this technique. Statistical information was collected from the experiment and used to corroborate our findings:
 Statistical evaluation using rank correlation gave a significant level of accuracy (0.99 < P < 0.95) and showed that the results of the five tested documents are in agreement.
 The proposed method is based on a different philosophy to those methods that depend on the removal of features based on overall frequency counts (Salton and McGill 1983, Jones and Paynter 2002). In contrast to those methods, the proposed method is based on total word participation in the feature extraction. The results showed that this feature is efficient and accurate for semantic text categorization.
 The onion algorithm may be characterized as a more efficient algorithm for the lowest complexity, O(d*n log n), than the text categorization methods that are used currently, for example the Bayesian classifier with O(n^{3}k) complexity.
In conclusion, use of the onion algorithm for text categorization purposes may be considered to be a solution to the problems of managing information storage and retrieval in large digital libraries.
7.2 Future work
A way to improve our proposed method in future would be to develop numerical conversion of a symbolic expression based on a different encoding method. In this way, we propose to extend the encoding to a larger 8bit conversion. The aim of this proposition is for each character of each word to be described in a more specific way. Phonetic conversion in place of ASCII conversion may provide an even more accurate solution in our proposed method.
The next aim of this study is to increase the text length in each processed document to more than 100 words.
Furthermore, this system has the ability to investigate the relation of the position of one convex polygon to that of another in the Cartesian plane. Thus, in this study we considered the degree a of the intersection of convex polygons as a measure of comparison. However, in future the centre of gravity of a convex polygon may be used as a measure of comparison and treated as a vector (O'Rourke 1995).
In our latest work modified the proposed method to construct an antispam filter.
Finally, this method may be considered to be a preliminary study and the next aim is to apply it to a real test collection. However, the statistical conclusion and the results of the 15 tested documents show that we are moving in the right direction.
References
Anick, P. and Vaithyanathan, S. (1997) "Exploiting Clustering and Phrases for ContextBased Information Retrieval". In Proceedings of SIGIR'97: the 20th International Conference on Research and Development in Information Retrieval (ACM Press), pp. 314322
Borgefors, G., Ragnemalm, I. and Sanniti di Baja, G. (1992) "Feature extraction on the Euclidean distance transform". In Progress in Image Analysis and Processing II, edited by V. Cantoni, M. Ferretti, S. Levialdi, R. Negrini and R. Stefanelli, (World Scientific: Singapore), pp. 115122
Bose, P. and Toussaint, G. (1995) "No Quadrangulation is Extremely Odd". In 6th International Symposium on Algorithms and Computation (formerly SIGAL International Symposium on Algorithms), pp. 340358
Bremner, D., et al. (2003) "OutputSensitive Algorithms for Computing NearestNeighbour Decision Boundaries". In Proceedings of the 8th Workshop on Algorithms and Data Structures (WADS 2003), Ottawa, Ontario, Canada, July 30August 1 http://citeseer.ist.psu.edu/593554.html
Deerwester, S., Dumais, S. T., et al. (1990) "Indexing by latent semantic analysis". Journal of the American Society for Information Science, 41(6), 391407 http://citeseer.ist.psu.edu/deerwester90indexing.html
Domingos, P. and Pazzani, M. J. (1997) "On the optimality of the simple Bayesian classifier under zeroone loss". Machine Learning, 29(23), 103130 http://citeseer.ist.psu.edu/domingos97optimality.html
Dumais, S. T., Platt, J., Heckerman, D. and Sahami, M. (1998) "Inductive Learning Algorithms and Representations for Text Categorization". In Proceedings of the 7th International Conference on Information and Knowledge Management (ACM Press), pp. 148155
Graham, R. L. (1972) "An efficient algorithm for determining the convex hull of a finite planar set". Information Processing Letters, 1, 132133
Joachims, T. (1998) "Text categorization with support vector machines: Learning with many relevant features". European Conference on Machine Learning (ECML), pp. 123132 http://citeseer.ist.psu.edu/joachims97text.html
Jones, S. and Paynter, G. W. (2002) "Automatic Extraction of Document Keyphrases for Use in Digital Libraries: Evaluation and Applications". Journal of the American Society for Information Science and Technology, 53(8), 653677
Kendall, M. G. (1962) Rank Correlation Methods, 3rd edition (Charles Griffin: London), p. 199
Lewis, D. D. and Hayes, P. J. (1994) ACM Transactions on Information Systems, special issue on text categorization, 12(3), July
O'Rourke, J., Chien, J., Olson, C. and Naddor, T. (1982) "A new linear algorithm for intersecting convex polygons".Computer Graphics and Image Processing, 19 (4), 384391
O'Rourke, J. (1995) Computational Geometry in C (Cambridge University Press: New York), p. 47
Poulos, M., Rangoussi, M., et al. (1999) "Parametric person identification from the EEG using computational geometry". Proceedings of the Sixth International Conference on Electronics, Circuits and Systems (ICECS'99), Pafos, Cyprus, September, pp. 10051012
Poulos, M., Magkos, E., Chrissikopoulos, V. and Alexandris, N. (2003) "Secure Fingerprint Verification Based on Image Segmentation using Computational Geometry Algorithms". In Proceedings of the IASTED International Conference Signal Processing Pattern Recognition & Application, pp. 308312 http://www.ionio.gr/~mpoulos/papers/Fingerprint%20verification.pdf
Salton, G. and McGill, M. (1983) Introduction to Modern Information Retrieval (McGrawHill)
Slonim, N. and Tishby, N. (2001) "The power of word clusters for text classification". In 23rd European Colloquium on Information Retrieval Research (ECIR) http://www.cis.upenn.edu/group/datamining/ReadingGroup/papers/textclassinfobottleneck.pdf
Tokunaga, T. (1999) Computation and Language Volume 5: Information Retrieval and Natural Language Processing (University of Tokyo Press)
Toussaint, G., Bhattacharya, B. and Poulsen, R. (1984) "The application of Voronoi diagrams to non?≠parametric decision rules". In Proceedings of Computer Science and Statistics: 16th Symposium of the Interface
Weston, J., et al. (1997) "Density Estimation Using Support Vector Machines". Technical Report CSDTR9723, Royal Holloway College, University of London http://citeseer.ist.psu.edu/weston98density.html
Yang, Y. (1999) "An evaluation of statistical approaches to text categorization". Journal of Information Retrieval, Vol 1, No. 1/2, 6788 http://citeseer.ist.psu.edu/yang97evaluation.html
Links
Reuters21578 collection http://www.daviddlewis.com/resources/testcollections/reuters21578/
Appendix
This Appendix describes the algorithms used in the text categorization method. Section A1 describes the algorithm for constructing a twodimensional convex polygon, and section A2 outlines an onion algorithm. In our case we considered that the point P=P_{xy} is a converted character, where P_{y} on the yaxis represents the corresponding value of a text character and P_{x}on the xaxis represents the frequency of this character.
A1 Construction of a twodimensional convex polygon
Graham (1972) proposed an algorithm of complexity O(n logn) which computed the convex hull of a given set of n points on the 2D plane. This algorithm proceeds in three phases.
Phase 1. The highest ycoordinate, which is clearly
on the hull, is used as the pivot (Figure A1).
Figure A1. The original points
Phase 2. The points are sorted in order of
increasing angle about the pivot. A horizontal ray P_{0},
emanating from the pivot, is rotated counterclockwise. The first point that the
ray contacts is sorted as the first point (P_{1}). In the same
way, the remaining points are sorted, until the ray returns to its original
horizontal position (Figure A2).
Figure A2. Sorting procedure
Phase 3. The horizontal ray, emanating from the pivot
(P_{0}), begins to rotate counterclockwise and stops at
the first point sorted in phase 2. Then these points, the first and the second
(P_{0}, P_{1}), form the new axis of
rotation. Using point P_{1 } as the pivot, the third point of
the convex polygon is determined as the first point of contact on rotation of
the new axis (Figure A3).
Figure A3. Final phase of the construction of the convex polygon
This procedure is repeated until a phase where the first point of contact of the rotating ray is again P_{0}. Then the hull is closed.
Algebraic explanation of the area of a convex polygon
Suppose the x and y coordinates of a point p_{i}are denoted by (x_{i}, y_{i}) and there are n total points in the plane. If we considered that the algebraic area (or the signed area) of the oriented triangle p_{1}, p_{2}, p_{3} is , then determinant .
The sign of the algebraic area is positive if and only if the orientation of the triangle is counterclockwise. Consider any nvertex simple polygon P whose vertices around its boundary in some orientation (clockwise or counterclockwise) are . Let O denote the origin in the plane of P. Then the algebraic area of the convex polygon P is , where where we assume . Here again, the sign is positive if and only if the given orientation is counterclockwise around the boundary of P. Thus, any simple polygon P with more than three vertices (i.e. n > 3) has at least one diagonal. A diagonal is a line segment that connects two nonadjacent vertices of P and has no intersection with the exterior of P.
A.2 Onion layers of a set of points
Let P be a set of n points in R_{2}, and let (H_{0}, H_{1}, ... ,H_{k}) be the onion of P, that is, H_{0 } is the convex hull of P. Now remove the points on the boundary of H_{0}from P and let H_{1}be the convex hull of what remains in P, and so on until no more points are left in P (see the example in Figure A4).
Figure A4. Onion layers of a set of points
Thus, the construction of the onion is produced by iterative application of the giftwrapping algorithm and shows that it takes a total time of O(d*n log n).