JaTeCS (Java Text Categorization System)

JaTeCS is an open source Java library focused on Automatic Text Categorization (ATC). It covers all the steps of an experimental activity, from reading the corpus to the evaluation of the experimental results. JaTeCS focuses on text as the central input, and its code is optimized for this type of data. As with many other machine learning (ML) frameworks, it provides data readers for many formats and well-known corpora, NLP tools, feature selection and weighting methods, the implementation of many ML algorithms as well as wrappers for well-known external software (e.g., libSVM, SVM_light). JaTeCS also provides the implementation of methods related to ATC that are rarely, if never, provided by other ML framework (e.g., active learning, quantification, transfer learning).

The software is released under the terms of GPL license.

Software installation

To use the latest release of JaTeCS in your Maven projects, add the following on your project POM:

<repositories>

    <repository>
        <id>jatecs-mvn-repo</id>
        <url>https://github.com/jatecs/jatecs/raw/mvn-repo/</url>
        <snapshots>
            <enabled>true</enabled>
            <updatePolicy>always</updatePolicy>
        </snapshots>
    </repository>

</repositories>

then in the dependencies list add

<dependencies>

    <dependency>
        <groupId>hlt.isti.cnr.it</groupId>
        <artifactId>jatecs-gpl</artifactId>
        <version>1.0.0</version>
    </dependency>
    
</dependencies>

How to develop your apps with the software

Data representation through IIndex data structure

In JaTeCS, the raw textual data is manipulated through the use of an indexed structure named IIndex. This data structure handles all relations among documents, features, and categories (which could be defined in a taxonomy). The IIndex can be used to manipulate or query data.

The following snippet shows a very simple example printing the number of terms that appear more than 5 times in each document:

<pre><code> for(int docID : index.getDocumentDB().getDocuments()) { String documentName = index.getDocumentDB().getDocumentName(docID); int frequentTerms = 0; for (int featID : index.getContentDB().getDocumentFeatures(docID)) { if (index.getContentDB().getDocumentFeatureFrequency(docID, featID) > 5) frequentTerms++; } System.out.println("Document "+documentName+" contains " + frequentTerms + " frequent terms"); } </code></pre>

A richer example on the use of the IIndex structure can be found in IndexQuery.java.

The class TroveMainIndexBuilder.java is meant to create the IIndex, which might be used independently, or in combination with CorpusReader.java and FullIndexConstructor.java to construct an index from a raw corpus of documents (several examples of this latter could be consulted in directory dataset, including, e.g., the Reuters21578 collection -file IndexReuters21578.java)-, the RCV1-v2 collection -file IndexRCV1.java-, among many others). JaTeCS provides common feature extractors to represent features from raw textual data, like the BoW (bag-of-words) extractor, or the characters n-grams extractor. The dataset directory contains many examples of corpus indexing. Both extractors are subclasses of the generic class FeatureExtractor.java which provides additional capabilities like stemming, stopword removal, etc.

Preparing data for experiments: feature selection and feature weighting

Once the index has been created and instantiated, a common practice often followed in the experimentation pipeline consists of selecting most informative features (and discarding the rest). JaTeCS provides several implementations of global (see GlobalTSR.java) or local (see LocalTSR.java) Term Selection Reduction (TSR) methods. JaTeCS also provides many implementations of popular TSR functions, including InformationGain, ChiSquare, GainRatio, among many others. Additionally, the GlobalTSR can be set with different policies, such as sum, average, or max (subclasses of IGlobalTSRPolicy.java). JaTeCS also implements the RoundRobinTSR method, which selects the most important features to each category in a round robin manner. The following snippet illustrates how round robin feature selection with information gain is carried out in JaTeCS (see the full example here):

<pre><code> RoundRobinTSR tsr = new RoundRobinTSR(new InformationGain()); tsr.setNumberOfBestFeatures(5000); tsr.computeTSR(index); </code></pre>

The last step in data preparation consists of weighting the features so as to bring bear to the "relative importance" of terms in the documents. JaTeCS offers two such popular methods, including the well-known TfIdf and BM25. Generally, the weighting function (here exemplified by TfIdf) is to be applied as follows (see the complete example here):

<pre><code> IWeighting weighting = new TfNormalizedIdf(trainIndex); IIndex weightedTrainingIndex = weighting.computeWeights(trainIndex); IIndex weightedTestIndex = weighting.computeWeights(testIndex); </code></pre>

Building the classifier

Building a classifier typically involves a two-step process, including (i) model learning (ILearner), and (ii) document classification IClassifier. JaTeCS implements several machine learning algorithms, including: AdaBoost-MH, MP-Boost, KNN, logistic_regression, naive bayes, SVM, among many others (placed in the source directory classification).

The following code shows how SVMlib could be trained in JaTeCS (check LearnSVMlib.java for the full example, and the source directory classification for examples involving other learning algorithms):

<pre><code> SvmLearner svmLearner = new SvmLearner(); IClassifier svmClassifier = svmLearner.build(trainIndex); </code></pre>

Once trained, the model could be used to classify unseen documents. This is carried out in JaTeCS by running a classifier, instantiated with the previous model parameters and receiving as argument an index containing all test documents to be classified (a full example is available in ClassifySVMlib.java), i.e.,:

<pre><code> Classifier classifier = new Classifier(testIndex, svmClassifier); classifier.exec(); </code></pre>

JaTeCS also brings support to evaluation of results by means of the following classes: ClassificationComparer.java (simple flat evaluation) and HierarchicalClassificationComparer.java (evaluation for hierarchical taxonomies of codes); a full example involving both evaluation procedures could be found here. Evaluation is easily performed in JaTeCS in just few lines of code, e.g., :

<pre><code> ClassificationComparer flatComparer = new ClassificationComparer(classifier.getClassificationDB(), testIndex.getClassificationDB()); ContingencyTableSet tableSet = flatComparer.evaluate(); </code></pre>

Applications of Text Classification

JaTeCS includes many ready-to-use applications that could be useful for users which are mainly interested in running experiments on their own data quickly, but also for the practitioners, that might rather be interested in developing their own algorithms and applications; those might found on the JaTeCS apps implementations a perfect starting point where to start familiarizing with the framework through examples. In what follows, we show some selected examples, while many others could be found here.

Text Quantification

Text Quantificatio

Jatecs

Install / Use

README