Donate a beverage to help me to keep CRFSharp up to date :)

CRFSharp

CRFSharp is Conditional Random Fields (CRF) implemented by .Net Core(C#), a machine learning algorithm for learning from labeled sequences of examples.

Overview

CRFSharp is Conditional Random Fields implemented by .Net Core(C#), a machine learning algorithm for learning from labeled sequences of examples. It is widely used in Natural Language Process (NLP) tasks, for example: word breaker, postaging, named entity recognized and so on.

CRFSharp (aka CRF#) is based on .NET Core, so it can run on Windows/Linux and other platforms .Net core is supporting. Its main algorithm is similar as CRF++ written by Taku Kudo. It encodes model parameters by L-BFGS. Moreover, it also has many significant improvements than CRF++, such as totally parallel encoding, optimizing memory usage and so on.

Currently, when training corpus, compared with CRF++, CRFSharp can make full use of multi-core CPUs and use memory effectively, especially for very huge size training corpus and tags. So in the same environment, CRFSharp is able to encode much more complex models with less cost than CRF++.

The following screenshot is an example that CRFSharp is running on a machine with 16 cores CPUs and 96GB memory. The training corpus has 1.24 million records with nearly 1.2 billion features. From the screenshot, all CPU cores are full used and memory usage is stable. The average encoding time per iteration is 3 minutes and 33 seconds.

Besides command line tool, CRFSharp has also provided APIs and these APIs can be used into other projects and services for key techincal tasks. For example: WordSegment project has used CRFSharp to recognize named entity; Query Term Analyzer project has used it to analyze query term important level in word formation and Geography Coder project has used it to detect geo-entity from text. For detailed information about APIs, please see section [Use CRFSharp API in your project] in below.

To use CRFSharp, we need to prepare corpus and design feature templates at first. CRFSharp's file formats are compatible with CRF++(official website:https://taku910.github.io/crfpp/). The following paragraphs will introduce data formats and how to use CRFSharp in both command line and APIs

Training file format

Training corpus contains many records to describe what the model should be. For each record, it is split into one or many tokens and each token has one or many dimension features to describe itself.

In training file, each record can be represented as a matrix and ends with an empty line. In the matrix, each row describes one token and its features, and each column represents a feature in one dimension. In entire training corpus, the number of column must be fixed.

When CRFSharp encodes, if the column size is N, according template file describes, the first N-1 columns will usually be used as input data to generate binary feature set and train model. The Nth column (aka last column) is the answer that the model should output. The means, for one record, if we have an ideal encoded model, given all tokens’ the first N-1 columns, the model should output each token’s Nth column data as the entire record’s answer.

There is an example (a bigger training example file is at download section, you can see and download it there):

Word | Pos | Tag -----------|------|---- ! | PUN | S Tokyo | NNP | S_LOCATION and | CC | S New | NNP | B_LOCATION York | NNP | E_LOCATION are | VBP | S major | JJ | S financial | JJ | S centers | NNS | S . | PUN | S | |
! | PUN | S p | FW | S ' | PUN | S y | NN | S h | FW | S 44 | CD | S University | NNP | B_ORGANIZATION of | IN | M_ORGANIZATION Texas | NNP | M_ORGANIZATION Austin | NNP | E_ORGANIZATION

The example is for labeling named entities in records. It has two records and each token has three columns. The first column is the term of a token, the second column is the token’s pos-tag result and the third column is to describe whether the token is a named entity or a part of named entity and its type. The first and the second columns are input data for encoding model, and the third column is the model ideal output as answer.

In above example, we designed output answer as "POS_TYPE". POS means the position of the term in the chunk or named entity, TYPE means the output type of the term.

For POS, it supports four types as follows: S: the chunk has only one term B: the begin term of the chunk M: one of the middle term in the chunk E: the end term of the chunk

For TYPE, the example contains many types as follows: ORGANIZATION : the name of one organization LOCATION : the name of one location For output answer without TYPE, it's just a normal term, not a named entity.

Test file format

Test file has the similar format as training file. The only different between training and test file is the last column. In test file, all columns are features for CRF model.

CRFSharp command line tools

CRFSharpConsole.exe is a command line tool to encode and decode CRF model. By default, the help information showed as follows:
Linear-chain CRF encoder & decoder by Zhongkai Fu (fuzhongkai@gmail.com)
CRFSharpConsole.exe [parameter list...]
-encode [parameter list...] - Encode CRF model from given training corpus -decode [parameter list...] - Decode CRF model to label text -shrink [parameter list...] - Shrink encoded CRF model size

As the above information shows, the tool provides two run modes. Encode mode is for training model, and decode mode is for testing model. The following paragraphs introduces how to use these two modes.

Encode model

This mode is used to train CRF model from training corpus. Besides -encode parameter, the command line parameters as follows:
CRFSharpConsole.exe -encode [parameters list]
-template <filename>: template file name
-trainfile <filename>: training corpus file name
-modelfile <filename>: encoded model file name
-maxiter <integer number>: maximum iteration, when encoding iteration reaches this value, the process will be ended. Default value is 1000
-minfeafreq <integer number>: minimum feature frequency, if one feature's frequency is less than this value, the feature will be dropped. Default value is 2
-mindiff <float-point number>: minimum diff value, when diff less than the value consecutive 3 times, the process will be ended. Default value is 0.0001
-thread <integer number>: threads used to train model. Default value is 1
-slotrate <float-point value>: the maximum slot usage rate threshold when building feature set. it is ranged in (0.0, 1.0). the higher value means longer time to build feature set, but smaller feature set size. Default value is 0.95
-hugelexmem <integer>: build lexical dictionary in huge mode and shrink starts when used memory reaches this value. This mode can build more lexical items, but slowly. Value ranges [1,100] and default is disabled.
-regtype <type string>: regularization type. L1 and L2 regularization are supported. Default is L2
-retrainmodel <string>: the existing model for re-training.
-debug: encode model as debug mode

Note: either -maxiter reaches setting value or -mindiff reaches setting value in consecutive three times, the training process will be finished and saved encoded model.

Note: -hugelexmem is only used for special task, and it is not recommended for common task, since it costs lots of time for memory shrink in order to load more lexical features into memory

A command line example as follows: CRFSharpConsole.exe -encode -template template.1 -trainfile ner.train -modelfile ner.model -maxiter 100 -minfeafreq 1 -mindiff 0.0001 -thread 4 –debug

The entire encoding process contains four main steps as follows:

Load train corpus from file, generate and select feature set according templates.
Build selected feature set index data as double array trie-tree format, and save them into file.
Run encoding process iteratively to tune feature values until reach end condition.
Save encoded feature values into file.
In step 3, after run each iteration, some detailed encoding information will be show. For example:
M_RANK_1 [FR=47658, TE=54.84%]
M_RANK_2:27.07% M_RANK_0:26.65% E_RANK_0:0.31% B_RANK_0:0.21% E_RANK_1:0.19%
iter=65 terr=0.320290 serr=0.717372 diff=0.0559666295793355 fsize=73762836(1.10% act) Time span: 00:31:56.4866295, Aver. time span per iter: 00:00:29
The encoding information has two parts. The first part is information about each tag, the second part is information in overview.For each tag, it has two lines information. The first line shows the number of this tag in total (FR) and current token error rate (TE) about this tag. The second line shows this tag's token error distribution. In above example, in No.65 iteration, M_RANK_1 tag's token error rate is 54.84% in total. In these token error, 27.07% is M_RANK_2, 26.65% is M_RANK_0 and so on.For second part (information in overview), some global information is showed.
iter : the number of iteration processed
terr : tag's token error rate in all
serr : record's error rate in all
diff : different between current and previous iteration
fsize( x% act) : the number of feature set in total, x% act means the number of non-zero value features. In L1 regularization, with the increasement of iter, x% is reduced. In L2 regularization, x% is always 100%.
Time span : how long the encoding process has been taken
**Aver. time

CRFSharp

Install / Use

README

CRFSharp

Overview

Training file format

Test file format

CRFSharp command line tools

Encode model