JNN (Java Neural Network Toolkit) - 2015-09-10

Original writer: Wang Ling

This package contains a Java Neural Network Toolkit with implementations of: -An Word Representation model allowing vectors to be generated as a word lookup table,a set of features, or/and the C2W model (words are representated by their sequence of characters) -An LSTM-based Language Model -An LSTM-based Part-Of-Speech-Tagger Model

The system requires Java 1.8+ to be installed, and approximately 8-16 GB of memory depending on the size and complexity of the network.

0.1 Quick Start

Examples for training a Part-of-speech tagger and language models can be found in scripts/run_pos.sh and scripts/run_lm.sh, respectively. These can be run with the following commands:

sh scripts/run_pos.sh sh scripts/run_lm.sh

These scripts download currently available data for both tasks, and serve as examples of how the code is to be run. The POS tagger is trained on the Ark POS dataset found in https://code.google.com/p/ark-tweet-nlp/downloads/list . The language models are trained on subsets of wikipedia, which we make available at https://www.l2f.inesc-id.pt/~wlin/wiki.gz .

1.1 Language Modeling

Sample datasets can be downloaded by running: sh scripts/download_wikidata.sh

The LSTM-based language model can be trained by calling:

java -Xmx10g -cp jnn.jar:libs/* jnn.functions.nlp.app.lm.LSTMLanguageModel -batch_size 10 -iterations 1000000 -lr 0.1 -output_dir sample_lm_model -softmax_function word-5000 -test_file wiki/wiki.test.en -threads 8 -train_file wiki/wiki.train.en -validation_file wiki/wiki.dev.en -validation_interval 10000 -word_dim 50 -char_dim 50 -char_state_dim 150 -lm_state_dim 150 -word_features characters -nd4j_resource_dir nd4j_resources -update momentum

This command will train a neural language model using the training file wiki/wiki.train.en, validating on wiki/wiki.val.en, and testing on wiki/wiki.test.en.

Arguments are described below:

batch_size - number of sentences (lines) processed in each each mini-batch iterations - number of iterations the model is to be trained (each iterations processes one mini-batch) lr - learning rate output_dir - directory to save the model, write the statistics (perplexities), and scores for the test data word_features - type of word representation used (options described in 3) softmax_function - type of softmax unit used for predicting words (options described in 1.2) train_file - training text file validation_file - validation text file test_file - test text file validation_interval - number of mini-batches to be run before computing perplexities on the validation set word_dim - word vector dimension (In a lookup table, this will generate a vocab*word_dim table, while in the C2W model, the character LSTM states will be projected into a vector of size word_dim) char_dim - character vector dimension (Always uses a lookup table) char_state_dim - LSTM state and cell dimensions used to build lstm states lm_state - LSTM state and cell dimensions for the language model nd4j_resource_dir - ND4J configuration directories (simply point to nd4j_resources) threads - number of threads to be used (sentences in each mini-batch will be divided among threads) update - sgd method (regular, momentum or adagrad)

The following files will be created in the directory specified by -output_dir:

model.gz - The model is stored in this file every time the validation perplexity improves over the previous best value. If this file exists when the command is called, this model will be loaded and training will be carried out from this point. This way if something goes wrong during training (e.g. server crashes), training will resume at the last saved point. model.tmp.gz - A backup copy of hte model.gz file, this is kept so that if the script fails when model.gz is being written, it is not lost. Thus, if model.gz is incomplete, simply copy model.tmp.gz over it. rep.gz - The word representation model, this can be used in order to reuse the word representations trained on this task as initilization for other tasks. test.scores.gz - Once the model finishes training, the file specified by -test_file is trained and sentence level perplexities are computed and stored in this file. (to simply run a model on the test set, make sure the model.gz is created and set -iterations to 0) stats - Reports statistics during training. In this task, perplexities on the development set are reported.

1.2 Softmax functions

The most straight-forward way to predict each word as a softmax over the whole training vocabulary (set -softmax_function to word). However, the normalization over the whole vocabulary is expensive. One way around this problem is to prune the vocabulary by replacing less frequent words by an unknown token. This can be done by setting -softmax_function to word-*, where * is the number of words to consider. Thus, word-5000, will perform a softmax over the top 5000 words and replaces the rest of the words by an unknown token.

It is also possible to use Noise Constrastive Estimation by setting the -softmax_function parameter to word-nce. This allows parameters to be estimated for the whole vocabulary, while avoiding the normalization over the whole vocabulary at training time.

2.1 Part-of-Speech Tagging

Sample datasets can be downloaded by running: sh scripts/download_posdata.sh

The LSTM-based Part-of-Speech Tagger can be trained by calling:

java -Xmx10g -cp jnn.jar:libs/* jnn.functions.nlp.app.pos.PosTagger -lr 0.3 -batch_size 100 -validation_interval 10 -threads 8 -train_file twpos-data-v0.3/oct27.splits/oct27.train -validation_file twpos-data-v0.3/oct27.splits/oct27.dev -test_file twpos-data-v0.3/oct27.splits/oct27.test -input_format conll-0-1 -word_features characters -context_model blstm -iterations 1000 -output_dir /models/pos_model -sequence_activation 2 -word_dim 50 -char_dim 50 -char_state_dim 150 -context_state_dim 150 -update momentum -nd4j_resource_dir nd4j_resources/

This command will train a POS tagger using the training file twpos-data-v0.3/oct27.splits/oct27.train, validating on twpos-data-v0.3/oct27.splits/oct27.dev, and testing on twpos-data-v0.3/oct27.splits/oct27.test.

Arguments are described below: batch_size - number of sentences (lines) processed in each each mini-batch iterations - number of iterations the model is to be trained (each iterations processes one mini-batch) lr - learning rate output_dir - directory to save the model, write the statistics (accuracies), and scores for the test data word_features - type of word representation used (options described in 3) train_file - training file validation_file - validation file test_file - test file input_format - file format (options described in 2.2) context_model - model that encodes contextual information (options described in 2.3) word_dim - word vector dimension (In a lookup table, this will generate a vocab*word_dim table, while in the C2W model, the character LSTM states will be projected into a vector of size word_dim) char_dim - character vector dimension (Always uses a lookup table) char_state_dim - LSTM state and cell dimensions used to build lstm states lm_state - LSTM state and cell dimensions for the language model sequence_activation - Activation function applied to the word vector after the composition (0 = none, 1 = logistic, 2 = tanh) nd4j_resource_dir - ND4J configuration directories (simply point to nd4j_resources) threads - number of threads to be used (sentences in each mini-batch will be divided among threads) update - sgd method (regular, momentum or adagrad)

The following files will be created in the directory specified by -output_dir:

model.gz - The model is stored in this file every time the validation perplexity exceeds the previous highest value. If this file exists when the command is called, this model will be loaded and training will be carried out from this point. This way if something goes wrong during training (e.g. server crashes), training will resume at the last saved point. model.tmp.gz - A backup copy of hte model.gz file, this is kept so that if the script fails when model.gz is being written, it is not lost. Thus, if model.gz is incomplete, simply copy model.tmp.gz over it. rep.gz - The word representation model, this can be used in order to reuse the word representations trained on this task as initilization for other tasks. validation.output - Automatically tagged validation set using the tagger. test.output - Automatically tagged test set using the tagger. validation.output - Reports statistics during training on the validation set. In this task, tagging accuracies are reported. test.output - Reports statistics during training on the test set. In this task, tagging accuracies are reported. validation.correct - Lists correctly labelled words in the validation set. test.correct - Lists correctly labelled words in the test set. validation.incorrect - Lists incorrectly labelled words in the validation set. test.incorrect - Lists incorrectly labelled words in the test set.

2.2 File Formats

We allow 3 different formats. 1 - The Conll column format is displayed as follows:

1 In _ IN IN _ 43 ADV _ _ 2 an _ DT DT _ 5 NMOD _ _ 3 Oct. _ NN NNP _ 5 TMP _ _ 4 19 _ CD CD _ 3 NMOD _ _ 5 review _ NN NN _ 1 PMOD _ _ 6 of _ IN IN _ 5 NMOD _ _ 7 _ `` _ 9 P _ _ 8 The _ DT DT _ 9 NMOD _ _ 9 Misant

JNN

Install / Use

README

0.1 Quick Start

1.1 Language Modeling

1.2 Softmax functions

2.1 Part-of-Speech Tagging

2.2 File Formats

Related Skills