DIORA

This is the official repo for our NAACL 2019 paper Unsupervised Latent Tree Induction with Deep Inside-Outside Recursive Autoencoders (DIORA), which presents a fully-unsupervised method for discovering syntax. If you use this code for research, please cite our paper as follows:

@inproceedings{drozdov2019diora,
  title={Unsupervised Latent Tree Induction with Deep Inside-Outside Recursive Autoencoders},
  author={Drozdov, Andrew and Verga, Pat and Yadav, Mohit and Iyyer, Mohit and McCallum, Andrew},
  booktitle={North American Association for Computational Linguistics},
  year={2019},
}

The paper is available on arXiv: https://arxiv.org/abs/1904.02142

For questions/concerns/bugs please contact adrozdov at cs.umass.edu.

Recent Related Work

Follow up work by us:

Selection of other work with DIORA:

Quick Start

Clone repository.

git clone git@github.com:iesl/diora.git
cd diora

Download the pre-trained model.

wget http://diora-naacl-2019.s3.amazonaws.com/diora-checkpoints.zip
unzip diora-checkpoints.zip

(Optional) Download training data: To reproduce experiments from our NAACL submission, concatenate the data from SNLI and MultiNLI.

cat ./snli_1.0/snli_1.0_train.jsonl ./multinli_1.0/multinli_1.0_train.jsonl > ./data/allnli.jsonl

Running DIORA.

# Install dependencies (using conda).
conda create -n diora-latest python=3.8
source activate diora-latest

## PYTORCH for mac
conda install pytorch=1.10.1 torchvision=0.11.2 torchaudio=0.10.1 -c pytorch

## PYTORCH for linux (w/ GPU and CUDA 10.2)
conda install pytorch=1.10.1 torchvision=0.11.2 torchaudio=0.10.1 cudatoolkit=10.2 -c pytorch

pip install nltk
pip install h5py
pip install tqdm

# Example of running various commands.

export PYTHONPATH=$(pwd)/pytorch:$PYTHONPATH

## Add the --cuda flag if you have GPU access.

## Parse some text.
python pytorch/diora/scripts/parse.py \
    --batch_size 10 \
    --data_type txt_id \
    --elmo_cache_dir ./cache \
    --load_model_path ./diora-checkpoints/mlp-softmax/model.pt \
    --model_flags ./diora-checkpoints/mlp-softmax/flags.json \
    --validation_path ./pytorch/sample.txt \
    --validation_filter_length 10

## Extract vectors using latent trees,
python pytorch/diora/scripts/phrase_embed_simple.py --parse_mode latent \
    --batch_size 10 \
    --data_type txt_id \
    --elmo_cache_dir ./cache \
    --load_model_path ./diora-checkpoints/mlp-softmax/model.pt \
    --model_flags ./diora-checkpoints/mlp-softmax/flags.json \
    --validation_path ./pytorch/sample.txt \
    --validation_filter_length 10

## or specify the trees to use.
python pytorch/diora/scripts/phrase_embed_simple.py --parse_mode given \
    --batch_size 10 \
    --data_type jsonl \
    --elmo_cache_dir ./cache \
    --load_model_path ./diora-checkpoints/mlp-softmax/model.pt \
    --model_flags ./diora-checkpoints/mlp-softmax/flags.json \
    --validation_path ./pytorch/sample.jsonl \
    --validation_filter_length 10

## Train from scratch.
python -m torch.distributed.launch --nproc_per_node=4 pytorch/diora/scripts/train.py \
    --arch mlp-shared \
    --batch_size 32 \
    --data_type nli \
    --elmo_cache_dir ./cache \
    --emb elmo \
    --hidden_dim 400 \
    --k_neg 100 \
    --log_every_batch 100 \
    --lr 2e-3 \
    --normalize unit \
    --reconstruct_mode softmax \
    --save_after 1000 \
    --train_filter_length 20 \
    --train_path ./data/allnli.jsonl \
    --max_step 300000 \
    --cuda --multigpu

Evaluation

First parse the data, then run evalb from our helper script.

# Parse the data.
python pytorch/diora/scripts/parse.py \
    --retain_file_order \
    --batch_size 10 \
    --data_type ptb \
    --elmo_cache_dir ./cache \
    --load_model_path ./diora-checkpoints/mlp-softmax/model.pt \
    --model_flags ./diora-checkpoints/mlp-softmax/flags.json \
    --experiment_path ./log/eval-ptb \
    --validation_path ./data/ptb/ptb-test.txt \
    --validation_filter_length -1

# (optional) Build EVALB if you haven't already.
(cd EVALB && make)

# Run evaluation.
python pytorch/diora/scripts/evalb.py \
    --evalb ./EVALB \
    --evalb_config ./EVALB/diora.prm \
    --out ./log/eval-ptb \
    --pred ./log/eval-ptb/parse.jsonl \
    --gold ./data/ptb/ptb-test.txt

Using the mlp-softmax checkpoint to parse the PTB test set should give the following output and results:

$ python pytorch/diora/scripts/evalb.py \
    --evalb ./EVALB \
    --evalb_config ./EVALB/diora.prm \
    --out ./log/eval-ptb \
    --pred ./log/eval-ptb/parse.jsonl \
    --gold ./data/ptb/ptb-test.txt

Running: ./EVALB/evalb -p ./EVALB/diora.prm ./log/eval-ptb/gold.txt ./log/eval-ptb/pred.txt > ./log/eval-ptb/evalb.out

Results are ready at: ./log/eval-ptb/evalb.out

==== PREVIEW OF RESULTS (./log/eval-ptb/evalb.out) ====

-- All --
Number of sentence        =   2416
Number of Error sentence  =      0
Number of Skip  sentence  =      0
Number of Valid sentence  =   2416
Bracketing Recall         =  57.78
Bracketing Precision      =  44.28
Bracketing FMeasure       =  50.14
Complete match            =   0.46
Average crossing          =   5.71
No crossing               =  10.10
2 or less crossing        =  29.26
Tagging accuracy          =   9.76

-- len<=40 --
Number of sentence        =   2338
Number of Error sentence  =      0
Number of Skip  sentence  =      0
Number of Valid sentence  =   2338
Bracketing Recall         =  57.96
Bracketing Precision      =  44.57
Bracketing FMeasure       =  50.39
Complete match            =   0.47
Average crossing          =   5.39
No crossing               =  10.44
2 or less crossing        =  30.24
Tagging accuracy          =   9.79

Notes:

Set --validation_filter_length -1 to read all of the data.
Make sure to use --retain_file_order so that predictions line up with the reference file.
Set --data_type ptb. The PTB data should have one sentence per line be in the following format:

(S (NP (DT The) (VBG leading) (NNS indicators)) (VP (VBP have) (VP (VBN prompted) (NP (DT some) (NNS forecasters)))))

DIORA will not attempt to parse 1 or 2 word sentences, since there is only 1 possible output.
Using the provided configuration, the EVALB evaluation will ignore part of speech and constituency labels, but does take into account unary branching.
Our EVALB helper script automatically strips punctuation.

Multi-GPU Training

Using DistributedDataParallel:

export CUDA_VISIBLE_DEVICES=0,1
export NGPUS=2
python -m torch.distributed.launch --nproc_per_node=$NGPUS pytorch/diora/scripts/train.py \
    --cuda \
    --multigpu \
    ... # other args

Useful Command Line Arguments

Data

--data_type Specifies the format of the data. Choices = nli, txt, txt_id, synthetic. Can specify different types for trainining and validation using --train_data_type and --validation_data_type. The synthetic type does not require any input file.

For examples of the expected format, please refer to the following files:

nli The standard JSONL format used by SNLI and MultiNLI. Although examples are sentence pairs, the model only uses one sentence at a time.
txt A single space-delimited sentence per line.
txt_id Same as txt except the first token is an example id.

--train_path and validation_path Specifies the path to the input data for training and validation.

--train_filter_length Only examples less than this value will used for training. To consider all examples, set this to 0. Similarly, can use --validation_filter_length for validation.

--batch_size Specifies the batch size. The batch size specifically for validation can be set using --validation_batch_size, otherwise it will default to --batch_size.

--embeddings_path The path to GloVe-style word embeddings.

--emb Set to w2v for GloVe, elmo for ELMo, and both for a concatenation of the two.

--elmo_options_path and --elmo_weights_path The paths to the options and weights for ELMo.

Optimization and Model Configuration

--lr The learning rate.

--hidden_dim The dimension associated with the TreeLSTM.

--margin The margin value used in the objective for reconstruction.

--k_neg The number of negative examples to sample.

--freq_dist_power The negative examples are chosen according to their frequency within the training corpus. Lower values of --freq_dist_power make this distribution more peaked.

--normalize When set to unit, the values of each cell will have their norm set to 1. Choices = none, unit.

--reconstruct_mode Specifies how to reconstruct the correct word. Choices = margin.

Logging

--load_model_path For evaluation, parsing, and fine-tuning you can use this parameter to specify a previous checkpoint to initialize your model.

--experiment_path Specifies a directory where log files and checkpoints will be saved.

--log_every_batch Every N gradient updates a summary will be printed to the log.

--save_latest Every N gradient updates, a checkpoint will be saved called model_periodic.pt.

--save_distinct Every N gradient updates, a checkpoint will be saved called model.step_${N}.pt.

--save_after Checkpoints will only be saved after N gradient updates have been applied.

Diora

Install / Use

README