SkillAgentSearch skills...

Diora

Deep Inside-Outside Recursive Autoencoder

Install / Use

/learn @iesl/Diora
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

DIORA

This is the official repo for our NAACL 2019 paper Unsupervised Latent Tree Induction with Deep Inside-Outside Recursive Autoencoders (DIORA), which presents a fully-unsupervised method for discovering syntax. If you use this code for research, please cite our paper as follows:

@inproceedings{drozdov2019diora,
  title={Unsupervised Latent Tree Induction with Deep Inside-Outside Recursive Autoencoders},
  author={Drozdov, Andrew and Verga, Pat and Yadav, Mohit and Iyyer, Mohit and McCallum, Andrew},
  booktitle={North American Association for Computational Linguistics},
  year={2019},
}

The paper is available on arXiv: https://arxiv.org/abs/1904.02142

For questions/concerns/bugs please contact adrozdov at cs.umass.edu.

Recent Related Work

Follow up work by us:

Selection of other work with DIORA:

Quick Start

Clone repository.

git clone git@github.com:iesl/diora.git
cd diora

Download the pre-trained model.

wget http://diora-naacl-2019.s3.amazonaws.com/diora-checkpoints.zip
unzip diora-checkpoints.zip

(Optional) Download training data: To reproduce experiments from our NAACL submission, concatenate the data from SNLI and MultiNLI.

cat ./snli_1.0/snli_1.0_train.jsonl ./multinli_1.0/multinli_1.0_train.jsonl > ./data/allnli.jsonl

Running DIORA.

# Install dependencies (using conda).
conda create -n diora-latest python=3.8
source activate diora-latest

## PYTORCH for mac
conda install pytorch=1.10.1 torchvision=0.11.2 torchaudio=0.10.1 -c pytorch

## PYTORCH for linux (w/ GPU and CUDA 10.2)
conda install pytorch=1.10.1 torchvision=0.11.2 torchaudio=0.10.1 cudatoolkit=10.2 -c pytorch

pip install nltk
pip install h5py
pip install tqdm

# Example of running various commands.

export PYTHONPATH=$(pwd)/pytorch:$PYTHONPATH

## Add the --cuda flag if you have GPU access.

## Parse some text.
python pytorch/diora/scripts/parse.py \
    --batch_size 10 \
    --data_type txt_id \
    --elmo_cache_dir ./cache \
    --load_model_path ./diora-checkpoints/mlp-softmax/model.pt \
    --model_flags ./diora-checkpoints/mlp-softmax/flags.json \
    --validation_path ./pytorch/sample.txt \
    --validation_filter_length 10

## Extract vectors using latent trees,
python pytorch/diora/scripts/phrase_embed_simple.py --parse_mode latent \
    --batch_size 10 \
    --data_type txt_id \
    --elmo_cache_dir ./cache \
    --load_model_path ./diora-checkpoints/mlp-softmax/model.pt \
    --model_flags ./diora-checkpoints/mlp-softmax/flags.json \
    --validation_path ./pytorch/sample.txt \
    --validation_filter_length 10

## or specify the trees to use.
python pytorch/diora/scripts/phrase_embed_simple.py --parse_mode given \
    --batch_size 10 \
    --data_type jsonl \
    --elmo_cache_dir ./cache \
    --load_model_path ./diora-checkpoints/mlp-softmax/model.pt \
    --model_flags ./diora-checkpoints/mlp-softmax/flags.json \
    --validation_path ./pytorch/sample.jsonl \
    --validation_filter_length 10

## Train from scratch.
python -m torch.distributed.launch --nproc_per_node=4 pytorch/diora/scripts/train.py \
    --arch mlp-shared \
    --batch_size 32 \
    --data_type nli \
    --elmo_cache_dir ./cache \
    --emb elmo \
    --hidden_dim 400 \
    --k_neg 100 \
    --log_every_batch 100 \
    --lr 2e-3 \
    --normalize unit \
    --reconstruct_mode softmax \
    --save_after 1000 \
    --train_filter_length 20 \
    --train_path ./data/allnli.jsonl \
    --max_step 300000 \
    --cuda --multigpu

Evaluation

First parse the data, then run evalb from our helper script.

# Parse the data.
python pytorch/diora/scripts/parse.py \
    --retain_file_order \
    --batch_size 10 \
    --data_type ptb \
    --elmo_cache_dir ./cache \
    --load_model_path ./diora-checkpoints/mlp-softmax/model.pt \
    --model_flags ./diora-checkpoints/mlp-softmax/flags.json \
    --experiment_path ./log/eval-ptb \
    --validation_path ./data/ptb/ptb-test.txt \
    --validation_filter_length -1

# (optional) Build EVALB if you haven't already.
(cd EVALB && make)

# Run evaluation.
python pytorch/diora/scripts/evalb.py \
    --evalb ./EVALB \
    --evalb_config ./EVALB/diora.prm \
    --out ./log/eval-ptb \
    --pred ./log/eval-ptb/parse.jsonl \
    --gold ./data/ptb/ptb-test.txt

Using the mlp-softmax checkpoint to parse the PTB test set should give the following output and results:

$ python pytorch/diora/scripts/evalb.py \
    --evalb ./EVALB \
    --evalb_config ./EVALB/diora.prm \
    --out ./log/eval-ptb \
    --pred ./log/eval-ptb/parse.jsonl \
    --gold ./data/ptb/ptb-test.txt

Running: ./EVALB/evalb -p ./EVALB/diora.prm ./log/eval-ptb/gold.txt ./log/eval-ptb/pred.txt > ./log/eval-ptb/evalb.out

Results are ready at: ./log/eval-ptb/evalb.out

==== PREVIEW OF RESULTS (./log/eval-ptb/evalb.out) ====

-- All --
Number of sentence        =   2416
Number of Error sentence  =      0
Number of Skip  sentence  =      0
Number of Valid sentence  =   2416
Bracketing Recall         =  57.78
Bracketing Precision      =  44.28
Bracketing FMeasure       =  50.14
Complete match            =   0.46
Average crossing          =   5.71
No crossing               =  10.10
2 or less crossing        =  29.26
Tagging accuracy          =   9.76

-- len<=40 --
Number of sentence        =   2338
Number of Error sentence  =      0
Number of Skip  sentence  =      0
Number of Valid sentence  =   2338
Bracketing Recall         =  57.96
Bracketing Precision      =  44.57
Bracketing FMeasure       =  50.39
Complete match            =   0.47
Average crossing          =   5.39
No crossing               =  10.44
2 or less crossing        =  30.24
Tagging accuracy          =   9.79

Notes:

  • Set --validation_filter_length -1 to read all of the data.

  • Make sure to use --retain_file_order so that predictions line up with the reference file.

  • Set --data_type ptb. The PTB data should have one sentence per line be in the following format:

(S (NP (DT The) (VBG leading) (NNS indicators)) (VP (VBP have) (VP (VBN prompted) (NP (DT some) (NNS forecasters)))))
  • DIORA will not attempt to parse 1 or 2 word sentences, since there is only 1 possible output.

  • Using the provided configuration, the EVALB evaluation will ignore part of speech and constituency labels, but does take into account unary branching.

  • Our EVALB helper script automatically strips punctuation.

Multi-GPU Training

Using DistributedDataParallel:

export CUDA_VISIBLE_DEVICES=0,1
export NGPUS=2
python -m torch.distributed.launch --nproc_per_node=$NGPUS pytorch/diora/scripts/train.py \
    --cuda \
    --multigpu \
    ... # other args

Useful Command Line Arguments

Data

--data_type Specifies the format of the data. Choices = nli, txt, txt_id, synthetic. Can specify different types for trainining and validation using --train_data_type and --validation_data_type. The synthetic type does not require any input file.

For examples of the expected format, please refer to the following files:

  • nli The standard JSONL format used by SNLI and MultiNLI. Although examples are sentence pairs, the model only uses one sentence at a time.
  • txt A single space-delimited sentence per line.
  • txt_id Same as txt except the first token is an example id.

--train_path and validation_path Specifies the path to the input data for training and validation.

--train_filter_length Only examples less than this value will used for training. To consider all examples, set this to 0. Similarly, can use --validation_filter_length for validation.

--batch_size Specifies the batch size. The batch size specifically for validation can be set using --validation_batch_size, otherwise it will default to --batch_size.

--embeddings_path The path to GloVe-style word embeddings.

--emb Set to w2v for GloVe, elmo for ELMo, and both for a concatenation of the two.

--elmo_options_path and --elmo_weights_path The paths to the options and weights for ELMo.

Optimization and Model Configuration

--lr The learning rate.

--hidden_dim The dimension associated with the TreeLSTM.

--margin The margin value used in the objective for reconstruction.

--k_neg The number of negative examples to sample.

--freq_dist_power The negative examples are chosen according to their frequency within the training corpus. Lower values of --freq_dist_power make this distribution more peaked.

--normalize When set to unit, the values of each cell will have their norm set to 1. Choices = none, unit.

--reconstruct_mode Specifies how to reconstruct the correct word. Choices = margin.

Logging

--load_model_path For evaluation, parsing, and fine-tuning you can use this parameter to specify a previous checkpoint to initialize your model.

--experiment_path Specifies a directory where log files and checkpoints will be saved.

--log_every_batch Every N gradient updates a summary will be printed to the log.

--save_latest Every N gradient updates, a checkpoint will be saved called model_periodic.pt.

--save_distinct Every N gradient updates, a checkpoint will be saved called model.step_${N}.pt.

--save_after Checkpoints will only be saved after N gradient updates have been applied.

`-

View on GitHub
GitHub Stars89
CategoryDevelopment
Updated7mo ago
Forks23

Languages

Python

Security Score

87/100

Audited on Sep 5, 2025

No findings