NLProofS

EMNLP 2022: Generating Natural Language Proofs with Verifier-Guided Search https://arxiv.org/abs/2205.12443

Generate Convert Improve

Install / Use

/learn @princeton-nlp/NLProofS

About this skill

Quality Score

0/100

README

Generating Natural Language Proofs with Verifier-Guided Search

Task

Code for the paper:

Generating Natural Language Proofs with Verifier-Guided Search
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Kaiyu Yang, Jia Deng, and Danqi Chen

Requirements

Download and install Miniconda Python 3 (Anaconda should also work).
Clone this repo and cd into its root.
Install Python dependencies: conda env create -f nlproofs.yaml. You may need to edit nlproofs.yaml according to your system, e.g., use a different CUDA version. If you have trouble running the installation command, you may also manually install the packages in nlproofs.yaml in whatever way that works for you.
Activate the conda environment: conda activate nlproofs, and prepend the root of this repo to the PYTHONPATH environment variable.

Data Preprocessing

Download the v3_May6_2022 version of EntailmentBank (MD5: 9cb91896325157cee1f35616be0be179) and unzip it as ./data/entailment_trees_emnlp2021_data_v3/.
Download the OWA version of RuleTaker (MD5: bf490364bca241bb5ff9f0ab0c78b71a) and unzip it as ./data/proofwriter-dataset-V2020.12.3/.
Run python check_data.py to check.
Run python preprocess_ruletaker.py to preprocess the RuleTaker dataset.

EntailmentBank Experiments

We use Lightning CLI to create scripts for training, validation, and testing: prover/main.py and verifier/main.py for the prover and the verifier, respectively. They take arguments from the command line as well as YAML configuration files. Please run python main.py --help or refer to the documentation of Lightning CLI for details.

We provide YAML files for our hyperparameters and experimental settings in ./prover/ and ./verifier/. We run all experiments on a single NVIDIA A6000 GPU with 48GB memory. For running them on GPUs with smaller memory, you may have to change batch_size and accumulate_grad_batches. On newer GPUs, --trainer.precision bf16 may lead to significant speedup and memory savings. I have not tested those features thoroughly, so please use them at your own discretion. Note that pretrained T5 models do not play well with fp16.

Training

Prover

First, cd into ./prover/. Then run python main.py fit --help to see how to use the training script. Below are example commands used in our experiments:

python main.py fit --config cli_task1_single_shot_t5-large.yaml  # Train a single-shot prover on Task 1 of EntailmentBank.
python main.py fit --config cli_task1_stepwise_t5-large.yaml     # Train a stepwise prover on Task 1 of EntailmentBank.
python main.py fit --config cli_task2_single_shot_t5-large.yaml  # Train a single-shot prover on Task 2 of EntailmentBank.
python main.py fit --config cli_task2_stepwise_t5-large.yaml     # Train a stepwise prover on Task 2 of EntailmentBank.

The training script saves hyperparameters, model checkpoints, and other information to ./prover/lightning_logs/EXP_ID/, where EXP_ID is an arbitrary experiment ID that will be printed by the training script.

Verifier

First, cd into ./verifier/. Then run python main.py fit --help to see how to use the training script. Below are example commands used in our experiments:

python main.py fit --config cli_entailmentbank_task1.yaml  # Train a verifier on Task 1 of EntailmentBank.
python main.py fit --config cli_entailmentbank_task2.yaml  # Train a verifier on Task 2 of EntailmentBank.

The training script saves hyperparameters, model checkpoints, and other information to ./verifier/lightning_logs/EXP_ID/.

Validation and Testing

Once training completes, we use the model checkpoint to predict on the validation and testing data. cd into ./prover/ and run python main.py validate --help and python main.py test --help to see how to use the script for validation and testing. Assume we have a prover checkpoint PATH_TO_PROVER_CKPT and a verifier checkpoint PATH_TO_VERIFIER_CKPT, below are example commands:

python main.py validate --config cli_task2_stepwise_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT                                                                                                     # Validate the stepwise prover without verifier-guided search on Task 2 of EntailmentBank.
python main.py validate --config cli_task2_stepwise_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT --model.verifier_weight 0.5 --model.verifier_ckpt PATH_TO_VERIFIER_CKPT --model.proof_search true   # Validate NLProofS (stepwise prover + verifier-guided search).
python main.py validate --config cli_task2_stepwise_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT --model.verifier_weight 1.0 --model.verifier_ckpt PATH_TO_VERIFIER_CKPT --model.proof_search true   # Validate NLProofS w/o prover score.
python main.py test --config cli_task2_stepwise_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT --model.verifier_weight 0.5 --model.verifier_ckpt PATH_TO_VERIFIER_CKPT --model.proof_search true       # Test NLProofS (stepwise prover + verifier-guided search).
python main.py test --config cli_task1_single_shot_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT                                                                                                   # Test the single-shot prover on Task 1 of EntailmentBank.
python main.py test --confing cli_task2_single_shot_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT --data.path_test ../data/entailment_trees_emnlp2021_data_v3/dataset/task_3/test.jsonl            # Test the single-shot prover (trained on Task 2) on Task 3 of EntailmentBank.

Validation and testing results are saved as ./prover/lightning_logs/EXP_ID/results_val.tsv and ./prover/lightning_logs/EXP_ID/results_test.tsv. They are the input to the EntailmentBank's official evaluation code for calculating the evaluation metrics.

Test Results and Model Checkpoints

Slide right to see download links in the tables below.

Task 1

| Model | Leaves-F1 | Leaves-AllCorrect | Steps-F1 | Steps-AllCorrect | Intermediates-F1 | Intermediates-AllCorrect | Overall-AllCorrect | Model checkpoints | Validation predictions | Test predictions | | ------------- | -------- | ------- | --------------- | ------------- | ---------------- | ---------------- | ---------------- | ---------------- | ---------------- | ---------------- | | NLProofS | 97.6 | 90.0 | 54.8 | 41.8 | 72.0 | 39.7 | 38.2 | prover, verifier | results_val.tsv | results_test.tsv | | Stepwise prover | 98.8 | 98.5 | 54.8 | 41.5 | 71.9 | 38.5 | 36.8 | The prover above | results_val.tsv | results_test.tsv | | Single-shot prover | 98.2 | 82.7 | 51.8 | 40.9 | 66.7 | 36.5 | 34.7 | prover | results_val.tsv | results_test.tsv |

Task 2

| Model | Leaves-F1 | Leaves-AllCorrect | Steps-F1 | Steps-AllCorrect | Intermediates-F1 | Intermediates-AllCorrect | Overall-AllCorrect | Model checkpoints | Validation predictions | Test predictions | | ------------- | -------- | ------- | --------------- | ------------- | ---------------- | ---------------- | ---------------- | ---------------- | ---------------- | ---------------- | | NLProofS | 90.3 | 60.6 | 48.6 | 35.6 | 70.3 | 39.4 | 34.4 | prover, verifier | results_val.tsv | [results_test.tsv](https://huggingface.co/kaiyuy/NLProofS/resolve/main/prover/entailmentbank_task2/nlpr

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

fullstack-developer

Full-Stack Developer Role Role Definition CONCEPT: Full-stack developer expertise ARCHITECTURE: Covers both frontend and backend development BEST_PRACTICE: Comprehensive web applicat

groundhog

401

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

workshop-rules

Materials used to teach the summer camp <Data Science for Kids>

princeton-nlp

View profile

View on GitHub

GitHub Stars86

CategoryEducation

Updated10mo ago

Forks15

princeton-nlp/NLProofS

Languages

Python

Security Score

92/100

Audited on May 27, 2025

No findings