CoRPG

Code for paper Document-Level Paraphrase Generation with Sentence Rewriting and Reordering by Zhe Lin, Yitao Cai and Xiaojun Wan. This paper is accepted by Findings of EMNLP'21.

Generate Convert Improve

Install / Use

/learn @L-Zhe/CoRPG

About this skill

Quality Score

0/100

README

CoRPG

Code for paper Document-Level Paraphrase Generation with Sentence Rewriting and Reordering by Zhe Lin, Yitao Cai and Xiaojun Wan. This paper is accepted by Findings of EMNLP'21.

Datasets

We leverage ParaNMT to train a sentence-level paraphrasing model. We select News-Commentary as document corpora, and we employ sentence-level paraphrasing model to generate a pseudo document-level paraphrase and use ALBERT to generate its coherence relationship graph. All these data are released at here.

System Output

If you are looking for system output and don't bother to install dependencies and train a model (or run a pre-trained model), the all-system-output folder is for you.

Dependencies

PyTorch >= 1.4

Transformers == 4.1.1

nltk == 3.5

tqdm

torch_optimizer == 0.1.0

Train a Document-Level Paraphrase Model

Step1: Prepare dataset

We release the dataset we used in data folder. If you want to use your own dataset, you need to follow the following procedure.

First, you should train a sentence-level paraphrase model to generate pseudo documen paraphrase dataset (We leverage paraNMT and fairseq to train this model).

Then, you should download the ALBERT model and fine-tuning it with your own dataset with the following script:

python eval/coherence.py --train
			 --pretrain_model [pretrain_model file]
			 --save_file [the path to save fine-tune model]
			 --text_file [the corpora used to fine-tune the pretrain_model]

We also provide our fine-tune model in here.

Finally, you can leveraged ALBERT to generate the coherence relationship graph:

python eval/coherence.py --inference
			 --pretrain_model [pretrain_model file]
			 --text_file [generate the coherence relationship graph of this corpora]

NOTE： Our code only supports to generate the paraphrasing of documents with 5 sentences. If you want to generate longer or variable length document paraphrase, you need to make some modifications to the code.

Step2: Process dataset

Create Vocabulary:

python createVocab.py --file ./data/news-commentary/data/bpe/train.split.bpe \
                             ./data/news-commentary/data/bpe/train.pesu.split.bpe\
                      --save_path ./data/vocab.share

Processing Training Dataset:

python preprocess.py --source ./data/news-commentary/data/bpe/train.pesu.comb.bpe \
                     --graph ./data/news-commentary/data/train.pesu.graph \
                     --target ./data/news-commentary/data/bpe/train.comb.bpe \
                     --vocab ./data/vocab.share \
                     --save_file ./data/para.pair

Processing Test Dataset:

python preprocess.py --source ./data/news-commentary/data/bpe/test.comb.bpe \
                     --graph ./data/news-commentary/data/test.graph \
                     --vocab ./data/vocab.share \
                     --save_file ./data/sent.pt

Step3: Train a document-level paraphrase model

python train.py --cuda_num 0 1 2 3\
                --vocab ./data/vocab.share\
                --train_file ./data/para.pair\
                --checkpoint_path ./data/model \
                --batch_print_info 200 \
                --grad_accum 1 \
                --graph_eps 0.5 \
                --max_tokens 5000

Step4: Generate document-level paraphrase

python generator.py --cuda_num 4 \
                    --file ./data/sent.pt\
                    --ref_file ./data/news-commentary/data/test.comb \
                    --max_tokens 10000 \
                    --vocab ./data/vocab.share \
                    --decode_method greedy \
                    --beam 5 \
                    --model_path ./data/model.pkl \
                    --output_path ./data/output \
                    --max_length 300

Pre-trained Models

We release our pretrained model at here.

Evaluation Matrics

We evaluate our model in three aspects: relevancy, diversity, coherence.

Relevancy

We leverage BERTScore to evaluate the semantic relevancy between paraphrase and original sentence.

Diversity

We employ self-TER and self-WER to evaluate the diversity of our model.

Coherence

We raise COH and COH-p to evaluate the coherence of paraphrase as follows:

where $$P_{SOP}$$ is calculated by ALBERT. We provide the script for these two evaluation matrics as follow:

python eval/coherence.py --coh
			 --pretrain_model [the pretrain albert file]
			 --text_file [the corpora to be evaluated]

Results

Citation

If you use any content of this repo for your work, please cite the following bib entry:

@inproceedings{lin-wan-2021-document,
    title = "Document-Level Paraphrase Generation with Sentence Rewriting and Reordering",
    author = "Lin, Zhe and Cai, Yitao and Wan, Xiaojun",
    booktitle = "Findings of EMNLP",
    year={2021}
}

Related Skills

qqbot-channel

345.4k

QQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口，自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。

docs-writer

100.0k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

345.4k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

ddd

Guía de Principios DDD para el Proyecto > 📚 Documento Complementario : Este documento define los principios y reglas de DDD. Para ver templates de código, ejemplos detallados y guías paso

L-Zhe

View profile

View on GitHub

GitHub Stars26

CategoryContent

Updated9mo ago

Forks5

L-Zhe/CoRPG

Languages

Python

Security Score

87/100

Audited on Jul 2, 2025

No findings