DiPS

NAACL 2019: Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation

Generate Convert Improve

Install / Use

/learn @malllabiisc/DiPS

About this skill

Quality Score

0/100

README

Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation

Source code for NAACL 2019 paper: Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation

Overview of DiPS during decoding to generate k paraphrases. At each time step, a set of N sequences V(t) is used to determine k < N sequences (X∗) via submodular maximization . The above figure illustrates the motivation behind each submodular component. Please see Section 4 in the paper for details.

Also on GEM/NL-Augmenter 🦎 → 🐍

Please use/check diverse_paraphrase in NL-Augmenter for the transformer-model version. Diverse-Paraphrase: NL-Augmenter.

Dependencies

compatible with python 3.6
dependencies can be installed using requirements.txt

Dataset

Download the following datasets:

Extract and place them in the data directory. Path : data/<dataset-folder-name>. A sample dataset folder might look like data/quora/<train/test/val>/<src.txt/tgt.txt>.

Download GoogleNews-vectors-negative300.bin.gz into the data directory. In case the above link doesn't work, find the zip file here

Setup:

To get the project's source code, clone the github repository:

$ git clone https://github.com/malllabiisc/DiPS

Install VirtualEnv using the following (optional):

$ [sudo] pip install virtualenv

Create and activate your virtual environment (optional):

$ virtualenv -p python3 venv
$ source venv/bin/activate

Install all the required packages:

$ pip install -r requirements.txt

Install the submodopt package by running the following command from the root directory of the repository:

$ cd ./packages/submodopt
$ python setup.py install
$ cd ../../

Training the sequence to sequence model

python -m src.main -mode train -gpu 0 -use_attn -bidirectional -dataset quora -run_name <run_name>

Create dictionary for submodular subset selection. Used for Semantic similarity (L2)

To use trained embeddings -

python -m src.create_dict -model trained -run_name <run_name> -gpu 0

To use pretrained word2vec embeddings -

python -m src.create_dict -model pretrained -run_name <run_name> -gpu 0

This will generate the word2vec.pickle file in data/embeddings

Decoding using submodularity

python -m src.main -mode decode -selec submod -run_name <run_name> -beam_width 10 -gpu 0

Citation

Please cite the following paper if you find this work relevant to your application

@inproceedings{dips2019,
    title = "Submodular Optimization-based Diverse Paraphrasing and its Effectiveness in Data Augmentation",
    author = "Kumar, Ashutosh  and
      Bhattamishra, Satwik  and
      Bhandari, Manik  and
      Talukdar, Partha",
    booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
    month = jun,
    year = "2019",
    address = "Minneapolis, Minnesota",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/N19-1363",
    pages = "3609--3619"
}

For any clarification, comments, or suggestions please create an issue or contact ashutosh@iisc.ac.in or Satwik Bhattamishra

Related Skills

node-connect

343.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

92.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

343.3k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

343.3k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

malllabiisc

View profile

View on GitHub

GitHub Stars67

CategoryDevelopment

Updated11d ago

Forks15

malllabiisc/DiPS

Languages

Python

Security Score

100/100

Audited on Mar 20, 2026

No findings

DiPS

Install / Use

README

Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation

Also on GEM/NL-Augmenter 🦎 → 🐍

Dependencies

Dataset

Setup:

Training the sequence to sequence model

Create dictionary for submodular subset selection. Used for Semantic similarity (L<sub>2</sub>)

Decoding using submodularity

Citation

Related Skills