ProtAugment

Code for ProtAugment: Unsupervised diverse short-texts paraphrasing for intent detection meta-learning

Generate Convert Improve

Install / Use

/learn @tdopierre/ProtAugment

About this skill

Quality Score

0/100

README

`PROTAUGMENT`

This repository contains the official code for the paper PROTAUGMENT: Unsupervised diverse short-texts paraphrasing for intent detection meta-learning. In this paper, we aim at training a few-shot text classification model. To improve the robustness of our model, we introduced unlabeled data and paraphrases. The paper can be found here.

Setup

To run this code, we advise to create a specific virtual environment. Please note I used Python 3.6.9 in my code. Some other versions of python might not work.

# Environment creation
python3 -m virtualenv .venv --python=python3.6

Then, install the required python libraries

# Activate environment
source .venv/bin/activate

# Install requirements
pip install -r requirements.txt

Modeling

PROTAUGMENT has the following architecture:

alt text

In this framework, two models are used:

A paraphrase generation model
An embedder model (usually a language model), which is used to do few-shot text classification

Paraphrase generation Model

The paraphrase generation model is a BART (article, model) model, trained on the paraphrase generation task using 3 datasets: Google-PAWS, MSR, Quora. If you want to reproduce the results, the model is available on the 🤗 HuggingFace hub, named tdopierre/ProtAugment-ParaphraseGenerator

If you want to use another Seq2Seq model to generate paraphrases, feel free to change the --paraphrase-model-name-or-path and --paraphrase-tokenizer-name-or-path parameters.

If you want to train your own paraphrase generation model, follow the instructions in the paraphrase/fine-tune-BART directory.

Language Model

For each dataset, we first fine-tune a language model on the dataset, on the masked language modeling task. Such models are available on the 🤗 HuggingFace hub:

If you want to fine-tune your own language model, use the scripts available in the language_modeling directory.

Running `PROTAUGMENT`

If you want to run a single experiment using PROTAUGMENT, use models/proto/protaugment.sh. This script loads the correct environment and runs the python script models/proto/protaugment.py. The shell scripts passes the arguments to the python script, as can be seen in utils/scripts/run_protaugment.sh. I tried my best to document the different arguments in the protaugment.py. If anything is unclear, please reach out to @tdopierre directly.

To run the experiments obtained in the paper, use utils/scripts/protaugment/run_protaugment.sh If you are using the SLURM job manager, you may use the alternative script run_protaugment-slurm.sh. This will break down each individual experiment into a separate job, and hence will run much faster (provided you have adequate hardware).

⚠️ WARNING ⚠️ This script is very long to run, as it contains all experiments reported in the paper.

Example: Changing paraphrases source

If you have your own paraphrases of unlabeled texts available, you can use them in PROTAUGMENT by specifiying the path in the --augmentation-data-path argument. As an example, you can have a look at the way I use augmentation issued by back-translation in the run_protaugment.sh script.

References

If you use this code or build upon it in your projects, please cite our paper

@article{Dopierre2021ProtAugmentUD,
  title={ProtAugment: Unsupervised diverse short-texts paraphrasing for intent detection meta-learning},
  author={Thomas Dopierre and C. Gravier and Wilfried Logerais},
  journal={ArXiv},
  year={2021},
  volume={abs/2105.12995}
}

Related Skills

proje

Interactive vocabulary learning platform with smart flashcards and spaced repetition for effective language acquisition.

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

tdopierre

View profile

View on GitHub

GitHub Stars25

CategoryEducation

Updated10d ago

Forks10

tdopierre/ProtAugment

Languages

Python

Security Score

90/100

Audited on Mar 23, 2026

No findings