Data Augmentation Techniques for NLP

If you'd like to add your paper, do not email us. Instead, read the protocol for adding a new entry and send a pull request.

We group the papers by text classification, translation, summarization, question-answering, sequence tagging, parsing, grammatical-error-correction, generation, dialogue, multimodal, mitigating bias, mitigating class imbalance, adversarial examples, compositionality, and automated augmentation.

This repository is based on our paper, "A survey of data augmentation approaches in NLP (Findings of ACL '21)". You can cite it as follows:

@inproceedings{feng-etal-2021-survey,
    title = "A Survey of Data Augmentation Approaches for {NLP}",
    author = "Feng, Steven Y.  and
      Gangal, Varun  and
      Wei, Jason  and
      Chandar, Sarath  and
      Vosoughi, Soroush  and
      Mitamura, Teruko  and
      Hovy, Eduard",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.84",
    doi = "10.18653/v1/2021.findings-acl.84",
    pages = "968--988",
}

Authors: <a href="https://scholar.google.ca/citations?hl=en&user=zwiszZIAAAAJ">Steven Y. Feng</a>, <a href="https://scholar.google.com/citations?user=rWZq2nQAAAAJ&hl=en">Varun Gangal</a>, <a href="https://scholar.google.com/citations?user=wA5TK_0AAAAJ&hl=en">Jason Wei</a>, <a href="https://scholar.google.co.in/citations?user=yxWtZLAAAAAJ&hl=en">Sarath Chandar</a>, <a href="https://scholar.google.ca/citations?user=45DAXkwAAAAJ&hl=en">Soroush Vosoughi</a>, <a href="https://scholar.google.com/citations?user=gjsxBCkAAAAJ&hl=en">Teruko Mitamura</a>, <a href="https://scholar.google.com/citations?user=PUFxrroAAAAJ&hl=en">Eduard Hovy</a>

Special thanks to Ryan Shentu, Fiona Feng, Karen Liu, Emily Nie, Tanya Lu, and Bonnie Ma for helping out with this repo. Note: WIP. More papers will be added from our survey paper to this repo soon. Inquiries should be directed to stevenyfeng@gmail.com or by opening an issue here.

Also, check out our talk for Google Research (Steven Feng and Varun Gangal) here, and our podcast episode (Steven Feng and Eduard Hovy) here and here.

Text Classification

| Paper | Datasets | | -- | --- | | Unsupervised Word Sense Disambiguation Rivaling Supervised Methods (ACL '95) | Paper-Specific/Legacy Corpus | | Synonym Replacement (Character-Level Convolutional Networks for Text Classification, NeurIPS '15) | AG’s News, DBPedia, Yelp, Yahoo Answers, Amazon | | That’s So Annoying!!!: A Lexical and Frame-Semantic Embedding Based Data Augmentation Approach to Automatic Categorization of Annoying Behaviors using #petpeeve Tweets (EMNLP '15) | twitter| | Robust Training under Linguistic Adversity (EACL '17) code | Movie review, customer review, SUBJ, SST | | Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations (NAACL '18) code | SST, SUBJ, MRQA, RT, TREC | | Variational Pretraining for Semi-supervised Text Classification (ACL '19) code | IMDB, AG News, Yahoo, hatespeech | | EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks (EMNLP '19) code | SST, CR, SUBJ, TREC, PC | | A Closer Look At Feature Space Data Augmentation For Few-Shot Intent Classification (DeepLo @ EMNLP '19) | SNIPS | | Nonlinear Mixup: Out-Of-Manifold Data Augmentation for Text Classification (AAAI '20) | TREC, SST, Subj, MR | | MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification (ACL '20) code | AG News, DBpedia, Yahoo, IMDb | | Unsupervised Data Augmentation for Consistency Training (NeurIPS '20) code | Yelp, IMDb, amazon, DBpedia | | Not Enough Data? Deep Learning to the Rescue! (AAAI '20) | ATIS, TREC, WVA | | Data Augmentation using Pre-trained Transformer Models LifeLongNLP @ AACL '20, code |SNIPS, TREC, SST2 | | SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving Out-of-Domain Robustness (EMNLP '20) code | IWSLT'14 | | Data Boost: Text Data Augmentation Through Reinforcement Learning Guided Conditional Generation (EMNLP '20) | ICWSM 20’ Data Challenge, SemEval '17 sentiment analysis, SemEval '18 irony | | Textual Data Augmentation for Efficient Active Learning on Tiny Datasets (EMNLP '20) | SST2, TREC | | Text Augmentation in a Multi-Task View (EACL '21) | SST2, TREC, SUBJ | | GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation (arXiv '21) | SST2, CR, TREC, SUBJ, MPQA, CoLA | | Few-Shot Text Classification with Triplet Loss, Data Augmentation, and Curriculum Learning (NAACL '21) code | HUFF, COV-Q, AMZN, FEWREL | | Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification (EMNLP '21) code | IMDB, SST2, SST5, TREC, YELP2, YELP5 | | AEDA: An Easier Data Augmentation Technique for Text Classification (EMNLP '21) code | SST, CR, SUBJ, TREC, PC |

Translation

| Paper | Datasets | | -- | --- | | Backtranslation (Improving Neural Machine Translation Models with Monolingual Data, ACL '16) | WMT '15 en-de, IWSLT '15 en-tr | | Adapting Neural Machine Translation with Parallel Synthetic Data (WMT '17) | COMMON, 1 Billion Words, dev2013, XRCE, IT, E-Com| | Data Augmentation for Low-Resource Neural Machine Translation (ACL '17) code | WMT '14/'15/'16 en-de/de-en| | Synthetic Data for Neural Machine Translation of Spoken-Dialects (arxiv '17) | LDC2012T09, OpenSubtitles-2013| | Multi-Source Neural Machine Translation with Data Augmentation (IWSLT '18) | TED Talks| | SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation (EMNLP '18) | IWSLT '15 en-vi, IWSLT '16 de-en, WMT '15 en-de | | Generalizing Back-Translation in Neural Machine Translation (WMT '19) | ed NewsCrawl2, WMT'18 de-en| | Neural Fuzzy Repair: Integrating Fuzzy Matches into Neural Machine Translation (ACL '19) | DGT-TM en-ml/en-hu| | Augmenting Neural Machine Translation with Knowledge Graphs (arxiv '19) | WMT '14 -'18| | Generalized Data Augmentation for Low-Resource Translation (ACL '19) code| ENG-HRL-LRL, HRL-LRL | | Improving Robustness of Machine Translation with Synthetic Noise (NAACL '19) code| EP, TED, MTNT en-fr en-jpn| | Soft Contextual Data Augmentation for Neural Machine Translation (ACL '19) code | IWSLT '14 de/es/he-en, WMT '14 en-de | | Data augmentation using back-translation for context-aware neural machine translation (DiscoMT @ EMNLP '19) code | IWSLT'17 en-ja/en-fr, BookCorpus, Europarl v7, National Diet of Japan | | Improving Neural Machine Translation Robustness via Data Augmentation: Beyond Back-Translation (W-NUT @ EMNLP '19) | WMT'15/'19 en/fr, MTNT, IWSLT'17, MuST-C | | Data augmentation for pipeline-based speech translation [(Baltic HLT '20)](https://hal.inria.

DataAug4NLP

Install / Use

README

Data Augmentation Techniques for NLP

Text Classification

Translation