SkillAgentSearch skills...

Nlpaug

Data augmentation for NLP

Install / Use

/learn @makcedward/Nlpaug

README

<p align="center"> <br> <img src="https://github.com/makcedward/nlpaug/blob/master/res/logo_small.png"/> <br> <p> <p align="center"> <a href="https://travis-ci.org/makcedward/nlpaug"> <img alt="Build" src="https://travis-ci.org/makcedward/nlpaug.svg?branch=master"> </a> <a href="https://www.codacy.com/app/makcedward/nlpaug?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=makcedward/nlpaug&amp;utm_campaign=Badge_Grade"> <img alt="Code Quality" src="https://api.codacy.com/project/badge/Grade/2d6d1d08016a4f78818161a89a2dfbfb"> </a> <a href="https://pepy.tech/badge/nlpaug"> <img alt="Downloads" src="https://pepy.tech/badge/nlpaug"> </a> </p>

nlpaug

This python library helps you with augmenting nlp for your machine learning projects. Visit this introduction to understand about Data Augmentation in NLP. Augmenter is the basic element of augmentation while Flow is a pipeline to orchestra multi augmenter together.

Features

  • Generate synthetic data for improving model performance without manual effort
  • Simple, easy-to-use and lightweight library. Augment data in 3 lines of code
  • Plug and play to any machine leanring/ neural network frameworks (e.g. scikit-learn, PyTorch, TensorFlow)
  • Support textual and audio input
<h3 align="center">Textual Data Augmentation Example</h3> <br><p align="center"><img src="https://github.com/makcedward/nlpaug/blob/master/res/textual_example.png"/></p> <h3 align="center">Acoustic Data Augmentation Example</h3> <br><p align="center"><img src="https://github.com/makcedward/nlpaug/blob/master/res/audio_example.png"/></p>

| Section | Description | |:---:|:---:| | Quick Demo | How to use this library | | Augmenter | Introduce all available augmentation methods | | Installation | How to install this library | | Recent Changes | Latest enhancement | | Extension Reading | More real life examples or researchs | | Reference | Reference of external resources such as data or model |

Quick Demo

Augmenter

| Augmenter | Target | Augmenter | Action | Description | |:---:|:---:|:---:|:---:|:---:| |Textual| Character | KeyboardAug | substitute | Simulate keyboard distance error | |Textual| | OcrAug | substitute | Simulate OCR engine error | |Textual| | RandomAug | insert, substitute, swap, delete | Apply augmentation randomly | |Textual| Word | AntonymAug | substitute | Substitute opposite meaning word according to WordNet antonym| |Textual| | ContextualWordEmbsAug | insert, substitute | Feeding surroundings word to BERT, DistilBERT, RoBERTa or XLNet language model to find out the most suitlabe word for augmentation| |Textual| | RandomWordAug | swap, crop, delete | Apply augmentation randomly | |Textual| | SpellingAug | substitute | Substitute word according to spelling mistake dictionary | |Textual| | SplitAug | split | Split one word to two words randomly| |Textual| | SynonymAug | substitute | Substitute similar word according to WordNet/ PPDB synonym | |Textual| | TfIdfAug | insert, substitute | Use TF-IDF to find out how word should be augmented | |Textual| | WordEmbsAug | insert, substitute | Leverage word2vec, GloVe or fasttext embeddings to apply augmentation| |Textual| | BackTranslationAug | substitute | Leverage two translation models for augmentation | |Textual| | ReservedAug | substitute | Replace reserved words | |Textual| Sentence | ContextualWordEmbsForSentenceAug | insert | Insert sentence according to XLNet, GPT2 or DistilGPT2 prediction | |Textual| | AbstSummAug | substitute | Summarize article by abstractive summarization method | |Textual| | LambadaAug | substitute | Using language model to generate text and then using classification model to retain high quality results | |Signal| Audio | CropAug | delete | Delete audio's segment | |Signal| | LoudnessAug|substitute | Adjust audio's volume | |Signal| | MaskAug | substitute | Mask audio's segment | |Signal| | NoiseAug | substitute | Inject noise | |Signal| | PitchAug | substitute | Adjust audio's pitch | |Signal| | ShiftAug | substitute | Shift time dimension forward/ backward | |Signal| | SpeedAug | substitute | Adjust audio's speed | |Signal| | VtlpAug | substitute | Change vocal tract | |Signal| | NormalizeAug | substitute | Normalize audio | |Signal| | PolarityInverseAug | substitute | Swap positive and negative for audio | |Signal| Spectrogram | FrequencyMaskingAug | substitute | Set block of values to zero according to frequency dimension | |Signal| | TimeMaskingAug | substitute | Set block of values to zero according to time dimension | |Signal| | LoudnessAug | substitute | Adjust volume |

Flow

| Augmenter | Augmenter | Description | |:---:|:---:|:---:| |Pipeline| Sequential | Apply list of augmentation functions sequentially | |Pipeline| Sometimes | Apply some augmentation functions randomly |

Installation

The library supports python 3.5+ in linux and window platform.

To install the library:

pip install numpy requests nlpaug

or install the latest version (include BETA features) from github directly

pip install numpy git+https://github.com/makcedward/nlpaug.git

or install over conda

conda install -c makcedward nlpaug

If you use BackTranslationAug, ContextualWordEmbsAug, ContextualWordEmbsForSentenceAug and AbstSummAug, installing the following dependencies as well

pip install torch>=1.6.0 transformers>=4.11.3 sentencepiece

If you use LambadaAug, installing the following dependencies as well

pip install simpletransformers>=0.61.10

If you use AntonymAug, SynonymAug, installing the following dependencies as well

pip install nltk>=3.4.5

If you use WordEmbsAug (word2vec, glove or fasttext), downloading pre-trained model first and installing the following dependencies as well

from nlpaug.util.file.download import DownloadUtil
DownloadUtil.download_word2vec(dest_dir='.') # Download word2vec model
DownloadUtil.download_glove(model_name='glove.6B', dest_dir='.') # Download GloVe model
DownloadUtil.download_fasttext(model_name='wiki-news-300d-1M', dest_dir='.') # Download fasttext model

pip install gensim>=4.1.2

If you use SynonymAug (PPDB), downloading file from the following URI. You may not able to run the augmenter if you get PPDB file from other website

http://paraphrase.org/#/download

If you use PitchAug, SpeedAug and VtlpAug, installing the following dependencies as well

pip install librosa>=0.9.1 matplotlib

Recent Changes

1.1.11 Jul 6, 2022

See changelog for more details.

Extension Reading

View on GitHub
GitHub Stars4.7k
CategoryData
Updated4h ago
Forks474

Languages

Jupyter Notebook

Security Score

100/100

Audited on Apr 1, 2026

No findings