Neuspell
NeuSpell: A Neural Spelling Correction Toolkit
Install / Use
/learn @neuspell/NeuspellREADME
Contents
- Installation & Quick Start
- Toolkit
- Finetuning on custom data and creating new models
- Applications
- Additional Requirements
Updates
Latest
- April 2021:
- APIs for creating synthetic data now available for English language. See Synthetic data creation.
neuspellis now available through pip. See Installation through pip- Added support for different transformer-based models such DistilBERT, XLM-RoBERTa, etc. See Finetuning on custom data and creating new models section for more details.
Previous
- March, 2021:
- Code-base reformatted. Addressed bug fixes and issues.
- November, 2020:
- Neuspell's
BERTpretrained model is now available as part of huggingface models asmurali1996/bert-base-cased-spell-correction. We provide an example code snippet at ./scripts/huggingface for curious practitioners.
- Neuspell's
- September, 2020:
- This work is accepted at EMNLP 2020 (system demonstrations)
Installation
git clone https://github.com/neuspell/neuspell; cd neuspell
pip install -e .
To install extra requirements,
pip install -r extras-requirements.txt
or individually as:
pip install -e .[elmo]
pip install -e .[spacy]
NOTE: For zsh, use ".[elmo]" and ".[spacy]" instead
Additionally, spacy models can be downloaded as:
python -m spacy download en_core_web_sm
Then, download pretrained models of neuspell following Download Checkpoints
Here is a quick-start code snippet (command line usage) to use a checker model. See test_neuspell_correctors.py for more usage patterns.
import neuspell
from neuspell import available_checkers, BertChecker
""" see available checkers """
print(f"available checkers: {neuspell.available_checkers()}")
# → available checkers: ['BertsclstmChecker', 'CnnlstmChecker', 'NestedlstmChecker', 'SclstmChecker', 'SclstmbertChecker', 'BertChecker', 'SclstmelmoChecker', 'ElmosclstmChecker']
""" select spell checkers & load """
checker = BertChecker()
checker.from_pretrained()
""" spell correction """
checker.correct("I luk foward to receving your reply")
# → "I look forward to receiving your reply"
checker.correct_strings(["I luk foward to receving your reply", ])
# → ["I look forward to receiving your reply"]
checker.correct_from_file(src="noisy_texts.txt")
# → "Found 450 mistakes in 322 lines, total_lines=350"
""" evaluation of models """
checker.evaluate(clean_file="bea60k.txt", corrupt_file="bea60k.noise.txt")
# → data size: 63044
# → total inference time for this data is: 998.13 secs
# → total token count: 1032061
# → confusion table: corr2corr:940937, corr2incorr:21060,
# incorr2corr:55889, incorr2incorr:14175
# → accuracy is 96.58%
# → word correction rate is 79.76%
Alternatively, once can also select and load a spell checker differently as follows:
from neuspell import SclstmChecker
checker = SclstmChecker()
checker = checker.add_("elmo", at="input") # "elmo" or "bert", "input" or "output"
checker.from_pretrained()
This feature of adding ELMO or BERT model is currently supported for selected models. See List of neural models in the toolkit for details.
If interested, follow Additional Requirements for installing non-neural spell
checkers- Aspell and Jamspell.
Installation through pip
pip install neuspell
In v1.0, allennlp library is not automatically installed which is used for models containing ELMO. Hence, to utilize
those checkers, do a source install as in Installation & Quick Start
Toolkit
Introduction
NeuSpell is an open-source toolkit for context sensitive spelling correction in English. This toolkit comprises of 10 spell checkers, with evaluations on naturally occurring mis-spellings from multiple (publicly available) sources. To make neural models for spell checking context dependent, (i) we train neural models using spelling errors in context, synthetically constructed by reverse engineering isolated mis-spellings; and (ii) use richer representations of the context.This toolkit enables NLP practitioners to use our proposed and existing spelling correction systems, both via a simple unified command line, as well as a web interface. Among many potential applications, we demonstrate the utility of our spell-checkers in combating adversarial misspellings.
Live demo available at http://neuspell.github.io/
<p align="center"> <br> <img src="https://github.com/neuspell/neuspell/blob/master/images/ui.png?raw=true" width="400"/> <br> <p>List of neural models in the toolkit:
CNN-LSTMSC-LSTMNested-LSTMBERTSC-LSTM plus ELMO (at input)SC-LSTM plus ELMO (at output)SC-LSTM plus BERT (at input)SC-LSTM plus BERT (at output)
Performances
| Spell<br>Checker | Word<br>Correction <br>Rate | Time per<br>sentence <br>(in milliseconds) |
|-------------------------------------|-----------------------|--------------------------------------|
| Aspell | 48.7 | 7.3* |
| Jamspell | 68.9 | 2.6* |
| CNN-LSTM | 75.8 | 4.2 |
| SC-LSTM | 76.7 | 2.8 |
| Nested-LSTM | 77.3 | 6.4 |
| BERT | 79.1 | 7.1 |
| SC-LSTM plus ELMO (at input) | 79.8 | 15.8 |
| SC-LSTM plus ELMO (at output) | 78.5 | 16.3 |
| SC-LSTM plus BERT (at input) | 77.0 | 6.7 |
| SC-LSTM plus BERT (at output) | 76.0 | 7.2 |
Performance of different correctors in the NeuSpell toolkit on the BEA-60K dataset with real-world spelling
mistakes. ∗ indicates evaluation on a CPU (for others we use a GeForce RTX 2080 Ti GPU).
Download Checkpoints
To download selected checkpoints, select a Checkpoint name from below and then run download. Each checkpoint is associated with a neural spell checker as shown in the table.
| Spell Checker | Class | Checkpoint name | Disk space (approx.) |
|-------------------------------------|---------------------|-----------------------------|----------------------|
| CNN-LSTM | CnnlstmChecker | 'cnn-lstm-probwordnoise' | 450 MB |
| SC-LSTM | SclstmChecker | 'scrnn-probwordnoise' | 450 MB |
| Nested-LSTM | NestedlstmChecker | 'lstm-lstm-probwordnoise' | 455 MB |
| BERT | BertChecker | 'subwordbert-probwordnoise' | 740 MB |
| SC-LSTM plus ELMO (at input) | ElmosclstmChecker | 'elmoscrnn-probwordnoise' | 840 MB |
| SC-LSTM plus BERT (at input) | BertsclstmChecker | 'bertscrnn-probwordnoise' | 900 MB |
| SC-LSTM plus BERT (at output) | SclstmbertChecker | 'scrnnbert-probwordnoise' | 1.19 GB |
| SC-LSTM plus ELMO (at output) | SclstmelmoChecker | 'scrnnelmo-probwordnoise' | 1.23 GB |
import neuspell
neuspell.seq_modeling.downloads.download_pretrained_model("subwordbert-probwordnoise")
Alternatively, download all Neuspell neural models by running the following (available in versions after v1.0):
import neuspell
neuspell.seq_modeling.downloads.download_pretrained_model("_all_")
Alternatively,
Datasets
We curate several synthetic and natural datasets for training/evaluating neuspell models. For full details, check our paper. Run the following to
Related Skills
node-connect
336.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
82.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
336.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
82.8kCommit, push, and open a PR
