Checklist
Beyond Accuracy: Behavioral Testing of NLP models with CheckList
Install / Use
/learn @marcotcr/ChecklistREADME
CheckList
This repository contains code for testing NLP Models as described in the following paper:
Beyond Accuracy: Behavioral Testing of NLP models with CheckList
Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh Association for Computational Linguistics (ACL), 2020
Bibtex for citations:
@inproceedings{checklist:acl20,
author = {Marco Tulio Ribeiro and Tongshuang Wu and Carlos Guestrin and Sameer Singh},
title = {Beyond Accuracy: Behavioral Testing of NLP models with CheckList},
booktitle = {Association for Computational Linguistics (ACL)},
year = {2020}
}
Table of Contents
- CheckList
Installation
From pypi:
pip install checklist
jupyter nbextension install --py --sys-prefix checklist.viewer
jupyter nbextension enable --py --sys-prefix checklist.viewer
Note: --sys-prefix to install into python’s sys.prefix, which is useful for instance in virtual environments, such as with conda or virtualenv. If you are not in such environments, please switch to --user to install into the user’s home jupyter directories.
From source:
git clone git@github.com:marcotcr/checklist.git
cd checklist
pip install -e .
Either way, you need to install pytorch or tensorflow if you want to use masked language model suggestions:
pip install torch
For most tutorials, you also need to download a spacy model:
python -m spacy download en_core_web_sm
Tutorials
Please note that the visualizations are implemented as ipywidgets, and don't work on colab or JupyterLab (use jupyter notebook). Everything else should work on these though.
- Generating data
- Perturbing data
- Test types, expectation functions, running tests
- The CheckList process
Paper tests
Notebooks: how we created the tests in the paper
Replicating paper tests, or running them with new models
For all of these, you need to unpack the release data (in the main repo folder after cloning):
tar xvzf release_data.tar.gz
Sentiment Analysis
Loading the suite:
import checklist
from checklist.test_suite import TestSuite
suite_path = 'release_data/sentiment/sentiment_suite.pkl'
suite = TestSuite.from_file(suite_path)
Running tests with precomputed bert predictions (replace bert on pred_path with amazon, google, microsoft, or roberta for others):
pred_path = 'release_data/sentiment/predictions/bert'
suite.run_from_file(pred_path, overwrite=True)
suite.summary() # or suite.visual_summary_table()
To test your own model, get predictions for the texts in release_data/sentiment/tests_n500 and save them in a file where each line has 4 numbers: the prediction (0 for negative, 1 for neutral, 2 for positive) and the prediction probabilities for (negative, neutral, positive).
Then, update pred_path with this file and run the lines above.
QQP
import checklist
from checklist.test_suite import TestSuite
suite_path = 'release_data/qqp/qqp_suite.pkl'
suite = TestSuite.from_file(suite_path)
Running tests with precomputed bert predictions (replace bert on pred_path with roberta if you want):
pred_path = 'release_data/qqp/predictions/bert'
suite.run_from_file(pred_path, overwrite=True, file_format='binary_conf')
suite.visual_summary_table()
To test your own model, get predictions for pairs in release_data/qqp/tests_n500 (format: tsv) and output them in a file where each line has a single number: the probability that the pair is a duplicate.
SQuAD
import checklist
from checklist.test_suite import TestSuite
suite_path = 'release_data/squad/squad_suite.pkl'
suite = TestSuite.from_file(suite_path)
Running tests with precomputed bert predictions:
pred_path = 'release_data/squad/predictions/bert'
suite.run_from_file(pred_path, overwrite=True, file_format='pred_only')
suite.visual_summary_table()
To test your own model, get predictions for pairs in release_data/squad/squad.jsonl (format: jsonl) or release_data/squad/squad.json (format: json, like SQuAD dev) and output them in a file where each line has a single string: the prediction span.
Testing huggingface transformer pipelines
See this notebook.
Code snippets
Templates
See 1. Generating data for more details.
import checklist
from checklist.editor import Editor
import numpy as np
editor = Editor()
ret = editor.template('{first_name} is {a:profession} from {country}.',
profession=['lawyer', 'doctor', 'accountant'])
np.random.choice(ret.data, 3)
['Mary is a doctor from Afghanistan.',
'Jordan is an accountant from Indonesia.',
'Kayla is a lawyer from Sierra Leone.']
RoBERTa suggestions
See 1. Generating data for more details.
In template:
ret = editor.template('This is {a:adj} {mask}.',
adj=['good', 'bad', 'great', 'terrible'])
ret.data[:3]
['This is a good idea.',
'This is a good sign.',
'This is a good thing.']
Multiple masks:
ret = editor.template('This is {a:adj} {mask} {mask}.',
adj=['good', 'bad', 'great', 'terrible'])
ret.data[:3]
['This is a good history lesson.',
'This is a good chess move.',
'This is a good news story.']
Getting suggestions rather than filling out templates:
editor.suggest('This is {a:adj} {mask}.',
adj=['good', 'bad', 'great', 'terrible'])[:5]
['idea', 'sign', 'thing', 'example', 'start']
Getting suggestions for replacements (only a single text allowed, no templates):
editor.suggest_replace('This is a good movie.', 'good')[:5]
['great', 'horror', 'bad', 'terrible', 'cult']
Getting suggestions through jupyter visualization:
editor.visual_suggest('This is {a:mask} movie.')

Multilingual suggestions
Just initialize the editor with the language argument (should work with language names and iso 639-1 codes):
import checklist
from checklist.editor import Editor
import numpy as np
# in Portuguese
editor = Editor(language='portuguese')
ret = editor.template('O João é um {mask}.',)
ret.data[:3]
['O João é um português.',
'O João é um poeta.',
'O João é um brasileiro.']
# in Chinese
editor = Editor(language='chinese')
ret = editor.template('西游记的故事很{mask}。',)
ret.data[:3]
['西游记的故事很精彩。',
'西游记的故事很真实。',
'西游记的故事很经典。']
We're using FlauBERT for french, German BERT for german, and XLM-RoBERTa for everything else (click the link for a list of supported languages). We can't vouch for the quality of the suggestions in other languages, but it seems to work reasonably well for the languages we speak (although not as well as English).
Lexicons (somewhat multilingual)
editor.lexicons is a dictionary, which can be used in templates. For example:
import checklist
from checklist.editor import Editor
import numpy as np
# Default: English
editor = Editor()
ret = editor.template('{male1} went to see {male2} in {city}.', remove_duplicates=True)
list(np.random.choice(ret.data, 3))
['Dan went to see Hugh in Riverside.',
'Stephen went to see Eric in Omaha.',
'Patrick went to see Nick in Kansas City.']
Person names and location (country, city) names are multilingual, depending on the editor language. We got the data from wikidata, so there is a bias towards names on wikipedia.
editor = Editor(language='german')
ret = editor.template('{male1} went to see {male2} in {city}.', remove_dup
Related Skills
node-connect
338.7kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
338.7kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.6kCommit, push, and open a PR
