ConvoKit
ConvoKit is a toolkit for extracting conversational features and analyzing social phenomena in conversations. It includes several large conversational datasets along with scripts exemplifying the use of the toolkit on these datasets.
Install / Use
/learn @CornellNLP/ConvoKitREADME
ConvoKit
<!-- ALL-CONTRIBUTORS-BADGE:START - Do not remove or modify this section --> <!-- ALL-CONTRIBUTORS-BADGE:END -->This toolkit contains tools to extract conversational features and analyze social phenomena in conversations, using a single unified interface inspired by (and compatible with) scikit-learn. Several large conversational datasets are included together with scripts exemplifying the use of the toolkit on these datasets. The latest version is 4.1.0 (released Mar. 10, 2026); follow the project on GitHub to keep track of updates.
Join our Discord community to stay informed, connect with fellow developers, and be part of an engaging space where we share progress, discuss features, and tackle issues together.
Read our documentation or try ConvoKit in our interactive tutorial.
The toolkit currently implements features for:
Linguistic coordination <sub><sup>(API)</sup></sub>
A measure of linguistic influence (and relative power) between individuals or groups based on their use of function words. Example: exploring the balance of power in the U.S. Supreme Court.
Politeness strategies <sub><sup>(API)</sup></sub>
A set of lexical and parse-based features correlating with politeness and impoliteness. Example: understanding the (mis)use of politeness strategies in conversations gone awry on Wikipedia.
Expected Conversational Context Framework <sub><sup>(API)</sup></sub>
A framework for characterizing utterances and terms based on their expected conversational context, consisting of model implementations and wrapper pipelines. Examples: deriving question types and other characterizations in British parliamentary question periods, exploration of Switchboard dialog acts corpus, examining Wikipedia talk page discussions and computing the orientation of justice utterances in the US Supreme Court
<!-- ### [Prompt types](http://www.cs.cornell.edu/~cristian/Asking_too_much.html) <sub><sup>[(API)](https://convokit.cornell.edu/documentation/promptTypes.html)</sup></sub> An unsupervised method for grouping utterances and utterance features by their rhetorical role. Examples: [extracting question types in the U.K. parliament](https://github.com/CornellNLP/ConvoKit/blob/master/examples/prompt-types/prompt-type-wrapper-demo.ipynb), [extended version demonstrating additional functionality](https://github.com/CornellNLP/ConvoKit/blob/master/examples/prompt-types/prompt-type-demo.ipynb), [understanding the use of conversational prompts in conversations gone awry on Wikipedia](https://github.com/CornellNLP/ConvoKit/blob/master/examples/conversations-gone-awry/Conversations_Gone_Awry_Prediction.ipynb). Also includes functionality to extract surface motifs to represent utterances, used in the above paper [(API)](https://convokit.cornell.edu/documentation/phrasingMotifs.html). -->Hypergraph conversation representation <sub><sup>(API)</sup></sub>
A method for extracting structural features of conversations through a hypergraph representation. Example: hypergraph creation and feature extraction, visualization and interpretation on a subsample of Reddit.
Linguistic diversity in conversations <sub><sup>(API)</sup></sub>
A method to compute the linguistic diversity of individuals within their own conversations, and between other individuals in a population. Example: speaker conversation attributes and diversity example on ChangeMyView
CRAFT: Online forecasting of conversational outcomes <sub><sup>(API)</sup></sub>
A neural model for forecasting future outcomes of conversations (e.g., derailment into personal attacks) as they develop. Available as an interactive notebook: full version (fine-tuning + inference) or inference-only.
Redirection and Utterance Likelihood <sub><sup>(API)</sup></sub>
The methods to compute the extent to which utterances redirect the flow of the conversation (Redirection) and to measure the log-likelihoods of utterances given a defined conversation context (Utterance Likelihood). Example: redirection in supreme court oral arguments
Pivotal Moment Measure <sub><sup>(API)</sup></sub>
A method to identify pivotal moments in conversations. Example: pivotal moments in conversations gone awry
Datasets
ConvoKit ships with several datasets ready for use "out-of-the-box".
These datasets can be downloaded using the convokit.download() helper function. Alternatively you can access them directly here.
Conversations Gone Awry Datasets (Wikipedia/CMV)
Three related corpora of conversations that derail into antisocial behavior. One corpus (CGA-WIKI) consists of Wikipedia talk page conversations that derail into personal attacks as labeled by crowdworkers (4,188 conversations containing 30.021 comments). Another (CGA-CMV) consists of discussion threads on the subreddit ChangeMyView (CMV) that derail into rule-violating behavior as determined by the presence of a moderator intervention (6,842 conversations containing 42,964 comments). The last is a recent expansion of the CGA-CMV dataset, containing now 19,578 conversations and 116,793 utterances.
Name for download: conversations-gone-awry-corpus (for CGA-WIKI), conversations-gone-awry-cmv-corpus (for CGA-CMV), and conversations-gone-awry-cmv-corpus-large (for CGA-CMV-Large)
Cornell Movie-Dialogs Corpus
A large metadata-rich collection of fictional conversations extracted from raw movie scripts. (220,579 conversational exchanges between 10,292 pairs of movie characters in 617 movies).
Name for download: movie-corpus
Parliament Question Time Corpus
Parliamentary question periods from May 1979 to December 2016 (216,894 question-answer pairs).
Name for download: parliament-corpus
Supreme Court Corpus
A collection of conversations from the U.S. Supreme Court Oral Arguments.
Name for download: supreme-corpus
Wikipedia Talk Pages Corpus
A medium-size collection of conversations from Wikipedia editors' talk pages.
Name for download: wiki-corpus
Tennis Interviews
Transcripts for tennis singles
