ConvoKit

This toolkit contains tools to extract conversational features and analyze social phenomena in conversations, using a single unified interface inspired by (and compatible with) scikit-learn. Several large conversational datasets are included together with scripts exemplifying the use of the toolkit on these datasets. The latest version is 4.1.0 (released Mar. 10, 2026); follow the project on GitHub to keep track of updates.

Join our Discord community to stay informed, connect with fellow developers, and be part of an engaging space where we share progress, discuss features, and tackle issues together.

Read our documentation or try ConvoKit in our interactive tutorial.

The toolkit currently implements features for:

Linguistic coordination (API)

A measure of linguistic influence (and relative power) between individuals or groups based on their use of function words. Example: exploring the balance of power in the U.S. Supreme Court.

Politeness strategies (API)

A set of lexical and parse-based features correlating with politeness and impoliteness. Example: understanding the (mis)use of politeness strategies in conversations gone awry on Wikipedia.

Expected Conversational Context Framework (API)

A framework for characterizing utterances and terms based on their expected conversational context, consisting of model implementations and wrapper pipelines. Examples: deriving question types and other characterizations in British parliamentary question periods, exploration of Switchboard dialog acts corpus, examining Wikipedia talk page discussions and computing the orientation of justice utterances in the US Supreme Court

Hypergraph conversation representation (API)

A method for extracting structural features of conversations through a hypergraph representation. Example: hypergraph creation and feature extraction, visualization and interpretation on a subsample of Reddit.

Linguistic diversity in conversations (API)

A method to compute the linguistic diversity of individuals within their own conversations, and between other individuals in a population. Example: speaker conversation attributes and diversity example on ChangeMyView

CRAFT: Online forecasting of conversational outcomes (API)

A neural model for forecasting future outcomes of conversations (e.g., derailment into personal attacks) as they develop. Available as an interactive notebook: full version (fine-tuning + inference) or inference-only.

Redirection and Utterance Likelihood (API)

The methods to compute the extent to which utterances redirect the flow of the conversation (Redirection) and to measure the log-likelihoods of utterances given a defined conversation context (Utterance Likelihood). Example: redirection in supreme court oral arguments

Pivotal Moment Measure (API)

A method to identify pivotal moments in conversations. Example: pivotal moments in conversations gone awry

Datasets

ConvoKit ships with several datasets ready for use "out-of-the-box". These datasets can be downloaded using the convokit.download() helper function. Alternatively you can access them directly here.

Conversations Gone Awry Datasets (Wikipedia/CMV)

Three related corpora of conversations that derail into antisocial behavior. One corpus (CGA-WIKI) consists of Wikipedia talk page conversations that derail into personal attacks as labeled by crowdworkers (4,188 conversations containing 30.021 comments). Another (CGA-CMV) consists of discussion threads on the subreddit ChangeMyView (CMV) that derail into rule-violating behavior as determined by the presence of a moderator intervention (6,842 conversations containing 42,964 comments). The last is a recent expansion of the CGA-CMV dataset, containing now 19,578 conversations and 116,793 utterances. Name for download: conversations-gone-awry-corpus (for CGA-WIKI), conversations-gone-awry-cmv-corpus (for CGA-CMV), and conversations-gone-awry-cmv-corpus-large (for CGA-CMV-Large)

ConvoKit

Install / Use

README

ConvoKit

Linguistic coordination <sub><sup>(API)</sup></sub>

Politeness strategies <sub><sup>(API)</sup></sub>

Expected Conversational Context Framework <sub><sup>(API)</sup></sub>

Hypergraph conversation representation <sub><sup>(API)</sup></sub>

Linguistic diversity in conversations <sub><sup>(API)</sup></sub>

CRAFT: Online forecasting of conversational outcomes <sub><sup>(API)</sup></sub>

Redirection and Utterance Likelihood <sub><sup>(API)</sup></sub>

Pivotal Moment Measure <sub><sup>(API)</sup></sub>

Datasets

Conversations Gone Awry Datasets (Wikipedia/CMV)

Cornell Movie-Dialogs Corpus

Parliament Question Time Corpus

Supreme Court Corpus

Wikipedia Talk Pages Corpus

Tennis Interviews