SkillAgentSearch skills...

MACE

Multi-Annotator Competence Estimation tool

Install / Use

/learn @dirkhovy/MACE
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

MACE: Multi-Annotator Competence Estimation

Ask 5 people to label or rate something, and you likely get several different answers. But for ML (and lots of other applications), you usually need a single aggregated answer. Using the majority vote is easy… but often wrong. However, disagreement isn’t noise–it’s information. It can mean the item is genuinely hard, or that someone wasn’t paying attention.

MACE is an Expectation-Maximization (EM)-based algorithm that uses variational inference with Bayesian priors to simultaneously:

  • Learn the most likely aggregate labels for items from multiple annotators
  • Estimate the competence (reliability) of each annotator
  • Model how difficult each item is

It models annotators as either "knowing" the correct answer or "guessing" according to some strategy. (That assumes there is one correct answer. In pluralistic cases, where several answers can be correct, try setting beta > alpha and output distributions.)

Features

  • ✅ Supports discrete categorical labels (default) and continuous numeric values
  • ✅ Can incorporate control items (known ground truth) for semi-supervised learning
  • ✅ Allows specifying label priors (if known)
  • ✅ Provides confidence estimates via entropy calculations
  • ✅ Optional distribution output shows full probability distributions
  • ✅ Handles missing annotations (empty cells in CSV)

Installation

Requirements

  • Python 3.14 or higher
  • NumPy
  • SciPy

Install Dependencies

pip install numpy scipy

Usage

Basic Command

python3 mace.py [options] <CSV input file>

Options

| Option | Description | |--------|-------------| | --help | Display help information | | --version | Display version information | | --alpha <FLOAT> | First hyperparameter of beta prior for Variational Bayes EM (default method). alpha > beta means we assume most annotators are unreliable. Default: 0.5 | | --beta <FLOAT> | Second hyperparameter of beta prior for Variational Bayes EM (default method). beta > alpha means we assume most annotators are reliable. Default: 0.5 | | --continuous | Interpret data values as continuous numeric (returns weighted averages weighted by competence) | | --controls <FILE> | File with control items (i.e., known ground truth labels) for semi-supervised learning. Each line corresponds to one item, so the number of lines MUST match the input CSV file. Control items usually improve accuracy. | | --distribution | Output full probability distributions instead of single predictions in '[prefix.]prediction' | | --em | Use regular EM (Maximum Likelihood Estimation) instead of Variational Bayes EM (default). Performance is usually worse than Variational. | | --entropies | Write entropy values (uncertainty measure) for each item to a separate file '[prefix.]entropies' | | --headers | Add header rows to output files describing column contents | | --iterations <INT> | Number of EM iterations per restart (1-1000). Default: 50 | | --prefix <STRING> | Prefix for output files (e.g., outout.prediction) | | --priors <FILE> | File with label priors (tab-separated "label\tweight" pairs). All labels in the data must be covered. Weights will be normalized to probabilities | | --restarts <N> | Number of random restarts (1-1000). More restarts can find better solutions. Default: 10 | | --smoothing <FLOAT> | Smoothing parameter added to fractional counts for regular EM. Default: 0.01/num_labels | | --test <FILE> | Test file with gold standard labels for evaluation (reports accuracy or RMSE). Each line corresponds to one item in the CSV file, so the number of lines must match. | | --threshold <FLOAT> | Entropy threshold (0.0-1.0). Filter out uncertain instances by returning only the top n%. Default: 1.0 |

Input Files

1. Input Format

The main input file with the annotations must be a comma-separated (CSV) file where:

  • Each row represents one instance/item (rows can be empty, for example to separate input blocks, and to enable sequence labeling/time step prediction).
  • Each column represents one annotator (empty cells indicate missing annotations)
  • File should be formatted in UTF-8 to avoid problems with newline characters

Note: Make sure the last line has a line break.

Example Input (Discrete Labels with 5 Annotators and Empty Line)

NOUN,,,NOUN,PRON
VERB,VERB,,VERB,

ADJ,,ADJ,,ADV
,VERB,,VERB,ADV
NOUN,,,NOUN,PRON

Example Input (Continuous Values)

3.5,4.2,,,3.8,3.9,4.1
,,4.0,4.5,,3.7,3.6,4.3
4.1,3.9,3.8,4.2,,4.0,,3.7

2. Label Priors

By default, MACE uses a uniform prior over labels (1/num_labels for each label). The prior file is optional, and gives the a-priori prevalence of the individual labels (if we know them). We can supply this to MACE with --priors <FILE>. The file needs to list all labels (one per line) and tab-separated the weight, probability, or frequency (MACE automatically normalizes these).

  • Format: Tab-separated "label\tweight" pairs, one per line
  • Normalization: Weights are automatically normalized to sum to 1.0
  • Validation: All labels in the data must be present in the priors file
  • Usage: Priors are used in the E-step to compute gold label marginals

Example Input (Discrete Labels)

NOUN	30
VERB	30
ADJ	20
ADV	10
PRON	10

3. Control Items

If we know the correct answer for some items, we can include control items via --controls <FILE>. This helps MACE assess annotator reliability in semi-supervised learning. The file with control items needs to have the same number of lines as the input file, with the correct labels specified for the control items.

Example Input (Discrete Labels):

PRON




NOUN

4. Test File

If we know all answers and only want to get the performance for MACE, we can supply a test file via --test <FILE>. This file must have the same number of lines as the input file. MACE will output an accuracy score.

Example Input (Discrete Labels)

PRON
VERB

ADJ
VERB
NOUN

Output Files

MACE generates the following output files:

1. Predictions (<prefix>.prediction)

  • Discrete mode: One label per line (most likely label for each instance)
  • Continuous mode: One weighted average per line
  • Distribution mode: Tab-separated distributions (see --distribution option)
  • Empty lines indicate instances filtered by threshold or with no input annotations

This file has the same number of lines as the input file.

Example Output

NOUN
VERB

ADJ
VERB
NOUN

If you set --distribution, each line contains the distribution over answer values, sorted by entropy.

Example Output

NOUN 0.9997443833265887	PRON 7.140381903855615E-5	ADJ 6.140428479093134E-5	VERB 6.140428479093134E-5	ADV 6.140428479093134E-5
VERB 0.9999961943848287	NOUN 9.514037928812883E-7	ADJ 9.514037928812883E-7	PRON 9.514037928812883E-7	ADV 9.514037928812883E-7

ADJ 0.9990184050335877	ADV 2.741982824057974E-4	NOUN 2.3579889466878394E-4	VERB 2.3579889466878394E-4	PRON 2.3579889466878394E-4
VERB 0.9994950838119411	ADV 1.4104305366466138E-4	NOUN 1.2129104479807625E-4	ADJ 1.2129104479807625E-4	PRON 1.2129104479807625E-4
NOUN 0.9997443833265887	PRON 7.140381903855615E-5	ADJ 6.140428479093134E-5	VERB 6.140428479093134E-5	ADV 6.140428479093134E-5

2. Competence Scores (<prefix>.competence)

  • One line with tab-separated values
  • Each value (0-1) represents the reliability of one annotator
  • Higher values = more reliable annotator
  • the competence estimate for each annotator, [prefix.]competence. This file has one line with tab separated values. In the POS example from above, this would be

Example Output

0.8820970950608722  0.7904155783217401		0.6598575839917008 0.8822161621354134	 0.03114062354821738

Here, the first four annotators are fairly reliable, but the 5th one is not.

3. Entropies (<prefix>.entropies) - Optional

  • One entropy value per line (if --entropies is used)
  • Higher entropy = more uncertainty/disagreement among annotators, often more difficult items
  • Lower entropy = high confidence/agreement

This will output a file with the same number of lines as the input file

Example Output

0.0027237895900081095
5.657170773284981E-5

0.009138546784668605
0.005036498835041038
0.0027237895900081095

Here, the first line after the break is the most difficult.

Examples

Basic Usage

# Evaluate annotations and write output to "prediction" and "competence"
python3 mace.py data/examples/example.csv

With Custom Prefix

# Write output to "out.prediction" and "out.competence"
python3 mace.py --prefix out data/examples/example.csv

Test Evaluation

# Evaluate against gold standard and print accuracy
python3 mace.py --test data/examples/example.key data/examples/example.csv
# Output: Coverage: 1.0  Accuracy on test set: 0.81

Filter Uncertain Instances

# Only predict for top 90% most confident instances
python3 mace.py --threshold 0.9 data/examples/example.csv
# Output: Coverage: 0.91  Accuracy on test set: 0.8571428571428571
# Improves accuracy at the expense of coverage

Continuous Numeric Values

# Process numeric scores, return weighted averages
python3 mace.py --continuous data/examples/example.csv

# With test evaluation (uses RMSE instead of accuracy)
python3 mace.py --continuous --test data/examples/example.key data/examples/example.csv
# Output: RMSE on test set: 0.7520364577656833

Distribution Output

# Get full probability distributions for each instance
python3 mace.py --distribution data/examples/example.csv

# Discrete: "3 0.9992656061207903	0 0.00032738278917563427	2 0.00020579145942908883	1 0.00020121963060488852"
# Continuous: "2.7333598926512326	0.8278241191907356	0.0	3.0	10" (mean, std, min, max, n_annotators)

With

View on GitHub
GitHub Stars137
CategoryDevelopment
Updated23d ago
Forks14

Languages

Java

Security Score

80/100

Audited on Mar 11, 2026

No findings