MACE
Multi-Annotator Competence Estimation tool
Install / Use
/learn @dirkhovy/MACEREADME
MACE: Multi-Annotator Competence Estimation
Ask 5 people to label or rate something, and you likely get several different answers. But for ML (and lots of other applications), you usually need a single aggregated answer. Using the majority vote is easy… but often wrong. However, disagreement isn’t noise–it’s information. It can mean the item is genuinely hard, or that someone wasn’t paying attention.
MACE is an Expectation-Maximization (EM)-based algorithm that uses variational inference with Bayesian priors to simultaneously:
- Learn the most likely aggregate labels for items from multiple annotators
- Estimate the competence (reliability) of each annotator
- Model how difficult each item is
It models annotators as either "knowing" the correct answer or "guessing" according to some strategy. (That assumes there is one correct answer. In pluralistic cases, where several answers can be correct, try setting beta > alpha and output distributions.)
Features
- ✅ Supports discrete categorical labels (default) and continuous numeric values
- ✅ Can incorporate control items (known ground truth) for semi-supervised learning
- ✅ Allows specifying label priors (if known)
- ✅ Provides confidence estimates via entropy calculations
- ✅ Optional distribution output shows full probability distributions
- ✅ Handles missing annotations (empty cells in CSV)
Installation
Requirements
- Python 3.14 or higher
- NumPy
- SciPy
Install Dependencies
pip install numpy scipy
Usage
Basic Command
python3 mace.py [options] <CSV input file>
Options
| Option | Description |
|--------|-------------|
| --help | Display help information |
| --version | Display version information |
| --alpha <FLOAT> | First hyperparameter of beta prior for Variational Bayes EM (default method). alpha > beta means we assume most annotators are unreliable. Default: 0.5 |
| --beta <FLOAT> | Second hyperparameter of beta prior for Variational Bayes EM (default method). beta > alpha means we assume most annotators are reliable. Default: 0.5 |
| --continuous | Interpret data values as continuous numeric (returns weighted averages weighted by competence) |
| --controls <FILE> | File with control items (i.e., known ground truth labels) for semi-supervised learning. Each line corresponds to one item, so the number of lines MUST match the input CSV file. Control items usually improve accuracy. |
| --distribution | Output full probability distributions instead of single predictions in '[prefix.]prediction' |
| --em | Use regular EM (Maximum Likelihood Estimation) instead of Variational Bayes EM (default). Performance is usually worse than Variational. |
| --entropies | Write entropy values (uncertainty measure) for each item to a separate file '[prefix.]entropies' |
| --headers | Add header rows to output files describing column contents |
| --iterations <INT> | Number of EM iterations per restart (1-1000). Default: 50 |
| --prefix <STRING> | Prefix for output files (e.g., out → out.prediction) |
| --priors <FILE> | File with label priors (tab-separated "label\tweight" pairs). All labels in the data must be covered. Weights will be normalized to probabilities |
| --restarts <N> | Number of random restarts (1-1000). More restarts can find better solutions. Default: 10 |
| --smoothing <FLOAT> | Smoothing parameter added to fractional counts for regular EM. Default: 0.01/num_labels |
| --test <FILE> | Test file with gold standard labels for evaluation (reports accuracy or RMSE). Each line corresponds to one item in the CSV file, so the number of lines must match. |
| --threshold <FLOAT> | Entropy threshold (0.0-1.0). Filter out uncertain instances by returning only the top n%. Default: 1.0 |
Input Files
1. Input Format
The main input file with the annotations must be a comma-separated (CSV) file where:
- Each row represents one instance/item (rows can be empty, for example to separate input blocks, and to enable sequence labeling/time step prediction).
- Each column represents one annotator (empty cells indicate missing annotations)
- File should be formatted in UTF-8 to avoid problems with newline characters
Note: Make sure the last line has a line break.
Example Input (Discrete Labels with 5 Annotators and Empty Line)
NOUN,,,NOUN,PRON
VERB,VERB,,VERB,
ADJ,,ADJ,,ADV
,VERB,,VERB,ADV
NOUN,,,NOUN,PRON
Example Input (Continuous Values)
3.5,4.2,,,3.8,3.9,4.1
,,4.0,4.5,,3.7,3.6,4.3
4.1,3.9,3.8,4.2,,4.0,,3.7
2. Label Priors
By default, MACE uses a uniform prior over labels (1/num_labels for each label). The prior file is optional, and gives the a-priori prevalence of the individual labels (if we know them). We can supply this to MACE with --priors <FILE>. The file needs to list all labels (one per line) and tab-separated the weight, probability, or frequency (MACE automatically normalizes these).
- Format: Tab-separated "label\tweight" pairs, one per line
- Normalization: Weights are automatically normalized to sum to 1.0
- Validation: All labels in the data must be present in the priors file
- Usage: Priors are used in the E-step to compute gold label marginals
Example Input (Discrete Labels)
NOUN 30
VERB 30
ADJ 20
ADV 10
PRON 10
3. Control Items
If we know the correct answer for some items, we can include control items via --controls <FILE>. This helps MACE assess annotator reliability in semi-supervised learning. The file with control items needs to have the same number of lines as the input file, with the correct labels specified for the control items.
Example Input (Discrete Labels):
PRON
NOUN
4. Test File
If we know all answers and only want to get the performance for MACE, we can supply a test file via --test <FILE>. This file must have the same number of lines as the input file. MACE will output an accuracy score.
Example Input (Discrete Labels)
PRON
VERB
ADJ
VERB
NOUN
Output Files
MACE generates the following output files:
1. Predictions (<prefix>.prediction)
- Discrete mode: One label per line (most likely label for each instance)
- Continuous mode: One weighted average per line
- Distribution mode: Tab-separated distributions (see
--distributionoption) - Empty lines indicate instances filtered by threshold or with no input annotations
This file has the same number of lines as the input file.
Example Output
NOUN
VERB
ADJ
VERB
NOUN
If you set --distribution, each line contains the distribution over answer values, sorted by entropy.
Example Output
NOUN 0.9997443833265887 PRON 7.140381903855615E-5 ADJ 6.140428479093134E-5 VERB 6.140428479093134E-5 ADV 6.140428479093134E-5
VERB 0.9999961943848287 NOUN 9.514037928812883E-7 ADJ 9.514037928812883E-7 PRON 9.514037928812883E-7 ADV 9.514037928812883E-7
ADJ 0.9990184050335877 ADV 2.741982824057974E-4 NOUN 2.3579889466878394E-4 VERB 2.3579889466878394E-4 PRON 2.3579889466878394E-4
VERB 0.9994950838119411 ADV 1.4104305366466138E-4 NOUN 1.2129104479807625E-4 ADJ 1.2129104479807625E-4 PRON 1.2129104479807625E-4
NOUN 0.9997443833265887 PRON 7.140381903855615E-5 ADJ 6.140428479093134E-5 VERB 6.140428479093134E-5 ADV 6.140428479093134E-5
2. Competence Scores (<prefix>.competence)
- One line with tab-separated values
- Each value (0-1) represents the reliability of one annotator
- Higher values = more reliable annotator
- the competence estimate for each annotator,
[prefix.]competence. This file has one line with tab separated values. In the POS example from above, this would be
Example Output
0.8820970950608722 0.7904155783217401 0.6598575839917008 0.8822161621354134 0.03114062354821738
Here, the first four annotators are fairly reliable, but the 5th one is not.
3. Entropies (<prefix>.entropies) - Optional
- One entropy value per line (if
--entropiesis used) - Higher entropy = more uncertainty/disagreement among annotators, often more difficult items
- Lower entropy = high confidence/agreement
This will output a file with the same number of lines as the input file
Example Output
0.0027237895900081095
5.657170773284981E-5
0.009138546784668605
0.005036498835041038
0.0027237895900081095
Here, the first line after the break is the most difficult.
Examples
Basic Usage
# Evaluate annotations and write output to "prediction" and "competence"
python3 mace.py data/examples/example.csv
With Custom Prefix
# Write output to "out.prediction" and "out.competence"
python3 mace.py --prefix out data/examples/example.csv
Test Evaluation
# Evaluate against gold standard and print accuracy
python3 mace.py --test data/examples/example.key data/examples/example.csv
# Output: Coverage: 1.0 Accuracy on test set: 0.81
Filter Uncertain Instances
# Only predict for top 90% most confident instances
python3 mace.py --threshold 0.9 data/examples/example.csv
# Output: Coverage: 0.91 Accuracy on test set: 0.8571428571428571
# Improves accuracy at the expense of coverage
Continuous Numeric Values
# Process numeric scores, return weighted averages
python3 mace.py --continuous data/examples/example.csv
# With test evaluation (uses RMSE instead of accuracy)
python3 mace.py --continuous --test data/examples/example.key data/examples/example.csv
# Output: RMSE on test set: 0.7520364577656833
Distribution Output
# Get full probability distributions for each instance
python3 mace.py --distribution data/examples/example.csv
# Discrete: "3 0.9992656061207903 0 0.00032738278917563427 2 0.00020579145942908883 1 0.00020121963060488852"
# Continuous: "2.7333598926512326 0.8278241191907356 0.0 3.0 10" (mean, std, min, max, n_annotators)
