Evalica
Evalica, your favourite evaluation toolkit
Install / Use
/learn @dustalov/EvalicaREADME
Evalica, your favourite evaluation toolkit
Evalica [ɛˈʋalit͡sa] (eh-vah-lee-tsah) is an evaluation toolkit for statistical analysis, combining fast Rust implementations with Python APIs for ranking, reliability, and uncertainty estimation. Evalica is fully compatible with NumPy arrays and pandas data frames.
The logo was created using Recraft.
Installation
Pairwise Comparisons
Imagine that we would like to rank the different meals and have the following dataset of three comparisons produced by food experts.
| Item X| Item Y | Winner |
|:---:|:---:|:---:|
| pizza | burger | x |
| burger | sushi | y |
| pizza | sushi | tie |
Given this hypothetical example, Evalica takes these three columns and computes the outcome of the given pairwise comparison according to the chosen model. Note that the first argument is the column Item X, the second argument is the column Item Y, and the third argument corresponds to the column Winner.
>>> from evalica import elo, Winner
>>> result = elo(
... ['pizza', 'burger', 'pizza'],
... ['burger', 'sushi', 'sushi'],
... [Winner.X, Winner.Y, Winner.Draw],
... )
>>> result.scores
pizza 1014.972058
burger 970.647200
sushi 1014.380742
Name: elo, dtype: float64
As a result, we obtain Elo scores of our items. In this example, pizza was the most favoured item, sushi was the runner-up, and burger was the least preferred item.
| Item| Score |
|---|---:|
| pizza | 1014.97 |
| burger | 970.65 |
| sushi | 1014.38 |
Inter-Rater Reliability
Evalica also supports computing Krippendorff's alpha, a statistical measure of inter-rater reliability. Unlike pairwise comparisons, alpha accepts a matrix where rows represent raters (observers) and columns represent units (items being rated).
>>> import pandas as pd
>>> from evalica import alpha
>>> data = pd.DataFrame([
... [1, 1, None, 1],
... [2, 2, 3, 2],
... [3, 3, 3, 3],
... [3, 3, 3, 3],
... [2, 2, 2, 2],
... [1, 2, 3, 4],
... [4, 4, 4, 4],
... [1, 1, 2, 1],
... [2, 2, 2, 2],
... [None, 5, 5, 5],
... [None, None, 1, 1],
... ]).T
>>> result = alpha(data, distance='nominal')
>>> result.alpha
0.7434210526315788
>>> from evalica import alpha_bootstrap
>>> bootstrap_result = alpha_bootstrap(data, distance='nominal', n_resamples=1000, random_state=42)
>>> (bootstrap_result.low, bootstrap_result.high)
(0.4431818181818182, 0.9411764705882353)
This example demonstrates computing alpha and its bootstrap confidence intervals with nominal distance for categorical ratings. Evalica supports multiple distance metrics: nominal, ordinal, interval, ratio, or custom distance functions.
Command-Line Interface
Evalica also provides a simple command-line interface, allowing the use of these methods in shell scripts and for prototyping.
Pairwise Ranking
$ evalica -i food.csv pairwise bradley-terry
item,score,rank
Tacos,2.509025136024378,1
Sushi,1.1011561298265815,2
Burger,0.8549063627182466,3
Pasta,0.7403814336665869,4
Pizza,0.5718366915548537,5
Refer to the food.csv file as an input example.
Krippendorff's Alpha
For Krippendorff's alpha, use a CSV file with ratings in a matrix format (no header):
$ evalica -i codings.csv alpha --distance=nominal
metric,value
alpha,0.743421052631579
observed,7.999999999999999
expected,31.179487179487182
Web Application
Evalica has a built-in Gradio application that can be launched as python3 -m evalica.gradio. Please ensure that the library was installed as pip install evalica[gradio].
Implemented Methods
| Method | In Python | In Rust | |---|:---:|:---:| | Counting | ✅ | ✅ | | Average Win Rate | ✅ | ✅ | | Bradley–Terry | ✅ | ✅ | | Elo | ✅ | ✅ | | Eigenvalue | ✅ | ✅ | | PageRank | ✅ | ✅ | | Newman | ✅ | ✅ | | Krippendorff's Alpha | ✅ | ✅ |
<!-- Present: ✅ / Absent: ❌ -->Contributing
Evalica is a mixed Rust/Python project that uses PyO3, so it requires setting up the Maturin build system.
To set up the environment, we recommend using the uv package manager, as demonstrated in our test suite:
$ uv venv
$ uv pip install maturin
$ source .venv/bin/activate
$ maturin develop --uv --extras dev,docs,gradio
In case uv is not available, you can use the following workaround:
$ python3 -m venv venv
$ source venv/bin/activate
$ pip install maturin
$ maturin develop --extras dev,docs,gradio
It is also possible to omit the Rust-accelerated routines via pip install --no-binary evalica.
We welcome pull requests on GitHub: https://github.com/dustalov/evalica. To contribute, fork the repository, create a separate branch for your changes, and submit a pull request.
Citation
- Ustalov, D. Reliable, Reproducible, and Really Fast Leaderboards with Evalica. 2025. Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations. 46–53. arXiv: 2412.11314 [cs.CL].
@inproceedings{Ustalov:25,
author = {Ustalov, Dmitry},
title = {{Reliable, Reproducible, and Really Fast Leaderboards with Evalica}},
year = {2025},
booktitle = {Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations},
pages = {46--53},
address = {Abu Dhabi, UAE},
publisher = {Association for Computational Linguistics},
eprint = {2412.11314},
eprinttype = {arxiv},
eprintclass = {cs.CL},
url = {https://aclanthology.org/2025.coling-demos.6},
language = {english},
}
The code for replicating the experiments is available in the coling2025 directory.
Copyright
Copyright (c) 2024–2026 Dmitry Ustalov. See LICENSE for details.
Related Skills
himalaya
349.7kCLI to manage emails via IMAP/SMTP. Use `himalaya` to list, read, write, reply, forward, search, and organize emails from the terminal. Supports multiple accounts and message composition with MML (MIME Meta Language).
node-connect
349.7kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
taskflow
349.7kname: taskflow description: Use when work should span one or more detached tasks but still behave like one job with a single owner context. TaskFlow is the durable flow substrate under authoring layer
claude-opus-4-5-migration
109.7kMigrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5
