TypeEvalPy
A Micro-benchmarking Framework for Python Type Inference Tools
Install / Use
/learn @secure-software-engineering/TypeEvalPyREADME
📌 Features:
- 📜 Contains 154 code snippets to test and benchmark.
- 🏷 Offers 845 type annotations across a diverse set of Python functionalities.
- 📂 Organized into 18 distinct categories targeting various Python features.
- 🚢 Seamlessly manages the execution of containerized tools.
- 🔄 Efficiently transforms inferred types into a standardized format.
- 📊 Automatically produces meaningful metrics for in-depth assessment and comparison.
[New] TypeEvalPy Autogen
- 🤖 Autogenerates code snippets and ground truth to scale the benchmark based on the original
TypeEvalPybenchmark. - 📈 The autogen benchmark now contains:
- Python files: 7121
- Type annotations: 78373
🛠️ Supported Tools
| Supported :white_check_mark: | In-progress :wrench: | Planned :bulb: | | --------------------------------------------------------------------- | -------------------------------------------------------------------- | ----------------------------------------------------- | | HeaderGen | Intellij PSI | MonkeyType | | Jedi | Pyre | Pyannotate | | Pyright | PySonar2 | | HiTyper | Pytype | | Scalpel | TypeT5 | | Type4Py | | | GPT | | | Ollama | | | RightTyper | |
🏆 TypeEvalPy Leaderboard
Below is a comparison showcasing exact matches across different tools and LLMs on the Autogen benchmark.
| Rank | 🛠️ Tool | Function Return Type | Function Parameter Type | Local Variable Type | Total | | ---- | ---------------------------------------------------------------------------------------------- | -------------------- | ----------------------- | ------------------- | ----- | | 1 | mistral-large-it-2407-123b | 16701 | 728 | 57550 | 74979 | | 2 | qwen2-it-72b | 16488 | 629 | 55160 | 72277 | | 3 | llama3.1-it-70b | 16648 | 580 | 54445 | 71673 | | 4 | gemma2-it-27b | 16342 | 599 | 49772 | 66713 | | 5 | codestral-v0.1-22b | 16456 | 706 | 49379 | 66541 | | 6 | codellama-it-34b | 15960 | 473 | 48957 | 65390 | | 7 | mistral-nemo-it-2407-12.2b | 16221 | 526 | 48439 | 65186 | | 8 | mistral-v0.3-it-7b | 16686 | 472 | 47935 | 65093 | | 9 | phi3-medium-it-14b | 16802 | 467 | 45121 | 62390 | | 10 | llama3.1-it-8b | 16125 | 492 | 44313 | 60930 | | 11 | codellama-it-13b | 16214 | 479 | 43021 | 59714 | | 12 | phi3-small-it-7.3b | 16155 | 422 | 38093 | 54670 | | 13 | qwen2-it-7b | 15684 | 313 | 38109 | 54106 | | 14 | HeaderGen | 14086 | 346 | 36370 | 50802 | | 15 | phi3-mini-it-3.8b | 15908 | 320 | 30341 | 46569 | | 16 | phi3.5-mini-it-3.8b | 15763 | 362 | 28694 | 44819 | | 17 | codellama-it-7b | 13779 | 318 | 29346 | 43443 | | 18 | Jedi | 13160 | 0 | 15403 | 28563 | | 19 | Scalpel | 15383 | 171 | 18 | 15572 | | 20 | gemma2-it-9b | 1611 | 66 | 5464 | 7141 | | 21 | Type4Py | 3143 | 38 | 2243 | 5424 | | 22 | tinyllama-1.1b | 1514 | 28 | 2699 | 4241 | | 23 | mixtral-v0.1-it-8x7b | 3235 | 33 | 377 | 3645 | | 24 | phi3.5-moe-it-41.9b | 3090 | 25 | 273 | 3388 | | 25 | gemma2-it-2b | 1497 | 41 | 1848 | 3386 |
<sub>(Auto-generated based on the the analysis run on 30 Aug 2024)</sub>
:whale: Running with Docker
1️⃣ Clone the repo
git clone https://github.com/secure-software-engineering/TypeEvalPy.git
2️⃣ Build Docker image
docker build -t typeevalpy .
3️⃣ Run TypeEvalPy
🕒 Takes about 30mins on first run to build Docker containers.
📂 Results will be generated in the results folder within the root directory of the repository.
Each results folder will have a timestamp, allowing you to easily track and compare different runs.
-
Table 1 in the paper is derived from three auto-generated CSV tables:
paper_table_1.csv- details Exact matches by type category.paper_table_2.csv- lists Exact matches for 18 micro-benchmark categories.paper_table_3.csv- provides Sound and Complete values for tools.
-
Table 2 in the paper is based on the following CSV table:
paper_table_5.csv- shows Exact matches with top_n values for machine learning tools.
Additionally, there are CSV tables that are not included in the paper:
paper_table_4.csv- containing Sound and Complete values for 18 micro-benchmark categories.paper_table_6.csv- featuring Sensitivity analysis.
docker run \
-v /var/run/docker.sock:/var/run/docker.sock \
-v ./results:/app/results \
typeevalpy
🔧 Optionally, run analysis on specific tools:
docker run \
-v /var/run/docker.sock:/var/run/docker.sock \
-v ./results:/app/results \
typeevalpy --runners headergen scalpel
📊 Run analysis on custom benchmarks:
Here, running with the autogen benchmark on HeaderGen
docker run \
-v /var/run/docker.sock:/var/run/docker.sock \
-v ./results:/app/results \
typeevalpy \
--runners headergen \
--custom_benchmark_dir /app/autogen_typeevalpy_benchmark
🛠️ Availabl
Related Skills
node-connect
345.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
claude-opus-4-5-migration
104.6kMigrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5
frontend-design
104.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
model-usage
345.4kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
