TypeEvalPy

A Micro-benchmarking Framework for Python Type Inference Tools

Generate Convert Improve

Install / Use

/learn @secure-software-engineering/TypeEvalPy

About this skill

Quality Score

0/100

README

<p align="center"> <img src="TypeEvalPy.jpg" width="75%" align="center"> <br> <h3 align="center"> A Micro-benchmarking Framework for Python Type Inference Tools </h3> </p>

📌 Features:

📜 Contains 154 code snippets to test and benchmark.
🏷 Offers 845 type annotations across a diverse set of Python functionalities.
📂 Organized into 18 distinct categories targeting various Python features.
🚢 Seamlessly manages the execution of containerized tools.
🔄 Efficiently transforms inferred types into a standardized format.
📊 Automatically produces meaningful metrics for in-depth assessment and comparison.

[New] TypeEvalPy Autogen

🤖 Autogenerates code snippets and ground truth to scale the benchmark based on the original TypeEvalPy benchmark.
📈 The autogen benchmark now contains:
- Python files: 7121
- Type annotations: 78373

🛠️ Supported Tools

| Supported :white_check_mark: | In-progress :wrench: | Planned :bulb: | | --------------------------------------------------------------------- | -------------------------------------------------------------------- | ----------------------------------------------------- | | HeaderGen | Intellij PSI | MonkeyType | | Jedi | Pyre | Pyannotate | | Pyright | PySonar2 | | HiTyper | Pytype | | Scalpel | TypeT5 | | Type4Py | | | GPT | | | Ollama | | | RightTyper | |

🏆 TypeEvalPy Leaderboard

Below is a comparison showcasing exact matches across different tools and LLMs on the Autogen benchmark.

| Rank | 🛠️ Tool | Function Return Type | Function Parameter Type | Local Variable Type | Total | | ---- | ---------------------------------------------------------------------------------------------- | -------------------- | ----------------------- | ------------------- | ----- | | 1 | mistral-large-it-2407-123b | 16701 | 728 | 57550 | 74979 | | 2 | qwen2-it-72b | 16488 | 629 | 55160 | 72277 | | 3 | llama3.1-it-70b | 16648 | 580 | 54445 | 71673 | | 4 | gemma2-it-27b | 16342 | 599 | 49772 | 66713 | | 5 | codestral-v0.1-22b | 16456 | 706 | 49379 | 66541 | | 6 | codellama-it-34b | 15960 | 473 | 48957 | 65390 | | 7 | mistral-nemo-it-2407-12.2b | 16221 | 526 | 48439 | 65186 | | 8 | mistral-v0.3-it-7b | 16686 | 472 | 47935 | 65093 | | 9 | phi3-medium-it-14b | 16802 | 467 | 45121 | 62390 | | 10 | llama3.1-it-8b | 16125 | 492 | 44313 | 60930 | | 11 | codellama-it-13b | 16214 | 479 | 43021 | 59714 | | 12 | phi3-small-it-7.3b | 16155 | 422 | 38093 | 54670 | | 13 | qwen2-it-7b | 15684 | 313 | 38109 | 54106 | | 14 | HeaderGen | 14086 | 346 | 36370 | 50802 | | 15 | phi3-mini-it-3.8b | 15908 | 320 | 30341 | 46569 | | 16 | phi3.5-mini-it-3.8b | 15763 | 362 | 28694 | 44819 | | 17 | codellama-it-7b | 13779 | 318 | 29346 | 43443 | | 18 | Jedi | 13160 | 0 | 15403 | 28563 | | 19 | Scalpel | 15383 | 171 | 18 | 15572 | | 20 | gemma2-it-9b | 1611 | 66 | 5464 | 7141 | | 21 | Type4Py | 3143 | 38 | 2243 | 5424 | | 22 | tinyllama-1.1b | 1514 | 28 | 2699 | 4241 | | 23 | mixtral-v0.1-it-8x7b | 3235 | 33 | 377 | 3645 | | 24 | phi3.5-moe-it-41.9b | 3090 | 25 | 273 | 3388 | | 25 | gemma2-it-2b | 1497 | 41 | 1848 | 3386 |

<sub>(Auto-generated based on the the analysis run on 30 Aug 2024)</sub>

:whale: Running with Docker

1️⃣ Clone the repo

git clone https://github.com/secure-software-engineering/TypeEvalPy.git

2️⃣ Build Docker image

docker build -t typeevalpy .

3️⃣ Run TypeEvalPy

🕒 Takes about 30mins on first run to build Docker containers.

📂 Results will be generated in the results folder within the root directory of the repository. Each results folder will have a timestamp, allowing you to easily track and compare different runs.

<details> <summary><b>Correlation of CSV Files Generated to Tables in ICSE Paper</b></summary> Here is how the auto-generated CSV tables relate to the paper's tables:

Table 1 in the paper is derived from three auto-generated CSV tables:
- paper_table_1.csv - details Exact matches by type category.
- paper_table_2.csv - lists Exact matches for 18 micro-benchmark categories.
- paper_table_3.csv - provides Sound and Complete values for tools.
Table 2 in the paper is based on the following CSV table:
- paper_table_5.csv - shows Exact matches with top_n values for machine learning tools.

Additionally, there are CSV tables that are not included in the paper:

paper_table_4.csv - containing Sound and Complete values for 18 micro-benchmark categories.
paper_table_6.csv - featuring Sensitivity analysis.

</details>

docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy

🔧 Optionally, run analysis on specific tools:

docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy --runners headergen scalpel

📊 Run analysis on custom benchmarks:

Here, running with the autogen benchmark on HeaderGen

docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy \
      --runners headergen \
      --custom_benchmark_dir /app/autogen_typeevalpy_benchmark

🛠️ Availabl

Related Skills

node-connect

345.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

claude-opus-4-5-migration

104.6k

Migrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5

frontend-design

104.6k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

model-usage

345.4k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

secure-software-engineering

View profile

View on GitHub

GitHub Stars38

CategoryDevelopment

Updated4mo ago

Forks5

secure-software-engineering/TypeEvalPy

Languages

Python

Security Score

77/100

Audited on Nov 13, 2025

No findings