SkillAgentSearch skills...

TypeEvalPy

A Micro-benchmarking Framework for Python Type Inference Tools

Install / Use

/learn @secure-software-engineering/TypeEvalPy
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<p align="center"> <img src="TypeEvalPy.jpg" width="75%" align="center"> <br> <h3 align="center"> A Micro-benchmarking Framework for Python Type Inference Tools </h3> </p>

📌 Features:

  • 📜 Contains 154 code snippets to test and benchmark.
  • 🏷 Offers 845 type annotations across a diverse set of Python functionalities.
  • 📂 Organized into 18 distinct categories targeting various Python features.
  • 🚢 Seamlessly manages the execution of containerized tools.
  • 🔄 Efficiently transforms inferred types into a standardized format.
  • 📊 Automatically produces meaningful metrics for in-depth assessment and comparison.

[New] TypeEvalPy Autogen

  • 🤖 Autogenerates code snippets and ground truth to scale the benchmark based on the original TypeEvalPy benchmark.
  • 📈 The autogen benchmark now contains:
    • Python files: 7121
    • Type annotations: 78373

🛠️ Supported Tools

| Supported :white_check_mark: | In-progress :wrench: | Planned :bulb: | | --------------------------------------------------------------------- | -------------------------------------------------------------------- | ----------------------------------------------------- | | HeaderGen | Intellij PSI | MonkeyType | | Jedi | Pyre | Pyannotate | | Pyright | PySonar2 | | HiTyper | Pytype | | Scalpel | TypeT5 | | Type4Py | | | GPT | | | Ollama | | | RightTyper | |


🏆 TypeEvalPy Leaderboard

Below is a comparison showcasing exact matches across different tools and LLMs on the Autogen benchmark.

| Rank | 🛠️ Tool | Function Return Type | Function Parameter Type | Local Variable Type | Total | | ---- | ---------------------------------------------------------------------------------------------- | -------------------- | ----------------------- | ------------------- | ----- | | 1 | mistral-large-it-2407-123b | 16701 | 728 | 57550 | 74979 | | 2 | qwen2-it-72b | 16488 | 629 | 55160 | 72277 | | 3 | llama3.1-it-70b | 16648 | 580 | 54445 | 71673 | | 4 | gemma2-it-27b | 16342 | 599 | 49772 | 66713 | | 5 | codestral-v0.1-22b | 16456 | 706 | 49379 | 66541 | | 6 | codellama-it-34b | 15960 | 473 | 48957 | 65390 | | 7 | mistral-nemo-it-2407-12.2b | 16221 | 526 | 48439 | 65186 | | 8 | mistral-v0.3-it-7b | 16686 | 472 | 47935 | 65093 | | 9 | phi3-medium-it-14b | 16802 | 467 | 45121 | 62390 | | 10 | llama3.1-it-8b | 16125 | 492 | 44313 | 60930 | | 11 | codellama-it-13b | 16214 | 479 | 43021 | 59714 | | 12 | phi3-small-it-7.3b | 16155 | 422 | 38093 | 54670 | | 13 | qwen2-it-7b | 15684 | 313 | 38109 | 54106 | | 14 | HeaderGen | 14086 | 346 | 36370 | 50802 | | 15 | phi3-mini-it-3.8b | 15908 | 320 | 30341 | 46569 | | 16 | phi3.5-mini-it-3.8b | 15763 | 362 | 28694 | 44819 | | 17 | codellama-it-7b | 13779 | 318 | 29346 | 43443 | | 18 | Jedi | 13160 | 0 | 15403 | 28563 | | 19 | Scalpel | 15383 | 171 | 18 | 15572 | | 20 | gemma2-it-9b | 1611 | 66 | 5464 | 7141 | | 21 | Type4Py | 3143 | 38 | 2243 | 5424 | | 22 | tinyllama-1.1b | 1514 | 28 | 2699 | 4241 | | 23 | mixtral-v0.1-it-8x7b | 3235 | 33 | 377 | 3645 | | 24 | phi3.5-moe-it-41.9b | 3090 | 25 | 273 | 3388 | | 25 | gemma2-it-2b | 1497 | 41 | 1848 | 3386 |

<sub>(Auto-generated based on the the analysis run on 30 Aug 2024)</sub>


:whale: Running with Docker

1️⃣ Clone the repo

git clone https://github.com/secure-software-engineering/TypeEvalPy.git

2️⃣ Build Docker image

docker build -t typeevalpy .

3️⃣ Run TypeEvalPy

🕒 Takes about 30mins on first run to build Docker containers.

📂 Results will be generated in the results folder within the root directory of the repository. Each results folder will have a timestamp, allowing you to easily track and compare different runs.

<details> <summary><b>Correlation of CSV Files Generated to Tables in ICSE Paper</b></summary> Here is how the auto-generated CSV tables relate to the paper's tables:
  • Table 1 in the paper is derived from three auto-generated CSV tables:

    • paper_table_1.csv - details Exact matches by type category.
    • paper_table_2.csv - lists Exact matches for 18 micro-benchmark categories.
    • paper_table_3.csv - provides Sound and Complete values for tools.
  • Table 2 in the paper is based on the following CSV table:

    • paper_table_5.csv - shows Exact matches with top_n values for machine learning tools.

Additionally, there are CSV tables that are not included in the paper:

  • paper_table_4.csv - containing Sound and Complete values for 18 micro-benchmark categories.
  • paper_table_6.csv - featuring Sensitivity analysis.
</details>
docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy

🔧 Optionally, run analysis on specific tools:

docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy --runners headergen scalpel

📊 Run analysis on custom benchmarks:

Here, running with the autogen benchmark on HeaderGen

docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy \
      --runners headergen \
      --custom_benchmark_dir /app/autogen_typeevalpy_benchmark

🛠️ Availabl

Related Skills

View on GitHub
GitHub Stars38
CategoryDevelopment
Updated4mo ago
Forks5

Languages

Python

Security Score

77/100

Audited on Nov 13, 2025

No findings