OmniCode

OmniCode: A Diverse Software Engineering Benchmark for Evaluating Large Language Models

Generate Convert Improve

Install / Use

/learn @seal-research/OmniCode

About this skill

Quality Score

0/100

README

OmniCode

Welcome to OmniCode! This is benchmark for evaluating various LLM powered agents on various Software Developemnt activities . Below, you'll find the commands to test your setup and evaluate the results.

OmniCode synthetically builds multiple tasks out of a base dataset to holistically evaluate software engineering agents. Four different types of tasks that we consider: Bug fixing, test generation, responding to code review, and enforcing style guidelines.

Supported Tasks

In this section, you will find details of the different specifications of our tasks: Bug Fixing, Test Generation, Style Review, and Review Response!

Bug Fixing Evaluation (`--BugFixing`)

Description: The agent receives a repository and PR description, identifies and applies minimal source code changes (excluding tests) to meet the specified requirements. It verifies the fix by reproducing the issue, applying the fix, re-running the relevant test, and ensuring completeness.
Evaluation: Success is measured by the fix passing all relevant tests without introducing unintended changes.
Use Case: Ideal for evaluating a model’s ability to make minimal, correct, and test-verified code changes.

Test Generation Evaluation (`--TestGeneration`)

Description: The agent receives a repository and a problem description, then writes a new test in the repository’s test suite that reproduces the reported issue using the existing testing framework (e.g., pytest).
Evaluation: Success is measured by the test failing on incorrect implementations and passing on correct ones.
Use Case: Useful for assessing a model's ability to generate meaningful, differentiating test cases.

Style Review Evaluation (`--StyleReview`)

Description: The agent runs a style check on a given instance, applies fixes for detected issues, and verifies functionality remains unaffected by re-running relevant tests.
Evaluation: Success is measured by the reduction of style violations without breaking functionality.
Use Case: Designed for scenarios where code quality and adherence to style guidelines are important.

Review Response Evaluation (`--BugFixing`)

Description: The agent receives a problem description, a failed patch, and a review explaining the failure. It uses this context to avoid repeating mistakes and implements an improved fix. The evaluation is the same as BugFixing since we check whether the predicted patch passes the final tests.
Evaluation: Success is measured by whether the improved patch resolves the issue while avoiding pitfalls highlighted in the review.
Use Case: Especially relevant for testing a model’s ability to apply reviewer feedback to refine implementations.

Setup

Environment

OmniCode requires Python 3.13 and its dependencies can be installed via pip install -r requirements.txt

Clone

We have some submodules have to be clone as well. You can clone our repo by:

git clone --recursive git@github.com:seal-research/OmniCode.git
cd OmniCode

Dataset

Our dataset is currently located on HuggingFace at (seal-research/OmniCode). To use OmniCode, you will have to pull/download the data from our hugging face repo to the ./data directory.

pip install -U hf
hf download seal-research/OmniCode \
  --repo-type dataset \
  --local-dir data

OmniCode/   
└── data/
   ├── omnicode_instances_python.json
   ├── omnicode_instances_java.json
   ├── omnicode_instances_cpp.json
   ├── omnicode_style_instances_python.json
   ├── omnicode_style_instances_java.json
   └── omnicode_style_instances_cpp.json

Submodules

OmniCode is currently set up to work with a specific swebench and multiswebench version, which can be installed using:

cd SWE-bench
pip install .
cd ..

cd multi-swe-bench
pip install .
cd ..

Or if you are comfortable using git submodules, you can use:

git submodule update --init --recursive
cd <submodule_path>
pip install .

NOTE: Running pip install . in multi-swe-bench installs multi-swe-bench as a package. If you make changes to multi-swe-bench and wish to run/test the changes locally, you can re-run pip install . in the multi-sweb-bench folder to update the package for your local OmniCode.

Apptainer

Compared to Docker, Apptainer is designed to run containers without requiring root privileges, making it more suitable for shared or restricted environments (e.g., HPC clusters). Docker is commonly used in service and cloud deployments but usually relies on root or privileged daemons. By supporting Apptainer, OmniCode enables containerized workflows for users who cannot use Docker due to permission or security constraints.

Follow the official instruction to install Apptainer first. When you want to use apptainer mode, turn --use_apptainer parameter into True in your command. If --use_apptainer is False, OmniCode would use Docker automatically.

OmniCode Evaluation

To run the full OmniCode benchmark, you can pass the corresponding flags to the evaluation command line tool.

The omnicode command allows you to run multiple code evaluation benchmarks, such as TestGeneration, StyleReview and ReviewResponse. You can specify flags to choose which benchmarks to execute. The command also supports running multiple benchmarks in one go.

Example 1: Running `BugFixing` for a single instance

OmniCode with the --BugFixing flag can be used to evaluate whether a patch resolves the test for a particular issue. In the following command, we pass in the --predictions_patch gold to indicate that we want to evaluate on the correct patch as a sanity check. Passing in the path to actual predictions here will enable evaluating on generated patches. This command with build the docker image and run the evaluation on the instance astropy__astropy-13033 (which is a bug in the astropy library).

python omnicode.py --BugFixing --dataset_name data/omnicode_instances_python.json --predictions_path gold --run_id BugFixing --instance_ids astropy__astropy-13236 --use_apptainer False

Example 2: Running `TestGeneration` for a single instance

The following command with the --TestGeneration flag can be used to evaluate generated tests. The path to generated tests can be specified with --predictions_path

   python omnicode.py --TestGeneration --dataset_name data/omnicode_instances_python.json --predictions_path gold --language python --max_workers 1 --run_id BadPatchTest --use_apptainer False --instance_ids astropy__astropy-14995

Java Support

Note: Bug Fixing and Test Generation agents also support Java repositories, including Java-specific build and test tooling. Please note that this is an experimental feature and may not always function correctly. In order to set up Java support, a few additional steps are needed:

Add desired repo into target_repos and repo_file_map in multiswebench_local/prepare_eval
From the multiswebench_local directory, run python prepare_eval.py
From the omnicode directory, run python omnicode.py --MSWEBugFixing --predictions_path gold --run_id mswebench_test --max_workers 1 --instance_ids elastic__logstash_17021 --mswe_phase all --force_rebuild True --clean True --use_apptainer False

For now, you should stick with the original three java repos (elastic/logstash, alibaba/fastjson, mockito/mockito), since there may be some issues with the new ones that were just added very recently.

The process often takes a while. The logging is a bit different than the normal swebench btw, it instead writes to a dedicated location under multiswebench_runs.

Custom preds file can look like this, for example:

[
  {
    "id": "mockito/mockito:3424",
    "org": "mockito",
    "repo": "mockito",
    "number": 3424,
    "patch": "diff --git a..."
  }
]

Should be saved in a JSON format and can replace gold in the example call above.

MSWEBugFixing for newly onboarded Java Tasks

Prerequisites:

Multiswebench [org]__[repo]_dataset.jsonl for new instance should be present
Add desired repo into target_repos and repo_file_map in multiswebench_local/prepare_eval
From the multiswebench_local directory, run python prepare_eval.py

Example Command:

python omnicode.py --MSWEBugFixing --predictions_path gold --run_id mswebench_bugfixing_test --max_workers 1 --instance_ids google__gson_1093 --mswe_phase all --force_rebuild True --clean True --use_apptainer False

Java Test Generation

Test Generation for Java follows mostly the same format as Test Generation for Python. However, the output files are in a different format, and all instances must also exist in Multi-SWE-Bench's dataset.

Use the --MSWETestGeneration flag to run test generation for Java repos supported by multi-swe-bench.

Example Command

You can run test generation testing as follows. The tags work how they work for python test generation.

pyth

Related Skills

node-connect

338.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

83.4k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

338.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

83.4k

Commit, push, and open a PR

seal-research

View profile

View on GitHub

GitHub Stars13

CategoryDevelopment

Updated9d ago

Forks0

seal-research/OmniCode

Languages

Python

Security Score

80/100

Audited on Mar 18, 2026

No findings

OmniCode

Install / Use

README

OmniCode

Supported Tasks

Bug Fixing Evaluation (--BugFixing)

Test Generation Evaluation (--TestGeneration)

Style Review Evaluation (--StyleReview)

Review Response Evaluation (--BugFixing)

Setup

Environment

Clone

Dataset

Submodules

Apptainer

OmniCode Evaluation

Example 1: Running BugFixing for a single instance

Example 2: Running TestGeneration for a single instance

Java Support

MSWEBugFixing for newly onboarded Java Tasks

Java Test Generation

Example Command

Related Skills

Bug Fixing Evaluation (`--BugFixing`)

Test Generation Evaluation (`--TestGeneration`)

Style Review Evaluation (`--StyleReview`)

Review Response Evaluation (`--BugFixing`)

Example 1: Running `BugFixing` for a single instance

Example 2: Running `TestGeneration` for a single instance