OmniCode
OmniCode: A Diverse Software Engineering Benchmark for Evaluating Large Language Models
Install / Use
/learn @seal-research/OmniCodeREADME
OmniCode
Welcome to OmniCode! This is benchmark for evaluating various LLM powered agents on various Software Developemnt activities . Below, you'll find the commands to test your setup and evaluate the results.
OmniCode synthetically builds multiple tasks out of a base dataset to holistically evaluate software engineering agents. Four different types of tasks that we consider: Bug fixing, test generation, responding to code review, and enforcing style guidelines.
<img width="800" height="400" alt="image" src="https://github.com/user-attachments/assets/46a4e55c-d8fd-4940-a7ad-26ea746f6c54" />Supported Tasks
In this section, you will find details of the different specifications of our tasks: Bug Fixing, Test Generation, Style Review, and Review Response!
Bug Fixing Evaluation (--BugFixing)
- Description: The agent receives a repository and PR description, identifies and applies minimal source code changes (excluding tests) to meet the specified requirements. It verifies the fix by reproducing the issue, applying the fix, re-running the relevant test, and ensuring completeness.
- Evaluation: Success is measured by the fix passing all relevant tests without introducing unintended changes.
- Use Case: Ideal for evaluating a model’s ability to make minimal, correct, and test-verified code changes.
Test Generation Evaluation (--TestGeneration)
- Description: The agent receives a repository and a problem description, then writes a new test in the repository’s test suite that reproduces the reported issue using the existing testing framework (e.g., pytest).
- Evaluation: Success is measured by the test failing on incorrect implementations and passing on correct ones.
- Use Case: Useful for assessing a model's ability to generate meaningful, differentiating test cases.
Style Review Evaluation (--StyleReview)
- Description: The agent runs a style check on a given instance, applies fixes for detected issues, and verifies functionality remains unaffected by re-running relevant tests.
- Evaluation: Success is measured by the reduction of style violations without breaking functionality.
- Use Case: Designed for scenarios where code quality and adherence to style guidelines are important.
Review Response Evaluation (--BugFixing)
- Description: The agent receives a problem description, a failed patch, and a review explaining the failure. It uses this context to avoid repeating mistakes and implements an improved fix. The evaluation is the same as BugFixing since we check whether the predicted patch passes the final tests.
- Evaluation: Success is measured by whether the improved patch resolves the issue while avoiding pitfalls highlighted in the review.
- Use Case: Especially relevant for testing a model’s ability to apply reviewer feedback to refine implementations.
Setup
Environment
OmniCode requires Python 3.13 and its dependencies can be installed via pip install -r requirements.txt
Clone
We have some submodules have to be clone as well. You can clone our repo by:
git clone --recursive git@github.com:seal-research/OmniCode.git
cd OmniCode
Dataset
Our dataset is currently located on HuggingFace at (seal-research/OmniCode).
To use OmniCode, you will have to pull/download the data from our hugging face repo to the ./data directory.
pip install -U hf
hf download seal-research/OmniCode \
--repo-type dataset \
--local-dir data
OmniCode/
└── data/
├── omnicode_instances_python.json
├── omnicode_instances_java.json
├── omnicode_instances_cpp.json
├── omnicode_style_instances_python.json
├── omnicode_style_instances_java.json
└── omnicode_style_instances_cpp.json
Submodules
OmniCode is currently set up to work with a specific swebench and multiswebench version, which can be installed using:
cd SWE-bench
pip install .
cd ..
cd multi-swe-bench
pip install .
cd ..
Or if you are comfortable using git submodules, you can use:
git submodule update --init --recursive
cd <submodule_path>
pip install .
NOTE: Running
pip install .in multi-swe-bench installs multi-swe-bench as a package. If you make changes to multi-swe-bench and wish to run/test the changes locally, you can re-runpip install .in the multi-sweb-bench folder to update the package for your local OmniCode.
Apptainer
Compared to Docker, Apptainer is designed to run containers without requiring root privileges, making it more suitable for shared or restricted environments (e.g., HPC clusters). Docker is commonly used in service and cloud deployments but usually relies on root or privileged daemons. By supporting Apptainer, OmniCode enables containerized workflows for users who cannot use Docker due to permission or security constraints.
Follow the official instruction to install Apptainer first. When you want to use apptainer mode, turn --use_apptainer parameter into True in your command. If --use_apptainer is False, OmniCode would use Docker automatically.
OmniCode Evaluation
To run the full OmniCode benchmark, you can pass the corresponding flags to the evaluation command line tool.
The omnicode command allows you to run multiple code evaluation benchmarks, such as TestGeneration, StyleReview and ReviewResponse. You can specify flags to choose which benchmarks to execute. The command also supports running multiple benchmarks in one go.
Example 1: Running BugFixing for a single instance
OmniCode with the --BugFixing flag can be used to evaluate whether a patch resolves the test for a particular issue.
In the following command, we pass in the --predictions_patch gold to indicate that we want to evaluate on the correct patch as a sanity check.
Passing in the path to actual predictions here will enable evaluating on generated patches.
This command with build the docker image and run the evaluation on the instance astropy__astropy-13033 (which is a bug in the astropy library).
python omnicode.py --BugFixing --dataset_name data/omnicode_instances_python.json --predictions_path gold --run_id BugFixing --instance_ids astropy__astropy-13236 --use_apptainer False
Example 2: Running TestGeneration for a single instance
The following command with the --TestGeneration flag can be used to evaluate generated tests. The path to generated tests can be specified with --predictions_path
python omnicode.py --TestGeneration --dataset_name data/omnicode_instances_python.json --predictions_path gold --language python --max_workers 1 --run_id BadPatchTest --use_apptainer False --instance_ids astropy__astropy-14995
Java Support
- Note: Bug Fixing and Test Generation agents also support Java repositories, including Java-specific build and test tooling. Please note that this is an experimental feature and may not always function correctly. In order to set up Java support, a few additional steps are needed:
- Add desired repo into
target_reposandrepo_file_mapinmultiswebench_local/prepare_eval - From the multiswebench_local directory,
run python prepare_eval.py - From the omnicode directory, run
python omnicode.py --MSWEBugFixing --predictions_path gold --run_id mswebench_test --max_workers 1 --instance_ids elastic__logstash_17021 --mswe_phase all --force_rebuild True --clean True --use_apptainer False
For now, you should stick with the original three java repos (elastic/logstash, alibaba/fastjson, mockito/mockito), since there may be some issues with the new ones that were just added very recently.
The process often takes a while. The logging is a bit different than the normal swebench btw, it instead writes to a dedicated location under multiswebench_runs.
Custom preds file can look like this, for example:
[
{
"id": "mockito/mockito:3424",
"org": "mockito",
"repo": "mockito",
"number": 3424,
"patch": "diff --git a..."
}
]
Should be saved in a JSON format and can replace gold in the example call above.
MSWEBugFixing for newly onboarded Java Tasks
Prerequisites:
- Multiswebench
[org]__[repo]_dataset.jsonlfor new instance should be present - Add desired repo into
target_reposandrepo_file_mapinmultiswebench_local/prepare_eval - From the multiswebench_local directory,
run python prepare_eval.py
Example Command:
python omnicode.py --MSWEBugFixing --predictions_path gold --run_id mswebench_bugfixing_test --max_workers 1 --instance_ids google__gson_1093 --mswe_phase all --force_rebuild True --clean True --use_apptainer False
Java Test Generation
Test Generation for Java follows mostly the same format as Test Generation for Python. However, the output files are in a different format, and all instances must also exist in Multi-SWE-Bench's dataset.
Use the --MSWETestGeneration flag to run test generation for Java repos supported by multi-swe-bench.
Example Command
You can run test generation testing as follows. The tags work how they work for python test generation.
pyth
Related Skills
node-connect
338.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
338.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.4kCommit, push, and open a PR
