<p align="center"> <h1 align="center">DeepGym</h1> <p align="center">Reward signals for RL code training. Sandbox it, verify it, score it.</p> </p> <p align="center"> <a href="https://pypi.org/project/deepgym/"><img src="https://img.shields.io/pypi/v/deepgym" alt="PyPI"></a> <a href="https://pypi.org/project/deepgym/"><img src="https://img.shields.io/pypi/pyversions/deepgym" alt="Python"></a> <a href="https://github.com/DeepGym/deepgym/blob/main/LICENSE"><img src="https://img.shields.io/github/license/DeepGym/deepgym" alt="License"></a> <a href="https://github.com/DeepGym/deepgym/wiki"><img src="https://img.shields.io/badge/docs-wiki-blue" alt="Wiki"></a> </p>

Your model writes code. DeepGym runs it in an isolated sandbox, executes tests against it, and returns a structured reward signal -- per-test-case scores, shaped reward components, execution metrics -- that plugs straight into TRL, verl, OpenRLHF, or your own GRPO/DAPO/PPO loop.

DeepSeek-R1 deliberately avoided neural reward models for code because they're susceptible to reward hacking at scale. DAPO, QwQ-32B, and Open-R1 followed the same path: rule-based, execution-verified rewards. That's what DeepGym provides -- deterministic, execution-based scoring with per-test granularity, running in sandboxed containers so untrusted model outputs can't touch your infrastructure.

                          reward signal
               +------------------------------------+
               |                                    |
               v                                    |
           +-------+     +----------+     +--------------------+
           | Model | --> | DeepGym  | --> |      Sandbox       |
           +-------+     +----------+     | (Daytona / local)  |
               ^              |           +--------------------+
               |              |                    |
               |              v                    v
               |         +-----------+       +----------+
               |         |  RunResult |<-----| Verifier |
               |         +-----------+       +----------+
               |           |                       |
               |           | score: 0.85           | JSON stdout
               |           | passed: false         | per-test cases
               |           | cases: [...]          | reward components
               |           v
           +-------------------+
           |   Training Loop   |
           | (TRL/verl/ORLHF)  |
           +-------------------+

Install

pip install deepgym

<details> <summary>More install options</summary>

# With Daytona sandbox support
pip install deepgym[daytona]

# With HuggingFace Hub integration
pip install deepgym[hf]

# With lm-evaluation-harness
pip install deepgym[lm-eval]

# Everything (dev + daytona + hf + lm-eval)
pip install deepgym[all]

# From source
git clone https://github.com/DeepGym/deepgym.git
cd deepgym
pip install -e ".[all]"

</details>

Quick Start

from deepgym import DeepGym, load_environment

dg = DeepGym(mode='local')
env = load_environment('coin_change')

solution = '''
def coin_change(coins, amount):
    dp = [float('inf')] * (amount + 1)
    dp[0] = 0
    for coin in coins:
        for x in range(coin, amount + 1):
            dp[x] = min(dp[x], dp[x - coin] + 1)
    return dp[amount] if dp[amount] != float('inf') else -1
'''

result = dg.run(env, model_output=solution)
print(result.score)    # 1.0
print(result.passed)   # True
print(result.cases)    # per-test breakdown: which tests passed, which failed

How it works

  Model            DeepGym             Sandbox              Verifier
    |                 |                   |                     |
    |  solution code  |                   |                     |
    |---------------->|                   |                     |
    |                 |  create sandbox   |                     |
    |                 |------------------>|                     |
    |                 |  upload files     |                     |
    |                 |------------------>|                     |
    |                 |                   |  python verifier.py |
    |                 |                   |-------------------->|
    |                 |                   |                     | run tests
    |                 |                   |                     | (seeded)
    |                 |                   |  JSON stdout        |
    |                 |                   |<--------------------|
    |                 |  stdout + stderr  |                     |
    |                 |<------------------|                     |
    |                 |  parse JSON       |                     |
    |  RunResult      |                   |                     |
    |<----------------|                   |                     |
    |                 |                   |                     |

The verifier returns structured JSON: a 0.0-1.0 score, pass/fail, per-test-case breakdown, and optional shaped reward components (correctness, efficiency, style -- whatever you define). The per-test granularity is what makes this useful for training. Binary pass/fail is a sparse signal. Knowing that 12 out of 14 tests passed, and specifically which two failed, gives the optimizer something to work with -- this is the same approach used by CodePRM, PRIME, and Posterior-GRPO, but without needing a separate process reward model.

Why execution-based rewards

The field has largely converged here. A Practitioner's Guide to Multi-Turn Agentic RL found execution-based unit tests hit 22% success on SWE-Gym vs 4.2% for sparse binary and 7-9% for model-based judges (including GPT-4.1). DeepSeek-R1, DAPO, and QwQ-32B all use rule-based execution rewards rather than neural reward models.

The catch is infrastructure. You need sandboxed execution (you can't run untrusted model output on your training nodes), deterministic scoring (GRPO computes advantages across completions -- non-determinism breaks this), and structured output (binary pass/fail is too sparse for GRPO/DAPO to learn from). DeepGym handles all three.

What you get

Execution-based verification -- the approach DeepSeek-R1, DAPO, and QwQ-32B converged on, not neural reward models
Per-test reward signals -- test-case-level scores like CodePRM and PRIME provide, without training a separate PRM
Shaped reward components -- reward_components dict for multi-signal composition (correctness + efficiency + style), similar to Posterior-GRPO's gated reward approach
Deterministic seeded scoring -- same solution, same score, every time. GRPO and DAPO both require this
Sandboxed execution via Daytona -- container isolation for untrusted code, same pattern as verl's Sandbox Fusion and DeepSWE's 512-container setup
Reward hack detection -- 6 adversarial attack strategies. Anthropic's Nov 2025 paper showed reward hacking during RL causes emergent misalignment. Check your verifiers before you train
24 built-in environments + 2,350+ importable benchmarks (HumanEval, MBPP, EvalPlus, BigCodeBench)
Drop-in integrations -- Axolotl, TRL GRPOTrainer, verl compute_score, OpenRLHF reward server, lm-eval tasks, HF Hub
PRM data generation -- convert per-test results into Axolotl-compatible stepwise supervision datasets via deepgym generate-prm
Batch scoring -- N completions in parallel with run_batch(), async client with semaphore-based concurrency
Gymnasium API -- reset() / step() for multi-turn agent training, same interface as Agent-R1 and VerlTool
REST API -- FastAPI server with async jobs and API key auth

Usage

Score a single solution

from deepgym import DeepGym, load_environment

dg = DeepGym(mode='local')
env = load_environment('two_sum')

result = dg.run(env, model_output='def two_sum(nums, target): ...')
print(result.score)              # 0.85
print(result.passed)             # False
print(result.reward_components)  # {'correctness': 0.85, 'efficiency': 0.9}

Batch scoring for GRPO

Generate N completions, score them all, compute advantages:

solutions = [model.generate(prompt) for _ in range(8)]
batch = dg.run_batch(env, solutions, max_parallel=8)

scores = [r.score for r in batch.results]
mean = sum(scores) / len(scores)
std = (sum((s - mean) ** 2 for s in scores) / len(scores)) ** 0.5
advantages = [(s - mean) / (std + 1e-8) for s in scores]

TRL

from deepgym.integrations.trl import make_trl_reward_fn
from trl import GRPOTrainer

reward_fn = make_trl_reward_fn(env)
trainer = GRPOTrainer(model=model, reward_funcs=[reward_fn])
trainer.train()

verl

from deepgym.integrations.verl import make_verl_compute_score

compute_score = make_verl_compute_score(env)
# In verl config: custom_reward_function.path = "your_reward_module.py"

OpenRLHF

from fastapi import FastAPI
from deepgym.integrations.openrlhf import create_openrlhf_router

app = FastAPI()
app.include_router(create_openrlhf_router(env, dg))
# uvicorn app:app --port 8000
# POST /reward/score {"prompts": [...], "outputs": [...]} -> {"rewards": [...]}

lm-evaluation-harness

python -c "from deepgym.integrations.lm_eval import register_deepg

Deepgym

Install / Use

README