Deepgym
RL training environments with verifiable rewards for coding agents. Works with TRL, Unsloth, verl, OpenRLHF.
Install / Use
/learn @DeepGym/DeepgymREADME
Your model writes code. DeepGym runs it in an isolated sandbox, executes tests against it, and returns a structured reward signal -- per-test-case scores, shaped reward components, execution metrics -- that plugs straight into TRL, verl, OpenRLHF, or your own GRPO/DAPO/PPO loop.
DeepSeek-R1 deliberately avoided neural reward models for code because they're susceptible to reward hacking at scale. DAPO, QwQ-32B, and Open-R1 followed the same path: rule-based, execution-verified rewards. That's what DeepGym provides -- deterministic, execution-based scoring with per-test granularity, running in sandboxed containers so untrusted model outputs can't touch your infrastructure.
reward signal
+------------------------------------+
| |
v |
+-------+ +----------+ +--------------------+
| Model | --> | DeepGym | --> | Sandbox |
+-------+ +----------+ | (Daytona / local) |
^ | +--------------------+
| | |
| v v
| +-----------+ +----------+
| | RunResult |<-----| Verifier |
| +-----------+ +----------+
| | |
| | score: 0.85 | JSON stdout
| | passed: false | per-test cases
| | cases: [...] | reward components
| v
+-------------------+
| Training Loop |
| (TRL/verl/ORLHF) |
+-------------------+
Install
pip install deepgym
<details>
<summary>More install options</summary>
# With Daytona sandbox support
pip install deepgym[daytona]
# With HuggingFace Hub integration
pip install deepgym[hf]
# With lm-evaluation-harness
pip install deepgym[lm-eval]
# Everything (dev + daytona + hf + lm-eval)
pip install deepgym[all]
# From source
git clone https://github.com/DeepGym/deepgym.git
cd deepgym
pip install -e ".[all]"
</details>
Quick Start
from deepgym import DeepGym, load_environment
dg = DeepGym(mode='local')
env = load_environment('coin_change')
solution = '''
def coin_change(coins, amount):
dp = [float('inf')] * (amount + 1)
dp[0] = 0
for coin in coins:
for x in range(coin, amount + 1):
dp[x] = min(dp[x], dp[x - coin] + 1)
return dp[amount] if dp[amount] != float('inf') else -1
'''
result = dg.run(env, model_output=solution)
print(result.score) # 1.0
print(result.passed) # True
print(result.cases) # per-test breakdown: which tests passed, which failed
How it works
Model DeepGym Sandbox Verifier
| | | |
| solution code | | |
|---------------->| | |
| | create sandbox | |
| |------------------>| |
| | upload files | |
| |------------------>| |
| | | python verifier.py |
| | |-------------------->|
| | | | run tests
| | | | (seeded)
| | | JSON stdout |
| | |<--------------------|
| | stdout + stderr | |
| |<------------------| |
| | parse JSON | |
| RunResult | | |
|<----------------| | |
| | | |
The verifier returns structured JSON: a 0.0-1.0 score, pass/fail, per-test-case breakdown, and optional shaped reward components (correctness, efficiency, style -- whatever you define). The per-test granularity is what makes this useful for training. Binary pass/fail is a sparse signal. Knowing that 12 out of 14 tests passed, and specifically which two failed, gives the optimizer something to work with -- this is the same approach used by CodePRM, PRIME, and Posterior-GRPO, but without needing a separate process reward model.
Why execution-based rewards
The field has largely converged here. A Practitioner's Guide to Multi-Turn Agentic RL found execution-based unit tests hit 22% success on SWE-Gym vs 4.2% for sparse binary and 7-9% for model-based judges (including GPT-4.1). DeepSeek-R1, DAPO, and QwQ-32B all use rule-based execution rewards rather than neural reward models.
The catch is infrastructure. You need sandboxed execution (you can't run untrusted model output on your training nodes), deterministic scoring (GRPO computes advantages across completions -- non-determinism breaks this), and structured output (binary pass/fail is too sparse for GRPO/DAPO to learn from). DeepGym handles all three.
What you get
- Execution-based verification -- the approach DeepSeek-R1, DAPO, and QwQ-32B converged on, not neural reward models
- Per-test reward signals -- test-case-level scores like CodePRM and PRIME provide, without training a separate PRM
- Shaped reward components --
reward_componentsdict for multi-signal composition (correctness + efficiency + style), similar to Posterior-GRPO's gated reward approach - Deterministic seeded scoring -- same solution, same score, every time. GRPO and DAPO both require this
- Sandboxed execution via Daytona -- container isolation for untrusted code, same pattern as verl's Sandbox Fusion and DeepSWE's 512-container setup
- Reward hack detection -- 6 adversarial attack strategies. Anthropic's Nov 2025 paper showed reward hacking during RL causes emergent misalignment. Check your verifiers before you train
- 24 built-in environments + 2,350+ importable benchmarks (HumanEval, MBPP, EvalPlus, BigCodeBench)
- Drop-in integrations -- Axolotl, TRL
GRPOTrainer, verlcompute_score, OpenRLHF reward server, lm-eval tasks, HF Hub - PRM data generation -- convert per-test results into Axolotl-compatible stepwise supervision datasets via
deepgym generate-prm - Batch scoring -- N completions in parallel with
run_batch(), async client with semaphore-based concurrency - Gymnasium API --
reset()/step()for multi-turn agent training, same interface as Agent-R1 and VerlTool - REST API -- FastAPI server with async jobs and API key auth
Usage
Score a single solution
from deepgym import DeepGym, load_environment
dg = DeepGym(mode='local')
env = load_environment('two_sum')
result = dg.run(env, model_output='def two_sum(nums, target): ...')
print(result.score) # 0.85
print(result.passed) # False
print(result.reward_components) # {'correctness': 0.85, 'efficiency': 0.9}
Batch scoring for GRPO
Generate N completions, score them all, compute advantages:
solutions = [model.generate(prompt) for _ in range(8)]
batch = dg.run_batch(env, solutions, max_parallel=8)
scores = [r.score for r in batch.results]
mean = sum(scores) / len(scores)
std = (sum((s - mean) ** 2 for s in scores) / len(scores)) ** 0.5
advantages = [(s - mean) / (std + 1e-8) for s in scores]
TRL
from deepgym.integrations.trl import make_trl_reward_fn
from trl import GRPOTrainer
reward_fn = make_trl_reward_fn(env)
trainer = GRPOTrainer(model=model, reward_funcs=[reward_fn])
trainer.train()
verl
from deepgym.integrations.verl import make_verl_compute_score
compute_score = make_verl_compute_score(env)
# In verl config: custom_reward_function.path = "your_reward_module.py"
OpenRLHF
from fastapi import FastAPI
from deepgym.integrations.openrlhf import create_openrlhf_router
app = FastAPI()
app.include_router(create_openrlhf_router(env, dg))
# uvicorn app:app --port 8000
# POST /reward/score {"prompts": [...], "outputs": [...]} -> {"rewards": [...]}
lm-evaluation-harness
python -c "from deepgym.integrations.lm_eval import register_deepg
