ThetaEvolve

ThetaEvolve: Test-time Learning on Open Problems, enabling RL training on AlphaEvolve/OpenEvolve and emphasizing scaling test-time compute

Generate Convert Improve

Install / Use

/learn @ypwang61/ThetaEvolve

About this skill

Quality Score

0/100

README

ThetaEvolve: Test-time Learning on Open Problems

Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Yang, Zeyi Huang, Xuehai He, Luyao Ma, Baolin Peng, Hao Cheng, Pengcheng He, Weizhu Chen, Shuohang Wang, Simon Shaolei Du*, Yelong Shen*

<br>

</div>

Outline

We introduce ThetaEvolve, an open-source pipeline that simplifies (e.g., with single LLM) and extends AlphaEvolve to efficiently scale both ❄️in-context learning and 🔥RL training at test time.

With ThetaEvolve, an 8B model can outperform AlphaEvolve on open optimization problems by scaling compute for inference or test-time RL🚀:

⭕Circle packing:

AlphaEvolve (Gemini-2.0-Flash/Pro) : 2.63586276
Ours (R1-Qwen3-8B): 2.63598308

Setup

Our RL environment follows the same setup as slime and OpenEvolve. We use Docker (run in the ThetaEvolve folder):

# Reproducible setup (recommended): pin the exact image digest.
# This digest corresponds to slimerl/slime:latest at the time of writing.
SLIME_IMAGE="slimerl/slime@sha256:704eb14e1b02ef229e4ab440981aa81b543716c335e2af32cb32ffdc030e3008"
docker pull "${SLIME_IMAGE}"

# Start the container
docker run --rm --name slime-evolve \
  --gpus all --ipc=host --shm-size=16g \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  --ulimit nofile=1048576:1048576 \
  -v "$PWD":/workspace -w /workspace \
  -v /path/to/disk:/data \
  -it "${SLIME_IMAGE}" /bin/bash

If you explicitly want the newest image instead of reproducibility, you can use:

docker pull slimerl/slime:latest

latest is mutable and may change over time. For reproducible experiments, always use the pinned digest.

You can verify the exact digest pulled on your machine with:

docker inspect --format='{{join .RepoDigests "\n"}}' slimerl/slime

After entering the docker, run the installation commands:

cd /workspace
pip install -e .
cd openevolve_adapted
pip install --ignore-installed blinker
rm -rf openevolve.egg-info && pip install -e .
cd ..

Tasks

You could check our tasks in openevolve_adapted/examples. It is easy to extend to more tasks with continous objective values.

Run

To run the experiments, you could change the parameters in run.sh, and then directly run bash run.sh (Notably, for 8B model, we need at least 8x80G GPUs like A100s).

Fist, remember to set the save_path to store ckpts:

export SAVE_PATH=/path/to/disk/save

Then for example, if you want to run prorl-v2-1.5B, circle packing, RL training, original score as reward, you could set:

#### Model selection ####
SMALL_MODEL_NAME="dpsk_prorl_v2_1.5b"

#### Task configuration ####
TASK="circle_packing_modular"

#### CONFIG_POSTFIX options ####
CONFIG_POSTFIX="it_XL"

#### Training mode: True for training, False for inference-only ####
IS_TRAINING=True

#### Training parameters ####
# Options: "original_reward", "rl_normalized_reward"
REWARD_PROCESS_TYPE="original_reward"

#### Lazy output penalty ####
# 1 -> child = parent
# 2 -> child = any program in database
LAZY_OUTPUT_PENALTY=1

Finally set the wandb configurations:

WANDB_API_KEY=aaa
WANDB_ENTITY=bbb
WANDB_PROJECT=ccc

Then you can directly run

bash run.sh

Recommended logging for future reference

Use a fixed data root and keep per-run metadata + logs:

export SAVE_PATH=/data/thetaevolve
mkdir -p "${SAVE_PATH}"/{runs,logs}

RUN_TS=$(date +%Y%m%d_%H%M%S)
RUN_LOG_DIR="${SAVE_PATH}/runs/${RUN_TS}"
mkdir -p "${RUN_LOG_DIR}"

# Save reproducibility info
git rev-parse HEAD > "${RUN_LOG_DIR}/git_commit.txt"
cp run.sh "${RUN_LOG_DIR}/run.sh.snapshot"
cp scripts_evolve/Nemotron-Research-Reasoning-Qwen-1.5B/general.sh "${RUN_LOG_DIR}/general.sh.snapshot"

# Launch and tee logs
bash run.sh 2>&1 | tee "${RUN_LOG_DIR}/train.log"

This preserves the exact run script/config used for each experiment.

You could also adjust more parameters in scripts_evolve/Nemotron-Research-Reasoning-Qwen-1.5B/general.sh. Like ckpt saving frequency (default 10), number of evaluation threads (default 16), gpus (default 8), etc.

Results

Some results we obtain are available in Results. You can run python vis.py to see the verification results in each sub-task directory.

For example, we have our best-known solution for circle packing (with zero tolerance) in Results/CirclePacking/figs/8B-w_RL@65-Formal.png and AlphaEvolve's solution in Results/CirclePacking/figs/AlphaEvolve.png:

We point out that our solution is better than AlphaEvolve’s, and that our configuration is asymmetric, whereas AlphaEvolve’s solution is symmetric.

The program for finding it (with 1e-6 tolerance as OpenEvolve verification, detailed in paper) is shown in Results/CirclePacking/programs/8B-w_RL@65.py. For the formal one (without tolerance as AlphaEvolve), the program is shown in Results/CirclePacking/programs/8B-w_RL@65-Formal.py. The later one has a specific function for determing the size for shrinking radii, but in general, you could get close results by shrinking radii with values like 1e-9.

We also provide results from other tasks for visualization.

If you want to run these programs or the initial program, you could try to assign the parameters from config file.

TASK="circle_packing_modular"

CONFIG_POSTFIX="it_XL"

# # test command with verifier
OPENEVOLVE_CONFIG_PATH=$PWD/examples/${TASK}/configs/config_${TASK}_${CONFIG_POSTFIX}.yaml \
PYTHONPATH=$PWD \
python $PWD/examples/${TASK}/evaluators/evaluator_modular.py \
$PWD/examples/${TASK}/initial_programs/initial_program.py

Or you could just replace the parameters to directly rerun.

Citation

If you find our work useful, please consider citing:

@article{wang2025thetaevolve,
  title={ThetaEvolve: Test-time Learning on Open Problems},
  author={Wang, Yiping and Su, Shao-Rong and Zeng, Zhiyuan and Xu, Eva and Ren, Liliang and Yang, Xinyu and Huang, Zeyi and He, Xuehai and Ma, Luyao and Peng, Baolin and Cheng, Hao and He, Pengcheng and Chen, Weizhu and Wang, Shuohang and Du, Simon Shaolei and Shen, Yelong},
  journal={arXiv preprint 2511.23473},
  year={2025}
}

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

flutter-tutor

Flutter Learning Tutor Guide You are a friendly computer science tutor specializing in Flutter development. Your role is to guide the student through learning Flutter step by step, not to provide d

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

last30days-skill

16.9k

AI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary