RLSolver: GPU-based Massively Parallel Environments for Large-Scale Combinatorial Optimization (CO) Problems Using Reinforcement Learning

RLSolver Contest 2025: docs website

We aim to showcase the effectiveness of GPU-based massively parallel environments for large-scale combinatorial optimization (CO) problems using reinforcement learning (RL). RL with the help of GPU-based parallel environments can significantly improve the sampling speed and can obtain high-quality solutions within short time.

Overview

RLSolver has three layers:

Environments: providing massively parallel environments using GPUs.
RL agents: providing RL algorithms, e.g., REINFORCE, PPO, and DQN.
Problems: typical CO problems, e.g., graph maxcut, TNCO and TSP.

Key Technologies

GPU-based Massively parallel environments of Markov chain Monte Carlo (MCMC) simulations on GPU using thousands of CUDA cores and tensor cores.
Distribution-wise is much much faster than the instance-wise methods (such as MCPG and ISCO), since we can obtain the results directly by inference.

Why Use GPU-based Massively Parallel Environments?

The bottleneck of using RL for solving large-scale CO problems especially in distribution-wise scenarios is the low sampling speed since existing solver engines (a.k.a, gym-style environments) are implemented on CPUs. Training the policy network is essentially estimating the gradients via a Markov chain Monte Carlo (MCMC) simulation, which requires a large number of samples from environments.

Existing CPU-based environments have two significant disadvantages: 1) The number of CPU cores is typically small, generally ranging from 16 to 256, resulting in a small number of parallel environments. 2) The communication link between CPUs and GPUs has limited bandwidth. The massively parallel environments can overcome these disadvantages, since we can build thounsands of environments and the communication bottleneck between CPUs and GPUs is bypassed; therefore the sampling speed is significantly improved.

Improving the Sampling Speed

The sampling speed denotes the number of collect samples per second. We used CPU and GPU based environments. We see that the sampling speed is improved by at least 2 orders by using GPU-based massively parallel environments compared with conventional CPUs.

<a target="\_blank"> <div align="center"> <img src=rlsolver/fig/maxcutStepVsSamplingSpeed.png width="80%"/> </div> </a> We see that, if the number of CPU-based environments increases, the sampling speed does not improve clearly. However, the sampling speed improves greatly if using GPUs.

Improving the Convergence Speed

To achieve the same objective value, if we use more parallel environments, the less running time.

Improving the Quality of Solutions

GPU-based parallel environments can significantly improve the quality of solutions during training, since RL methods require many high-quality samples from the environments for training. Take graph maxcut as an example. We select G22 in the Gset dataset. The above figure shows the objective values vs. number of epochs with different number of GPU-based parallel environments. We see that, generally, the more parallel environments, the higher objective values, and the faster convergence.

Comparison of libraries in RL for combinatorial optimization.

| Library | RL methods | Supported pattern | Actor-critic algs | Non-actor-critic algs | Euclidean topology |Non-Euclidean topology | Distribution-wise | Instance-wise | Problem-specific methods| Commercial solvers | Integration of OR and RL| |--------|-----------|------------|-------------|--------|---------|-------------------------|---------|------------|---------|------------|--------| | Jumanji | A2C | I, II | Y | N| Y |N |Y |N |N |N |N | | RL4CO | A2C, PPO, reinforce | I | Y | Only reinforce | Y |N |Y |N |N |N |N | | RLSolver (Ours)| PPO, S2V-DQN, ECO-DQN, MCPG, dREINFORCE, ISCO, PI-GNN, etc | I, II | Y | Y | Y |Y |Y |Y |Y |Y |Y |

RLSolver crosses two domains: operations research (OR) and RL. Commercial solver-based OR methods use Gurobi (or SCIP) based on ILP or QUBO/Ising models. RLSolver supports the integration of OR and RL, e.g., RL for branching, cutting plane, or column generation (CG).

Two Patterns in RL-based methods

Sparse-rewards Pattern (I): The rewards are given when the goal is finished, and therefore are sparse. We formulate the CO problem as Markov decision process (MDP), and then use RL algorithms to select the node and add it into a node set. There are three important functions for a gym-style environment:

reset(): Set the selected nodes as an empty set.
step(): Select the node with the maximum Q-value and then add it to the set.
reward(): Calculate the objective values over all parallel environments.

Dense-rewards Pattern (II): The reward is immediately obtained, and therefore are dense. Policy-based methods first formulate the CO problem as a QUBO problem, and then learn a policy using say REINFORCE algorithm to minimize the Hamiltonian objective function. Here, the policy is a vector of probabilities of the nodes belong to the set. For example, the policy for a graph with 3 nodes is [0, 0, 0.9] means that the probabilities of the first two nodes belong to the set are 0, and the probability of the third node belong to the set is 0.9. We introduce four important functions for all parallel environments:

reset(): Generate random initial solutions for all parallel environments.
step(): Search for better solutions based on the current solutions. It has two sub-functions.
- sampling() is the sampling method.
- local_search() returns a better solution by flipping some bits. It can improve the quality of the current solution in a local domain.
pick_good_xs(): Select the good solutions in all parallel environments, where each environment returns exactly one good solution with corresponding objective value.
obj(): Calculate the objective values.

Example (Graph Maxcut)

Sparse-rewards Pattern (I): In left part of of the above figure, the initial state is empty, i.e., no node is selected. Then we select node 1 with the maximum Q-value and add it to the state, thus the new state is [1], and the reward is 2.

Dense-rewards Pattern (II): In right part of the above figure, the current state is [2, 3], i.e., node 2 and 3 are selected, and the objective value is 2. The new state is [1, 3, 4], i.e., node 1, 3, and 4 are selected, and the objective value is 4.

Implementation for GPU-based Parallelism

All states and objective values are stored by PyTorch Tensors, so that they are mapped to CUDA cores and tensor cores of GPUs.
We use vmap (one GPU) or pmap (multiple GPUs) to push the map into PyTorch operations, effectively vectorizing those operations.

For example, we calculate the objective values of states over all parallel environments by using the following codes:

from torch import vmap
batched_obj = vmap(objective)
objs = batched_obj(states)

where the "objective" function is denotes the calculation of the objective value for one state.

Dimension of states and objective values: The states over all parallel environments are stored as PyTorch tensors with the environments as the first dimension and the graph nodes as the second dimension. For example, for the graph with 100 nodes, and we use 1000 environments, and the dimension of states is 1000 * 100. Correspondingly, the dimension of the objective values over all parallel environments is 1000 * 1.

Methods

RLSolver provides a diverse range of methods for large-scale combinatorial optimization (CO) problems.

| Method | Category | Pattern | Source | Parallel environments | |---------------|--------------|---------|----------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------| | Gurobi (QUBO) | Conventional | - | code | - | | Greedy | Conventional | - | code | - | | SDP | Conventional | - | code | - | | S2V-DQN | RL | I | [code](https://gith

RLSolver

Install / Use

README