SkillAgentSearch skills...

Subare

reinforcement learning algorithms from the book by Sutton and Barto

Install / Use

/learn @idsc-frazzoli/Subare

README

ch.ethz.idsc.subare <a href="https://travis-ci.org/idsc-frazzoli/subare"><img src="https://travis-ci.org/idsc-frazzoli/subare.svg?branch=master" alt="Build Status"></a>

Library for reinforcement learning in Java, version 0.3.8

Repository includes algorithms, examples, and exercises from the 2nd edition of Reinforcement Learning: An Introduction by Richard S. Sutton, and Andrew G. Barto.

Our implementation is inspired by the python code by Shangtong Zhang, but differs from the reference in two aspects:

  • the algorithms are implemented separate from the problem scenarios
  • the math is in exact precision which reproduces symmetries in the results in case the problem features symmetries

Algorithms

  • Iterative Policy Evaluation (parallel, in 4.1, p.59)
  • Value Iteration to determine V*(s) (parallel, in 4.4, p.65)
  • Action-Value Iteration to determine Q*(s,a) (parallel)
  • First Visit Policy Evaluation (in 5.1, p.74)
  • Monte Carlo Exploring Starts (in 5.3, p.79)
  • Contant-alpha Monte Carlo
  • Tabular Temporal Difference (in 6.1, p.96)
  • Sarsa: An on-policy TD control algorithm (in 6.4, p.104)
  • Q-learning: An off-policy TD control algorithm (in 6.5, p.105)
  • Expected Sarsa (in 6.6, p.107)
  • Double Sarsa, Double Expected Sarsa, Double Q-Learning (in 6.7, p.109)
  • n-step Temporal Difference for estimating V(s) (in 7.1, p.115)
  • n-step Sarsa, n-step Expected Sarsa, n-step Q-Learning (in 7.2, p.118)
  • Random-sample one-step tabular Q-planning (parallel, in 8.1, p.131)
  • Tabular Dyna-Q (in 8.2, p.133)
  • Prioritized Sweeping (in 8.4, p.137)
  • Semi-gradient Tabular Temporal Difference (in 9.3, p.164)
  • True Online Sarsa (in 12.8, p.309)

Gallery

<table> <tr> <td>

prisonersdilemma

Prisoner's Dilemma

<td>

gambler_exact

Exact Gambler

</tr> </table>

Examples

4.1 Gridworld

<table><tr> <td valign="top">

AV-Iteration q(s,a)

gridworld_qsa_avi

<td>

TabularQPlan

gridworld_qsa_rstqp

<td>

Monte Carlo

gridworld_qsa_mces

</tr><tr> <td>

Q-Learning

gridworld_qsa_qlearning

<td>

Expected-Sarsa

gridworld_qsa_expected

<td>

Sarsa

gridworld_qsa_original

</tr><tr> <td>

3-step Q-Learning

gridworld_qsa_qlearning3

<td>

3-step E-Sarsa

gridworld_qsa_expected3

<td>

3-step Sarsa

gridworld_qsa_original3

</tr><tr> <td>

OTrue Online Sarsa

gridworld_tos_original

<td>

ETrue Online Sarsa

gridworld_tos_expected

<td>

QTrue Online Sarsa

gridworld_tos_qlearning

</tr></table>

4.2: Jack's car rental

Value Iteration v(s)

carrental_vi_true

4.4: Gambler's problem

Value Iteration v(s)

gambler_sv

Action Value Iteration and optimal policy

gambler_avi

<table><tr><td>

Monte Carlo q(s,a)

gambler_qsa_mces

<td>

ESarsa q(s,a)

gambler_qsa_esarsa

<td>

QLearning q(s,a)

gambler_qsa_qlearn

</tr></table>

5.1 Blackjack

Monte Carlo Exploring Starts

blackjack_mces

5.2 Wireloop

<table><tr><td>

AV-Iteration

wire5_avi

<td>

TabularQPlan

wire5_qsa_rstqp

<td>

Q-Learning

wire5_qsa_qlearning

<td>

E-Sarsa

wire5_qsa_expected

<td>

Sarsa

wire5_qsa_original

<td>

Monte Carlo

wire5_mces

</tr></table>

5.8 Racetrack

paths obtained using value iteration

<table><tr><td valign="top">

track 1

track1

<td valign="top">

track 2

track2

</tr></table>

6.5 Windygrid

<table><tr><td>

Action Value Iteration

windygrid_qsa_avi

<td>

TabularQPlan

windygrid_qsa_rstqp

</tr></table>

6.6 Cliffwalk

<table><tr><td>

Action Value Iteration

cliffwalk_qsa_avi

<td>

Q-Learning

cliffwalk_qsa_qlearning

<td>

TabularQPlan

cliffwalk_qsa_rstqp

<td>

Expected Sarsa

cliffwalk_qsa_expected

</tr></table>

8.1 Dynamaze

<table><tr><td>

Action Value Iteration

maze5_qsa_avi

<td>

Prioritized sweeping

maze2_ps_qlearning

</tr></table>

Additional Examples

Repeated Prisoner's dilemma

Exact expected reward of two adversarial optimistic agents depending on their initial configuration:

opts

Exact expected reward of two adversarial Upper-Confidence-Bound agents depending on their initial configuration:

ucbs

Integration

Specify dependency and repository of the tensor library in the pom.xml file of your maven project:

<dependencies>
  <dependency>
    <groupId>ch.ethz.idsc</groupId>
    <artifactId>subare</artifactId>
    <version>0.3.8</version>
  </dependency>
</dependencies>

<repositories>
  <repository>
    <id>subare-mvn-repo</id>
    <url>https://raw.github.com/idsc-frazzoli/subare/mvn-repo/</url>
    <snapshots>
      <enabled>true</enabled>
      <updatePolicy>always</updatePolicy>
    </snapshots>
  </repository>
</repositories>

The source code is attached to every release.

Contributors

Jan Hakenberg, Christian Fluri

Publications

References


ethz300

Related Skills

View on GitHub
GitHub Stars17
CategoryEducation
Updated3mo ago
Forks2

Languages

Java

Security Score

80/100

Audited on Jan 10, 2026

No findings