RL
R.L. methods and techniques.
Install / Use
/learn @ivanbelenky/RLREADME
Reinforcement Learning
Installation
uv pip install -e .
Overview
This repository contains code that implements algorithms and models from Sutton's book on reinforcement learning. The book, titled "Reinforcement Learning: An Introduction," is a classic text on the subject and provides a comprehensive introduction to the field.
The code in this repository is organized into several modules, each of which covers differents topics.
Methods
- [x] Multi Armed Bandits
- [x] Epsilon Greedy
- [x] Optimistic Initial Values
- [x] Gradient
- [x] α (non stationary)
- [x] Model Based
- [x] Policy Evaluation
- [x] Policy Iteration
- [x] Value Iteration
- [x] Monte Carlo estimation and control
- [x] First-visit α-MC
- [x] Every-visit α-MC
- [x] MC with Exploring Starts
- [x] Off-policy MC, ordinary and weighted importance sampling
- [x] Temporal Difference
- [x] TD(n) estimation
- [x] n-step SARSA
- [x] n-step Q-learning
- [x] n-step Expected SARSA
- [x] double Q learning
- [x] n-step Tree Backup
- [x] Planning
- [x] Dyna-Q/Dyna-Q+
- [x] Prioritized Sweeping
- [x] Trajectory Sampling
- [x] MCTS
- [ ] On-policy Prediction
- [x] Gradient MC
- [x] $n$-step semi-gradient TD
- [ ] ANN
- [ ] Least-Squares TD
- [ ] Kernel-based
- [x] On-policy Control
- [x] Episodic semi-gradient
- [x] Semi-gradient n-step Sarsa
- [x] Differential Semi-gradient n-step Sarsa
- [ ] Elegibility Traces
- [x] TD($\lambda$)
- [ ] True Online
- [x] Sarsa($\lambda$)
- [ ] True Online Sarsa($\lambda$)
- [ ] Policy Gradient
- [x] REINFORCE: Monte Carlo Policy Gradient w/wo Baseline
- [ ] Actor-Critic (episodic) w/wo eligibility traces
- [ ] Actor-Critic (continuing) with eligibility traces <br>
All model free solvers will work just by defining states actions and a trasition function. Transitions are defined as a function that takes a state and an action and returns a tuple of the next state and the reward. The transition function also returns a boolean indicating whether the episode has terminated.
states: Sequence[Any]
actions: Sequence[Any]
transtion: Callable[[Any, Any], tuple[tuple[Any, float], bool]]
Examples
Single State Infinite Variance Example 5.5

from mypyrl import off_policy_mc, ModelFreePolicy
states = [0]
actions = ['left', 'right']
def single_state_transition(state, action):
if action == 'right':
return (state, 0), True
if action == 'left':
threshold = np.random.random()
if threshold > 0.9:
return (state, 1), True
else:
return (state, 0), False
b = ModelFreePolicy(actions, states) #by default equiprobable
pi = ModelFreePolicy(actions, states)
pi.pi[0] = np.array([1, 0])
# calculate ordinary and weighted samples state value functions
vqpi_ord, samples_ord = off_policy_mc(states, actions, single_state_transition,
policy=pi, b=b, ordinary=True, first_visit=True, gamma=1., n_episodes=1E4)
vqpi_w, samples_w = off_policy_mc(states, actions, single_state_transition,
policy=pi, b=b, ordinary=False, first_visit=True, gamma=1., n_episodes=1E4)

Monte Carlo Tree Search maze solving plot
s = START_XY
budget = 500
cp = 1/np.sqrt(2)
end = False
max_steps = 50
while not end:
action, tree = mcts(s, cp, budget, obstacle_maze, action_map, max_steps, eps=1)
(s, _), end = obstacle_maze(s, action)
tree.plot()

Contributing
While the code in this package provides a basic implementation of the algorithms from the book, it is not necessarily the most efficient or well-written. If you have suggestions for improving the code, please feel free to open an issue.
Overall, this package provides a valuable resource for anyone interested in learning about reinforcement learning and implementing algorithms from scratch. By no means prod ready.
