ReinforcementLearning
Reinforcing Your Learning of Reinforcement Learning
Install / Use
/learn @Urinx/ReinforcementLearningREADME
Reinforcement Learning
Reinforcing Your Learning of Reinforcement Learning.
这个是我在学习强化学习的过程中的一些记录,以及写的一些代码。建立这个Github项目主要是可以和大家一起相互学习和交流,也同时方便其他人寻找强化学习方面的资料。我为什么学习强化学习,主要是想把 AlphaZero 的那套方法(结合深度学习的蒙特卡洛树搜索)用在 RNA 分子结构预测上,目前已经做了一些尝试,比如寻找 RNA 分子的二级结构折叠路径。
首先看的书是 Richard S. Sutton 和 Andrew G. Barto 的 Reinforcement Learning: An Introduction (Second edition)。
看书的同时,也根据网上的一些文章写一些简单的代码,依次如下。
Table of contents
- Q-Learning
- Deep Q-Learning Network (DQN)
- Dueling Double DQN & Prioritized Experience Replay
- Policy Gradients (PG)
- Advantage Actor Critic (A2C)
- Asynchronous Advantage Actor Critic (A3C)
- Proximal Policy Optimization (PPO)
- Deep Deterministic Policy Gradient (DDPG)
- AlphaGoZero Introduction
- Monte Carlo Tree Search (MCTS)
- AlphaGomoku
- RNA Folding Path
- Atari Game Roms
Q-Learning
Bellman equation:

Frozen Lake Game
<div align=center> <img width="300" height="300" src="imgs/frozenlake.png" alt="Frozen Lake Game"> </div>基于 Q-Learning 玩 Frozen Lake 游戏:[code]
Tic Tac Toe
<div align=center> <img width="100" height="130" src="imgs/tic1.png" alt="Tic Tac Toe"> <img width="100" height="130" src="imgs/tic2.png" alt="Tic Tac Toe"> <img width="100" height="130" src="imgs/tic3.png" alt="Tic Tac Toe"> <img width="100" height="130" src="imgs/tic4.png" alt="Tic Tac Toe"> <img width="100" height="130" src="imgs/tic5.png" alt="Tic Tac Toe"> <img width="100" height="130" src="imgs/tic6.png" alt="Tic Tac Toe"> <img width="100" height="130" src="imgs/tic7.png" alt="Tic Tac Toe"> </div>基于 Q-Learning 玩井字棋游戏:[code]
训练结果:
Q-Learning Player vs Q-Learning Player
====================
Train result - 100000 episodes
Q-Learning win rate: 0.45383
Q-Learning win rate: 0.3527
players draw rate: 0.19347
====================
Q-Learning Player vs Random Player
====================
Train result - 100000 episodes
Q-Learning win rate: 0.874
Random win rate: 0.03072
players draw rate: 0.09528
====================
Taxi v2
<div align=center> <img width="93" height="133" src="imgs/taxi1.png" alt="Taxi v2"> <img width="93" height="133" src="imgs/taxi2.png" alt="Taxi v2"> <img width="93" height="133" src="imgs/taxi3.png" alt="Taxi v2"> <img width="93" height="133" src="imgs/taxi4.png" alt="Taxi v2"> <img width="93" height="133" src="imgs/taxi5.png" alt="Taxi v2"> <img width="93" height="133" src="imgs/taxi6.png" alt="Taxi v2"> </div>基于 Q-Learning 玩 Taxi v2 游戏:[code]
[0]. Diving deeper into Reinforcement Learning with Q-Learning<br/> [1]. Q* Learning with FrozenLake - Notebook<br/> [2]. Q* Learning with OpenAI Taxi-v2 - Notebook
Deep Q-Learning Network
<div align=center> <img width="400" height="300" src="imgs/DQN.png" alt="Deep Q-Learning Network"> </div>
weights updation:

Doom Game
<div align=center> <img src="imgs/play_doom.gif" alt="play Doom"> </div>游戏环境这里使用的是 ViZDoom ,神经网络是三层的卷积网络。[code]

训练大约 1200 轮后结果如下:

Episode 0 Score: 61.0
Episode 1 Score: 68.0
Episode 2 Score: 51.0
Episode 3 Score: 62.0
Episode 4 Score: 56.0
Episode 5 Score: 33.0
Episode 6 Score: 86.0
Episode 7 Score: 57.0
Episode 8 Score: 88.0
Episode 9 Score: 61.0
[*] Average Score: 62.3
Atari Space Invaders
<div align=center> <img width="427" height="530" src="imgs/play_atari_space_invaders.gif" alt="Atari Space Invaders"> </div>游戏环境使用的是 Gym Retro ,神经网络见下图。[code]

训练大约 25 局后结果如下:
[*] Episode: 11, total reward: 120.0, explore p: 0.7587, train loss: 0.0127
[*] Episode: 12, total reward: 80.0, explore p: 0.7495, train loss: 0.0194
[*] Episode: 13, total reward: 110.0, explore p: 0.7409, train loss: 0.0037
[*] Episode: 14, total reward: 410.0, explore p: 0.7233, train loss: 0.0004
[*] Episode: 15, total reward: 240.0, explore p: 0.7019, train loss: 0.0223
[*] Episode: 16, total reward: 230.0, explore p: 0.6813, train loss: 0.0535
[*] Episode: 17, total reward: 315.0, explore p: 0.6606, train loss: 9.7144
[*] Episode: 18, total reward: 140.0, explore p: 0.6455, train loss: 0.0022
[*] Episode: 19, total reward: 310.0, explore p: 0.6266, train loss: 1.5386
[*] Episode: 20, total reward: 200.0, explore p: 0.6114, train loss: 1.5545
[*] Episode: 21, total reward: 65.0, explore p: 0.6044, train loss: 0.0042
[*] Episode: 22, total reward: 210.0, explore p: 0.5895, train loss: 0.0161
[*] Episode: 23, total reward: 155.0, explore p: 0.5778, train loss: 0.0006
[*] Episode: 24, total reward: 105.0, explore p: 0.5665, train loss: 0.0016
[*] Episode: 25, total reward: 425.0, explore p: 0.5505, train loss: 0.0063
[0]. An introduction to Deep Q-Learning: let’s play Doom<br/> [1]. Deep Q learning with Doom - Notebook<br/> [2]. Deep Q Learning with Atari Space Invaders<br/> [3]. Atari 2600 VCS ROM Collection
Dueling Double DQN and Prioritized Experience Replay
Four improvements in Deep Q Learning:
- Fixed Q-targets

- Double DQN

- Dueling DQN

- Prioritized Experience Replay
Doom Deadly Corridor
<div align=center> <img src="imgs/play_doom_deadly_corridor.gif" alt="play Doom Deadly Corridor"> </div>其中,Dueling DQN 的神经网络如下图: [code]

Prioritized Experience Replay 采用 SumTree 的方法:

[0]. Improvements in Deep Q Learning: Dueling Double DQN, Prioritized Experience Replay, and fixed Q-targets<br/> [1]. Let’s make a DQN: Double Learning and Prioritized Experience Replay<br/> [2]. Double Dueling Deep Q Learning with Prioritized Experience Replay - Notebook
Policy Gradients
<div align=center> <img width="500" src="imgs/policy_gradients.png" alt="Policy Gradients"> </div> <div align=center> <img src="imgs/pg_algorithm.svg" alt="PG Algorithm"> </div>CartPole Game
<div align=center> <img src="imgs/play_cartpole.gif" alt="Play CartPole Game"> </div>其中,Policy Gradient 神经网络如下图。

训练大约 950 轮后结果如下:

====================
Episode: 941
Reward: 39712.0
Mean Reward: 2246.384288747346
Max reward so far: 111837.0
====================
Episode: 942
Reward: 9417.0
Mean Reward: 2253.9883351007425
Max reward so far: 111837.0
====================
Episode: 943
Reward: 109958.0
Mean Reward: 2368.08156779661
Max reward so far: 111837.0
====================
Episode: 944
Reward: 73285.0
Mean Reward: 2443.125925925926
Max reward so far: 111837.0
====================
Episode: 945
Reward: 40370.0
Mean Reward: 2483.217758985201
Max reward so far: 111837.0
[*] Model Saved: ./model/model.ckpt
具体代码请参见:[tensorflow] [pytorch]
Doom Deathmatch
<div align=center> <img src="imgs/play_doom_deathmatch.gif" alt="play Doom Deathmatch"> </div>
神经网络如上,具体代码请参见:[code]
[0]. [An introduction to Policy Gradients with Cartpole an
