Reinforcement Learning

Reinforcing Your Learning of Reinforcement Learning.

这个是我在学习强化学习的过程中的一些记录，以及写的一些代码。建立这个Github项目主要是可以和大家一起相互学习和交流，也同时方便其他人寻找强化学习方面的资料。我为什么学习强化学习，主要是想把 AlphaZero 的那套方法（结合深度学习的蒙特卡洛树搜索）用在 RNA 分子结构预测上，目前已经做了一些尝试，比如寻找 RNA 分子的二级结构折叠路径。

首先看的书是 Richard S. Sutton 和 Andrew G. Barto 的 Reinforcement Learning: An Introduction (Second edition)。

看书的同时，也根据网上的一些文章写一些简单的代码，依次如下。

Q-Learning
Deep Q-Learning Network (DQN)
- Doom Game
- Atari Space Invaders
Dueling Double DQN & Prioritized Experience Replay
- Doom Deadly Corridor
Policy Gradients (PG)
- CartPole Game
- Doom Deathmatch
Advantage Actor Critic (A2C)
Asynchronous Advantage Actor Critic (A3C)
Proximal Policy Optimization (PPO)
- Half Cheetah
Deep Deterministic Policy Gradient (DDPG)
- Ant
AlphaGoZero Introduction
Monte Carlo Tree Search (MCTS)
- Gomoku
AlphaGomoku
RNA Folding Path
Atari Game Roms

Q-Learning

Bellman equation:

Frozen Lake Game

基于 Q-Learning 玩 Frozen Lake 游戏：[code]

Tic Tac Toe

基于 Q-Learning 玩井字棋游戏：[code]

训练结果：

Q-Learning Player vs Q-Learning Player
====================
Train result - 100000 episodes
Q-Learning win rate: 0.45383
Q-Learning win rate: 0.3527
players draw rate: 0.19347
====================

Q-Learning Player vs Random Player
====================
Train result - 100000 episodes
Q-Learning win rate: 0.874
Random win rate: 0.03072
players draw rate: 0.09528
====================

Taxi v2

基于 Q-Learning 玩 Taxi v2 游戏：[code]

[0]. Diving deeper into Reinforcement Learning with Q-Learning<br/> [1]. Q* Learning with FrozenLake - Notebook<br/> [2]. Q* Learning with OpenAI Taxi-v2 - Notebook

Deep Q-Learning Network

weights updation:

Doom Game

游戏环境这里使用的是 ViZDoom ，神经网络是三层的卷积网络。[code]

DQN neural network

训练大约 1200 轮后结果如下：

Doom loss

Episode 0 Score: 61.0
Episode 1 Score: 68.0
Episode 2 Score: 51.0
Episode 3 Score: 62.0
Episode 4 Score: 56.0
Episode 5 Score: 33.0
Episode 6 Score: 86.0
Episode 7 Score: 57.0
Episode 8 Score: 88.0
Episode 9 Score: 61.0
[*] Average Score: 62.3

Atari Space Invaders

游戏环境使用的是 Gym Retro ，神经网络见下图。[code]

DQN neural network

训练大约 25 局后结果如下：

[*] Episode: 11, total reward: 120.0, explore p: 0.7587, train loss: 0.0127
[*] Episode: 12, total reward: 80.0, explore p: 0.7495, train loss: 0.0194
[*] Episode: 13, total reward: 110.0, explore p: 0.7409, train loss: 0.0037
[*] Episode: 14, total reward: 410.0, explore p: 0.7233, train loss: 0.0004
[*] Episode: 15, total reward: 240.0, explore p: 0.7019, train loss: 0.0223
[*] Episode: 16, total reward: 230.0, explore p: 0.6813, train loss: 0.0535
[*] Episode: 17, total reward: 315.0, explore p: 0.6606, train loss: 9.7144
[*] Episode: 18, total reward: 140.0, explore p: 0.6455, train loss: 0.0022
[*] Episode: 19, total reward: 310.0, explore p: 0.6266, train loss: 1.5386
[*] Episode: 20, total reward: 200.0, explore p: 0.6114, train loss: 1.5545
[*] Episode: 21, total reward: 65.0, explore p: 0.6044, train loss: 0.0042
[*] Episode: 22, total reward: 210.0, explore p: 0.5895, train loss: 0.0161
[*] Episode: 23, total reward: 155.0, explore p: 0.5778, train loss: 0.0006
[*] Episode: 24, total reward: 105.0, explore p: 0.5665, train loss: 0.0016
[*] Episode: 25, total reward: 425.0, explore p: 0.5505, train loss: 0.0063

[0]. An introduction to Deep Q-Learning: let’s play Doom<br/> [1]. Deep Q learning with Doom - Notebook<br/> [2]. Deep Q Learning with Atari Space Invaders<br/> [3]. Atari 2600 VCS ROM Collection

Dueling Double DQN and Prioritized Experience Replay

Four improvements in Deep Q Learning:

Fixed Q-targets
Double DQN
Dueling DQN
Prioritized Experience Replay

Doom Deadly Corridor

其中，Dueling DQN 的神经网络如下图: [code]

Dueling DQN

Prioritized Experience Replay 采用 SumTree 的方法:

SumTree

[0]. Improvements in Deep Q Learning: Dueling Double DQN, Prioritized Experience Replay, and fixed Q-targets<br/> [1]. Let’s make a DQN: Double Learning and Prioritized Experience Replay<br/> [2]. Double Dueling Deep Q Learning with Prioritized Experience Replay - Notebook

Policy Gradients

CartPole Game

其中，Policy Gradient 神经网络如下图。

Policy Gradient Network

训练大约 950 轮后结果如下：

====================
Episode: 941
Reward: 39712.0
Mean Reward: 2246.384288747346
Max reward so far: 111837.0
====================
Episode: 942
Reward: 9417.0
Mean Reward: 2253.9883351007425
Max reward so far: 111837.0
====================
Episode: 943
Reward: 109958.0
Mean Reward: 2368.08156779661
Max reward so far: 111837.0
====================
Episode: 944
Reward: 73285.0
Mean Reward: 2443.125925925926
Max reward so far: 111837.0
====================
Episode: 945
Reward: 40370.0
Mean Reward: 2483.217758985201
Max reward so far: 111837.0
[*] Model Saved: ./model/model.ckpt

具体代码请参见：[tensorflow] [pytorch]

Doom Deathmatch

神经网络如上，具体代码请参见：[code]

[0]. [An introduction to Policy Gradients with Cartpole an

ReinforcementLearning

Install / Use

README