SkillAgentSearch skills...

ReinforcementLearning

Reinforcing Your Learning of Reinforcement Learning

Install / Use

/learn @Urinx/ReinforcementLearning

README

Reinforcement Learning

Reinforcing Your Learning of Reinforcement Learning.

这个是我在学习强化学习的过程中的一些记录,以及写的一些代码。建立这个Github项目主要是可以和大家一起相互学习和交流,也同时方便其他人寻找强化学习方面的资料。我为什么学习强化学习,主要是想把 AlphaZero 的那套方法(结合深度学习的蒙特卡洛树搜索)用在 RNA 分子结构预测上,目前已经做了一些尝试,比如寻找 RNA 分子的二级结构折叠路径。

首先看的书是 Richard S. Sutton 和 Andrew G. Barto 的 Reinforcement Learning: An Introduction (Second edition)

看书的同时,也根据网上的一些文章写一些简单的代码,依次如下。

Table of contents

Q-Learning

Bellman equation: Bellman equation

Frozen Lake Game

<div align=center> <img width="300" height="300" src="imgs/frozenlake.png" alt="Frozen Lake Game"> </div>

基于 Q-LearningFrozen Lake 游戏:[code]

Tic Tac Toe

<div align=center> <img width="100" height="130" src="imgs/tic1.png" alt="Tic Tac Toe"> <img width="100" height="130" src="imgs/tic2.png" alt="Tic Tac Toe"> <img width="100" height="130" src="imgs/tic3.png" alt="Tic Tac Toe"> <img width="100" height="130" src="imgs/tic4.png" alt="Tic Tac Toe"> <img width="100" height="130" src="imgs/tic5.png" alt="Tic Tac Toe"> <img width="100" height="130" src="imgs/tic6.png" alt="Tic Tac Toe"> <img width="100" height="130" src="imgs/tic7.png" alt="Tic Tac Toe"> </div>

基于 Q-Learning 玩井字棋游戏:[code]

训练结果:

Q-Learning Player vs Q-Learning Player
====================
Train result - 100000 episodes
Q-Learning win rate: 0.45383
Q-Learning win rate: 0.3527
players draw rate: 0.19347
====================

Q-Learning Player vs Random Player
====================
Train result - 100000 episodes
Q-Learning win rate: 0.874
Random win rate: 0.03072
players draw rate: 0.09528
====================

Taxi v2

<div align=center> <img width="93" height="133" src="imgs/taxi1.png" alt="Taxi v2"> <img width="93" height="133" src="imgs/taxi2.png" alt="Taxi v2"> <img width="93" height="133" src="imgs/taxi3.png" alt="Taxi v2"> <img width="93" height="133" src="imgs/taxi4.png" alt="Taxi v2"> <img width="93" height="133" src="imgs/taxi5.png" alt="Taxi v2"> <img width="93" height="133" src="imgs/taxi6.png" alt="Taxi v2"> </div>

基于 Q-LearningTaxi v2 游戏:[code]

[0]. Diving deeper into Reinforcement Learning with Q-Learning<br/> [1]. Q* Learning with FrozenLake - Notebook<br/> [2]. Q* Learning with OpenAI Taxi-v2 - Notebook

Deep Q-Learning Network

<div align=center> <img width="400" height="300" src="imgs/DQN.png" alt="Deep Q-Learning Network"> </div>

weights updation:

Doom Game

<div align=center> <img src="imgs/play_doom.gif" alt="play Doom"> </div>

游戏环境这里使用的是 ViZDoom ,神经网络是三层的卷积网络。[code]

DQN neural network

训练大约 1200 轮后结果如下:

Doom loss

Episode 0 Score: 61.0
Episode 1 Score: 68.0
Episode 2 Score: 51.0
Episode 3 Score: 62.0
Episode 4 Score: 56.0
Episode 5 Score: 33.0
Episode 6 Score: 86.0
Episode 7 Score: 57.0
Episode 8 Score: 88.0
Episode 9 Score: 61.0
[*] Average Score: 62.3

Atari Space Invaders

<div align=center> <img width="427" height="530" src="imgs/play_atari_space_invaders.gif" alt="Atari Space Invaders"> </div>

游戏环境使用的是 Gym Retro ,神经网络见下图。[code]

DQN neural network

训练大约 25 局后结果如下:

[*] Episode: 11, total reward: 120.0, explore p: 0.7587, train loss: 0.0127
[*] Episode: 12, total reward: 80.0, explore p: 0.7495, train loss: 0.0194
[*] Episode: 13, total reward: 110.0, explore p: 0.7409, train loss: 0.0037
[*] Episode: 14, total reward: 410.0, explore p: 0.7233, train loss: 0.0004
[*] Episode: 15, total reward: 240.0, explore p: 0.7019, train loss: 0.0223
[*] Episode: 16, total reward: 230.0, explore p: 0.6813, train loss: 0.0535
[*] Episode: 17, total reward: 315.0, explore p: 0.6606, train loss: 9.7144
[*] Episode: 18, total reward: 140.0, explore p: 0.6455, train loss: 0.0022
[*] Episode: 19, total reward: 310.0, explore p: 0.6266, train loss: 1.5386
[*] Episode: 20, total reward: 200.0, explore p: 0.6114, train loss: 1.5545
[*] Episode: 21, total reward: 65.0, explore p: 0.6044, train loss: 0.0042
[*] Episode: 22, total reward: 210.0, explore p: 0.5895, train loss: 0.0161
[*] Episode: 23, total reward: 155.0, explore p: 0.5778, train loss: 0.0006
[*] Episode: 24, total reward: 105.0, explore p: 0.5665, train loss: 0.0016
[*] Episode: 25, total reward: 425.0, explore p: 0.5505, train loss: 0.0063

[0]. An introduction to Deep Q-Learning: let’s play Doom<br/> [1]. Deep Q learning with Doom - Notebook<br/> [2]. Deep Q Learning with Atari Space Invaders<br/> [3]. Atari 2600 VCS ROM Collection

Dueling Double DQN and Prioritized Experience Replay

Four improvements in Deep Q Learning:

  • Fixed Q-targets Fixed Q-targets
  • Double DQN Double DQN
  • Dueling DQN Dueling DQN
  • Prioritized Experience Replay
<div align=center> <img height="400" src="imgs/PER.png" alt="PER"> </div>

Doom Deadly Corridor

<div align=center> <img src="imgs/play_doom_deadly_corridor.gif" alt="play Doom Deadly Corridor"> </div>

其中,Dueling DQN 的神经网络如下图: [code]

Dueling DQN

Prioritized Experience Replay 采用 SumTree 的方法:

SumTree

[0]. Improvements in Deep Q Learning: Dueling Double DQN, Prioritized Experience Replay, and fixed Q-targets<br/> [1]. Let’s make a DQN: Double Learning and Prioritized Experience Replay<br/> [2]. Double Dueling Deep Q Learning with Prioritized Experience Replay - Notebook

Policy Gradients

<div align=center> <img width="500" src="imgs/policy_gradients.png" alt="Policy Gradients"> </div> <div align=center> <img src="imgs/pg_algorithm.svg" alt="PG Algorithm"> </div>

CartPole Game

<div align=center> <img src="imgs/play_cartpole.gif" alt="Play CartPole Game"> </div>

其中,Policy Gradient 神经网络如下图。

Policy Gradient Network

训练大约 950 轮后结果如下:

====================
Episode: 941
Reward: 39712.0
Mean Reward: 2246.384288747346
Max reward so far: 111837.0
====================
Episode: 942
Reward: 9417.0
Mean Reward: 2253.9883351007425
Max reward so far: 111837.0
====================
Episode: 943
Reward: 109958.0
Mean Reward: 2368.08156779661
Max reward so far: 111837.0
====================
Episode: 944
Reward: 73285.0
Mean Reward: 2443.125925925926
Max reward so far: 111837.0
====================
Episode: 945
Reward: 40370.0
Mean Reward: 2483.217758985201
Max reward so far: 111837.0
[*] Model Saved: ./model/model.ckpt

具体代码请参见:[tensorflow] [pytorch]

Doom Deathmatch

<div align=center> <img src="imgs/play_doom_deathmatch.gif" alt="play Doom Deathmatch"> </div>

神经网络如上,具体代码请参见:[code]

[0]. [An introduction to Policy Gradients with Cartpole an

View on GitHub
GitHub Stars96
CategoryEducation
Updated1mo ago
Forks22

Languages

Python

Security Score

100/100

Audited on Feb 7, 2026

No findings