SkillAgentSearch skills...

RLSeq2Seq

Deep Reinforcement Learning For Sequence to Sequence Models

Install / Use

/learn @yaserkl/RLSeq2Seq

README


RLSeq2Seq


.. image:: https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat :target: https://github.com/yaserkl/RLSeq2Seq/pulls .. image:: https://img.shields.io/badge/Made%20with-Python-1f425f.svg :target: https://www.python.org/ .. image:: https://img.shields.io/pypi/l/ansicolortags.svg :target: https://github.com/yaserkl/RLSeq2Seq/blob/master/LICENSE.txt .. image:: https://img.shields.io/github/contributors/Naereen/StrapDown.js.svg :target: https://github.com/yaserkl/RLSeq2Seq/graphs/contributors .. image:: https://img.shields.io/github/issues/Naereen/StrapDown.js.svg :target: https://github.com/yaserkl/RLSeq2Seq/issues .. image:: https://img.shields.io/badge/arXiv-1805.09461-red.svg?style=flat :target: https://arxiv.org/abs/1805.09461

NOTE: This code is no longer actively maintained.

This repository contains the code developed in TensorFlow_ for the following paper:

| Deep Reinforcement Learning For Sequence to Sequence Models, | by: Yaser Keneshloo, Tian Shi, Naren Ramakrishnan, and Chandan K. Reddy_

.. _Deep Reinforcement Learning For Sequence to Sequence Models: https://arxiv.org/abs/1805.09461 .. _TensorFlow: https://www.tensorflow.org/ .. _Yaser Keneshloo: https://github.com/yaserkl .. _Tian Shi: http://life-tp.com/Tian_Shi/ .. _Chandan K. Reddy: http://people.cs.vt.edu/~reddy/ .. _Naren Ramakrishnan: http://people.cs.vt.edu/naren/

If you used this code, please kindly consider citing the following paper:

.. code:: shell

@article{keneshloo2018deep,
 title={Deep Reinforcement Learning For Sequence to Sequence Models},
 author={Keneshloo, Yaser and Shi, Tian and Ramakrishnan, Naren and Reddy, Chandan K.},
 journal={arXiv preprint arXiv:1805.09461},
 year={2018}
}

################# Table of Contents ################# .. contents:: :local: :depth: 3

.. Chapter 1 Title .. ===============

.. Section 1.1 Title .. -----------------

.. Subsection 1.1.1 Title .. ~~~~~~~~~~~~~~~~~~~~~~

.. image:: docs/_img/rlseq.png :target: docs/_img/rlseq.png

============ Motivation

In recent years, sequence-to-sequence (seq2seq) models are used in a variety of tasks from machine translation, headline generation, text summarization, speech to text, to image caption generation. The underlying framework of all these models are usually a deep neural network which contains an encoder and decoder. The encoder processes the input data and a decoder receives the output of the encoder and generates the final output. Although simply using an encoder/decoder model would, most of the time, produce better result than traditional methods on the above-mentioned tasks, researchers proposed additional improvements over these sequence to sequence models, like using an attention-based model over the input, pointer-generation models, and self-attention models. However, all these seq2seq models suffer from two common problems: 1) exposure bias and 2) inconsistency between train/test measurement. Recently a completely fresh point of view emerged in solving these two problems in seq2seq models by using methods in Reinforcement Learning (RL). In these new researches, we try to look at the seq2seq problems from the RL point of view and we try to come up with a formulation that could combine the power of RL methods in decision-making and sequence to sequence models in remembering long memories. In this paper, we will summarize some of the most recent frameworks that combines concepts from RL world to the deep neural network area and explain how these two areas could benefit from each other in solving complex seq2seq tasks. In the end, we will provide insights on some of the problems of the current existing models and how we can improve them with better RL models. We also provide the source code for implementing most of the models that will be discussed in this paper on the complex task of abstractive text summarization.


==================== Requirements


Python

  • Use Python 2.7

Python requirements can be installed as follows:

.. code:: bash

pip install -r python_requirements.txt

TensorFlow

  • Tensorflow 1.10.1

GPU

  • CUDA 9
  • Cudnn 7.1

============ DATASET


CNN/Daily Mail dataset

https://github.com/abisee/cnn-dailymail


Newsroom dataset

https://summari.es/

We have provided helper codes to download the cnn-dailymail dataset and pre-process this dataset and newsroom dataset. Please refer to this link <src/helper>_ to access them.

We saw a large improvement on the ROUGE measure by using our processed version of these datasets in the summarization results, therefore, we strongly suggest using these pre-processed files for all the training.


==================== Running Experiments

This code is a general framework for a variety of different modes that supports the following features:

  1. Scheduled Sampling, Soft-Scheduled Sampling, and End2EndBackProp.

  2. Policy-Gradient w. Self-Critic learning and temporal attention and intra-decoder attention:

    A. Following A Deep Reinforced Model for Abstractive Summarization <https://arxiv.org/abs/1705.04304>_

  3. Actor-Critic model through DDQN and Dueling network based on these papers:

    A. Deep Reinforcement Learning with Double Qlearning <https://arxiv.org/abs/1509.06461>_ B. Dueling Network Architectures for Deep Reinforcement Learning <https://arxiv.org/abs/1511.06581>_ C. An ActorCritic Algorithm for Sequence Prediction <https://arxiv.org/abs/1607.07086>_



Scheduled Sampling, Soft-Scheduled Sampling, and End2EndBackProp

Bengio et al <https://arxiv.org/abs/1506.03099>. proposed the idea of scheduled sampling for avoiding exposure bias problem. Recently, Goyal et al <https://arxiv.org/abs/1506.03099>. proposed a differentiable relaxtion of this method, by using soft-argmax rather hard-argmax, that solves the back-propagation error that exists in this model. Also, Ranzato et al <https://arxiv.org/abs/1511.06732>_. proposed another simple model called End2EndBackProp for avoiding exposure bias problem. To train a model based on each of these papers, we provide different flags as follows:

+----------------------------+---------+-------------------------------------------------------------------+ | Parameter | Default | Description | +============================+=========+===================================================================+ | scheduled_sampling | False | whether to do scheduled sampling or not | +----------------------------+---------+-------------------------------------------------------------------+ | sampling_probability | 0 | epsilon value for choosing ground-truth or model output | +----------------------------+---------+-------------------------------------------------------------------+ | fixed_sampling_probability | False | Whether to use fixed sampling probability or adaptive | +----------------------------+---------+-------------------------------------------------------------------+ | hard_argmax | True | Whether to use soft argmax or hard argmax | +----------------------------+---------+-------------------------------------------------------------------+ | greedy_scheduled_sampling | False | Whether to use greedy or sample for the output, True means greedy | +----------------------------+---------+-------------------------------------------------------------------+ | E2EBackProp | False | Whether to use E2EBackProp algorithm to solve exposure bias | +----------------------------+---------+-------------------------------------------------------------------+ | alpha | 1 | soft argmax argument | +----------------------------+---------+-------------------------------------------------------------------+

Scheduled Sampling using Hard-Argmax and Greedy selection (`Bengio et al <https://arxiv.org/abs/1506.03099>`_.):

.. code:: bash

CUDA_VISIBLE_DEVICES=0 python src/run_summarization.py --mode=train --data_path=$HOME/data/cnn_dm/finished_files/chunked/train_* --vocab_path=$HOME/data/cnn_dm/finished_files/vocab --log_root=$HOME/working_dir/cnn_dm/RLSeq2Seq/ --exp_name=scheduled-sampling-hardargmax-greedy --batch_size=80 --max_iter=40000 --scheduled_sampling=True --sampling_probability=2.5E-05 --hard_argmax=True --greedy_scheduled_sampling=True
Scheduled Sampling using Soft-Argmax and Sampling selection (`Goyal et al <https://arxiv.org/abs/1506.03099>`_.):

.. code:: bash

CUDA_V

Related Skills

View on GitHub
GitHub Stars768
CategoryEducation
Updated5mo ago
Forks161

Languages

Python

Security Score

97/100

Audited on Oct 4, 2025

No findings