SkillAgentSearch skills...

TextRL

Implementation of ChatGPT RLHF (Reinforcement Learning with Human Feedback) on any generation model in huggingface's transformer (blommz-176B/bloom/gpt/bart/T5/MetaICL)

Install / Use

/learn @voidful/TextRL
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

TextRL: Text Generation with Reinforcement Learning

<p align="center"> <a href="https://pypi.org/project/textrl/"> <img alt="PyPI" src="https://img.shields.io/pypi/v/textrl"> </a> <a href="https://github.com/voidful/tfkit"> <img alt="Download" src="https://img.shields.io/pypi/dm/textrl"> </a> <a href="https://github.com/voidful/tfkit"> <img alt="Last Commit" src="https://img.shields.io/github/last-commit/voidful/textrl"> </a> <a href="https://www.codefactor.io/repository/github/voidful/textrl"> <img src="https://www.codefactor.io/repository/github/voidful/textrl/badge" alt="CodeFactor" /> </a> <a href="https://github.com/voidful/textrl"> <img src="https://visitor-badge.glitch.me/badge?page_id=voidful.textrl" alt="Visitor" /> </a> </p>

TextRL is a Python library that aims to improve text generation using reinforcement learning, building upon Hugging Face's Transformers, PFRL, and OpenAI GYM. TextRL is designed to be easily customizable and can be applied to various text-generation models.

TextRL

Table of Contents

Introduction

TextRL utilizes reinforcement learning to fine-tune text generation models. It is built upon the following libraries:

Example - gpt2

<details><summary>CLICK ME</summary> <p>

GPT2 Example

import pfrl
from textrl import TextRLEnv, TextRLActor, train_agent_with_evaluation
from transformers import AutoModelForCausalLM, AutoTokenizer
import logging
import sys

logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')

checkpoint = "gpt2"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype="auto", device_map="auto")

model = model.cuda()


class MyRLEnv(TextRLEnv):
    def get_reward(self, input_item, predicted_list, finish):  # predicted will be the list of predicted token
        reward = [0]
        if finish:
            reward = [1]  # calculate reward score base on predicted_list
        return reward


observaton_list = [{"input":"explain how attention work in seq2seq model"}]
env = TextRLEnv(model, tokenizer, observation_input=observaton_list, max_length=20, compare_sample=2)
actor = TextRLActor(env, model, tokenizer,
                    act_deterministically=False,
                    temperature=1.0,
                    top_k=0,
                    top_p=1.0,
                    repetition_penalty=2)
agent = actor.agent_ppo(update_interval=2, minibatch_size=2, epochs=10)
print(actor.predict(observaton_list[0]))

train_agent_with_evaluation(
    agent,
    env,
    steps=100,
    eval_n_steps=None,
    eval_n_episodes=1,
    eval_interval=2,
    outdir='bloom—test',
)

print(actor.predict(observaton_list[0]))
</p> </details>

Example - flan-t5

<details><summary>CLICK ME</summary> <p>

Example Code

colab example: google/flan-t5-base

import pfrl
from textrl import TextRLEnv, TextRLActor, train_agent_with_evaluation
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import logging
import sys

logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')


tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")  
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")
model.eval()
model.cuda()

sentiment = pipeline('sentiment-analysis',model="cardiffnlp/twitter-roberta-base-sentiment",tokenizer="cardiffnlp/twitter-roberta-base-sentiment",device=0,return_all_scores=True)

class MyRLEnv(TextRLEnv):
    def get_reward(self, input_item, predicted_list, finish): # predicted will be the list of predicted token
      reward = 0
      if finish or len(predicted_list[0]) >= self.env_max_length:
        predicted_text = tokenizer.convert_tokens_to_string(predicted_list[0])
        # sentiment classifier
        reward = sentiment(input_item['input']+predicted_text)[0][0]['score'] * 10
      return reward

observaton_list = [{'input':'i think dogecoin is'}]
env = MyRLEnv(model, tokenizer, observation_input=observaton_list, compare_sample=1)
actor = TextRLActor(env,model,tokenizer,optimizer='adamw',
                    temperature=0.8,
                    top_k=100,
                    top_p=0.85,)
agent = actor.agent_ppo(update_interval=50, minibatch_size=3, epochs=10,lr=3e-4)
print(actor.predict(observaton_list[0]))

pfrl.experiments.train_agent_with_evaluation(
    agent,
    env,
    steps=3000,
    eval_n_steps=None,
    eval_n_episodes=1,       
    train_max_episode_len=100,  
    eval_interval=10,
    outdir='checkpoint', 
)
agent.load("./checkpoint/best")
print(actor.predict(observaton_list[0]))
</p> </details>

Example - bigscience/bloomz-7b1-mt

<details><summary>CLICK ME</summary> <p>

bloomz-7b1-mt Example

import pfrl
from textrl import TextRLEnv, TextRLActor, train_agent_with_evaluation
from transformers import AutoModelForCausalLM, AutoTokenizer
import logging
import sys

logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')

checkpoint = "bigscience/bloomz-7b1-mt"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype="auto", device_map="auto")

model = model.cuda()


class MyRLEnv(TextRLEnv):
    def get_reward(self, input_item, predicted_list, finish):  # predicted will be the list of predicted token
        reward = [0]
        if finish:
            reward = [1]  # calculate reward score base on predicted_list
        return reward


observaton_list = [{"input":"explain how attention work in seq2seq model"}]
env = TextRLEnv(model, tokenizer, observation_input=observaton_list, max_length=20, compare_sample=2)
actor = TextRLActor(env, model, tokenizer,
                    act_deterministically=False,
                    temperature=1.0,
                    top_k=0,
                    top_p=1.0)
agent = actor.agent_ppo(update_interval=2, minibatch_size=2, epochs=10)
print(actor.predict(observaton_list[0]))

train_agent_with_evaluation(
    agent,
    env,
    steps=100,
    eval_n_steps=None,
    eval_n_episodes=1,
    eval_interval=2,
    outdir='bloom—test',
)

print(actor.predict(observaton_list[0]))
</p> </details>

Example - 176B BLOOM

<details><summary>CLICK ME</summary> <p>

bloomz-176B Example

Strongly recommend contribute on public swarm to increase petals capacity

https://github.com/bigscience-workshop/petals

install pip install petals -U first

import pfrl
from textrl import TextRLEnv, TextRLActor, train_agent_with_evaluation
from transformers import BloomTokenizerFast
from petals import DistributedBloomForCausalLM
import logging
import sys

logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')

MODEL_NAME = "bigscience/bloom-petals"
tokenizer = BloomTokenizerFast.from_pretrained(MODEL_NAME)
model = DistributedBloomForCausalLM.from_pretrained(MODEL_NAME)
model = model.cuda()


class MyRLEnv(TextRLEnv):
    def get_reward(self, input_item, predicted_list, finish):  # predicted will be the list of predicted token
        reward = [0]
        if finish:
            reward = [1]  # calculate reward score base on predicted_list
        return reward


observaton_list = [{"input":"explain how attention work in seq2seq model"}]
env = TextRLEnv(model, tokenizer, observation_input=observaton_list, max_length=20, compare_sample=2)
actor = TextRLActor(env, model, tokenizer,
                    act_deterministically=False,
                    temperature=1.0,
                    top_k=0,
                    top_p=1.0)
agent = actor.agent_ppo(update_interval=2, minibatch_size=2, epochs=10)

print(actor.predict(observaton_list[0]))

train_agent_with_evaluation(
    agent,
    env,
    steps=100,
    eval_n_steps=None,
    eval_n_episodes=1,
    eval_interval=2,
    outdir='bloom—test',
)

print(actor.predict(observaton_list[0]))
</p> </details>

Example - Controllable generation via RL to let Elon Musk speak ill of DOGE

<details><summary>CLICK ME</summary> <p> [Controllable generation via RL to let Elon Musk speak ill of DOGE ](https://github.com/voidful/TextRL/blob/main/example/2022-12-10-textrl-elon-musk.ipynb)

colab example: bigscience/bloom-560m

colab exmaple: huggingtweets/elonmusk

before: i think dogecoin is a great idea.
after: i think dogecoin is a great idea, but I think it is a little overused.

</p> </details>

Installation

pip install

pip install pfrl@git+https://github.com/voidful/pfrl.git
pip install textrl

Build from source

git clone and cd into this project.

pip install -e .

Usage

Initialize ag

Related Skills

View on GitHub
GitHub Stars565
CategoryEducation
Updated9d ago
Forks61

Languages

Python

Security Score

100/100

Audited on Mar 21, 2026

No findings