TextRL

Implementation of ChatGPT RLHF (Reinforcement Learning with Human Feedback) on any generation model in huggingface's transformer (blommz-176B/bloom/gpt/bart/T5/MetaICL)

Generate Convert Improve

Install / Use

/learn @voidful/TextRL

About this skill

Quality Score

0/100

README

TextRL: Text Generation with Reinforcement Learning

TextRL is a Python library that aims to improve text generation using reinforcement learning, building upon Hugging Face's Transformers, PFRL, and OpenAI GYM. TextRL is designed to be easily customizable and can be applied to various text-generation models.

TextRL

Introduction
Examples
Installation
- Pip Install
- Build from Source
Usage
Dump Model
Key Parameters for RL Training

Introduction

TextRL utilizes reinforcement learning to fine-tune text generation models. It is built upon the following libraries:

Example - `gpt2`

<details><summary>CLICK ME</summary>

GPT2 Example

import pfrl
from textrl import TextRLEnv, TextRLActor, train_agent_with_evaluation
from transformers import AutoModelForCausalLM, AutoTokenizer
import logging
import sys

logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')

checkpoint = "gpt2"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype="auto", device_map="auto")

model = model.cuda()


class MyRLEnv(TextRLEnv):
    def get_reward(self, input_item, predicted_list, finish):  # predicted will be the list of predicted token
        reward = [0]
        if finish:
            reward = [1]  # calculate reward score base on predicted_list
        return reward


observaton_list = [{"input":"explain how attention work in seq2seq model"}]
env = TextRLEnv(model, tokenizer, observation_input=observaton_list, max_length=20, compare_sample=2)
actor = TextRLActor(env, model, tokenizer,
                    act_deterministically=False,
                    temperature=1.0,
                    top_k=0,
                    top_p=1.0,
                    repetition_penalty=2)
agent = actor.agent_ppo(update_interval=2, minibatch_size=2, epochs=10)
print(actor.predict(observaton_list[0]))

train_agent_with_evaluation(
    agent,
    env,
    steps=100,
    eval_n_steps=None,
    eval_n_episodes=1,
    eval_interval=2,
    outdir='bloom—test',
)

print(actor.predict(observaton_list[0]))

</details>

Example - `flan-t5`

<details><summary>CLICK ME</summary>

Example Code

colab example: google/flan-t5-base

import pfrl
from textrl import TextRLEnv, TextRLActor, train_agent_with_evaluation
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import logging
import sys

logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')


tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")  
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")
model.eval()
model.cuda()

sentiment = pipeline('sentiment-analysis',model="cardiffnlp/twitter-roberta-base-sentiment",tokenizer="cardiffnlp/twitter-roberta-base-sentiment",device=0,return_all_scores=True)

class MyRLEnv(TextRLEnv):
    def get_reward(self, input_item, predicted_list, finish): # predicted will be the list of predicted token
      reward = 0
      if finish or len(predicted_list[0]) >= self.env_max_length:
        predicted_text = tokenizer.convert_tokens_to_string(predicted_list[0])
        # sentiment classifier
        reward = sentiment(input_item['input']+predicted_text)[0][0]['score'] * 10
      return reward

observaton_list = [{'input':'i think dogecoin is'}]
env = MyRLEnv(model, tokenizer, observation_input=observaton_list, compare_sample=1)
actor = TextRLActor(env,model,tokenizer,optimizer='adamw',
                    temperature=0.8,
                    top_k=100,
                    top_p=0.85,)
agent = actor.agent_ppo(update_interval=50, minibatch_size=3, epochs=10,lr=3e-4)
print(actor.predict(observaton_list[0]))

pfrl.experiments.train_agent_with_evaluation(
    agent,
    env,
    steps=3000,
    eval_n_steps=None,
    eval_n_episodes=1,       
    train_max_episode_len=100,  
    eval_interval=10,
    outdir='checkpoint', 
)
agent.load("./checkpoint/best")
print(actor.predict(observaton_list[0]))

</details>

Example - `bigscience/bloomz-7b1-mt`

<details><summary>CLICK ME</summary>

bloomz-7b1-mt Example

import pfrl
from textrl import TextRLEnv, TextRLActor, train_agent_with_evaluation
from transformers import AutoModelForCausalLM, AutoTokenizer
import logging
import sys

logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')

checkpoint = "bigscience/bloomz-7b1-mt"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype="auto", device_map="auto")

model = model.cuda()


class MyRLEnv(TextRLEnv):
    def get_reward(self, input_item, predicted_list, finish):  # predicted will be the list of predicted token
        reward = [0]
        if finish:
            reward = [1]  # calculate reward score base on predicted_list
        return reward


observaton_list = [{"input":"explain how attention work in seq2seq model"}]
env = TextRLEnv(model, tokenizer, observation_input=observaton_list, max_length=20, compare_sample=2)
actor = TextRLActor(env, model, tokenizer,
                    act_deterministically=False,
                    temperature=1.0,
                    top_k=0,
                    top_p=1.0)
agent = actor.agent_ppo(update_interval=2, minibatch_size=2, epochs=10)
print(actor.predict(observaton_list[0]))

train_agent_with_evaluation(
    agent,
    env,
    steps=100,
    eval_n_steps=None,
    eval_n_episodes=1,
    eval_interval=2,
    outdir='bloom—test',
)

print(actor.predict(observaton_list[0]))

</details>

Example - 176B BLOOM

<details><summary>CLICK ME</summary>

bloomz-176B Example

Strongly recommend contribute on public swarm to increase petals capacity

https://github.com/bigscience-workshop/petals

install pip install petals -U first

import pfrl
from textrl import TextRLEnv, TextRLActor, train_agent_with_evaluation
from transformers import BloomTokenizerFast
from petals import DistributedBloomForCausalLM
import logging
import sys

logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')

MODEL_NAME = "bigscience/bloom-petals"
tokenizer = BloomTokenizerFast.from_pretrained(MODEL_NAME)
model = DistributedBloomForCausalLM.from_pretrained(MODEL_NAME)
model = model.cuda()


class MyRLEnv(TextRLEnv):
    def get_reward(self, input_item, predicted_list, finish):  # predicted will be the list of predicted token
        reward = [0]
        if finish:
            reward = [1]  # calculate reward score base on predicted_list
        return reward


observaton_list = [{"input":"explain how attention work in seq2seq model"}]
env = TextRLEnv(model, tokenizer, observation_input=observaton_list, max_length=20, compare_sample=2)
actor = TextRLActor(env, model, tokenizer,
                    act_deterministically=False,
                    temperature=1.0,
                    top_k=0,
                    top_p=1.0)
agent = actor.agent_ppo(update_interval=2, minibatch_size=2, epochs=10)

print(actor.predict(observaton_list[0]))

train_agent_with_evaluation(
    agent,
    env,
    steps=100,
    eval_n_steps=None,
    eval_n_episodes=1,
    eval_interval=2,
    outdir='bloom—test',
)

print(actor.predict(observaton_list[0]))

</details>

Example - Controllable generation via RL to let Elon Musk speak ill of DOGE

<details><summary>CLICK ME</summary> [Controllable generation via RL to let Elon Musk speak ill of DOGE ](https://github.com/voidful/TextRL/blob/main/example/2022-12-10-textrl-elon-musk.ipynb)

colab example: bigscience/bloom-560m

colab exmaple: huggingtweets/elonmusk

before: i think dogecoin is a great idea.
after: i think dogecoin is a great idea, but I think it is a little overused.

</details>

Installation

pip install

pip install pfrl@git+https://github.com/voidful/pfrl.git
pip install textrl

Build from source

git clone and cd into this project.

pip install -e .

Usage

Initialize ag

Related Skills

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

last30days-skill

15.9k

AI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary

autoresearch

2.8k

Claude Autoresearch Skill — Autonomous goal-directed iteration for Claude Code. Inspired by Karpathy's autoresearch. Modify → Verify → Keep/Discard → Repeat forever.

omg-learn

Learning from user corrections by creating skills and patterns. Patterns can prevent mistakes (block/warn/ask) or inject helpful context into prompts

voidful

View profile

View on GitHub

GitHub Stars565

CategoryEducation

Updated9d ago

Forks61

voidful/TextRL

Languages

Python

Security Score

100/100

Audited on Mar 21, 2026

No findings

TextRL

Install / Use

README

TextRL: Text Generation with Reinforcement Learning

Table of Contents

Introduction

Example - gpt2

GPT2 Example

Example - flan-t5

Example Code

Example - bigscience/bloomz-7b1-mt

bloomz-7b1-mt Example

Example - 176B BLOOM

bloomz-176B Example

Example - Controllable generation via RL to let Elon Musk speak ill of DOGE

Installation

pip install

Build from source

Usage

Initialize ag

Related Skills

Example - `gpt2`

Example - `flan-t5`

Example - `bigscience/bloomz-7b1-mt`