TextRL
Implementation of ChatGPT RLHF (Reinforcement Learning with Human Feedback) on any generation model in huggingface's transformer (blommz-176B/bloom/gpt/bart/T5/MetaICL)
Install / Use
/learn @voidful/TextRLREADME
TextRL: Text Generation with Reinforcement Learning
<p align="center"> <a href="https://pypi.org/project/textrl/"> <img alt="PyPI" src="https://img.shields.io/pypi/v/textrl"> </a> <a href="https://github.com/voidful/tfkit"> <img alt="Download" src="https://img.shields.io/pypi/dm/textrl"> </a> <a href="https://github.com/voidful/tfkit"> <img alt="Last Commit" src="https://img.shields.io/github/last-commit/voidful/textrl"> </a> <a href="https://www.codefactor.io/repository/github/voidful/textrl"> <img src="https://www.codefactor.io/repository/github/voidful/textrl/badge" alt="CodeFactor" /> </a> <a href="https://github.com/voidful/textrl"> <img src="https://visitor-badge.glitch.me/badge?page_id=voidful.textrl" alt="Visitor" /> </a> </p>TextRL is a Python library that aims to improve text generation using reinforcement learning, building upon Hugging Face's Transformers, PFRL, and OpenAI GYM. TextRL is designed to be easily customizable and can be applied to various text-generation models.

Table of Contents
Introduction
TextRL utilizes reinforcement learning to fine-tune text generation models. It is built upon the following libraries:
Example - gpt2
<details><summary>CLICK ME</summary>
<p>
GPT2 Example
import pfrl
from textrl import TextRLEnv, TextRLActor, train_agent_with_evaluation
from transformers import AutoModelForCausalLM, AutoTokenizer
import logging
import sys
logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')
checkpoint = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype="auto", device_map="auto")
model = model.cuda()
class MyRLEnv(TextRLEnv):
def get_reward(self, input_item, predicted_list, finish): # predicted will be the list of predicted token
reward = [0]
if finish:
reward = [1] # calculate reward score base on predicted_list
return reward
observaton_list = [{"input":"explain how attention work in seq2seq model"}]
env = TextRLEnv(model, tokenizer, observation_input=observaton_list, max_length=20, compare_sample=2)
actor = TextRLActor(env, model, tokenizer,
act_deterministically=False,
temperature=1.0,
top_k=0,
top_p=1.0,
repetition_penalty=2)
agent = actor.agent_ppo(update_interval=2, minibatch_size=2, epochs=10)
print(actor.predict(observaton_list[0]))
train_agent_with_evaluation(
agent,
env,
steps=100,
eval_n_steps=None,
eval_n_episodes=1,
eval_interval=2,
outdir='bloom—test',
)
print(actor.predict(observaton_list[0]))
</p>
</details>
Example - flan-t5
<details><summary>CLICK ME</summary>
<p>
Example Code
colab example: google/flan-t5-base
import pfrl
from textrl import TextRLEnv, TextRLActor, train_agent_with_evaluation
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import logging
import sys
logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")
model.eval()
model.cuda()
sentiment = pipeline('sentiment-analysis',model="cardiffnlp/twitter-roberta-base-sentiment",tokenizer="cardiffnlp/twitter-roberta-base-sentiment",device=0,return_all_scores=True)
class MyRLEnv(TextRLEnv):
def get_reward(self, input_item, predicted_list, finish): # predicted will be the list of predicted token
reward = 0
if finish or len(predicted_list[0]) >= self.env_max_length:
predicted_text = tokenizer.convert_tokens_to_string(predicted_list[0])
# sentiment classifier
reward = sentiment(input_item['input']+predicted_text)[0][0]['score'] * 10
return reward
observaton_list = [{'input':'i think dogecoin is'}]
env = MyRLEnv(model, tokenizer, observation_input=observaton_list, compare_sample=1)
actor = TextRLActor(env,model,tokenizer,optimizer='adamw',
temperature=0.8,
top_k=100,
top_p=0.85,)
agent = actor.agent_ppo(update_interval=50, minibatch_size=3, epochs=10,lr=3e-4)
print(actor.predict(observaton_list[0]))
pfrl.experiments.train_agent_with_evaluation(
agent,
env,
steps=3000,
eval_n_steps=None,
eval_n_episodes=1,
train_max_episode_len=100,
eval_interval=10,
outdir='checkpoint',
)
agent.load("./checkpoint/best")
print(actor.predict(observaton_list[0]))
</p>
</details>
Example - bigscience/bloomz-7b1-mt
<details><summary>CLICK ME</summary>
<p>
bloomz-7b1-mt Example
import pfrl
from textrl import TextRLEnv, TextRLActor, train_agent_with_evaluation
from transformers import AutoModelForCausalLM, AutoTokenizer
import logging
import sys
logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')
checkpoint = "bigscience/bloomz-7b1-mt"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype="auto", device_map="auto")
model = model.cuda()
class MyRLEnv(TextRLEnv):
def get_reward(self, input_item, predicted_list, finish): # predicted will be the list of predicted token
reward = [0]
if finish:
reward = [1] # calculate reward score base on predicted_list
return reward
observaton_list = [{"input":"explain how attention work in seq2seq model"}]
env = TextRLEnv(model, tokenizer, observation_input=observaton_list, max_length=20, compare_sample=2)
actor = TextRLActor(env, model, tokenizer,
act_deterministically=False,
temperature=1.0,
top_k=0,
top_p=1.0)
agent = actor.agent_ppo(update_interval=2, minibatch_size=2, epochs=10)
print(actor.predict(observaton_list[0]))
train_agent_with_evaluation(
agent,
env,
steps=100,
eval_n_steps=None,
eval_n_episodes=1,
eval_interval=2,
outdir='bloom—test',
)
print(actor.predict(observaton_list[0]))
</p>
</details>
Example - 176B BLOOM
<details><summary>CLICK ME</summary> <p>bloomz-176B Example
Strongly recommend contribute on public swarm to increase petals capacity
https://github.com/bigscience-workshop/petals
install pip install petals -U first
import pfrl
from textrl import TextRLEnv, TextRLActor, train_agent_with_evaluation
from transformers import BloomTokenizerFast
from petals import DistributedBloomForCausalLM
import logging
import sys
logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')
MODEL_NAME = "bigscience/bloom-petals"
tokenizer = BloomTokenizerFast.from_pretrained(MODEL_NAME)
model = DistributedBloomForCausalLM.from_pretrained(MODEL_NAME)
model = model.cuda()
class MyRLEnv(TextRLEnv):
def get_reward(self, input_item, predicted_list, finish): # predicted will be the list of predicted token
reward = [0]
if finish:
reward = [1] # calculate reward score base on predicted_list
return reward
observaton_list = [{"input":"explain how attention work in seq2seq model"}]
env = TextRLEnv(model, tokenizer, observation_input=observaton_list, max_length=20, compare_sample=2)
actor = TextRLActor(env, model, tokenizer,
act_deterministically=False,
temperature=1.0,
top_k=0,
top_p=1.0)
agent = actor.agent_ppo(update_interval=2, minibatch_size=2, epochs=10)
print(actor.predict(observaton_list[0]))
train_agent_with_evaluation(
agent,
env,
steps=100,
eval_n_steps=None,
eval_n_episodes=1,
eval_interval=2,
outdir='bloom—test',
)
print(actor.predict(observaton_list[0]))
</p>
</details>
Example - Controllable generation via RL to let Elon Musk speak ill of DOGE
<details><summary>CLICK ME</summary> <p> [Controllable generation via RL to let Elon Musk speak ill of DOGE ](https://github.com/voidful/TextRL/blob/main/example/2022-12-10-textrl-elon-musk.ipynb)colab example: bigscience/bloom-560m
colab exmaple: huggingtweets/elonmusk
before: i think dogecoin is a great idea.
after: i think dogecoin is a great idea, but I think it is a little overused.
Installation
pip install
pip install pfrl@git+https://github.com/voidful/pfrl.git
pip install textrl
Build from source
git clone and cd into this project.
pip install -e .
Usage
Initialize ag
Related Skills
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
last30days-skill
15.9kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
autoresearch
2.8kClaude Autoresearch Skill — Autonomous goal-directed iteration for Claude Code. Inspired by Karpathy's autoresearch. Modify → Verify → Keep/Discard → Repeat forever.
omg-learn
Learning from user corrections by creating skills and patterns. Patterns can prevent mistakes (block/warn/ask) or inject helpful context into prompts
