Text2text
Text2Text Language Modeling Toolkit
Install / Use
/learn @artitw/Text2textREADME
Text2Text Language Modeling Toolkit
<details> <summary>Overview</summary>- Colab Notebooks
- Installation Requirements
- Quick Start Guide
- Languages Available
- Requirements & Installation
- Examples
- Questions?
- Citation
- Contributing
- Code of Conduct
Colab Notebooks
Installation Requirements
pip install -qq -U text2text
- Examples run with <16 GB RAM on free Colab GPUs.
Quick Start Guide
Functionality | Invocation | Result
:------------: | :-------------: | :-------------:
Module Importing | import text2text as t2t | Libraries imported
Assistant | t2t.Assistant().transform("Describe Text2Text in a few words: ") | ['Text2Text is an AI-powered text generation tool that creates coherent and continuous text based on prompts.']
Language Model Setting | t2t.Transformer.PRETRAINED_TRANSLATOR = "facebook/m2m100_418M" | Change from default
Tokenization | t2t.Tokenizer().transform(["Hello, World!"]) | [['▁Hello', ',', '▁World', '!']]
Embedding | t2t.Vectorizer().transform(["Hello, World!"]) | [[0.18745188, 0.05658336, ..., 0.6332584 , 0.43805206]]
TF-IDF | t2t.Tfidfer().transform(["Hello, World!"]) | [{'!': 0.5, ',': 0.5, '▁Hello': 0.5, '▁World': 0.5}]
BM25 | t2t.Bm25er().transform(["Hello, World!"]) | [{'!': 0.3068528194400547, ',': 0.3068528194400547, '▁Hello': 0.3068528194400547, '▁World': 0.3068528194400547}]
Indexer | index = t2t.Indexer().transform(["Hello, World!"]) | Index object for information retrieval
Translation | t2t.Translater().transform(["Hello, World!"], src_lang="en, tgt_lang="zh") | ['你好,世界!']
Data Augmentation | t2t.Variator().transform(["Hello, World!"], src_lang="en) | ['Hello the world!', 'Welcome to the world.', 'Hello to the world!',...
Distance | t2t.Measurer().transform(["Hello, World! [SEP] Hello, what?"]) | [2]
Identification | t2t.Identifier().transform(["Aj keď sa Buzz Aldrin stal až „druhým človekom“..."]) | ['sk', 'Slovak']
Examples
Assistant
- Free private open source alternative to commercial LLMs.
- Commercial LLMs are costly, collect your data, impose quotas and rate limits that hinder development.
- Run at no cost on Google Colab free tier, so you don't even need your own device.
import text2text as t2t
asst = t2t.Assistant()
# Streaming example
chat_history = [
{"role": "user", "content": "Hi"},
{"role": "assistant", "content": "Hello, how are you?"},
{"role": "user", "content": "What should I do today?"}
]
result = asst.chat_completion(chat_history, stream=True) #{'role': 'assistant', 'content': '1. Make a list of things to be grateful for.\n2. Go outside and take a walk in nature.\n3. Practice mindfulness meditation.\n4. Connect with a loved one or friend.\n5. Do something kind for someone else.\n6. Engage in a creative activity like drawing or writing.\n7. Read an uplifting book or listen to motivational podcasts.'}
for chunk in result:
print(chunk.choices[0].delta.content, end='', flush=True)
# Running conversation
messages = []
while True:
user_input = input("User: ")
print()
messages.append({"role": "user", "content": user_input})
print("Assistant: ")
result = asst.chat_completion(messages, stream=False)
print(result.choices[0].message.content)
messages.append(dict(result.choices[0].message))
print()
# Schema for structured output
from pydantic import BaseModel
class Song(BaseModel):
name: str
artist: str
result = asst.chat_completion([
{"role": "user", "content": "What is Britney Spears's best song?"}
], schema=Song)
# Song(name='Toxic', artist='Britney Spears')
Tokenization
t2t.Tokenizer().transform([
"Let's go hiking tomorrow",
"안녕하세요.",
"돼지꿈을 꾸세요~~"
])
# Sub-word tokens
[['▁Let', "'", 's', '▁go', '▁hik', 'ing', '▁tom', 'orrow'],
['▁안녕', '하세요', '.'],
['▁', '돼', '지', '꿈', '을', '▁꾸', '세요', '~~']]
Embedding / Vectorization
t2t.Vectorizer().transform([
"Let's go hiking tomorrow",
"안녕하세요.",
"돼지꿈을 꾸세요~~"
])
# Embeddings
[[-0.00352954, 0.0260059 , 0.00407429, ..., -0.04830331,
-0.02540749, -0.00924972],
[ 0.00043362, 0.00249816, 0.01755436, ..., 0.04451273,
0.05118701, 0.01895813],
[-0.03563676, -0.04856304, 0.00518898, ..., -0.00311068,
0.00071953, -0.00216325]]
TF-IDF
t2t.Tfidfer().transform([
"Let's go hiking tomorrow",
"안녕하세요.",
"돼지꿈을 꾸세요~~"
])
# TF-IDF values
[{'!': 0.22360679774997894,
"'": 0.44721359549995787,
',': 0.22360679774997894,
'ing': 0.22360679774997894,
'orrow': 0.22360679774997894,
's': 0.44721359549995787,
'▁Let': 0.22360679774997894,
'▁go': 0.44721359549995787,
'▁hik': 0.22360679774997894,
'▁let': 0.22360679774997894,
'▁tom': 0.22360679774997894},
{'.': 0.5773502691896258,
'▁안녕': 0.5773502691896258,
'하세요': 0.5773502691896258},
{'~~': 0.3535533905932738,
'▁': 0.3535533905932738,
'▁꾸': 0.3535533905932738,
'꿈': 0.3535533905932738,
'돼': 0.3535533905932738,
'세요': 0.3535533905932738,
'을': 0.3535533905932738,
'지': 0.3535533905932738}]
BM25
t2t.Bm25er().transform([
"Let's go hiking tomorrow",
"안녕하세요.",
"돼지꿈을 꾸세요~~"
])
# BM25 values
[{"'": 1.2792257271403649,
'ing': 1.2792257271403649,
'orrow': 1.2792257271403649,
's': 1.2792257271403649,
'▁Let': 1.2792257271403649,
'▁go': 1.2792257271403649,
'▁hik': 1.2792257271403649,
'▁tom': 1.2792257271403649},
{'.': 1.751071282233123, '▁안녕': 1.751071282233123, '하세요': 1.751071282233123},
{'~~': 1.2792257271403649,
'▁': 1.2792257271403649,
'▁꾸': 1.2792257271403649,
'꿈': 1.2792257271403649,
'돼': 1.2792257271403649,
'세요': 1.2792257271403649,
'을': 1.2792257271403649,
'지': 1.2792257271403649}]
Index
index = t2t.Indexer().transform([
"Let's go hiking tomorrow, let's go!",
"안녕하세요.",
"돼지꿈을 꾸세요~~",
])
index.retrieve(["돼지"], k=1) #[['"돼지꿈을 꾸세요~~"']]
# Add documents
index.add(["Hello, World! 你好,世界!"])
# Remove by ids
index.remove([2]) #Removes "돼지꿈을 꾸세요~~"
# Retrieve k results per query sorted by distance
index.retrieve(["你好, World"], k=3)
To learn more, see STF-IDF.
Levenshtein Sub-word Edit Distance
t2t.Measurer().transform([
"Hello, World! [SEP] Hello, what?",
"안녕하세요. [SEP] 돼지꿈을 꾸세요~~"
], metric="levenshtein_distance")
# Distances
[2, 8]
Translation
# Sample texts
article_en = 'The Secretary-General of the United Nations says there is no military solution in Syria.'
notre_dame_str = "As at most other universities, Notre Dame's students run a number of news media outlets. The nine student - run outlets include three newspapers, both a radio and television station, and several magazines and journals. Begun as a one - page journal in September 1876, the Scholastic magazine is issued twice monthly and claims to be the oldest continuous collegiate publication in the United States. The other magazine, The Juggler, is released twice a year and focuses on student literature and artwork. The Dome yearbook is published annually. The newspapers have varying publication interests, with The Observer published daily and mainly reporting university and other news, and staffed by students from both Notre Dame and Saint Mary's College. Unlike Scholastic and The Dome, The Observer is a
