Text2Text Language Modeling Toolkit

<details> <summary>Overview</summary>

</details>

Colab Notebooks

Assistant (free private ChatGPT LLM alternative)
STF-IDF multilingual search
All examples

Installation Requirements

pip install -qq -U text2text

Examples run with <16 GB RAM on free Colab GPUs.

Quick Start Guide

Functionality | Invocation | Result :------------: | :-------------: | :-------------: Module Importing | import text2text as t2t | Libraries imported Assistant | t2t.Assistant().transform("Describe Text2Text in a few words: ") | ['Text2Text is an AI-powered text generation tool that creates coherent and continuous text based on prompts.'] Language Model Setting | t2t.Transformer.PRETRAINED_TRANSLATOR = "facebook/m2m100_418M" | Change from default Tokenization | t2t.Tokenizer().transform(["Hello, World!"]) | [['▁Hello', ',', '▁World', '!']] Embedding | t2t.Vectorizer().transform(["Hello, World!"]) | [[0.18745188, 0.05658336, ..., 0.6332584 , 0.43805206]] TF-IDF | t2t.Tfidfer().transform(["Hello, World!"]) | [{'!': 0.5, ',': 0.5, '▁Hello': 0.5, '▁World': 0.5}] BM25 | t2t.Bm25er().transform(["Hello, World!"]) | [{'!': 0.3068528194400547, ',': 0.3068528194400547, '▁Hello': 0.3068528194400547, '▁World': 0.3068528194400547}] Indexer | index = t2t.Indexer().transform(["Hello, World!"]) | Index object for information retrieval Translation | t2t.Translater().transform(["Hello, World!"], src_lang="en, tgt_lang="zh") | ['你好,世界!'] Data Augmentation | t2t.Variator().transform(["Hello, World!"], src_lang="en) | ['Hello the world!', 'Welcome to the world.', 'Hello to the world!',... Distance | t2t.Measurer().transform(["Hello, World! [SEP] Hello, what?"]) | [2] Identification | t2t.Identifier().transform(["Aj keď sa Buzz Aldrin stal až „druhým človekom“..."]) | ['sk', 'Slovak']

Examples

Assistant

Free private open source alternative to commercial LLMs.
Commercial LLMs are costly, collect your data, impose quotas and rate limits that hinder development.
Run at no cost on Google Colab free tier, so you don't even need your own device.

import text2text as t2t
asst = t2t.Assistant()

# Streaming example
chat_history = [
    {"role": "user",  "content": "Hi"},
    {"role": "assistant", "content": "Hello, how are you?"},
    {"role": "user", "content": "What should I do today?"}
]

result = asst.chat_completion(chat_history, stream=True) #{'role': 'assistant', 'content': '1. Make a list of things to be grateful for.\n2. Go outside and take a walk in nature.\n3. Practice mindfulness meditation.\n4. Connect with a loved one or friend.\n5. Do something kind for someone else.\n6. Engage in a creative activity like drawing or writing.\n7. Read an uplifting book or listen to motivational podcasts.'}
for chunk in result:
  print(chunk.choices[0].delta.content, end='', flush=True)

# Running conversation
messages = []
while True:
  user_input = input("User: ")
  print()
  messages.append({"role": "user", "content": user_input})
  print("Assistant: ")
  result = asst.chat_completion(messages, stream=False)
  print(result.choices[0].message.content)
  messages.append(dict(result.choices[0].message))
  print()

# Schema for structured output
from pydantic import BaseModel

class Song(BaseModel):
    name: str
    artist: str

result = asst.chat_completion([
    {"role": "user",  "content": "What is Britney Spears's best song?"}
], schema=Song)
# Song(name='Toxic', artist='Britney Spears')

Tokenization

t2t.Tokenizer().transform([
  "Let's go hiking tomorrow",
  "안녕하세요.",
  "돼지꿈을 꾸세요~~"
])

# Sub-word tokens
[['▁Let', "'", 's', '▁go', '▁hik', 'ing', '▁tom', 'orrow'],
 ['▁안녕', '하세요', '.'],
 ['▁', '돼', '지', '꿈', '을', '▁꾸', '세요', '~~']]

Embedding / Vectorization

t2t.Vectorizer().transform([
  "Let's go hiking tomorrow",
  "안녕하세요.",
  "돼지꿈을 꾸세요~~"
])

# Embeddings
[[-0.00352954,  0.0260059 ,  0.00407429, ..., -0.04830331,
  -0.02540749, -0.00924972],
  [ 0.00043362,  0.00249816,  0.01755436, ...,  0.04451273,
    0.05118701,  0.01895813],
  [-0.03563676, -0.04856304,  0.00518898, ..., -0.00311068,
    0.00071953, -0.00216325]]

TF-IDF

t2t.Tfidfer().transform([
  "Let's go hiking tomorrow",
  "안녕하세요.",
  "돼지꿈을 꾸세요~~"
])

# TF-IDF values
[{'!': 0.22360679774997894,
  "'": 0.44721359549995787,
  ',': 0.22360679774997894,
  'ing': 0.22360679774997894,
  'orrow': 0.22360679774997894,
  's': 0.44721359549995787,
  '▁Let': 0.22360679774997894,
  '▁go': 0.44721359549995787,
  '▁hik': 0.22360679774997894,
  '▁let': 0.22360679774997894,
  '▁tom': 0.22360679774997894},
 {'.': 0.5773502691896258,
  '▁안녕': 0.5773502691896258,
  '하세요': 0.5773502691896258},
 {'~~': 0.3535533905932738,
  '▁': 0.3535533905932738,
  '▁꾸': 0.3535533905932738,
  '꿈': 0.3535533905932738,
  '돼': 0.3535533905932738,
  '세요': 0.3535533905932738,
  '을': 0.3535533905932738,
  '지': 0.3535533905932738}]

BM25

t2t.Bm25er().transform([
  "Let's go hiking tomorrow",
  "안녕하세요.",
  "돼지꿈을 꾸세요~~"
])

# BM25 values
[{"'": 1.2792257271403649,
  'ing': 1.2792257271403649,
  'orrow': 1.2792257271403649,
  's': 1.2792257271403649,
  '▁Let': 1.2792257271403649,
  '▁go': 1.2792257271403649,
  '▁hik': 1.2792257271403649,
  '▁tom': 1.2792257271403649},
 {'.': 1.751071282233123, '▁안녕': 1.751071282233123, '하세요': 1.751071282233123},
 {'~~': 1.2792257271403649,
  '▁': 1.2792257271403649,
  '▁꾸': 1.2792257271403649,
  '꿈': 1.2792257271403649,
  '돼': 1.2792257271403649,
  '세요': 1.2792257271403649,
  '을': 1.2792257271403649,
  '지': 1.2792257271403649}]

Index

index = t2t.Indexer().transform([
  "Let's go hiking tomorrow, let's go!",
  "안녕하세요.",
  "돼지꿈을 꾸세요~~",
])

index.retrieve(["돼지"], k=1) #[['"돼지꿈을 꾸세요~~"']]

# Add documents
index.add(["Hello, World! 你好,世界!"])

# Remove by ids
index.remove([2]) #Removes "돼지꿈을 꾸세요~~"

# Retrieve k results per query sorted by distance
index.retrieve(["你好, World"], k=3)

To learn more, see STF-IDF.

Levenshtein Sub-word Edit Distance

t2t.Measurer().transform([
  "Hello, World! [SEP] Hello, what?",
  "안녕하세요. [SEP] 돼지꿈을 꾸세요~~"
], metric="levenshtein_distance")

# Distances
 [2, 8]

Translation

# Sample texts
article_en = 'The Secretary-General of the United Nations says there is no military solution in Syria.'

notre_dame_str = "As at most other universities, Notre Dame's students run a number of news media outlets. The nine student - run outlets include three newspapers, both a radio and television station, and several magazines and journals. Begun as a one - page journal in September 1876, the Scholastic magazine is issued twice monthly and claims to be the oldest continuous collegiate publication in the United States. The other magazine, The Juggler, is released twice a year and focuses on student literature and artwork. The Dome yearbook is published annually. The newspapers have varying publication interests, with The Observer published daily and mainly reporting university and other news, and staffed by students from both Notre Dame and Saint Mary's College. Unlike Scholastic and The Dome, The Observer is a

Text2text

Install / Use

README