Datallm
DataLLM
Install / Use
/learn @mostly-ai/DatallmREADME
Attention: DataLLM is superseded by mostlyai-mock.
DataLLM: prompt LLMs for Tabular Data 🔮
Welcome to DataLLM, your go-to open-source platform for Tabular Data Generation!
DataLLM allows you to efficiently tap into the vast power of LLMs to...
- create mock data that fits your needs, as well as
- enrich datasets with world knowledge.
Start using DataLLM
- Sign in and retrieve an API key.
- Install the latest version of the DataLLM Python client.
pip install -U datallm
- Instantiate a client with your retrieved API key.
from datallm import DataLLM
datallm = DataLLM(api_key='INSERT_API_KEY', base_url='https://data.mostly.ai')
- Enrich an existing dataset with new columns, that are coherent with any of the already present columns.
import pandas as pd
df = pd.DataFrame({
"age in years": [5, 10, 13, 19, 30, 40, 50, 60, 70, 80],
"gender": ["m", "f", "m", "f", "m", "f", "m", "f", "m", "f"],
"country code": ["AT", "DE", "FR", "IT", "ES", "PT", "GR", "UK", "SE", "FI"],
})
# enrich the DataFrame with a new column containing the official country name
df["country"] = datallm.enrich(df, prompt="official name of the country")
# enrich the DataFrame with first name and last name
df["first name"] = datallm.enrich(df, prompt="the first name of that person")
df["last name"] = datallm.enrich(df, prompt="the last name of that person")
# enrich the DataFrame with a categorical
df["age group"] = datallm.enrich(
df, prompt="age group", categories=["kid", "teen", "adult", "elderly"]
)
# enrich with a boolean value and a integer value
df["speaks german"] = datallm.enrich(df, prompt="speaks german?", dtype="boolean")
df["body height"] = datallm.enrich(df, prompt="the body height in cm", dtype="integer")
print(df)
# age in years gender country code country first name last name age group speaks german body height
# 0 5 m AT Austria Julian Kittner kid True 106
# 1 10 f DE Germany Julia Buchner teen True 156
# 2 13 m FR France Benjamin Dumoulin teen False 174
# 3 19 f IT Italy Alessia Santamaria teen False 163
# 4 30 m ES Spain Paco Ruiz adult False 185
# 5 40 f PT Portugal Elisa Santos adult False 168
# 6 50 m GR Greece Dimitris Kleopas adult False 166
# 7 60 f UK United Kingdom Diane Huntley elderly False 162
# 8 70 m SE Sweden Stig Nordstrom elderly False 174
# 9 80 f FI Finland Aili Juhola elderly False 157
- Or create a completely new dataset from scratch.
df = datallm.mock(
n=100, # number of generated records
data_description="Guests of an Alpine ski hotel in Austria",
columns={
"full name": {"prompt": "first name and last name of the guest"},
"nationality": {"prompt": "the 2-letter code for the nationality"},
"date_of_birth": {"prompt": "the date of birth of that guest", "dtype": "date"},
"gender": {"categories": ["male", "female", "non-binary", "n/a"]},
"beds": {"prompt": "the number of beds within the hotel room; min: 2", "dtype": "integer"},
"email": {"prompt": "the customers email address", "regex": "([a-z|0-9|\\.]+)(@foo\\.bar)"},
},
temperature=0.7
)
print(df)
# full name nationality date_of_birth gender beds email
# 0 Melinda Baxter US 1986-07-09 female 2 melindabaxter@foo.bar
# 1 Andy Rouse GB 1941-03-14 male 4 andyrouse@foo.bar
# 2 Andreas Kainz AT 2001-01-10 male 2 andreas.kainz@foo.bar
# 3 Lisa Nowak AT 1994-01-02 female 2 lisanowak@foo.bar
# .. ... ... ... ... ... ...
# 96 Mike Peterson US 1997-04-28 male 2 mikepeterson@foo.bar
# 97 Susanne Hintze DE 1987-04-12 female 2 shintze@foo.bar
# 98 Ernst Wisniewski AT 1992-04-03 male 2 erntwisniewski@foo.bar
# 99 Tobias Schmitt AT 1987-06-24 male 2 tobias.schmitt@foo.bar
Key Features
- Efficient Tabular Data Generation: Easily prompt LLMs for structured data at scale.
- Contextual Generation: Each data row is sampled independently, and considers the prompt, the existing row values, as well as the dataset descriptions as context.
- Data Type Adherence: Supported data types are
string,categorical,integer,floats,boolean,date, anddatetime. - Regular Expression Support: Further constrain the range of allowed values with regular expressions.
- Flexible Sampling Parameters: Tailor the diversity and realism of your generated data via
temperatureandtop_p. - Esay-to-use Python Client: Use
datallm.mock()anddatallm.enrich()directly from any Python environment. - Multi-model Support: Optionally host multiple models to cater for different speed / knowledge requirements of your users.
Use Case Examples
Mock PII fields
import pandas as pd
df = pd.read_csv('https://github.com/mostly-ai/public-demo-data/raw/dev/census/census.csv.gz', nrows=10)
df = df[['race', 'sex', 'native_country']]
df['mock name'] = datallm.enrich(df, prompt='full name, consisting of first name, last name but without any titles')
df['mock email'] = datallm.enrich(df, prompt='email')
df['mock SSN'] = datallm.enrich(df, prompt='social security number', regex='\\d{3}-\\d{2}-\\d{4}')
print(df)
# race sex native_country mock name mock email mock SSN
# 0 White Male United-States James Ridgway james.ridgway@cw.com 393-36-5291
# 1 White Male United-States Jacob Lopez jacob.lopez@empresa.com 467-64-7848
# 2 White Male United-States Robert Jansen rjansen@michael-kors.com 963-13-6498
# 3 Black Male United-States Darnell Dixon darnell.dixon@gmail.com 125-59-9615
# 4 Black Female Cuba Alexis Ramirez aramirez12@example.com 881-46-9037
# 5 White Female United-States Kristen Miller kristen.miller@email.com 098-69-6224
# 6 Black Female Jamaica Coleen Williams mcoleenwilliams@example.com 980-26-3724
# 7 White Male United-States Jay Stephenson jaystephenson@gmail.com 464-05-4106
# 8 White Female United-States Lois Rodriguez lrodriguez75@hotmail.com 332-10-6400
# 9 White Male United-States Eddie Watson eddiewatson@email.com 645-47-1545
Summarize data records
import pandas as pd
df = pd.read_csv('https://github.com/mostly-ai/public-demo-data/raw/dev/census/census.csv.gz', nrows=10)
df['summary'] = datallm.enrich(df, prompt='summarize the data record in a single sentence')
print(df[['summary']])
# summary
# 0 Never married male employee
# 1 White male from United States, 50 years old, works as an executive
# 2 White male who is divorced and working as a Handlers-cleaners with
# 3 Male from United-States, aged 53, works as Handlers
# 4 Black female from Cuba with Bachelors degree who works as Prof-specialty
# 5 White married female with masters degree who works as exec-managerial
# 6 Jamaican immigrant who works in other-service and makes 18
# 7 52 year old US born male married with high school education working as an executive
# 8 Professional with a masters degree working 50 hours a week
# 9 White male from United-States with Bachelors degree who works as Exec
Augment your data
import pandas as pd
df = pd.DataFrame({'movie title': [
'A Fistful of Dollars', 'American Wedding', 'Ice Age', 'Liar Liar',
'March of the Penguins', 'Curly Sue', 'Braveheart', 'Bruce Almighty'
]})
df['genre'] = datallm.enrich(
df,
prompt='what is the genre of that movie?',
categories = ["action", "comedy", "drama", "horror", "sci-fi", "fantasy", "thriller", "documentary", "animation"],
temperature=0.0
)
print(df)
# movie title genre
# 0 A Fistful of Dollars action
# 1 American Wedding comedy
# 2 Ice Age animation
# 3 Liar Liar comedy
# 4 March of the Penguins documentary
# 5 Curly Sue comedy
# 6 Braveheart drama
# 7 Bruce Almighty comedy
Label your data
import pandas as pd
df = pd.read_csv('https://github.com/mostly-ai/public-demo-data/raw/dev/tweets/TheSocialDilemma.csv.gz', nrows=10)[['text']]
df['DataLLM sentiment'] = datallm.enrich(
df[['text']],
prompt='tweet sentiment',
categories=['Positive', 'Neutral', 'Negative'],
temperature=0.0,
)
print(df)
# text DataLLM sentiment
# 0 @musicmadmarc @SocialDilemma_ @netflix @Facebo... Positive
# 1 @musicmadmarc @SocialDilemma_ @netflix @Facebo... Neutral
# 2 Go watch “The Social Dilemma” on Netflix!\n\nI... Positive
# 3 I watched #TheSocialDilemma last night. I’m sc... Negative
# 4 The problem of me
Related Skills
node-connect
343.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
92.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
343.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
343.3kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
