Attention: DataLLM is superseded by mostlyai-mock.

DataLLM: prompt LLMs for Tabular Data 🔮

Welcome to DataLLM, your go-to open-source platform for Tabular Data Generation!

DataLLM allows you to efficiently tap into the vast power of LLMs to...

create mock data that fits your needs, as well as
enrich datasets with world knowledge.

Start using DataLLM

Sign in and retrieve an API key.
Install the latest version of the DataLLM Python client.

pip install -U datallm

Instantiate a client with your retrieved API key.

from datallm import DataLLM
datallm = DataLLM(api_key='INSERT_API_KEY', base_url='https://data.mostly.ai')

Enrich an existing dataset with new columns, that are coherent with any of the already present columns.

import pandas as pd
df = pd.DataFrame({
    "age in years": [5, 10, 13, 19, 30, 40, 50, 60, 70, 80],
    "gender": ["m", "f", "m", "f", "m", "f", "m", "f", "m", "f"],
    "country code": ["AT", "DE", "FR", "IT", "ES", "PT", "GR", "UK", "SE", "FI"],
})

# enrich the DataFrame with a new column containing the official country name
df["country"] = datallm.enrich(df, prompt="official name of the country")

# enrich the DataFrame with first name and last name
df["first name"] = datallm.enrich(df, prompt="the first name of that person")
df["last name"] = datallm.enrich(df, prompt="the last name of that person")

# enrich the DataFrame with a categorical
df["age group"] = datallm.enrich(
    df, prompt="age group", categories=["kid", "teen", "adult", "elderly"]
)

# enrich with a boolean value and a integer value
df["speaks german"] = datallm.enrich(df, prompt="speaks german?", dtype="boolean")
df["body height"] = datallm.enrich(df, prompt="the body height in cm", dtype="integer")
print(df)
#    age in years gender country code         country first name   last name age group speaks german  body height
# 0             5      m           AT         Austria     Julian     Kittner       kid          True          106
# 1            10      f           DE         Germany      Julia     Buchner      teen          True          156
# 2            13      m           FR          France   Benjamin    Dumoulin      teen         False          174
# 3            19      f           IT           Italy    Alessia  Santamaria      teen         False          163
# 4            30      m           ES           Spain       Paco        Ruiz     adult         False          185
# 5            40      f           PT        Portugal      Elisa      Santos     adult         False          168
# 6            50      m           GR          Greece   Dimitris     Kleopas     adult         False          166
# 7            60      f           UK  United Kingdom      Diane     Huntley   elderly         False          162
# 8            70      m           SE          Sweden       Stig   Nordstrom   elderly         False          174
# 9            80      f           FI         Finland       Aili      Juhola   elderly         False          157

Or create a completely new dataset from scratch.

df = datallm.mock(
    n=100,  # number of generated records 
    data_description="Guests of an Alpine ski hotel in Austria",
    columns={
        "full name": {"prompt": "first name and last name of the guest"},
        "nationality": {"prompt": "the 2-letter code for the nationality"},
        "date_of_birth": {"prompt": "the date of birth of that guest", "dtype": "date"},
        "gender": {"categories": ["male", "female", "non-binary", "n/a"]},
        "beds": {"prompt": "the number of beds within the hotel room; min: 2", "dtype": "integer"},
        "email": {"prompt": "the customers email address", "regex": "([a-z|0-9|\\.]+)(@foo\\.bar)"},
    },
    temperature=0.7
)
print(df)
#            full name nationality date_of_birth  gender  beds                   email
# 0     Melinda Baxter          US    1986-07-09  female     2   melindabaxter@foo.bar
# 1         Andy Rouse          GB    1941-03-14    male     4       andyrouse@foo.bar
# 2      Andreas Kainz          AT    2001-01-10    male     2   andreas.kainz@foo.bar
# 3         Lisa Nowak          AT    1994-01-02  female     2       lisanowak@foo.bar
# ..               ...         ...           ...     ...   ...                     ...
# 96     Mike Peterson          US    1997-04-28    male     2    mikepeterson@foo.bar
# 97    Susanne Hintze          DE    1987-04-12  female     2         shintze@foo.bar
# 98  Ernst Wisniewski          AT    1992-04-03    male     2  erntwisniewski@foo.bar
# 99    Tobias Schmitt          AT    1987-06-24    male     2  tobias.schmitt@foo.bar

Key Features

Efficient Tabular Data Generation: Easily prompt LLMs for structured data at scale.
Contextual Generation: Each data row is sampled independently, and considers the prompt, the existing row values, as well as the dataset descriptions as context.
Data Type Adherence: Supported data types are string, categorical, integer, floats, boolean, date, and datetime.
Regular Expression Support: Further constrain the range of allowed values with regular expressions.
Flexible Sampling Parameters: Tailor the diversity and realism of your generated data via temperature and top_p.
Esay-to-use Python Client: Use datallm.mock() and datallm.enrich() directly from any Python environment.
Multi-model Support: Optionally host multiple models to cater for different speed / knowledge requirements of your users.

Use Case Examples

Mock PII fields

import pandas as pd
df = pd.read_csv('https://github.com/mostly-ai/public-demo-data/raw/dev/census/census.csv.gz', nrows=10)
df = df[['race', 'sex', 'native_country']]
df['mock name'] = datallm.enrich(df, prompt='full name, consisting of first name, last name but without any titles')
df['mock email'] = datallm.enrich(df, prompt='email')
df['mock SSN'] = datallm.enrich(df, prompt='social security number', regex='\\d{3}-\\d{2}-\\d{4}')
print(df)
#     race     sex native_country        mock name                   mock email     mock SSN
# 0  White    Male  United-States    James Ridgway         james.ridgway@cw.com  393-36-5291
# 1  White    Male  United-States      Jacob Lopez      jacob.lopez@empresa.com  467-64-7848
# 2  White    Male  United-States    Robert Jansen     rjansen@michael-kors.com  963-13-6498
# 3  Black    Male  United-States    Darnell Dixon      darnell.dixon@gmail.com  125-59-9615
# 4  Black  Female           Cuba   Alexis Ramirez       aramirez12@example.com  881-46-9037
# 5  White  Female  United-States   Kristen Miller     kristen.miller@email.com  098-69-6224
# 6  Black  Female        Jamaica  Coleen Williams  mcoleenwilliams@example.com  980-26-3724
# 7  White    Male  United-States   Jay Stephenson      jaystephenson@gmail.com  464-05-4106
# 8  White  Female  United-States   Lois Rodriguez     lrodriguez75@hotmail.com  332-10-6400
# 9  White    Male  United-States     Eddie Watson        eddiewatson@email.com  645-47-1545

Summarize data records

import pandas as pd
df = pd.read_csv('https://github.com/mostly-ai/public-demo-data/raw/dev/census/census.csv.gz', nrows=10)
df['summary'] = datallm.enrich(df, prompt='summarize the data record in a single sentence')
print(df[['summary']])
#                                                                                summary
# 0                                                          Never married male employee
# 1                   White male from United States, 50 years old, works as an executive
# 2                   White male who is divorced and working as a Handlers-cleaners with
# 3                                  Male from United-States, aged 53, works as Handlers
# 4             Black female from Cuba with Bachelors degree who works as Prof-specialty
# 5                White married female with masters degree who works as exec-managerial
# 6                           Jamaican immigrant who works in other-service and makes 18
# 7  52 year old US born male married with high school education working as an executive
# 8                           Professional with a masters degree working 50 hours a week
# 9                White male from United-States with Bachelors degree who works as Exec

Augment your data

import pandas as pd
df = pd.DataFrame({'movie title': [
    'A Fistful of Dollars', 'American Wedding', 'Ice Age', 'Liar Liar',
    'March of the Penguins', 'Curly Sue', 'Braveheart', 'Bruce Almighty'
]})
df['genre'] = datallm.enrich(
    df, 
    prompt='what is the genre of that movie?', 
    categories = ["action", "comedy", "drama", "horror", "sci-fi", "fantasy", "thriller", "documentary", "animation"],
    temperature=0.0
)
print(df)
#              movie title        genre
# 0   A Fistful of Dollars       action
# 1       American Wedding       comedy
# 2                Ice Age    animation
# 3              Liar Liar       comedy
# 4  March of the Penguins  documentary
# 5              Curly Sue       comedy
# 6             Braveheart        drama
# 7         Bruce Almighty       comedy

Label your data

import pandas as pd
df = pd.read_csv('https://github.com/mostly-ai/public-demo-data/raw/dev/tweets/TheSocialDilemma.csv.gz', nrows=10)[['text']]
df['DataLLM sentiment'] = datallm.enrich(
    df[['text']],
    prompt='tweet sentiment',
    categories=['Positive', 'Neutral', 'Negative'],
    temperature=0.0,
)
print(df)
#                                                 text DataLLM sentiment
# 0  @musicmadmarc @SocialDilemma_ @netflix @Facebo...          Positive
# 1  @musicmadmarc @SocialDilemma_ @netflix @Facebo...           Neutral
# 2  Go watch “The Social Dilemma” on Netflix!\n\nI...          Positive
# 3  I watched #TheSocialDilemma last night. I’m sc...          Negative
# 4  The problem of me

Datallm

Install / Use

README