Messydata
Synthetic dirty data generator
Install / Use
/learn @sodadata/MessydataREADME
MessyData
Synthetic dirty data generator. Define a schema in YAML, get a realistic messy DataFrame.
MessyData generates structured datasets from a declarative config and injects configurable anomalies — missing values, duplicates, invalid categories, bad dates, and outliers. Designed for testing data pipelines, validating data quality tooling, and feeding AI/ML workflows that need realistic imperfect data.
Claude Code Skill
MessyData includes a Claude Code skill that teaches any agent how to write configs, validate them, and use the CLI. Download SKILL.md and place it at:
~/.claude/skills/messydata/SKILL.md
Then invoke it with /messydata in any Claude Code session.
Install
uv pip install messydata
# or
pip install messydata
Quick Start
With a Claude Code agent (fastest)
With the skill installed, just describe what you need in plain English:
/messydata generate a retail transactions dataset starting from 2024-01-01, 500 rows
per day. Include product catalog, customer region, payment method, and a realistic
price distribution. Add some missing values across all columns, a few duplicate
records, and occasional outlier prices. Save it to retail.csv.
The agent will write the YAML config, validate it, and run the CLI to produce the file — no manual config writing needed.
CLI
# Generate to a file (format inferred from extension)
messydata generate my_config.yaml --rows 1000 --seed 42 --output data.csv
messydata generate my_config.yaml --rows 1000 --output data.parquet
messydata generate my_config.yaml --rows 1000 --output data.json
# Stream to stdout
messydata generate my_config.yaml --rows 1000
# Single day (requires temporal: true on a date field)
messydata generate my_config.yaml --start-date 2025-06-01 --rows 500
# Date range — --rows is rows per day
messydata generate my_config.yaml --start-date 2025-01-01 --end-date 2025-03-31 --rows 500 --output data.csv
# Validate a config without generating (exits 0/1 — useful in CI and agent loops)
messydata validate my_config.yaml
# Print the full JSON Schema for the config format
messydata schema
YAML + Python
# my_config.yaml
name: orders
primary_key: order_id
records_per_primary_key:
type: lognormal
mu: 2.0
sigma: 0.5
anomalies:
- name: missing_values
prob: 1.0 # always inject
rate: 0.05 # 5% of cells set to NaN
fields:
- name: order_id
dtype: int32
unique_per_id: true
nullable: false
distribution:
type: sequential
start: 1
- name: order_date
dtype: object
unique_per_id: true
nullable: false
temporal: true # marks this as the date anchor
distribution:
type: sequential
start: "2024-01-01"
- name: amount
dtype: float32
nullable: false
distribution:
type: lognormal
mu: 3.5
sigma: 0.75
- name: status
dtype: object
nullable: false
distribution:
type: weighted_choice
values: [pending, shipped, delivered, cancelled]
weights: [0.1, 0.3, 0.5, 0.1]
from messydata import Pipeline
pipeline = Pipeline.from_config("my_config.yaml")
# All rows, sequential dates
df = pipeline.run(n_rows=1000, seed=42)
# All rows pinned to a single date
df = pipeline.run_for_date("2025-06-01", n_rows=500)
# One generation pass per day, concatenated
df = pipeline.run_date_range("2025-01-01", "2025-03-31", rows_per_day=500)
Python-first
All distribution and anomaly types are importable as Python classes with full IDE support:
from messydata import (
DatasetSchema, Pipeline,
FieldSpec, AnomalySpec,
Lognormal, WeightedChoice, Sequential,
)
schema = DatasetSchema(
name="orders",
primary_key="order_id",
records_per_primary_key=Lognormal(mu=2.0, sigma=0.5),
fields=[
FieldSpec(name="order_id", dtype="int32",
distribution=Sequential(start=1),
unique_per_id=True, nullable=False),
FieldSpec(name="amount", dtype="float32",
distribution=Lognormal(mu=3.5, sigma=0.75),
nullable=False),
FieldSpec(name="status", dtype="object",
distribution=WeightedChoice(
values=["pending", "shipped", "delivered"],
weights=[0.2, 0.5, 0.3])),
],
anomalies=[AnomalySpec(name="missing_values", prob=1.0, rate=0.05)],
)
df = Pipeline(schema).run(n_rows=1000, seed=42)
YAML Config Reference
Top-level keys
| Key | Type | Required | Description |
|---|---|---|---|
| name | string | yes | Dataset identifier |
| primary_key | string | no (default: id) | Field used as the primary grouping key |
| records_per_primary_key | distribution block | yes | How many rows to generate per primary key value — accepts any continuous distribution |
| fields | list of field specs | yes | Column definitions |
| anomalies | list of anomaly specs | no | Data quality issues to inject |
Row count:
run(n_rows=N)generates approximately N rows. Because each primary key group is sampled fromrecords_per_primary_key, the actual count may differ slightly. Each group always has at least 1 row.
Field spec properties
| Property | Type | Required | Default | Description |
|---|---|---|---|---|
| name | string | yes | — | Column name in the output DataFrame |
| dtype | string | no | object | Pandas dtype: int32, int64, float32, float64, object, bool |
| distribution | distribution block | yes | — | How values are sampled (see Distribution Reference) |
| unique_per_id | bool | no | false | If true, one value is drawn per primary key group and repeated for all rows in that group |
| nullable | bool | no | true | Marks the field as nullable — used by anomaly injection |
| temporal | bool | no | false | Marks this field as the date anchor for run_for_date / run_date_range. Exactly one field per schema. |
unique_per_id: true is appropriate for entity-level attributes that don't vary per transaction — e.g., a customer's region, a store's tier, a payment method for an order.
Distribution reference
Each distribution block requires a type key. All other keys are parameters for that distribution type.
Continuous distributions
| type | Parameters | Notes |
|---|---|---|
| uniform | min, max | Uniform over [min, max] |
| normal | mean, std | Gaussian |
| lognormal | mu, sigma | Log-normal — good default for prices, quantities, durations |
| weibull | a, scale (default 1.0) | Parametrised by shape a |
| exponential | scale (default 1.0) | Rate = 1 / scale |
| beta | a, b | Output in [0, 1] — useful for rates and probabilities |
| gamma | shape, scale (default 1.0) | General-purpose skewed positive |
| mixture | components, weights | Weighted blend of continuous distributions — see below |
Categorical distributions
| type | Parameters | Notes |
|---|---|---|
| weighted_choice | values, weights | Draws from a fixed list. weights must sum to 1. |
| weighted_choice_mapping | columns, weights | Draws correlated multi-column outcomes from a joint table — see below |
Special distributions
| type | Parameters | Notes |
|---|---|---|
| sequential | start, step (default 1) | Auto-incrementing. start can be an integer or a date string ("2023-01-01"). Each primary key group advances by step. |
weighted_choice — categorical with probabilities
distribution:
type: weighted_choice
values: [north, south, east, west]
weights: [0.4, 0.3, 0.2, 0.1]
weighted_choice_mapping — correlated multi-column categorical
When two or more columns are always correlated (e.g., product_id and product_name always appear together), use a single weighted_choice_mapping field. All lists under columns must have the same length — each index is one joint outcome.
- name: product # field name is a placeholder; actual columns come from `columns:`
dtype: object
distribution:
type: weighted_choice_mapping
columns:
product_id: [1001, 1002, 1003, 1004, 1005]
product_name: [Widget, Gadget, Doohickey, Thingamajig, Whatsit]
weights: [0.4, 0.2, 0.2, 0.1, 0.1]
This adds product_id and product_name as separate columns — guaranteed consistent. The placeholder name: product is not added to the DataFrame.
sequential — auto-incrementing integers or dates
# Integer sequence starting at 1
distribution:
type: sequential
start: 1
step: 1
# Date sequence — start must be a YYYY-MM-DD string
distribution:
type: sequential
start: "2023-01-01"
step: 1 # advances by 1 day per primary key group
mixture — weighted blend of continuous distributions
# Bimodal price distribution: budget items + premium items
distribution:
type: mixture
components:
- type: normal
mean: 15.0
std: 3.0
- type: lognormal
mu: 5.0
sigma: 0.8
weights: [0.6, 0.4]
mixture only supports continuous component types (uniform, normal, lognormal, weibull, exponential, beta, gamma). Categorical and sequential types cannot be used as components.
Anomaly reference
Each anomaly has two required fields:
| Field | Type | Description |
|---|---|---|
| prob | float [0–1] | Probability this anomaly fires on a given run. 1.0 = always inject. |
| rate | float [0–1] | Fraction of eligible rows or cells affected when the anomaly fires. |
Example:
prob: 0.3, rate: 0.05means a 30% chance the anomaly is active; when active, 5% of eligible rows are affected. Useprob: 1.0for deterministic injection.
Anomaly types
| name | columns
Related Skills
node-connect
352.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
