Messydata

Synthetic dirty data generator

Generate Convert Improve

Install / Use

/learn @sodadata/Messydata

About this skill

Quality Score

0/100

README

MessyData

Synthetic dirty data generator. Define a schema in YAML, get a realistic messy DataFrame.

MessyData generates structured datasets from a declarative config and injects configurable anomalies — missing values, duplicates, invalid categories, bad dates, and outliers. Designed for testing data pipelines, validating data quality tooling, and feeding AI/ML workflows that need realistic imperfect data.

Claude Code Skill

MessyData includes a Claude Code skill that teaches any agent how to write configs, validate them, and use the CLI. Download SKILL.md and place it at:

~/.claude/skills/messydata/SKILL.md

Then invoke it with /messydata in any Claude Code session.

Install

uv pip install messydata
# or
pip install messydata

Quick Start

With a Claude Code agent (fastest)

With the skill installed, just describe what you need in plain English:

/messydata generate a retail transactions dataset starting from 2024-01-01, 500 rows
per day. Include product catalog, customer region, payment method, and a realistic
price distribution. Add some missing values across all columns, a few duplicate
records, and occasional outlier prices. Save it to retail.csv.

The agent will write the YAML config, validate it, and run the CLI to produce the file — no manual config writing needed.

CLI

# Generate to a file (format inferred from extension)
messydata generate my_config.yaml --rows 1000 --seed 42 --output data.csv
messydata generate my_config.yaml --rows 1000 --output data.parquet
messydata generate my_config.yaml --rows 1000 --output data.json

# Stream to stdout
messydata generate my_config.yaml --rows 1000

# Single day (requires temporal: true on a date field)
messydata generate my_config.yaml --start-date 2025-06-01 --rows 500

# Date range — --rows is rows per day
messydata generate my_config.yaml --start-date 2025-01-01 --end-date 2025-03-31 --rows 500 --output data.csv

# Validate a config without generating (exits 0/1 — useful in CI and agent loops)
messydata validate my_config.yaml

# Print the full JSON Schema for the config format
messydata schema

YAML + Python

# my_config.yaml
name: orders
primary_key: order_id

records_per_primary_key:
  type: lognormal
  mu: 2.0
  sigma: 0.5

anomalies:
  - name: missing_values
    prob: 1.0   # always inject
    rate: 0.05  # 5% of cells set to NaN

fields:
  - name: order_id
    dtype: int32
    unique_per_id: true
    nullable: false
    distribution:
      type: sequential
      start: 1

  - name: order_date
    dtype: object
    unique_per_id: true
    nullable: false
    temporal: true                  # marks this as the date anchor
    distribution:
      type: sequential
      start: "2024-01-01"

  - name: amount
    dtype: float32
    nullable: false
    distribution:
      type: lognormal
      mu: 3.5
      sigma: 0.75

  - name: status
    dtype: object
    nullable: false
    distribution:
      type: weighted_choice
      values: [pending, shipped, delivered, cancelled]
      weights: [0.1, 0.3, 0.5, 0.1]

from messydata import Pipeline

pipeline = Pipeline.from_config("my_config.yaml")

# All rows, sequential dates
df = pipeline.run(n_rows=1000, seed=42)

# All rows pinned to a single date
df = pipeline.run_for_date("2025-06-01", n_rows=500)

# One generation pass per day, concatenated
df = pipeline.run_date_range("2025-01-01", "2025-03-31", rows_per_day=500)

Python-first

All distribution and anomaly types are importable as Python classes with full IDE support:

from messydata import (
    DatasetSchema, Pipeline,
    FieldSpec, AnomalySpec,
    Lognormal, WeightedChoice, Sequential,
)

schema = DatasetSchema(
    name="orders",
    primary_key="order_id",
    records_per_primary_key=Lognormal(mu=2.0, sigma=0.5),
    fields=[
        FieldSpec(name="order_id", dtype="int32",
                  distribution=Sequential(start=1),
                  unique_per_id=True, nullable=False),
        FieldSpec(name="amount", dtype="float32",
                  distribution=Lognormal(mu=3.5, sigma=0.75),
                  nullable=False),
        FieldSpec(name="status", dtype="object",
                  distribution=WeightedChoice(
                      values=["pending", "shipped", "delivered"],
                      weights=[0.2, 0.5, 0.3])),
    ],
    anomalies=[AnomalySpec(name="missing_values", prob=1.0, rate=0.05)],
)

df = Pipeline(schema).run(n_rows=1000, seed=42)

YAML Config Reference

Top-level keys

| Key | Type | Required | Description | |---|---|---|---| | name | string | yes | Dataset identifier | | primary_key | string | no (default: id) | Field used as the primary grouping key | | records_per_primary_key | distribution block | yes | How many rows to generate per primary key value — accepts any continuous distribution | | fields | list of field specs | yes | Column definitions | | anomalies | list of anomaly specs | no | Data quality issues to inject |

Row count: run(n_rows=N) generates approximately N rows. Because each primary key group is sampled from records_per_primary_key, the actual count may differ slightly. Each group always has at least 1 row.

Field spec properties

| Property | Type | Required | Default | Description | |---|---|---|---|---| | name | string | yes | — | Column name in the output DataFrame | | dtype | string | no | object | Pandas dtype: int32, int64, float32, float64, object, bool | | distribution | distribution block | yes | — | How values are sampled (see Distribution Reference) | | unique_per_id | bool | no | false | If true, one value is drawn per primary key group and repeated for all rows in that group | | nullable | bool | no | true | Marks the field as nullable — used by anomaly injection | | temporal | bool | no | false | Marks this field as the date anchor for run_for_date / run_date_range. Exactly one field per schema. |

unique_per_id: true is appropriate for entity-level attributes that don't vary per transaction — e.g., a customer's region, a store's tier, a payment method for an order.

Distribution reference

Each distribution block requires a type key. All other keys are parameters for that distribution type.

Continuous distributions

| type | Parameters | Notes | |---|---|---| | uniform | min, max | Uniform over [min, max] | | normal | mean, std | Gaussian | | lognormal | mu, sigma | Log-normal — good default for prices, quantities, durations | | weibull | a, scale (default 1.0) | Parametrised by shape a | | exponential | scale (default 1.0) | Rate = 1 / scale | | beta | a, b | Output in [0, 1] — useful for rates and probabilities | | gamma | shape, scale (default 1.0) | General-purpose skewed positive | | mixture | components, weights | Weighted blend of continuous distributions — see below |

Categorical distributions

| type | Parameters | Notes | |---|---|---| | weighted_choice | values, weights | Draws from a fixed list. weights must sum to 1. | | weighted_choice_mapping | columns, weights | Draws correlated multi-column outcomes from a joint table — see below |

Special distributions

| type | Parameters | Notes | |---|---|---| | sequential | start, step (default 1) | Auto-incrementing. start can be an integer or a date string ("2023-01-01"). Each primary key group advances by step. |

`weighted_choice` — categorical with probabilities

distribution:
  type: weighted_choice
  values: [north, south, east, west]
  weights: [0.4, 0.3, 0.2, 0.1]

`weighted_choice_mapping` — correlated multi-column categorical

When two or more columns are always correlated (e.g., product_id and product_name always appear together), use a single weighted_choice_mapping field. All lists under columns must have the same length — each index is one joint outcome.

- name: product        # field name is a placeholder; actual columns come from `columns:`
  dtype: object
  distribution:
    type: weighted_choice_mapping
    columns:
      product_id:   [1001,        1002,    1003,      1004,         1005]
      product_name: [Widget,      Gadget,  Doohickey, Thingamajig,  Whatsit]
    weights: [0.4, 0.2, 0.2, 0.1, 0.1]

This adds product_id and product_name as separate columns — guaranteed consistent. The placeholder name: product is not added to the DataFrame.

`sequential` — auto-incrementing integers or dates

# Integer sequence starting at 1
distribution:
  type: sequential
  start: 1
  step: 1

# Date sequence — start must be a YYYY-MM-DD string
distribution:
  type: sequential
  start: "2023-01-01"
  step: 1           # advances by 1 day per primary key group

`mixture` — weighted blend of continuous distributions

# Bimodal price distribution: budget items + premium items
distribution:
  type: mixture
  components:
    - type: normal
      mean: 15.0
      std: 3.0
    - type: lognormal
      mu: 5.0
      sigma: 0.8
  weights: [0.6, 0.4]

mixture only supports continuous component types (uniform, normal, lognormal, weibull, exponential, beta, gamma). Categorical and sequential types cannot be used as components.

Anomaly reference

Each anomaly has two required fields:

| Field | Type | Description | |---|---|---| | prob | float [0–1] | Probability this anomaly fires on a given run. 1.0 = always inject. | | rate | float [0–1] | Fraction of eligible rows or cells affected when the anomaly fires. |

Example: prob: 0.3, rate: 0.05 means a 30% chance the anomaly is active; when active, 5% of eligible rows are affected. Use prob: 1.0 for deterministic injection.

Anomaly types

| name | columns

Related Skills

node-connect

352.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

352.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

352.2k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

sodadata

View profile

View on GitHub

GitHub Stars31

CategoryDevelopment

Updated3d ago

Forks1

sodadata/messydata

Languages

Python

Security Score

95/100

Audited on Apr 5, 2026

No findings

Messydata

Install / Use

README

MessyData

Claude Code Skill

Install

Quick Start

With a Claude Code agent (fastest)

CLI

YAML + Python

Python-first

YAML Config Reference

Top-level keys

Field spec properties

Distribution reference

Continuous distributions

Categorical distributions

Special distributions

weighted_choice — categorical with probabilities

weighted_choice_mapping — correlated multi-column categorical

sequential — auto-incrementing integers or dates

mixture — weighted blend of continuous distributions

Anomaly reference

Anomaly types

Related Skills

`weighted_choice` — categorical with probabilities

`weighted_choice_mapping` — correlated multi-column categorical

`sequential` — auto-incrementing integers or dates

`mixture` — weighted blend of continuous distributions