SkillAgentSearch skills...

Messydata

Synthetic dirty data generator

Install / Use

/learn @sodadata/Messydata
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

MessyData

Synthetic dirty data generator. Define a schema in YAML, get a realistic messy DataFrame.

MessyData generates structured datasets from a declarative config and injects configurable anomalies — missing values, duplicates, invalid categories, bad dates, and outliers. Designed for testing data pipelines, validating data quality tooling, and feeding AI/ML workflows that need realistic imperfect data.


Claude Code Skill

MessyData includes a Claude Code skill that teaches any agent how to write configs, validate them, and use the CLI. Download SKILL.md and place it at:

~/.claude/skills/messydata/SKILL.md

Then invoke it with /messydata in any Claude Code session.


Install

uv pip install messydata
# or
pip install messydata

Quick Start

With a Claude Code agent (fastest)

With the skill installed, just describe what you need in plain English:

/messydata generate a retail transactions dataset starting from 2024-01-01, 500 rows
per day. Include product catalog, customer region, payment method, and a realistic
price distribution. Add some missing values across all columns, a few duplicate
records, and occasional outlier prices. Save it to retail.csv.

The agent will write the YAML config, validate it, and run the CLI to produce the file — no manual config writing needed.


CLI

# Generate to a file (format inferred from extension)
messydata generate my_config.yaml --rows 1000 --seed 42 --output data.csv
messydata generate my_config.yaml --rows 1000 --output data.parquet
messydata generate my_config.yaml --rows 1000 --output data.json

# Stream to stdout
messydata generate my_config.yaml --rows 1000

# Single day (requires temporal: true on a date field)
messydata generate my_config.yaml --start-date 2025-06-01 --rows 500

# Date range — --rows is rows per day
messydata generate my_config.yaml --start-date 2025-01-01 --end-date 2025-03-31 --rows 500 --output data.csv

# Validate a config without generating (exits 0/1 — useful in CI and agent loops)
messydata validate my_config.yaml

# Print the full JSON Schema for the config format
messydata schema

YAML + Python

# my_config.yaml
name: orders
primary_key: order_id

records_per_primary_key:
  type: lognormal
  mu: 2.0
  sigma: 0.5

anomalies:
  - name: missing_values
    prob: 1.0   # always inject
    rate: 0.05  # 5% of cells set to NaN

fields:
  - name: order_id
    dtype: int32
    unique_per_id: true
    nullable: false
    distribution:
      type: sequential
      start: 1

  - name: order_date
    dtype: object
    unique_per_id: true
    nullable: false
    temporal: true                  # marks this as the date anchor
    distribution:
      type: sequential
      start: "2024-01-01"

  - name: amount
    dtype: float32
    nullable: false
    distribution:
      type: lognormal
      mu: 3.5
      sigma: 0.75

  - name: status
    dtype: object
    nullable: false
    distribution:
      type: weighted_choice
      values: [pending, shipped, delivered, cancelled]
      weights: [0.1, 0.3, 0.5, 0.1]
from messydata import Pipeline

pipeline = Pipeline.from_config("my_config.yaml")

# All rows, sequential dates
df = pipeline.run(n_rows=1000, seed=42)

# All rows pinned to a single date
df = pipeline.run_for_date("2025-06-01", n_rows=500)

# One generation pass per day, concatenated
df = pipeline.run_date_range("2025-01-01", "2025-03-31", rows_per_day=500)

Python-first

All distribution and anomaly types are importable as Python classes with full IDE support:

from messydata import (
    DatasetSchema, Pipeline,
    FieldSpec, AnomalySpec,
    Lognormal, WeightedChoice, Sequential,
)

schema = DatasetSchema(
    name="orders",
    primary_key="order_id",
    records_per_primary_key=Lognormal(mu=2.0, sigma=0.5),
    fields=[
        FieldSpec(name="order_id", dtype="int32",
                  distribution=Sequential(start=1),
                  unique_per_id=True, nullable=False),
        FieldSpec(name="amount", dtype="float32",
                  distribution=Lognormal(mu=3.5, sigma=0.75),
                  nullable=False),
        FieldSpec(name="status", dtype="object",
                  distribution=WeightedChoice(
                      values=["pending", "shipped", "delivered"],
                      weights=[0.2, 0.5, 0.3])),
    ],
    anomalies=[AnomalySpec(name="missing_values", prob=1.0, rate=0.05)],
)

df = Pipeline(schema).run(n_rows=1000, seed=42)

YAML Config Reference

Top-level keys

| Key | Type | Required | Description | |---|---|---|---| | name | string | yes | Dataset identifier | | primary_key | string | no (default: id) | Field used as the primary grouping key | | records_per_primary_key | distribution block | yes | How many rows to generate per primary key value — accepts any continuous distribution | | fields | list of field specs | yes | Column definitions | | anomalies | list of anomaly specs | no | Data quality issues to inject |

Row count: run(n_rows=N) generates approximately N rows. Because each primary key group is sampled from records_per_primary_key, the actual count may differ slightly. Each group always has at least 1 row.


Field spec properties

| Property | Type | Required | Default | Description | |---|---|---|---|---| | name | string | yes | — | Column name in the output DataFrame | | dtype | string | no | object | Pandas dtype: int32, int64, float32, float64, object, bool | | distribution | distribution block | yes | — | How values are sampled (see Distribution Reference) | | unique_per_id | bool | no | false | If true, one value is drawn per primary key group and repeated for all rows in that group | | nullable | bool | no | true | Marks the field as nullable — used by anomaly injection | | temporal | bool | no | false | Marks this field as the date anchor for run_for_date / run_date_range. Exactly one field per schema. |

unique_per_id: true is appropriate for entity-level attributes that don't vary per transaction — e.g., a customer's region, a store's tier, a payment method for an order.


Distribution reference

Each distribution block requires a type key. All other keys are parameters for that distribution type.

Continuous distributions

| type | Parameters | Notes | |---|---|---| | uniform | min, max | Uniform over [min, max] | | normal | mean, std | Gaussian | | lognormal | mu, sigma | Log-normal — good default for prices, quantities, durations | | weibull | a, scale (default 1.0) | Parametrised by shape a | | exponential | scale (default 1.0) | Rate = 1 / scale | | beta | a, b | Output in [0, 1] — useful for rates and probabilities | | gamma | shape, scale (default 1.0) | General-purpose skewed positive | | mixture | components, weights | Weighted blend of continuous distributions — see below |

Categorical distributions

| type | Parameters | Notes | |---|---|---| | weighted_choice | values, weights | Draws from a fixed list. weights must sum to 1. | | weighted_choice_mapping | columns, weights | Draws correlated multi-column outcomes from a joint table — see below |

Special distributions

| type | Parameters | Notes | |---|---|---| | sequential | start, step (default 1) | Auto-incrementing. start can be an integer or a date string ("2023-01-01"). Each primary key group advances by step. |


weighted_choice — categorical with probabilities

distribution:
  type: weighted_choice
  values: [north, south, east, west]
  weights: [0.4, 0.3, 0.2, 0.1]

weighted_choice_mapping — correlated multi-column categorical

When two or more columns are always correlated (e.g., product_id and product_name always appear together), use a single weighted_choice_mapping field. All lists under columns must have the same length — each index is one joint outcome.

- name: product        # field name is a placeholder; actual columns come from `columns:`
  dtype: object
  distribution:
    type: weighted_choice_mapping
    columns:
      product_id:   [1001,        1002,    1003,      1004,         1005]
      product_name: [Widget,      Gadget,  Doohickey, Thingamajig,  Whatsit]
    weights: [0.4, 0.2, 0.2, 0.1, 0.1]

This adds product_id and product_name as separate columns — guaranteed consistent. The placeholder name: product is not added to the DataFrame.


sequential — auto-incrementing integers or dates

# Integer sequence starting at 1
distribution:
  type: sequential
  start: 1
  step: 1

# Date sequence — start must be a YYYY-MM-DD string
distribution:
  type: sequential
  start: "2023-01-01"
  step: 1           # advances by 1 day per primary key group

mixture — weighted blend of continuous distributions

# Bimodal price distribution: budget items + premium items
distribution:
  type: mixture
  components:
    - type: normal
      mean: 15.0
      std: 3.0
    - type: lognormal
      mu: 5.0
      sigma: 0.8
  weights: [0.6, 0.4]

mixture only supports continuous component types (uniform, normal, lognormal, weibull, exponential, beta, gamma). Categorical and sequential types cannot be used as components.


Anomaly reference

Each anomaly has two required fields:

| Field | Type | Description | |---|---|---| | prob | float [0–1] | Probability this anomaly fires on a given run. 1.0 = always inject. | | rate | float [0–1] | Fraction of eligible rows or cells affected when the anomaly fires. |

Example: prob: 0.3, rate: 0.05 means a 30% chance the anomaly is active; when active, 5% of eligible rows are affected. Use prob: 1.0 for deterministic injection.

Anomaly types

| name | columns

Related Skills

View on GitHub
GitHub Stars31
CategoryDevelopment
Updated3d ago
Forks1

Languages

Python

Security Score

95/100

Audited on Apr 5, 2026

No findings