DataClaw

This is a performance art project. Anthropic built their models on the world's freely shared information, then introduced increasingly dystopian data policies to stop anyone else from doing the same with their data - pulling up the ladder behind them. DataClaw lets you throw the ladder back down. The dataset it produces is yours to share.

Turn your Claude Code, Codex, and other coding-agent conversation history into structured data and publish it to Hugging Face with a single command. DataClaw parses session logs, redacts secrets and PII, and uploads the result as a ready-to-use dataset.

DataClaw

Every export is tagged dataclaw on Hugging Face. Together, they may someday form a growing distributed dataset of real-world human-AI coding collaboration.

Give this to your agent

Paste this into Claude Code, Codex, or any coding agent:

Help me export my Claude Code, Codex, and other coding-agent conversation history to Hugging Face using DataClaw.
Install it, then walk me through the process.

STEP 1 - INSTALL
  pip install -U dataclaw
  If that fails: git clone https://github.com/banodoco/dataclaw.git /tmp/dataclaw && pip install -U /tmp/dataclaw
  If that also fails, ask the user where the source is.

STEP 2 - INSTALL SKILL
  Skill support is currently only available for Claude Code.
  dataclaw update-skill claude
  For other agentic tools, skip this step and do not improvise a custom flow - follow the instructions in DataClaw's output on each step, especially next_steps and next_command.

STEP 3 - PREP
  dataclaw prep
  Every dataclaw command outputs next_steps in its JSON - follow them through the entire flow.

STEP 3A - CHOOSE SOURCE SCOPE (REQUIRED BEFORE EXPORT)
  Ask the user explicitly which source scope to export: a supported source key such as claude or codex, or all.
  dataclaw config --source all
  Do not export until source scope is explicitly confirmed.

STEP 3B - CHOOSE PROJECT SCOPE (REQUIRED BEFORE EXPORT)
  dataclaw list --source all
  Send the FULL project/folder list to the user in a message (name, source, sessions, size, excluded).
  Ask which projects to exclude.
  dataclaw config --exclude "project1,project2" OR dataclaw config --confirm-projects
  Do not export until folder selection is explicitly confirmed.

STEP 3C - SET REDACTED STRINGS
  Ask the user what additional strings should always be redacted, such as company names, client names, domains, internal URLs, or secrets that regex might miss.
  dataclaw config --redact "string1,string2"
  dataclaw config --redact-usernames "user1,user2"
  Only add these after explicit user confirmation.

STEP 4 - EXPORT LOCALLY
  dataclaw export --no-push --output dataclaw_export.jsonl

STEP 5 - REVIEW AND CONFIRM (REQUIRED BEFORE PUSH)
  Review PII findings and apply excludes/redactions as needed.
  Full name is requested for an exact-name privacy scan against the export.
  If the user declines sharing full name, use --skip-full-name-scan and attest the skip reason.
  dataclaw confirm --full-name "THEIR FULL NAME" --attest-full-name "..." --attest-sensitive "..." --attest-manual-scan "..."

STEP 6 - PUBLISH (ONLY AFTER EXPLICIT USER APPROVAL)
  dataclaw export --publish-attestation "User explicitly approved publishing to Hugging Face."
  Never publish unless the user explicitly says yes.

IF ANY COMMAND FAILS DUE TO A SKIPPED STEP:
  Restate the 6-step checklist above and resume from the blocked step (do not skip ahead).

IMPORTANT: Never run bare `hf auth login` when automating this with an agent - always use `--token`.
IMPORTANT: Always export with --no-push first and review for PII before publishing.

Manual usage (without an agent)

# STEP 1 - INSTALL
pip install -U dataclaw
hf auth login --token YOUR_TOKEN

# STEP 3 - PREP
dataclaw prep
dataclaw config --repo username/my-personal-codex-data

# STEP 3A - CHOOSE SOURCE SCOPE
dataclaw config --source all  # REQUIRED: choose a supported source key or all

# STEP 3B - CHOOSE PROJECT SCOPE
dataclaw list --source all  # Present full list and confirm folder scope before export
dataclaw config --exclude "personal-stuff,scratch"  # or: dataclaw config --confirm-projects

# STEP 3C - SET REDACTED STRINGS
dataclaw config --redact-usernames "my_github_handle,my_discord_name"
dataclaw config --redact "my-domain.com,my-secret-project"

# STEP 4 - EXPORT LOCALLY
dataclaw export --no-push

# STEP 5 - REVIEW AND CONFIRM
dataclaw confirm \
  --full-name "YOUR FULL NAME" \
  --attest-full-name "Asked for full name and scanned export for YOUR FULL NAME." \
  --attest-sensitive "Asked about company/client/internal names and private URLs; none found or redactions updated." \
  --attest-manual-scan "Manually scanned 20 sessions across beginning/middle/end and reviewed findings."

# Or: if user declines sharing full name
dataclaw confirm \
  --skip-full-name-scan \
  --attest-full-name "User declined to share full name; skipped exact-name scan." \
  --attest-sensitive "Asked about company/client/internal names and private URLs; none found or redactions updated." \
  --attest-manual-scan "Manually scanned 20 sessions across beginning/middle/end and reviewed findings."

# STEP 6 - PUBLISH
dataclaw export --publish-attestation "User explicitly approved publishing to Hugging Face."

Step 2 (INSTALL SKILL) is omitted in manual usage.

Commands

| Command | Description | |---------|-------------| | dataclaw status | Show current stage and next steps | | dataclaw prep | Discover projects, check HF auth, output JSON | | dataclaw prep --source <source\|all> | Prep with an explicit source scope | | dataclaw list | List all projects with exclusion status | | dataclaw list --source <source\|all> | List projects for a specific source scope | | dataclaw config | Show current config | | dataclaw config --repo user/my-personal-codex-data | Set HF repo | | dataclaw config --source <source\|all> | REQUIRED source scope selection (examples include claude, codex, and others) | | dataclaw config --exclude "a,b" | Add excluded projects (appends) | | dataclaw config --redact "str1,str2" | Add strings to always redact (appends) | | dataclaw config --redact-usernames "u1,u2" | Add usernames to anonymize (appends) | | dataclaw config --confirm-projects | Mark project selection as confirmed | | dataclaw export --no-push | Export locally only (always do this first) | | dataclaw export --source <source\|all> --no-push | Export a chosen source scope locally | | dataclaw confirm --full-name "NAME" --attest-full-name "..." --attest-sensitive "..." --attest-manual-scan "..." | Scan for PII, run exact-name privacy check, verify review attestations, unlock pushing | | dataclaw confirm --skip-full-name-scan --attest-full-name "..." --attest-sensitive "..." --attest-manual-scan "..." | Skip exact-name scan when user declines sharing full name (requires skip attestation) | | dataclaw export --publish-attestation "..." | Export and push (requires dataclaw confirm first) | | dataclaw export --all-projects | Include everything (ignore exclusions) | | dataclaw export --no-thinking | Exclude extended thinking blocks | | dataclaw update-skill claude | Install/update the dataclaw skill for Claude Code |

What gets exported

User messages - Including voice transcripts and images
Assistant responses
Assistant thinking - Opt out with --no-thinking
Tool calls - Tool name, inputs, outputs
Token usage - Input/output tokens per session
Metadata - Model name, git branch, timestamps

Privacy & Redaction

DataClaw applies multiple layers of protection:

Username redaction - Your OS username + any configured usernames replaced with stable hashes
Secret redaction - Regex patterns catch JWT tokens, API keys (Anthropic, OpenAI, HF, GitHub, AWS, etc.), database passwords, private keys, Discord webhooks, and more
Entropy analysis - Long high-entropy strings in quotes are flagged as potential secrets
Email redaction - Regex pattern catches email addresses
Custom redaction - You can configure additional strings to redact
Tool call redaction - Tool inputs and outputs are redacted with the same standard as regular messages

This is NOT foolproof. Always review your exported data before publishing. Automated redaction cannot catch everything - especially service-specific identifiers, third-party PII, or secrets in unusual formats.

We recommend to convert the exported jsonl into human-readable yaml using the script in https://github.com/peteromallet/dataclaw/issues/1 , then use tools such as trufflehog and gitleaks to scan it.

To help improve redaction, report issues: https://github.com/banodoco/dataclaw/issues

Data schema

Each line in conversations.jsonl is one session:

{
  "session_id": "abc-123",
  "project": "my-project",
  "model": "claude-opus-4-6",
  "git_branch": "main",
  "start_time": "2025-06-15T10:00:00+00:00",
  "end_time": "2025-06-15T10:30:00+00:00",
  "messages": [
    {
      "role": "user",
      "content": "Fix the login bug",
      "content_parts": [
        {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": "..."}}
      ],
      "timestamp": "..."
    },
    {
      "role": "assistant",
      "content": "I'll investigate the login flow.",
      "thinking": "The user wants me to look at...",
      "tool_uses": [
          {
            "tool": "bash",
            "input": {"command": "grep -r 'login' src/"},
            "output": {
              "text": "src/auth.py:42: def login(user, password):",
              "raw": {"stderr": "", "interrupted": false}
            },
            "status": "success"
          }
        ],
      "timestam

Dataclaw

Install / Use

README

DataClaw

Give this to your agent

Manual usage (without an agent)

Commands

What gets exported

Privacy & Redaction

Data schema