SkillAgentSearch skills...

Nbadb

Data Extraction (from https://stats.nba.com) and Processing Scripts to Produce the NBA Database on Kaggle (https://kaggle.com/wyattowalsh/basketball)

Install / Use

/learn @wyattowalsh/Nbadb
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

nbadb

nbadb logo

The most comprehensive open NBA database available.

PyPI Python License CI DuckDB Polars Ruff Docs Kaggle Data Coverage

nbadb warehouse preview

Basketball-native docs, lineage, and warehouse storytelling built around the unchanged nbadb mark.

| Extractors | Warehouse Models | Derived Outputs | Docs Pages | | ---------- | ---------------- | --------------- | ---------- | | 154 | 96 | 24 | 49 |

📊 What's Inside

nbadb exposes an analytics-first warehouse surface rather than a thin mirror of raw upstream payloads.

| Surface | What it covers | | ------- | -------------- | | dim_* | Stable identity and lookup context for players, teams, games, seasons, arenas, officials, and other conformed dimensions | | fact_* | Event and measurement tables across box scores, tracking, shot charts, play-by-play, standings, matchups, and specialty feeds | | bridge_* | Many-to-many connectors where public entities legitimately fan out | | agg_* | Reusable rollups for season, career, pace, efficiency, and other repeated reporting needs | | analytics_* | Convenience outputs for notebooks, dashboards, and quick exploratory analysis |

For the current public contract, use the generated docs surfaces: Schema Reference, Data Dictionary, and Lineage.

🏀 Data Coverage

All data spans from the 1946-47 season to present (auto-updating via the daily pipeline).

  • Game-level — box scores (traditional, advanced, misc, four factors, hustle, tracking), play-by-play, shot charts, rotations, win probability, game context, scoring runs
  • Player-level — career stats, season splits, matchups, awards, draft combine measurements, player tracking (speed, distance, touches, passes, rebounding, shooting), estimated metrics
  • Team-level — game logs, matchups, splits, clutch stats, franchise history, IST standings, playoff picture, pace and efficiency, player dashboards
  • League-level — leaders, hustle stats, lineup visualizations, shot locations by zone, synergy play types, league-wide tracking

📦 Output Formats

| Format | Path | Description | | ------ | ---- | ----------- | | DuckDB | nba.duckdb | Primary analytics engine — columnar storage and fast SQL queries | | SQLite | nba.sqlite | Portable single-file relational database | | Parquet | parquet/ | Zstd-compressed columnar files, partitioned by season | | CSV | csv/ | Universal flat files for any tool |

🚀 Quick Start

[!TIP]

pip install nbadb    # or: uv add nbadb

# Full build from scratch (1946-present, ~2-4 hours)
nbadb init

# Daily incremental update (~5-15 minutes)
nbadb daily

# Export to all formats
nbadb export

# Query with natural language
nbadb ask "who led the league in scoring last season"

# Upload to Kaggle
nbadb upload

⌨️ CLI Reference

| Command | Description | | ------- | ----------- | | nbadb init | Full pipeline — extract all endpoints, stage, transform, export | | nbadb daily | Incremental update for recent games | | nbadb monthly | Dimension refresh + recent data | | nbadb full | Full re-extraction without export | | nbadb migrate | Run schema migrations | | nbadb audit-models | Inventory consistency, column lineage, and validation gap audit | | nbadb export | Re-export DuckDB → SQLite / Parquet / CSV | | nbadb upload | Push the dataset to Kaggle | | nbadb download | Pull the Kaggle dataset and seed local DuckDB | | nbadb extract-completeness | Report endpoint coverage gaps | | nbadb docs-autogen | Regenerate generator-owned schema, data dictionary, ER, and lineage artifacts | | nbadb schema [TABLE] | Show schema for a table or list all star tables | | nbadb status | Pipeline status, row counts, and watermarks | | nbadb ask QUESTION | Natural-language query interface (read-only) |

Run nbadb --help or nbadb <command> --help for full option details.

For docs-site maintenance, regenerate generator-owned artifacts from the repo root with:

uv run nbadb docs-autogen --docs-root docs/content/docs

🤖 AI Query Interface

nbadb ask translates natural-language questions into read-only DuckDB queries:

nbadb ask "top 5 players by career three-pointers made"
nbadb ask "which teams had the best home record in 2023-24"
nbadb ask "LeBron James career averages by season"

Queries run against the star schema with safety guards (read-only mode, query limits, SQL injection protection).

📓 Kaggle Notebooks

Ten analysis notebooks are published on Kaggle, all powered by this dataset:

| Notebook | Description | | -------- | ----------- | | NBA Aging Curves | Peak, prime, and decline — career trajectory modeling | | Defense Decoded | Tracking + hustle + synergy PCA to quantify defense | | Draft Combine Analysis | What pre-draft measurements actually predict | | Game Prediction | Stacking ensemble model for game outcomes | | MVP Predictor | Explainable ML for MVP voting prediction | | Play-by-Play Insights | Win probability, scoring runs, and clutch analysis | | Player Archetypes | UMAP + GMM clustering — 8 data-driven player types | | Player Dashboard | Interactive explorer with 50+ metrics | | Player Similarity | Find any player's statistical twin | | Shot Chart Analysis | The geography of scoring and the 3-point revolution |

🏗️ Architecture

flowchart LR
    A["NBA API + static sources"] -->|"extract"| B["Stage\nDuckDB staging"]
    B --> C["Transform"]
    C --> D["Warehouse\nDimensions / facts / bridges"]
    C --> E["Derived outputs\nAggregates / analytics"]
    D & E --> F["Export"]
    F --> G["DuckDB"]
    F --> H["SQLite"]
    F --> I["Parquet / CSV"]
  • Polars for all DataFrame operations with zero-copy Arrow interchange to DuckDB
  • 3-tier Pandera validation — raw → staging → star
  • SQL-first transforms for the star surface, with dependency-ordered execution
  • SCD Type 2 for dim_player and dim_team_history (surrogate keys, valid_from/valid_to)
  • Checkpoint/resume for interrupted transform runs
  • Watermark tracking for incremental extraction

Read more in the full Architecture Guide.

🔧 Tech Stack

| Component | Technology | | --------- | ---------- | | Language | Python 3.12 | | Package Manager | uv | | DataFrames | Polars 1.38 | | Validation | Pandera (Polars backend) | | Analytics DB | DuckDB 1.4 | | Relational DB | SQLModel + SQLite | | CLI | Typer + Rich + Textual | | Type Checking | ty | | Linting | Ruff | | Docs | Fumadocs + Next.js | | CI | GitHub Actions (SHA-pinned) |

📖 Documentation

Full documentation lives at nbadb.w4w.dev.

  • Getting Started — install, run the pipeline, and learn where to go next
  • Architecture — pipeline stages, validation layers, and state tables
  • Schema Reference — curated star-surface guide plus generated raw/staging/star references
  • Data Dictionary — glossary plus generated raw/staging/star field references
  • Diagrams — ER, endpoint map, and pipeline visuals
  • Lineage — trace endpoints and staging inputs to final tables
  • Guides — onboarding, query recipes, P
View on GitHub
GitHub Stars58
CategoryData
Updated7h ago
Forks14

Languages

Python

Security Score

100/100

Audited on Apr 3, 2026

No findings