SkillAgentSearch skills...

Drainage

Rust + Python Lake House Health Analyzer | Detect • Diagnose • Optimize • Flow

Install / Use

/learn @danielbeach/Drainage
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Drainage 🌊

CI codecov PyPI version Rust Python

🌊 D R A I N A G E 🦀
Rust + Python Lake House Health Analyzer Detect • Diagnose • Optimize • Flow

A high-performance Rust library with Python bindings for analyzing the health of remote S3-stored data lakes (Delta Lake and Apache Iceberg). Drainage helps you understand and optimize your data lake by identifying issues like unreferenced files, suboptimal partitioning, and inefficient file sizes.

Features

  • 🚀 Fast Analysis: Built in Rust for maximum performance
  • 📊 Comprehensive Health Metrics:
    • Unreferenced and orphaned data files detection
    • Partition and clustering analysis (Delta Lake liquid clustering + Iceberg clustering)
    • File size distribution and optimization recommendations
    • Data skew analysis (partition and file size skew)
    • Metadata health monitoring
    • Snapshot retention analysis
    • Deletion vector impact analysis (Delta Lake & Iceberg v3+)
    • Schema evolution stability tracking (Delta Lake & Iceberg)
    • Time travel storage cost analysis (Delta Lake & Iceberg)
    • Table constraints and data quality insights (Delta Lake & Iceberg)
    • Advanced file compaction optimization (Delta Lake & Iceberg)
    • Overall health score calculation
  • 🔍 Multi-Format Support:
    • Delta Lake tables (including liquid clustering support)
    • Apache Iceberg tables (including clustering support)
  • ☁️ S3 Native: Direct S3 integration for analyzing remote data lakes
  • 🐍 Python Interface: Easy-to-use Python API powered by PyO3
  • 🧪 Comprehensive Testing: Full test suite with CI/CD across multiple platforms

Installation

As a Python Package

pip install drainage

Note: This package is automatically built and published to PyPI using GitHub Actions when version tags are pushed.

From Source

# Install maturin for building Rust Python extensions
pip install maturin

# Build and install the package
cd drainage
maturin develop --release

Quick Start

Quick Analysis (Auto-Detection)

import drainage

# Analyze any table with automatic type detection
report = drainage.analyze_table(
    s3_path="s3://my-bucket/my-table",
    aws_region="us-west-2"
)

# Print a comprehensive health report
drainage.print_health_report(report)

# Or access individual metrics
print(f"Health Score: {report.health_score}")
print(f"Table Type: {report.table_type}")
print(f"Total Files: {report.metrics.total_files}")

Analyzing a Delta Lake Table

import drainage

# Analyze a Delta Lake table
report = drainage.analyze_delta_lake(
    s3_path="s3://my-bucket/my-delta-table",
    aws_access_key_id="YOUR_ACCESS_KEY",  # Optional if using IAM roles
    aws_secret_access_key="YOUR_SECRET_KEY",  # Optional if using IAM roles
    aws_region="us-west-2"  # Optional, defaults to us-east-1
)

# View the health score (0.0 to 1.0)
print(f"Health Score: {report.health_score}")

# Check metrics
print(f"Total Files: {report.metrics.total_files}")
print(f"Total Size: {report.metrics.total_size_bytes} bytes")
print(f"Unreferenced Files: {len(report.metrics.unreferenced_files)}")
print(f"Partition Count: {report.metrics.partition_count}")

# View recommendations
for recommendation in report.metrics.recommendations:
    print(f"⚠️  {recommendation}")

Analyzing an Apache Iceberg Table

import drainage

# Analyze an Apache Iceberg table
report = drainage.analyze_iceberg(
    s3_path="s3://my-bucket/my-iceberg-table",
    aws_region="us-west-2"
)

# View file size distribution
dist = report.metrics.file_size_distribution
print(f"Small files (<16MB): {dist.small_files}")
print(f"Medium files (16-128MB): {dist.medium_files}")
print(f"Large files (128MB-1GB): {dist.large_files}")
print(f"Very large files (>1GB): {dist.very_large_files}")

Working on Databricks

import drainage
# Alternative: Use the table location directly
tables = spark.sql("SHOW TABLES IN development.backend_dev")
for table in tables.collect():
    table_name = table['tableName']
    
    # Get table properties including location
    table_info = spark.sql(f"DESCRIBE TABLE EXTENDED development.main.{table_name}")
    
    # Look for Location in the output
    location_rows = table_info.filter(table_info['col_name'] == 'Location').collect()
    
    if location_rows:
        s3_path = location_rows[0]['data_type']
        print(f"Table: {table_name}, Location: {s3_path}")
        
        # Test if this is a valid Delta Lake table
        if s3_path.startswith("s3://") and "__unitystorage" in s3_path:
            print(f"✅ Unity Catalog Delta table: {s3_path}")
            # Try analysis
            try:
                report = drainage.analyze_table(s3_path=s3_path, aws_region="us-east-1")
                drainage.print_health_report(report)
            except Exception as e:
                print(f"❌ Analysis failed: {e}")

Health Metrics Explained

Health Score

The health score ranges from 0.0 (poor health) to 1.0 (excellent health) and is calculated based on:

  • Unreferenced Files (-30%): Files that exist in S3 but aren't referenced in table metadata
  • Small Files (-20%): High percentage of small files (<16MB) indicates inefficient storage
  • Very Large Files (-10%): Files over 1GB may cause performance issues
  • Partitioning (-10-15%): Too many or too few files per partition
  • Data Skew (-15-25%): Uneven data distribution across partitions and file sizes
  • Metadata Bloat (-5%): Large metadata files that slow down operations
  • Snapshot Retention (-10%): Too many historical snapshots affecting performance
  • Deletion Vector Impact (-15%): High deletion vector impact affecting query performance
  • Schema Instability (-20%): Unstable schema evolution affecting compatibility and performance
  • Time Travel Storage Costs (-10%): High time travel storage costs affecting budget
  • Data Quality Issues (-15%): Poor data quality from insufficient constraints
  • File Compaction Opportunities (-10%): Missed compaction opportunities affecting performance

Key Metrics

File Analysis

  • total_files: Total number of data files in the table
  • total_size_bytes: Total size of all data files
  • avg_file_size_bytes: Average file size
  • unreferenced_files: List of files not referenced in table metadata
  • unreferenced_size_bytes: Total size of unreferenced files

Partition Analysis

  • partition_count: Number of partitions
  • partitions: Detailed information about each partition including:
    • Partition values
    • File count per partition
    • Total and average file sizes

File Size Distribution

  • small_files: Files under 16MB
  • medium_files: Files between 16MB and 128MB
  • large_files: Files between 128MB and 1GB
  • very_large_files: Files over 1GB

Clustering (Delta Lake & Iceberg)

  • clustering_columns: Columns used for clustering/sorting
  • cluster_count: Number of clusters
  • avg_files_per_cluster: Average files per cluster
  • Delta Lake: Supports liquid clustering (up to 4 columns)
  • Iceberg: Supports traditional clustering and Z-order

Data Skew Analysis

  • partition_skew_score: How unevenly data is distributed across partitions (0.0 = perfect, 1.0 = highly skewed)
  • file_size_skew_score: Variation in file sizes within partitions
  • largest_partition_size: Size of the largest partition
  • smallest_partition_size: Size of the smallest partition
  • avg_partition_size: Average partition size
  • partition_size_std_dev: Standard deviation of partition sizes

Metadata Health

  • metadata_file_count: Number of transaction logs/manifest files
  • metadata_total_size_bytes: Combined size of all metadata files
  • avg_metadata_file_size: Average size of metadata files
  • metadata_growth_rate: Estimated metadata growth rate
  • manifest_file_count: Number of manifest files (Iceberg only)

Snapshot Health

  • snapshot_count: Number of historical snapshots
  • oldest_snapshot_age_days: Age of the oldest snapshot
  • newest_snapshot_age_days: Age of the newest snapshot
  • avg_snapshot_age_days: Average snapshot age
  • snapshot_retention_risk: Risk level based on snapshot count (0.0 = good, 1.0 = high risk)

Deletion Vector Analysis (Delta Lake & Iceberg v3+)

  • deletion_vector_count: Number of deletion vectors
  • total_deletion_vector_size_bytes: Total size of all deletion vectors
  • avg_deletion_vector_size_bytes: Average deletion vector size
  • deletion_vector_age_days: Age of the oldest deletion vector
  • deleted_rows_count: Total number of deleted rows
  • deletion_vector_impact_score: Performance impact score (0.0 = no impact, 1.0 = high impact)

Schema Evolution Tracking (Delta Lake & Iceberg)

  • total_schema_changes: Total number of schema changes
  • breaking_changes: Number of breaking schema changes
  • non_breaking_changes: Number of non-breaking schema changes
  • schema_stability_score: Schema stability score (0.0 = unstable, 1.0 = very stable)
  • days_since_last_change: Days since last schema change
  • schema_change_frequency: Schema changes per day
  • current_schema_version: Current schema version

Time Travel Analysis (Delta Lake & Iceberg)

  • total_snapshots: Total number of historical snapshots
  • oldest_snapshot_age_days: Age of the oldest snapshot in days
  • newest_snapshot_age_days: Age of the newest snapshot in days
  • total_historical_size_bytes:

Related Skills

View on GitHub
GitHub Stars65
CategoryDevelopment
Updated1mo ago
Forks7

Languages

Rust

Security Score

95/100

Audited on Feb 23, 2026

No findings