Drainage

Rust + Python Lake House Health Analyzer | Detect • Diagnose • Optimize • Flow

Generate Convert Improve

Install / Use

/learn @danielbeach/Drainage

About this skill

Quality Score

0/100

README

Drainage 🌊

🌊 D R A I N A G E 🦀
Rust + Python Lake House Health Analyzer Detect • Diagnose • Optimize • Flow

A high-performance Rust library with Python bindings for analyzing the health of remote S3-stored data lakes (Delta Lake and Apache Iceberg). Drainage helps you understand and optimize your data lake by identifying issues like unreferenced files, suboptimal partitioning, and inefficient file sizes.

Features

🚀 Fast Analysis: Built in Rust for maximum performance
📊 Comprehensive Health Metrics:
- Unreferenced and orphaned data files detection
- Partition and clustering analysis (Delta Lake liquid clustering + Iceberg clustering)
- File size distribution and optimization recommendations
- Data skew analysis (partition and file size skew)
- Metadata health monitoring
- Snapshot retention analysis
- Deletion vector impact analysis (Delta Lake & Iceberg v3+)
- Schema evolution stability tracking (Delta Lake & Iceberg)
- Time travel storage cost analysis (Delta Lake & Iceberg)
- Table constraints and data quality insights (Delta Lake & Iceberg)
- Advanced file compaction optimization (Delta Lake & Iceberg)
- Overall health score calculation
🔍 Multi-Format Support:
- Delta Lake tables (including liquid clustering support)
- Apache Iceberg tables (including clustering support)
☁️ S3 Native: Direct S3 integration for analyzing remote data lakes
🐍 Python Interface: Easy-to-use Python API powered by PyO3
🧪 Comprehensive Testing: Full test suite with CI/CD across multiple platforms

Installation

As a Python Package

pip install drainage

Note: This package is automatically built and published to PyPI using GitHub Actions when version tags are pushed.

From Source

# Install maturin for building Rust Python extensions
pip install maturin

# Build and install the package
cd drainage
maturin develop --release

Quick Start

Quick Analysis (Auto-Detection)

import drainage

# Analyze any table with automatic type detection
report = drainage.analyze_table(
    s3_path="s3://my-bucket/my-table",
    aws_region="us-west-2"
)

# Print a comprehensive health report
drainage.print_health_report(report)

# Or access individual metrics
print(f"Health Score: {report.health_score}")
print(f"Table Type: {report.table_type}")
print(f"Total Files: {report.metrics.total_files}")

Analyzing a Delta Lake Table

import drainage

# Analyze a Delta Lake table
report = drainage.analyze_delta_lake(
    s3_path="s3://my-bucket/my-delta-table",
    aws_access_key_id="YOUR_ACCESS_KEY",  # Optional if using IAM roles
    aws_secret_access_key="YOUR_SECRET_KEY",  # Optional if using IAM roles
    aws_region="us-west-2"  # Optional, defaults to us-east-1
)

# View the health score (0.0 to 1.0)
print(f"Health Score: {report.health_score}")

# Check metrics
print(f"Total Files: {report.metrics.total_files}")
print(f"Total Size: {report.metrics.total_size_bytes} bytes")
print(f"Unreferenced Files: {len(report.metrics.unreferenced_files)}")
print(f"Partition Count: {report.metrics.partition_count}")

# View recommendations
for recommendation in report.metrics.recommendations:
    print(f"⚠️  {recommendation}")

Analyzing an Apache Iceberg Table

import drainage

# Analyze an Apache Iceberg table
report = drainage.analyze_iceberg(
    s3_path="s3://my-bucket/my-iceberg-table",
    aws_region="us-west-2"
)

# View file size distribution
dist = report.metrics.file_size_distribution
print(f"Small files (<16MB): {dist.small_files}")
print(f"Medium files (16-128MB): {dist.medium_files}")
print(f"Large files (128MB-1GB): {dist.large_files}")
print(f"Very large files (>1GB): {dist.very_large_files}")

Working on Databricks

import drainage
# Alternative: Use the table location directly
tables = spark.sql("SHOW TABLES IN development.backend_dev")
for table in tables.collect():
    table_name = table['tableName']
    
    # Get table properties including location
    table_info = spark.sql(f"DESCRIBE TABLE EXTENDED development.main.{table_name}")
    
    # Look for Location in the output
    location_rows = table_info.filter(table_info['col_name'] == 'Location').collect()
    
    if location_rows:
        s3_path = location_rows[0]['data_type']
        print(f"Table: {table_name}, Location: {s3_path}")
        
        # Test if this is a valid Delta Lake table
        if s3_path.startswith("s3://") and "__unitystorage" in s3_path:
            print(f"✅ Unity Catalog Delta table: {s3_path}")
            # Try analysis
            try:
                report = drainage.analyze_table(s3_path=s3_path, aws_region="us-east-1")
                drainage.print_health_report(report)
            except Exception as e:
                print(f"❌ Analysis failed: {e}")

Health Metrics Explained

Health Score

The health score ranges from 0.0 (poor health) to 1.0 (excellent health) and is calculated based on:

Unreferenced Files (-30%): Files that exist in S3 but aren't referenced in table metadata
Small Files (-20%): High percentage of small files (<16MB) indicates inefficient storage
Very Large Files (-10%): Files over 1GB may cause performance issues
Partitioning (-10-15%): Too many or too few files per partition
Data Skew (-15-25%): Uneven data distribution across partitions and file sizes
Metadata Bloat (-5%): Large metadata files that slow down operations
Snapshot Retention (-10%): Too many historical snapshots affecting performance
Deletion Vector Impact (-15%): High deletion vector impact affecting query performance
Schema Instability (-20%): Unstable schema evolution affecting compatibility and performance
Time Travel Storage Costs (-10%): High time travel storage costs affecting budget
Data Quality Issues (-15%): Poor data quality from insufficient constraints
File Compaction Opportunities (-10%): Missed compaction opportunities affecting performance

Key Metrics

File Analysis

total_files: Total number of data files in the table
total_size_bytes: Total size of all data files
avg_file_size_bytes: Average file size
unreferenced_files: List of files not referenced in table metadata
unreferenced_size_bytes: Total size of unreferenced files

Partition Analysis

partition_count: Number of partitions
partitions: Detailed information about each partition including:
- Partition values
- File count per partition
- Total and average file sizes

File Size Distribution

small_files: Files under 16MB
medium_files: Files between 16MB and 128MB
large_files: Files between 128MB and 1GB
very_large_files: Files over 1GB

Clustering (Delta Lake & Iceberg)

clustering_columns: Columns used for clustering/sorting
cluster_count: Number of clusters
avg_files_per_cluster: Average files per cluster
Delta Lake: Supports liquid clustering (up to 4 columns)
Iceberg: Supports traditional clustering and Z-order

Data Skew Analysis

partition_skew_score: How unevenly data is distributed across partitions (0.0 = perfect, 1.0 = highly skewed)
file_size_skew_score: Variation in file sizes within partitions
largest_partition_size: Size of the largest partition
smallest_partition_size: Size of the smallest partition
avg_partition_size: Average partition size
partition_size_std_dev: Standard deviation of partition sizes

Metadata Health

metadata_file_count: Number of transaction logs/manifest files
metadata_total_size_bytes: Combined size of all metadata files
avg_metadata_file_size: Average size of metadata files
metadata_growth_rate: Estimated metadata growth rate
manifest_file_count: Number of manifest files (Iceberg only)

Snapshot Health

snapshot_count: Number of historical snapshots
oldest_snapshot_age_days: Age of the oldest snapshot
newest_snapshot_age_days: Age of the newest snapshot
avg_snapshot_age_days: Average snapshot age
snapshot_retention_risk: Risk level based on snapshot count (0.0 = good, 1.0 = high risk)

Deletion Vector Analysis (Delta Lake & Iceberg v3+)

deletion_vector_count: Number of deletion vectors
total_deletion_vector_size_bytes: Total size of all deletion vectors
avg_deletion_vector_size_bytes: Average deletion vector size
deletion_vector_age_days: Age of the oldest deletion vector
deleted_rows_count: Total number of deleted rows
deletion_vector_impact_score: Performance impact score (0.0 = no impact, 1.0 = high impact)

Schema Evolution Tracking (Delta Lake & Iceberg)

total_schema_changes: Total number of schema changes
breaking_changes: Number of breaking schema changes
non_breaking_changes: Number of non-breaking schema changes
schema_stability_score: Schema stability score (0.0 = unstable, 1.0 = very stable)
days_since_last_change: Days since last schema change
schema_change_frequency: Schema changes per day
current_schema_version: Current schema version

Time Travel Analysis (Delta Lake & Iceberg)

total_snapshots: Total number of historical snapshots
oldest_snapshot_age_days: Age of the oldest snapshot in days
newest_snapshot_age_days: Age of the newest snapshot in days
total_historical_size_bytes:

Related Skills

node-connect

350.1k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.9k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

350.1k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

350.1k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。