SkillAgentSearch skills...

Minidb

MiniDB is a high-performance analytical database system built on Lakehouse architecture principles, combining the flexibility of data lakes with the performance and reliability of data warehouses.

Install / Use

/learn @yyun543/Minidb
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

MiniDB

<div align="center">

Version Go Version License Tests Architecture

High-performance Lakehouse Database Engine · Built on Apache Arrow and Parquet

English | 中文 | Quick Start | Documentation | Architecture

</div>

📖 Project Overview

MiniDB is a production-grade Lakehouse database engine that implements 72% of the core capabilities from the Delta Lake paper (PVLDB 2020), and achieves a 1000x write amplification improvement for UPDATE/DELETE operations beyond what's described in the paper. The project is written in Go, built on the Apache Arrow vectorized execution engine and Parquet columnar storage, providing complete ACID transaction guarantees.

🌟 Core Features

  • ✅ Full ACID Transactions - Atomicity/Consistency/Isolation/Durability guarantees based on Delta Log
  • ⚡ Vectorized Execution - Apache Arrow batch processing delivers 10-100x acceleration for analytical queries
  • 🔄 Merge-on-Read - Innovative MoR architecture reduces UPDATE/DELETE write amplification by 1000x
  • 📊 Intelligent Optimization - Z-Order multidimensional clustering, predicate pushdown, automatic compaction
  • 🕐 Time Travel - Complete version control and snapshot isolation, supporting historical data queries
  • 🔍 System Tables Bootstrap - Innovative SQL-queryable metadata system (sys.*)
  • 🎯 Dual Concurrency Control - Pessimistic + optimistic locks available, suitable for different deployment scenarios

📊 Performance Metrics

| Scenario | Performance Improvement | Description | |------|---------|------| | Vectorized Aggregation | 10-100x | GROUP BY + aggregation functions vs row-based execution | | Predicate Pushdown | 2-10x | Data skipping based on Min/Max statistics | | Z-Order Queries | 50-90% | File skip rate for multidimensional queries | | UPDATE Write Amplification | 1/1000 | MoR vs traditional Copy-on-Write | | Checkpoint Recovery | 10x | vs scanning all logs from the beginning |


🚀 Quick Start

System Requirements

  • Go 1.21+
  • Operating System: Linux/macOS/Windows
  • Memory: ≥4GB (8GB+ recommended)
  • Disk: ≥10GB available space

10-Second Installation

# Clone repository
git clone https://github.com/yyun543/minidb.git
cd minidb

# Install dependencies
go mod download

# Build binary
go build -o minidb ./cmd/server

# Start server
./minidb

The server will start on localhost:7205.

First Query

# Connect to MiniDB
nc localhost 7205

# Or use telnet
telnet localhost 7205
-- Create database and table
CREATE DATABASE ecommerce;
USE ecommerce;

CREATE TABLE products (
    id INT,
    name VARCHAR,
    price INT,
    category VARCHAR
);

-- Insert data
INSERT INTO products VALUES (1, 'Laptop', 999, 'Electronics');
INSERT INTO products VALUES (2, 'Mouse', 29, 'Electronics');
INSERT INTO products VALUES (3, 'Desk', 299, 'Furniture');

-- Vectorized analytical query
SELECT category, COUNT(*) as count, AVG(price) as avg_price
FROM products
GROUP BY category
HAVING count > 0
ORDER BY avg_price DESC;

-- Query transaction history (system table bootstrap feature)
SELECT version, operation, table_id, file_path
FROM sys.delta_log
ORDER BY version DESC
LIMIT 10;

📚 Core Architecture

Lakehouse Three-Layer Architecture

┌─────────────────────────────────────────────────────┐
│           SQL Layer (ANTLR4 Parser)                 │
│   DDL/DML/DQL · WHERE/JOIN/GROUP BY/ORDER BY        │
└─────────────────────────────────────────────────────┘
                         ↓
┌─────────────────────────────────────────────────────┐
│        Execution Layer (Dual Engines)               │
│                                                     │
│  ┌─────────────────┐    ┌──────────────────────┐    │
│  │ Vectorized      │    │ Regular Executor     │    │
│  │ Executor        │    │ (Fallback)           │    │
│  │ (Arrow Batch)   │    │                      │    │
│  └─────────────────┘    └──────────────────────┘    │
│                                                     │
│         Cost-Based Optimizer (Statistics)           │
└─────────────────────────────────────────────────────┘
                         ↓
┌─────────────────────────────────────────────────────┐
│         Storage Layer (Lakehouse)                   │
│                                                     │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────┐   │
│  │ Delta Log    │  │ Parquet      │  │ Object   │   │
│  │ Manager      │  │ Engine       │  │ Store    │   │
│  │ (ACID)       │  │ (Columnar)   │  │ (Local)  │   │
│  └──────────────┘  └──────────────┘  └──────────┘   │
│                                                     │
│  Features: MoR · Z-Order · Compaction · Pushdown    │
└─────────────────────────────────────────────────────┘

Delta Log Transaction Model

MiniDB implements two concurrency control mechanisms:

1. Pessimistic Lock Mode (Default)

type DeltaLog struct {
    entries    []LogEntry
    mu         sync.RWMutex  // Global read-write lock
    currentVer atomic.Int64
}
  • Use Case: Single-instance deployment, high-throughput writes
  • Advantages: Simple implementation, zero conflicts
  • Disadvantages: Doesn't support multi-client concurrency

2. Optimistic Lock Mode (Optional)

type OptimisticDeltaLog struct {
    conditionalStore ConditionalObjectStore
}

// Atomic operation: PUT if not exists
func (s *Store) PutIfNotExists(path string, data []byte) error
  • Use Case: Multi-client concurrency, cloud object storage
  • Advantages: High concurrency, no global locks
  • Disadvantages: Requires retry on conflict (default max 5 attempts)

Selecting Concurrency Mode:

// Enable optimistic locking
engine, _ := storage.NewParquetEngine(
    basePath,
    storage.WithOptimisticLock(true),
    storage.WithMaxRetries(5),
)

Storage File Structure

minidb_data/
├── sys/                          # System database
│   └── delta_log/
│       └── data/
│           └── *.parquet         # Transaction log persistence
│
├── ecommerce/                    # User database
│   ├── products/
│   │   └── data/
│   │       ├── products_xxx.parquet      # Base data files
│   │       ├── products_xxx_delta.parquet # Delta files (MoR)
│   │       └── zorder_xxx.parquet        # Z-Order optimized files
│   │
│   └── orders/
│       └── data/
│           └── *.parquet
│
└── logs/
    └── minidb.log               # Structured logs

💡 Core Features Explained

1. ACID Transaction Guarantees

MiniDB implements complete ACID properties through Delta Log:

-- Atomicity: Multi-row inserts either all succeed or all fail
BEGIN TRANSACTION;
INSERT INTO orders VALUES (1, 100, '2024-01-01');
INSERT INTO orders VALUES (2, 200, '2024-01-02');
COMMIT;  -- Atomic commit to Delta Log

-- Consistency: Constraint checking
CREATE UNIQUE INDEX idx_id ON products (id);
INSERT INTO products VALUES (1, 'Item1', 100);
INSERT INTO products VALUES (1, 'Item2', 200);  -- Violates unique constraint, rejected

-- Isolation: Snapshot isolation
-- Session 1: Reading snapshot version=10
-- Session 2: Concurrently writing to create version=11
-- Session 1 still reads consistent version=10 data

-- Durability: fsync guarantee
-- Data is immediately persisted to Parquet files
INSERT INTO products VALUES (3, 'Item3', 150);
-- After server crash and restart, data still exists

Test Coverage: test/delta_acid_test.go - 6 ACID scenario tests ✅ 100% passing

2. Merge-on-Read (MoR) Architecture

Traditional Copy-on-Write Problem:

UPDATE products SET price=1099 WHERE id=1;

Traditional approach:
1. Read 100MB Parquet file
2. Modify 1 row
3. Rewrite the entire 100MB file  ❌ 100MB write amplification

MiniDB MoR approach:
1. Write 1KB Delta file     ✅ Only 1KB written
2. Merge at read time

MoR Implementation Principle:

Product table query flow:
┌──────────────┐
│ Base Files   │  ← Base data (immutable)
│ 100MB        │
└──────────────┘
       +
┌──────────────┐
│ Delta Files  │  ← UPDATE/DELETE increments
│ 1KB          │
└──────────────┘
       ↓
   Read-Time
    Merge
       ↓
┌──────────────┐
│ Merged View  │  ← Latest data as seen by users
└──────────────┘

Code Example:

// internal/storage/merge_on_read.go
type MergeOnReadEngine struct {
    baseFiles  []ParquetFile   // Base files
    deltaFiles []DeltaFile     // Delta files
}

func (m *MergeOnReadEngine) Read() []Record {
    // 1. Read base files
    baseRecords := readBaseFiles(m.baseFiles)

    // 2. Apply delta updates
    for _, delta := range m.deltaFiles {
        baseRecords = applyDelta(baseRecords, delta)
    }

    return baseRecords
}

Performance Comparison: | Operation | Copy-on-Write | Merge-on-Read | Improvement Factor | |------|---------------|---------------|----------| | UPDATE 1 row (100MB file) | 100MB written | 1KB written | 100,000x | | DELETE 10 rows (1GB file) | 1GB rewritten | 10KB written | 100,000x | | Read latency | 0ms | 1-5ms | Slightly increased |

Test Coverage: test/merge_on_read_test.go - 3 MoR scenario tests ✅

3. Z-Order Multidimensional Clustering

Problem: Network security log query scenario

-- Scenario 1: Query by source IP
SELECT * FROM network_logs WHERE source_ip = '192.168.1.100';

-- Scenario 2: Query by destination IP
SELECT * FROM network_logs WHERE dest_ip = '10.0.0.50';

-- Scenario 3: Query by time
SELECT * FROM network_logs WHERE timestamp > '2024-01-01';

Traditional Single-Dimension Sorting:

Related Skills

View on GitHub
GitHub Stars77
CategoryData
Updated25d ago
Forks8

Languages

Go

Security Score

80/100

Audited on Mar 2, 2026

No findings