Minidb
MiniDB is a high-performance analytical database system built on Lakehouse architecture principles, combining the flexibility of data lakes with the performance and reliability of data warehouses.
Install / Use
/learn @yyun543/MinidbREADME
MiniDB
<div align="center">
High-performance Lakehouse Database Engine · Built on Apache Arrow and Parquet
English | 中文 | Quick Start | Documentation | Architecture
</div>📖 Project Overview
MiniDB is a production-grade Lakehouse database engine that implements 72% of the core capabilities from the Delta Lake paper (PVLDB 2020), and achieves a 1000x write amplification improvement for UPDATE/DELETE operations beyond what's described in the paper. The project is written in Go, built on the Apache Arrow vectorized execution engine and Parquet columnar storage, providing complete ACID transaction guarantees.
🌟 Core Features
- ✅ Full ACID Transactions - Atomicity/Consistency/Isolation/Durability guarantees based on Delta Log
- ⚡ Vectorized Execution - Apache Arrow batch processing delivers 10-100x acceleration for analytical queries
- 🔄 Merge-on-Read - Innovative MoR architecture reduces UPDATE/DELETE write amplification by 1000x
- 📊 Intelligent Optimization - Z-Order multidimensional clustering, predicate pushdown, automatic compaction
- 🕐 Time Travel - Complete version control and snapshot isolation, supporting historical data queries
- 🔍 System Tables Bootstrap - Innovative SQL-queryable metadata system (sys.*)
- 🎯 Dual Concurrency Control - Pessimistic + optimistic locks available, suitable for different deployment scenarios
📊 Performance Metrics
| Scenario | Performance Improvement | Description | |------|---------|------| | Vectorized Aggregation | 10-100x | GROUP BY + aggregation functions vs row-based execution | | Predicate Pushdown | 2-10x | Data skipping based on Min/Max statistics | | Z-Order Queries | 50-90% | File skip rate for multidimensional queries | | UPDATE Write Amplification | 1/1000 | MoR vs traditional Copy-on-Write | | Checkpoint Recovery | 10x | vs scanning all logs from the beginning |
🚀 Quick Start
System Requirements
- Go 1.21+
- Operating System: Linux/macOS/Windows
- Memory: ≥4GB (8GB+ recommended)
- Disk: ≥10GB available space
10-Second Installation
# Clone repository
git clone https://github.com/yyun543/minidb.git
cd minidb
# Install dependencies
go mod download
# Build binary
go build -o minidb ./cmd/server
# Start server
./minidb
The server will start on localhost:7205.
First Query
# Connect to MiniDB
nc localhost 7205
# Or use telnet
telnet localhost 7205
-- Create database and table
CREATE DATABASE ecommerce;
USE ecommerce;
CREATE TABLE products (
id INT,
name VARCHAR,
price INT,
category VARCHAR
);
-- Insert data
INSERT INTO products VALUES (1, 'Laptop', 999, 'Electronics');
INSERT INTO products VALUES (2, 'Mouse', 29, 'Electronics');
INSERT INTO products VALUES (3, 'Desk', 299, 'Furniture');
-- Vectorized analytical query
SELECT category, COUNT(*) as count, AVG(price) as avg_price
FROM products
GROUP BY category
HAVING count > 0
ORDER BY avg_price DESC;
-- Query transaction history (system table bootstrap feature)
SELECT version, operation, table_id, file_path
FROM sys.delta_log
ORDER BY version DESC
LIMIT 10;
📚 Core Architecture
Lakehouse Three-Layer Architecture
┌─────────────────────────────────────────────────────┐
│ SQL Layer (ANTLR4 Parser) │
│ DDL/DML/DQL · WHERE/JOIN/GROUP BY/ORDER BY │
└─────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────┐
│ Execution Layer (Dual Engines) │
│ │
│ ┌─────────────────┐ ┌──────────────────────┐ │
│ │ Vectorized │ │ Regular Executor │ │
│ │ Executor │ │ (Fallback) │ │
│ │ (Arrow Batch) │ │ │ │
│ └─────────────────┘ └──────────────────────┘ │
│ │
│ Cost-Based Optimizer (Statistics) │
└─────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────┐
│ Storage Layer (Lakehouse) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────┐ │
│ │ Delta Log │ │ Parquet │ │ Object │ │
│ │ Manager │ │ Engine │ │ Store │ │
│ │ (ACID) │ │ (Columnar) │ │ (Local) │ │
│ └──────────────┘ └──────────────┘ └──────────┘ │
│ │
│ Features: MoR · Z-Order · Compaction · Pushdown │
└─────────────────────────────────────────────────────┘
Delta Log Transaction Model
MiniDB implements two concurrency control mechanisms:
1. Pessimistic Lock Mode (Default)
type DeltaLog struct {
entries []LogEntry
mu sync.RWMutex // Global read-write lock
currentVer atomic.Int64
}
- Use Case: Single-instance deployment, high-throughput writes
- Advantages: Simple implementation, zero conflicts
- Disadvantages: Doesn't support multi-client concurrency
2. Optimistic Lock Mode (Optional)
type OptimisticDeltaLog struct {
conditionalStore ConditionalObjectStore
}
// Atomic operation: PUT if not exists
func (s *Store) PutIfNotExists(path string, data []byte) error
- Use Case: Multi-client concurrency, cloud object storage
- Advantages: High concurrency, no global locks
- Disadvantages: Requires retry on conflict (default max 5 attempts)
Selecting Concurrency Mode:
// Enable optimistic locking
engine, _ := storage.NewParquetEngine(
basePath,
storage.WithOptimisticLock(true),
storage.WithMaxRetries(5),
)
Storage File Structure
minidb_data/
├── sys/ # System database
│ └── delta_log/
│ └── data/
│ └── *.parquet # Transaction log persistence
│
├── ecommerce/ # User database
│ ├── products/
│ │ └── data/
│ │ ├── products_xxx.parquet # Base data files
│ │ ├── products_xxx_delta.parquet # Delta files (MoR)
│ │ └── zorder_xxx.parquet # Z-Order optimized files
│ │
│ └── orders/
│ └── data/
│ └── *.parquet
│
└── logs/
└── minidb.log # Structured logs
💡 Core Features Explained
1. ACID Transaction Guarantees
MiniDB implements complete ACID properties through Delta Log:
-- Atomicity: Multi-row inserts either all succeed or all fail
BEGIN TRANSACTION;
INSERT INTO orders VALUES (1, 100, '2024-01-01');
INSERT INTO orders VALUES (2, 200, '2024-01-02');
COMMIT; -- Atomic commit to Delta Log
-- Consistency: Constraint checking
CREATE UNIQUE INDEX idx_id ON products (id);
INSERT INTO products VALUES (1, 'Item1', 100);
INSERT INTO products VALUES (1, 'Item2', 200); -- Violates unique constraint, rejected
-- Isolation: Snapshot isolation
-- Session 1: Reading snapshot version=10
-- Session 2: Concurrently writing to create version=11
-- Session 1 still reads consistent version=10 data
-- Durability: fsync guarantee
-- Data is immediately persisted to Parquet files
INSERT INTO products VALUES (3, 'Item3', 150);
-- After server crash and restart, data still exists
Test Coverage: test/delta_acid_test.go - 6 ACID scenario tests ✅ 100% passing
2. Merge-on-Read (MoR) Architecture
Traditional Copy-on-Write Problem:
UPDATE products SET price=1099 WHERE id=1;
Traditional approach:
1. Read 100MB Parquet file
2. Modify 1 row
3. Rewrite the entire 100MB file ❌ 100MB write amplification
MiniDB MoR approach:
1. Write 1KB Delta file ✅ Only 1KB written
2. Merge at read time
MoR Implementation Principle:
Product table query flow:
┌──────────────┐
│ Base Files │ ← Base data (immutable)
│ 100MB │
└──────────────┘
+
┌──────────────┐
│ Delta Files │ ← UPDATE/DELETE increments
│ 1KB │
└──────────────┘
↓
Read-Time
Merge
↓
┌──────────────┐
│ Merged View │ ← Latest data as seen by users
└──────────────┘
Code Example:
// internal/storage/merge_on_read.go
type MergeOnReadEngine struct {
baseFiles []ParquetFile // Base files
deltaFiles []DeltaFile // Delta files
}
func (m *MergeOnReadEngine) Read() []Record {
// 1. Read base files
baseRecords := readBaseFiles(m.baseFiles)
// 2. Apply delta updates
for _, delta := range m.deltaFiles {
baseRecords = applyDelta(baseRecords, delta)
}
return baseRecords
}
Performance Comparison: | Operation | Copy-on-Write | Merge-on-Read | Improvement Factor | |------|---------------|---------------|----------| | UPDATE 1 row (100MB file) | 100MB written | 1KB written | 100,000x | | DELETE 10 rows (1GB file) | 1GB rewritten | 10KB written | 100,000x | | Read latency | 0ms | 1-5ms | Slightly increased |
Test Coverage: test/merge_on_read_test.go - 3 MoR scenario tests ✅
3. Z-Order Multidimensional Clustering
Problem: Network security log query scenario
-- Scenario 1: Query by source IP
SELECT * FROM network_logs WHERE source_ip = '192.168.1.100';
-- Scenario 2: Query by destination IP
SELECT * FROM network_logs WHERE dest_ip = '10.0.0.50';
-- Scenario 3: Query by time
SELECT * FROM network_logs WHERE timestamp > '2024-01-01';
Traditional Single-Dimension Sorting:
Related Skills
feishu-drive
339.3k|
things-mac
339.3kManage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)
clawhub
339.3kUse the ClawHub CLI to search, install, update, and publish agent skills from clawhub.com
yu-ai-agent
2.0k编程导航 2025 年 AI 开发实战新项目,基于 Spring Boot 3 + Java 21 + Spring AI 构建 AI 恋爱大师应用和 ReAct 模式自主规划智能体YuManus,覆盖 AI 大模型接入、Spring AI 核心特性、Prompt 工程和优化、RAG 检索增强、向量数据库、Tool Calling 工具调用、MCP 模型上下文协议、AI Agent 开发(Manas Java 实现)、Cursor AI 工具等核心知识。用一套教程将程序员必知必会的 AI 技术一网打尽,帮你成为 AI 时代企业的香饽饽,给你的简历和求职大幅增加竞争力。
