KaflowSQL
💬 Streaming SQL engine for Kafka with real-time joins, schema validation and DuckDB-powered transformations.
Install / Use
/learn @siqueiraa/KaflowSQLREADME
The Agony of Streaming Joins is Over
🚀 KaflowSQL v1.0.0 - Time-windowed streaming ETL with SQL-native temporal joins
Picture this: It's 3 PM on a Friday. Your product manager walks over with "a quick request" – they need real-time user personalization live by Monday.
You have user_actions in one Kafka topic and customer_profiles in another. In a relational database, this is a trivial LEFT JOIN. In the world of streaming, it's the beginning of a multi-week project involving Apache Flink or Spark Streaming, complex state management, and a new set of services to operate and monitor.
KaflowSQL turns that typical Friday afternoon crisis into a 30-minute task. The entire pipeline is defined in a single YAML file — it reads from Kafka, performs the transformations, and writes to a new topic:
What if joining streams was as simple as writing SQL?
# That's it. Your entire streaming pipeline.
name: user_personalization
query: |
SELECT
a.user_id,
a.action,
a.timestamp,
p.subscription_tier,
p.full_name
FROM user_actions a
LEFT JOIN customer_profiles p ON a.user_id = p.user_id
output:
topic: personalized_actions
format: avro
key: user_id
KaflowSQL transforms days into minutes. This isn't syntactic sugar – this declaration is the source of truth for a self-contained, high-performance engine that handles everything: consuming from Kafka, managing join state, handling out-of-order events, and producing results with exactly-once semantics.
Performance That Speaks for Itself
Real numbers from production workloads:
- 100K+ events/second throughput per pipeline
- Sub-millisecond join processing latency
- Minimal GC impact with Go's low-latency concurrent collector and zero-copy design
- Single binary deployment – no cluster management overhead
Tested on Mac M3 MAX - your production servers will be even faster.
From Complex to Simple: Real Use Cases
🔒 Fraud Detection (Financial Services)
Before: Multi-week Flink project with complex state management
After: Single YAML file deploying in minutes
name: fraud_detection_prep
query: |
SELECT
t.transaction_id,
t.amount,
u.risk_score,
u.account_age_days,
m.merchant_category,
m.country
FROM transactions t
LEFT JOIN users u ON t.user_id = u.user_id
LEFT JOIN merchants m ON t.merchant_id = m.merchant_id
WHERE t.amount > 100
output:
topic: fraud_detection_input
format: json
key: transaction_id
emission:
type: smart
require: [risk_score, merchant_category]
🛒 Real-Time Personalization (E-commerce)
Challenge: Join user clickstreams with product catalog data
Solution: Instant personalization with configurable data quality controls
name: personalization_engine
query: |
SELECT
c.user_id,
c.page_view,
c.timestamp,
u.subscription_tier,
p.category,
p.price,
h.last_purchase_date
FROM clickstreams c
LEFT JOIN user_profiles u ON c.user_id = u.user_id
LEFT JOIN products p ON c.product_id = p.product_id
LEFT JOIN purchase_history h ON c.user_id = h.user_id
output:
topic: personalized_recommendations
format: avro
key: user_id
state:
events:
default_ttl: 1h
overrides:
clickstreams: 30m # High-volume, short retention
purchase_history: 24h # Purchase patterns change slowly
dimensions:
- user_profiles # User demographics rarely change
- products # Product catalog is mostly static
🏭 IoT Monitoring (Manufacturing)
Challenge: Combine sensor readings with device metadata
Solution: High-throughput processing with minimal latency
name: sensor_enrichment
query: |
SELECT
s.device_id,
s.sensor_value,
s.timestamp,
d.location,
d.device_type,
c.threshold_config
FROM sensor_readings s
LEFT JOIN devices d ON s.device_id = d.device_id
LEFT JOIN device_config c ON s.device_id = c.device_id
output:
topic: enriched_sensor_data
format: avro
key: device_id
emission:
type: immediate # High throughput, partial data acceptable
state:
events:
default_ttl: 5m # High-volume, short retention
dimensions:
- devices
- device_config
Why KaflowSQL? The Architecture Difference
🎯 Specialized, Not Generic
Most streaming frameworks try to do everything. KaflowSQL does one thing exceptionally well: stateful SQL joins on Kafka streams.
⚡ Co-located State for Speed
- Embedded RocksDB: State and compute on the same node
- Sub-millisecond lookups: No network latency in the hot path
- Hash-based sharding: 256 partitions for concurrent processing
🧠 Temporal Stream Processing Engine
- Time-windowed joins with configurable retention and late-arrival handling
- Intelligent state management separating fast streams from slow dimensions
- DuckDB integration for complex SQL operations and aggregations
- Zero-copy architecture minimizing allocations and memory pressure
⏰ Sophisticated Windowing
- Business-driven TTL: Configure retention from minutes to days based on data velocity
- Per-topic temporal control: High-volume events (30m) vs user sessions (4h) vs static lookups (never expire)
- Late-arrival handling: Process out-of-order events within configurable time windows
- Automatic temporal cleanup: Memory-efficient state management prevents bloat
- Dimension vs event separation: Static reference data never expires, events have business-appropriate TTL
🎛️ Smart Emission Control
Choose between completeness and latency:
- Smart mode: Emit only when required fields are populated
- Immediate mode: Emit as soon as any data is available
- Fine-grained TTL: Per-topic expiration with dimension tables that never expire
🚀 Get Started in 2 Minutes
Option 1: Docker (Recommended)
# Create your pipeline
cat > pipelines/my_pipeline.yaml << EOF
name: user_enrichment
query: |
SELECT
e.user_id,
e.event_type,
e.timestamp,
p.subscription_tier,
p.country
FROM user_events e
LEFT JOIN user_profiles p ON e.user_id = p.user_id
output:
topic: enriched_events
format: json
key: user_id
emission:
type: smart
require: [subscription_tier]
state:
events:
default_ttl: 1h # Business-appropriate retention
overrides:
user_events: 2h # Longer for user journey tracking
dimensions:
- user_profiles # Static reference data never expires
EOF
# Configure Kafka connection
cat > config.yaml << EOF
kafka:
brokers: ["your-kafka:9092"]
schemaRegistry: "http://your-registry:8081"
useAvro: true
state:
rocksdb:
path: /data/rocksdb
EOF
# Run KaflowSQL
docker run --rm \
-v $(pwd)/pipelines:/app/pipelines:ro \
-v $(pwd)/config.yaml:/app/config.yaml:ro \
siqueiraa/kaflowsql:latest
That's it! Your streaming pipeline is now processing real-time joins.
Option 2: Native Binary
# Download latest release
wget https://github.com/siqueiraa/KaflowSQL/releases/latest/download/kaflowsql-linux-amd64.tar.gz
tar -xzf kaflowsql-linux-amd64.tar.gz
# Configure and run
./engine config.yaml pipelines/
When to Choose KaflowSQL
✅ Perfect For:
- Stream enrichment: Join events with lookup data
- Real-time personalization: User events + profile data
- Fraud detection: Transaction + risk score enrichment
- IoT processing: Sensor readings + device metadata
- Customer 360: Multi-stream customer views
- Teams that value: Developer velocity and operational simplicity
🤔 Consider Alternatives When:
- You need complex User-Defined Functions (UDFs) in Java/Scala
- You require distributed machine learning within your stream
- Your use case demands unlimited horizontal scale over simplicity
Real-World Performance
Financial Services Client
- Volume: 500K transactions/minute
- Latency: P99 < 2ms join processing
- Deployment: Single Docker container per region
- Ops overhead: Reduced from 40hrs/week to 2hrs/week
E-commerce Platform
- Volume: 1M+ page views/hour enriched with user profiles
- Memory usage: 2GB for 10M user profiles (dimension table)
- State recovery: Full cluster restart in < 30 seconds
- Team velocity: New pipelines from idea to production in same day
Windowing Strategies for Different Data Patterns
Multi-Velocity Stream Processing
name: customer_360_windowed
window: 24h
query: |
SELECT
e.user_id,
e.event_type,
e.timestamp,
p.email,
p.tier,
t.last_purchase_amount,
s.support_tickets
FROM events e
LEFT JOIN profiles p ON e.user_id = p.user_id
LEFT JOIN transactions t ON e.user_id = t.user_id
LEFT JOIN support s ON e.user_id = s.user_id
state:
events:
default_ttl: 1h # Most events are short-lived
overrides:
events: 2h # User interactions need longer correlation
transactions: 24h # Financial data patterns evolve slowly
support: 7d # Support tickets have long resolution cycles
dimensions:
- profiles # User profiles are slow-changing reference data
emission:
type: smart
require: [email, tier] # Wait for essential profile data
Why This Works:
- High-volume streams (events) get short TTL for memory efficiency
- Business-critical data (transactions) gets longer retention for pattern detection
- Slow-changing dimensions (profiles) never expire, always available for enrichment
- Late arrivals are handled within each stream's appropriate time window
Complete Configuration Reference
Pipeline Definition
# Required: Pipeline identification
name: my_pipeline
window: 1h # Optional: Global TTL override
# Required: SQL query defining the
