The Agony of Streaming Joins is Over

🚀 KaflowSQL v1.0.0 - Time-windowed streaming ETL with SQL-native temporal joins

Picture this: It's 3 PM on a Friday. Your product manager walks over with "a quick request" – they need real-time user personalization live by Monday.

You have user_actions in one Kafka topic and customer_profiles in another. In a relational database, this is a trivial LEFT JOIN. In the world of streaming, it's the beginning of a multi-week project involving Apache Flink or Spark Streaming, complex state management, and a new set of services to operate and monitor.

KaflowSQL turns that typical Friday afternoon crisis into a 30-minute task. The entire pipeline is defined in a single YAML file — it reads from Kafka, performs the transformations, and writes to a new topic:

What if joining streams was as simple as writing SQL?

# That's it. Your entire streaming pipeline.
name: user_personalization
query: |
  SELECT 
    a.user_id,
    a.action,
    a.timestamp,
    p.subscription_tier,
    p.full_name
  FROM user_actions a
  LEFT JOIN customer_profiles p ON a.user_id = p.user_id

output:
  topic: personalized_actions
  format: avro
  key: user_id

KaflowSQL transforms days into minutes. This isn't syntactic sugar – this declaration is the source of truth for a self-contained, high-performance engine that handles everything: consuming from Kafka, managing join state, handling out-of-order events, and producing results with exactly-once semantics.

Performance That Speaks for Itself

Real numbers from production workloads:

100K+ events/second throughput per pipeline
Sub-millisecond join processing latency
Minimal GC impact with Go's low-latency concurrent collector and zero-copy design
Single binary deployment – no cluster management overhead

Tested on Mac M3 MAX - your production servers will be even faster.

From Complex to Simple: Real Use Cases

🔒 Fraud Detection (Financial Services)

Before: Multi-week Flink project with complex state management
After: Single YAML file deploying in minutes

name: fraud_detection_prep
query: |
  SELECT 
    t.transaction_id,
    t.amount,
    u.risk_score,
    u.account_age_days,
    m.merchant_category,
    m.country
  FROM transactions t
  LEFT JOIN users u ON t.user_id = u.user_id
  LEFT JOIN merchants m ON t.merchant_id = m.merchant_id
  WHERE t.amount > 100

output:
  topic: fraud_detection_input
  format: json
  key: transaction_id

emission:
  type: smart
  require: [risk_score, merchant_category]

🛒 Real-Time Personalization (E-commerce)

Challenge: Join user clickstreams with product catalog data
Solution: Instant personalization with configurable data quality controls

name: personalization_engine
query: |
  SELECT 
    c.user_id,
    c.page_view,
    c.timestamp,
    u.subscription_tier,
    p.category,
    p.price,
    h.last_purchase_date
  FROM clickstreams c
  LEFT JOIN user_profiles u ON c.user_id = u.user_id
  LEFT JOIN products p ON c.product_id = p.product_id
  LEFT JOIN purchase_history h ON c.user_id = h.user_id

output:
  topic: personalized_recommendations
  format: avro
  key: user_id

state:
  events:
    default_ttl: 1h
    overrides:
      clickstreams: 30m        # High-volume, short retention
      purchase_history: 24h    # Purchase patterns change slowly
  dimensions:
    - user_profiles            # User demographics rarely change
    - products                # Product catalog is mostly static

🏭 IoT Monitoring (Manufacturing)

Challenge: Combine sensor readings with device metadata
Solution: High-throughput processing with minimal latency

name: sensor_enrichment
query: |
  SELECT 
    s.device_id,
    s.sensor_value,
    s.timestamp,
    d.location,
    d.device_type,
    c.threshold_config
  FROM sensor_readings s
  LEFT JOIN devices d ON s.device_id = d.device_id
  LEFT JOIN device_config c ON s.device_id = c.device_id

output:
  topic: enriched_sensor_data
  format: avro
  key: device_id

emission:
  type: immediate  # High throughput, partial data acceptable
  
state:
  events:
    default_ttl: 5m  # High-volume, short retention
  dimensions:
    - devices
    - device_config

Why KaflowSQL? The Architecture Difference

🎯 Specialized, Not Generic

Most streaming frameworks try to do everything. KaflowSQL does one thing exceptionally well: stateful SQL joins on Kafka streams.

⚡ Co-located State for Speed

Embedded RocksDB: State and compute on the same node
Sub-millisecond lookups: No network latency in the hot path
Hash-based sharding: 256 partitions for concurrent processing

🧠 Temporal Stream Processing Engine

Time-windowed joins with configurable retention and late-arrival handling
Intelligent state management separating fast streams from slow dimensions
DuckDB integration for complex SQL operations and aggregations
Zero-copy architecture minimizing allocations and memory pressure

⏰ Sophisticated Windowing

Business-driven TTL: Configure retention from minutes to days based on data velocity
Per-topic temporal control: High-volume events (30m) vs user sessions (4h) vs static lookups (never expire)
Late-arrival handling: Process out-of-order events within configurable time windows
Automatic temporal cleanup: Memory-efficient state management prevents bloat
Dimension vs event separation: Static reference data never expires, events have business-appropriate TTL

🎛️ Smart Emission Control

Choose between completeness and latency:

Smart mode: Emit only when required fields are populated
Immediate mode: Emit as soon as any data is available
Fine-grained TTL: Per-topic expiration with dimension tables that never expire

🚀 Get Started in 2 Minutes

Option 1: Docker (Recommended)

# Create your pipeline
cat > pipelines/my_pipeline.yaml << EOF
name: user_enrichment
query: |
  SELECT 
    e.user_id,
    e.event_type,
    e.timestamp,
    p.subscription_tier,
    p.country
  FROM user_events e
  LEFT JOIN user_profiles p ON e.user_id = p.user_id

output:
  topic: enriched_events
  format: json
  key: user_id

emission:
  type: smart
  require: [subscription_tier]

state:
  events:
    default_ttl: 1h          # Business-appropriate retention
    overrides:
      user_events: 2h        # Longer for user journey tracking  
  dimensions:
    - user_profiles          # Static reference data never expires
EOF

# Configure Kafka connection
cat > config.yaml << EOF
kafka:
  brokers: ["your-kafka:9092"]
  schemaRegistry: "http://your-registry:8081"
  useAvro: true

state:
  rocksdb:
    path: /data/rocksdb
EOF

# Run KaflowSQL
docker run --rm \
  -v $(pwd)/pipelines:/app/pipelines:ro \
  -v $(pwd)/config.yaml:/app/config.yaml:ro \
  siqueiraa/kaflowsql:latest

That's it! Your streaming pipeline is now processing real-time joins.

Option 2: Native Binary

# Download latest release
wget https://github.com/siqueiraa/KaflowSQL/releases/latest/download/kaflowsql-linux-amd64.tar.gz
tar -xzf kaflowsql-linux-amd64.tar.gz

# Configure and run
./engine config.yaml pipelines/

When to Choose KaflowSQL

✅ Perfect For:

Stream enrichment: Join events with lookup data
Real-time personalization: User events + profile data
Fraud detection: Transaction + risk score enrichment
IoT processing: Sensor readings + device metadata
Customer 360: Multi-stream customer views
Teams that value: Developer velocity and operational simplicity

🤔 Consider Alternatives When:

You need complex User-Defined Functions (UDFs) in Java/Scala
You require distributed machine learning within your stream
Your use case demands unlimited horizontal scale over simplicity

Real-World Performance

Financial Services Client

Volume: 500K transactions/minute
Latency: P99 < 2ms join processing
Deployment: Single Docker container per region
Ops overhead: Reduced from 40hrs/week to 2hrs/week

E-commerce Platform

Volume: 1M+ page views/hour enriched with user profiles
Memory usage: 2GB for 10M user profiles (dimension table)
State recovery: Full cluster restart in < 30 seconds
Team velocity: New pipelines from idea to production in same day

Windowing Strategies for Different Data Patterns

Multi-Velocity Stream Processing

name: customer_360_windowed
window: 24h

query: |
  SELECT 
    e.user_id,
    e.event_type,
    e.timestamp,
    p.email,
    p.tier,
    t.last_purchase_amount,
    s.support_tickets
  FROM events e
  LEFT JOIN profiles p ON e.user_id = p.user_id
  LEFT JOIN transactions t ON e.user_id = t.user_id  
  LEFT JOIN support s ON e.user_id = s.user_id

state:
  events:
    default_ttl: 1h           # Most events are short-lived
    overrides:
      events: 2h              # User interactions need longer correlation
      transactions: 24h       # Financial data patterns evolve slowly
      support: 7d             # Support tickets have long resolution cycles
  dimensions:
    - profiles                # User profiles are slow-changing reference data

emission:
  type: smart
  require: [email, tier]      # Wait for essential profile data

Why This Works:

High-volume streams (events) get short TTL for memory efficiency
Business-critical data (transactions) gets longer retention for pattern detection
Slow-changing dimensions (profiles) never expire, always available for enrichment
Late arrivals are handled within each stream's appropriate time window

Complete Configuration Reference

Pipeline Definition

# Required: Pipeline identification
name: my_pipeline
window: 1h                     # Optional: Global TTL override

# Required: SQL query defining the

KaflowSQL

Install / Use

README