Roar
Roar is a high-performance Arrow Flight gateway for Apache Kafka
Install / Use
/learn @TFMV/RoarREADME
🦁 Roar
Roar is a high-performance Arrow Flight gateway for Apache Kafka. It consumes messages from Kafka topics, converts them to Apache Arrow RecordBatches, and exposes them as ephemeral Arrow Flight streams.
Features
- Real-time Kafka Ingestion: Stream data directly from Kafka topics
- Arrow Flight Integration: Expose topics as Arrow Flight streams
- Ephemeral Streams: Efficient memory management with TTL controls
- Intelligent Back Pressure: Built-in stream buffer management and memory monitoring
- DuckDB Persistence: Optionally persist streams to DuckDB tables
- Prometheus Metrics: Comprehensive metrics for monitoring stream health
- Profiling Support: Built-in pprof for performance analysis
Installation
# Clone the repository
git clone https://github.com/TFMV/roar.git
cd roar
# Install dependencies
go mod tidy
# Build the binary
go build -o roar
Quick Start
# Start the server with example topics
./roar serve --brokers localhost:9092 --topics logs,metrics --batch-size 1024 --ttl 60s --buffer-limit 100MB
# Check available topics
./roar client --server grpc://localhost:8080
# View data from a specific topic
./roar client --server grpc://localhost:8080 --topic logs --limit 10
Usage
Server
The serve command starts the Roar server:
./roar serve [flags]
Flags
| Flag | Description | Default |
|------|-------------|---------|
| --brokers | Kafka bootstrap servers | localhost:9092 |
| --topics | Comma-separated list of topics | (required) |
| --schema-registry | Avro/Protobuf schema registry URI | |
| --batch-size | Max records per RecordBatch | 1024 |
| --ttl | Stream TTL | 60s |
| --buffer-limit | Max in-memory bytes per stream | 100MB |
| --port | Arrow Flight server port | 8080 |
| --persist | Whether to write to DuckDB | false |
| --db-path | Path to DuckDB database file | roar.duckdb |
| --metrics | Enable Prometheus metrics | false |
| --metrics-addr | Metrics server address | :9090 |
| --pprof | Enable pprof profiling | false |
| --pprof-addr | Pprof server address | :6060 |
Client
The client command allows you to connect to a Roar server and retrieve data:
./roar client [flags]
Flags
| Flag | Description | Default |
|------|-------------|---------|
| --server | Flight server address | grpc://localhost:8080 |
| --topic | Topic to fetch (if not specified, lists available topics) | |
| --output | Output format (table or csv) | table |
| --limit | Maximum number of rows to display | 10 |
Architecture
Roar consists of several key components:
- Kafka Ingest Layer: Consumes messages from Kafka topics using
segmentio/kafka-go - Arrow Conversion Layer: Transforms messages into Arrow RecordBatches using Arrow v18
- Flight Gateway: Serves Arrow data via the Flight protocol
- Stream Manager: Manages ephemeral streams with TTL and memory limits
- Back Pressure System: Monitors memory usage and applies back pressure when needed
- DuckDB Sink: Optional persistence to DuckDB tables (v2 API compatible)
flowchart TD
%% Kafka Topics
subgraph KAFKA["Apache Kafka"]
K1["Topic: logs"]
K2["Topic: metrics"]
K3["Topic: events"]
end
%% Roar Gateway
subgraph ROAR["Roar Gateway"]
subgraph INGEST["Kafka Ingest Layer"]
CON[Kafka Consumer]
BP[Backpressure Controller]
end
subgraph CONVERT["Arrow Conversion Layer"]
CONV[Arrow Converter]
SCHEMA[Schema Resolver]
end
subgraph STREAM["Ephemeral Stream Manager"]
REG[Stream Registry]
TTL[TTL Expirer]
BUF[Buffer Manager]
end
subgraph FLIGHT["Arrow Flight Server"]
SERVER[Flight Server]
LIST[ListFlights]
GET[DoGet]
end
subgraph DUCKDB["Optional Persistence"]
DWRITE[DuckDB Writer]
end
subgraph MONITOR["Observability"]
PROM[Prometheus Metrics]
PPROF[pprof Profiling]
end
end
%% Client Applications
subgraph CLIENTS["Flight Clients"]
PY[PyArrow Client]
DD[DuckDB Connector]
GO[Go Flight Client]
end
%% Kafka to Roar Ingest
K1 --> CON
K2 --> CON
K3 --> CON
CON --> BP
BP --> CONV
CONV --> SCHEMA
CONV --> REG
REG --> TTL
REG --> BUF
REG --> SERVER
BUF --> BP
%% Arrow Flight API
SERVER --> LIST
SERVER --> GET
%% Optional Persistence
REG --> DWRITE
%% Monitoring
CON -.-> PROM
BUF -.-> PROM
REG -.-> PROM
SERVER -.-> PROM
BP -.-> PROM
SERVER -.-> PPROF
%% Client Access
LIST --> PY
GET --> PY
LIST --> DD
GET --> DD
LIST --> GO
GET --> GO
%% Styling
classDef dark fill:#1C2833,stroke:#3498DB,stroke-width:2px,color:white
classDef kafka fill:#232F3E,stroke:#F0932B,stroke-width:2px,color:white
classDef roar fill:#17202A,stroke:#5DADE2,stroke-width:2px,color:white
classDef client fill:#145A32,stroke:#58D68D,stroke-width:2px,color:white
class KAFKA kafka
class ROAR roar
class CLIENTS client
class INGEST,CON,BP dark
class CONVERT,CONV,SCHEMA dark
class STREAM,REG,TTL,BUF dark
class FLIGHT,SERVER,LIST,GET dark
class DUCKDB,DWRITE dark
class MONITOR,PROM,PPROF dark
Back Pressure Implementation
Roar implements sophisticated back pressure mechanisms to ensure stability under high load:
- Stream Buffer Control: Each stream has a configurable maximum buffer size
- Memory-Aware Processing: Tracks memory usage and slows down consumption when thresholds are reached
- Automatic Record Dropping: Can optionally drop oldest records when buffers are full
- Metrics Exposure: Comprehensive metrics exposed via Prometheus for monitoring
Client Integration
Python
import pyarrow.flight as flight
# Connect to Roar server
client = flight.connect("grpc://localhost:8080")
# List available topics
for flight_info in client.list_flights():
print(f"Topic: {flight_info.descriptor.path[0]}")
# Get data from a topic
descriptor = flight.FlightDescriptor.for_path("logs")
info = client.get_flight_info(descriptor)
reader = client.do_get(info.endpoints[0].ticket)
# Convert to pandas DataFrame
df = reader.read_all().to_pandas()
print(df)
Monitoring
When metrics are enabled, Roar exposes Prometheus-compatible metrics at /metrics on the configured metrics port.
Key metrics include:
roar_kafka_messages_total: Total number of Kafka messages processedroar_record_batches_created_total: Total number of Arrow RecordBatches createdroar_active_streams: Current number of active streamsroar_stream_memory_bytes: Current memory usage by streamroar_stream_memory_percent: Percentage of maximum allowed memory usedroar_stream_buffer_utilization: Percentage of maximum allowed buffer usedroar_stream_records_dropped_total: Number of records dropped due to back pressureroar_stream_records_processed_total: Total number of records processed
Performance Tuning
For optimal performance:
- Batch Size: Adjust the
--batch-sizeparameter based on your record size and memory constraints - Buffer Limit: Set the
--buffer-limitto control memory usage per stream - TTL: Configure the
--ttlto manage how long inactive streams stay in memory - JVM Settings: When using with Java clients, ensure enough heap space is allocated
Profiling
When profiling is enabled, Roar exposes pprof endpoints at the configured pprof address (default: :6060).
Access profiles at:
http://localhost:6060/debug/pprof/
