Olake

OLake - Fastest Databases, Kafka & S3 Replication to Apache Iceberg or Plain Parquet. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Supported sources : Postgres, MongoDB, MySQL, Oracle, MSSql, DB2, Kafka, S3.

Generate Convert Improve

Install / Use

/learn @datazip-inc/Olake

About this skill

Quality Score

0/100

README

<h1 align="center" style="border-bottom: none"> <a href="https://datazip.io/olake" target="_blank"> <img alt="olake" src="https://github.com/user-attachments/assets/d204f25f-5289-423c-b3f2-44b2194bdeaf" width="100" height="100"/> </a> <br>OLake </h1> <p align="center"> <strong>OLake</strong> is a high-performance, open-source data ingestion engine for replicating databases, S3, and Kafka into <strong>Apache Iceberg</strong> (or plain Parquet). <br/> Built for scalable, real-time pipelines, OLake provides a simple web UI and CLI - used to ingest into vendor-lock-in free Iceberg tables supporting all the query-engines/warehouses. <br/><br/> Read the docs and benchmarks at <a href="https://olake.io/docs" target="_blank">olake.io/docs</a>. Join our active community on <a href="https://olake.io/slack/" target="_blank">Slack</a>. </p> <p align="center"> <a href="https://github.com/datazip-inc/olake/issues"> <img alt="GitHub issues" src="https://img.shields.io/github/issues/datazip-inc/olake"/> </a> <a href="https://olake.io/docs"> <img alt="Documentation" src="https://img.shields.io/badge/view-Documentation-white"/> </a> <a href="https://olake.io/slack/"> <img alt="slack" src="https://img.shields.io/badge/Join%20Our%20Community-Slack-blue"/> </a> <a href="https://olake.io/docs/community/contributing/"> <img alt="Contribute to OLake" src="https://img.shields.io/badge/Contribute-OLake-2563eb"/> </a> </p>

OLake — Super-fast Sync to Apache Iceberg

OLake supports replication from transactional databases such as PostgreSQL, MySQL, MongoDB, Oracle, DB2, and MSSQL, event-streaming systems like Apache Kafka and Object-store like S3, into open data lakehouse formats such as Apache Iceberg or Plain Parquet — delivering blazing-fast performance with minimal infrastructure cost.

🚀 Why OLake?

🧠 Smart sync: Full + CDC replication with automatic schema discovery & schema evolution
⚡ High throughput: 580K RPS (Postgres) & 338K RPS (MySQL)
➡️ Exactly once delivery & Arrow writes: Accuracy with speed.
💾 Iceberg-native: Supports Glue, Hive, JDBC, REST catalogs
🖥️ Self-serve UI: Deploy via Docker Compose and sync in minutes
💸 Infra-light: No Spark, no Flink, no Kafka, no Debezium
🗜️ Iceberg Table Optimization (Coming soon): Compaction tailored for CDC ingestion

📊 Benchmarks & possible connections

Full Load

| Source → Destination | Full Load | Relative Performance (Full Load) | Full Report | |----------------------|-----------------|--------------------------------------|--------------------------------------------------------------| | Postgres → Iceberg | 5,80,113 RPS | 12.5× faster than Fivetran | Full Report | | MySQL → Iceberg | 3,38,005 RPS | 2.83× faster than Fivetran | Full Report | | MongoDB → Iceberg | 37,879 RPS | - | Full Report | | Oracle → Iceberg | 5,26,337 RPS | - | Full Report | | Kafka → Iceberg | 1,54,320 RPS (Bounded Incremental) | 1.8x faster than Flink | Full Report |

CDC

| Source → Destination | CDC | Relative Performance (CDC) | Full Report | |----------------------|-----------------|--------------------------------------|--------------------------------------------------------------| | Postgres → Iceberg | 55,555 RPS | 2× faster than Fivetran | Full Report | | MySQL → Iceberg | 51,867 RPS | 1.85× faster than Fivetran | Full Report | | MongoDB → Iceberg | 10,692 RPS | - | Full Report | | Oracle → Iceberg | - | - | Full Report |

*These are preliminary results. Fully reproducible benchmark scores will be published soon.

🔧 Supported Sources and Destinations

Sources (Databases and S3)

| Source | Full Load | CDC | Incremental | Notes | Documentation | |---------------|--------------|---------------|-------------------|-----------------------------|-----------------------------| | PostgreSQL | ✅ | ✅ pgoutput | ✅ |wal2json deprecated |Postgres Docs | | MySQL | ✅ | ✅ | ✅ | Binlog-based CDC | MySQL Docs | | MongoDB | ✅ | ✅ | ✅ | Oplog-based CDC |MongoDB Docs | | Oracle | ✅ | WIP | ✅ | JDBC based Full Load & Incremental | Oracle Docs | | DB2 | ✅ | - | ✅ | JDBC based Full Load & Incremental | DB2 Docs | | MSSQL | ✅ | ✅ | ✅ | Full Load, CDC & Incremental | MSSQL Docs | | S3 | ✅ | - | ✅ | Ingests from Amazon S3 or S3-compatible (MinIO, LocalStack) | S3 Docs |

Sources (Kafka)

| Source | Bounded Incremental | Notes | Documentation | |--------|--------------------|-----------------------------------|---------------| | Kafka | ✅ | Latest offset bounded incremental sync | Kafka Docs |

Destinations

| Destination | Format | Supported Catalogs | |----------------|-----------|---------------------------------------------------------------| | Iceberg | ✅ | Glue, Hive, JDBC, REST (Nessie, Polaris, Unity, Lakekeeper, AWS S3 tables) | | Parquet | ✅ | Filesystem | | Other formats | 🔜 | Planned: Delta Lake, Hudi |

Writer Docs

Apache Iceberg Docs
1. Catalogs
2. Azure ADLS Gen2
3. Google Cloud Storage (GCS)
4. MinIO (local)
5. Iceberg Table Management
  1. S3 Tables Supported
Parquet Writer

🧪 Quickstart (UI + Docker)

OLake UI is a web-based interface for managing OLake jobs, sources, destinations, and configurations. You can run the entire OLake stack (UI, Backend, and all dependencies) using Docker Compose. This is the recommended way to get started. Run the UI, connect your source DB, and start syncing in minutes.

curl -sSL https://raw.githubusercontent.com/datazip-inc/olake-ui/master/docker-compose.yml | docker compose -f - up -d

Access the UI: * OLake UI: http://localhost:8000 * Log in with default credentials: admin / password.

Detailed getting started using OLake UI can be found here.

olake-ui

Creating Your First Job

With the UI running, you can create a data pipeline in a few steps:

Create a Job: Navigate to the Jobs tab and click Create Job.
Configure Source: Set up your source connection (e.g., PostgreSQL, MySQL, MongoDB).
Configure Destination: Set up your destination (e.g., Apache Iceberg with a Glue, REST, Hive, or JDBC catalog).
Select Streams: Choose which tables to sync and configure their sync mode (CDC or Full Refresh).
Configure & Run: Give your job a name, set a schedule, and click Create Job to finish.

For a detailed walkthrough, refer to the Jobs documentation.

🛠️ CLI Usage (Advanced)

For advanced users and automation, OLake's core logic is exposed via a powerful CLI. The core framework handles state management, configuration validation, logging, and type detection. It interacts with drivers using four main commands: