SwiftLake

Overview

SwiftLake is a Java library that bridges the gap between traditional SQL databases and cloud-native data lakes. By combining Apache Iceberg and DuckDB, it provides a lightweight, single-node solution that delivers SQL familiarity with cloud storage benefits, without the complexity of distributed systems.

Key Features and Benefits

Query and Manage Cloud Storage: SwiftLake brings familiar SQL queries and data management capabilities to object storage-based data lakes, providing a comfortable transition path for teams with RDBMS experience.
Efficient Data Operations: Leveraging DuckDB's columnar processing and Iceberg's transaction management, SwiftLake delivers fast data operations for ingestion, querying, and complex transformations.
Flexible Deployment: SwiftLake operates as a single-process application that connects DuckDB's lightweight engine with cloud storage, eliminating the need for distributed infrastructure for moderate workloads.
Core Data Lake Capabilities: SwiftLake provides CRUD operations, SCD support, schema evolution, and time travel functionality on cloud storage.
Cloud Economics: By using object storage for data and running compute only when needed, SwiftLake offers significant cost advantages over traditional database scaling approaches.

When to Use SwiftLake

SwiftLake is ideal for:

Organizations wanting SQL database familiarity with cloud storage economics
Teams needing schema evolution, time travel, or SCD merge capabilities
Scenarios where distributed processing frameworks would be overkill

By providing a middle ground between traditional databases and complex distributed systems, SwiftLake enables teams to modernize their data architecture with minimal disruption and maximal flexibility.

SwiftLake Capabilities and Constraints

Core Functionalities

Comprehensive Data Management:
- Execute queries
- Perform write operations: insert, delete, update
- Implement Slowly Changing Dimensions (SCD):
  - Type 1 merge
  - Type 2 merge
Dynamic Schema Evolution:
- Add, drop, rename, and reorder columns
- Widen column types
Advanced Partitioning Strategies:
- Enhance query performance through intelligent data grouping
- Support for multiple partition transforms:
  - Identity, bucket, truncate
  - Time-based: year, month, day, hour
- Hidden partitioning capability
- Partition evolution without data rewrite

Performance Optimizations

Efficient Caching: Optimize data access and query performance
MyBatis Integration: Seamless interaction with MyBatis framework

Current System Boundaries

File Format Compatibility: Currently supports only Parquet format
Table Management Mode:
- Implements Copy-On-Write mode exclusively
- Merge-On-Read not supported
Metadata Handling:
- Querying metadata tables not supported within SwiftLake
- Snapshot and metadata management requires external engines (e.g., Spark)
Partitioning Limitation: Cannot partition on columns from nested structs

Note: For operations like data compaction, expiring snapshots, and deleting orphan files, use compatible external engines such as Apache Spark.

Getting Started

Including SwiftLake Dependency

To use SwiftLake in your project, add the following dependency to your build file:

Maven

Add this to your pom.xml:

<dependency>
    <groupId>com.arcesium.swiftlake</groupId>
    <artifactId>swiftlake-core</artifactId>
    <version>0.2.0</version>
</dependency>

Gradle

Add this to your build.gradle:

implementation 'com.arcesium.swiftlake:swiftlake-core:0.2.0'

Setup

Configure and create a Catalog:

import java.util.HashMap;
import java.util.Map;

import org.apache.hadoop.conf.Configuration;
import org.apache.iceberg.CatalogUtil;
import org.apache.iceberg.catalog.Catalog;

Map<String, String> properties = new HashMap<>();
properties.put("warehouse", "warehouse");
properties.put("type", "hadoop");
properties.put("io-impl", "com.arcesium.swiftlake.io.SwiftLakeHadoopFileIO");
Catalog catalog = CatalogUtil.buildIcebergCatalog("local", properties, new Configuration());

Build SwiftLakeEngine:

import com.arcesium.swiftlake.SwiftLakeEngine;

SwiftLakeEngine swiftLakeEngine = SwiftLakeEngine.builderFor("demo").catalog(catalog).build();

Creating a Table

Define the schema:

import org.apache.iceberg.PartitionSpec;
import org.apache.iceberg.Schema;
import org.apache.iceberg.types.Types;

Schema schema = new Schema(
    Types.NestedField.required(1, "id", Types.LongType.get()),
    Types.NestedField.required(2, "data", Types.StringType.get()),
    Types.NestedField.required(3, "category", Types.StringType.get()),
    Types.NestedField.required(4, "date", Types.DateType.get())
);
PartitionSpec spec = PartitionSpec.builderFor(schema)
    .identity("date")
    .identity("category")
    .build();

Create the table:

import org.apache.iceberg.Table;
import org.apache.iceberg.catalog.TableIdentifier;

TableIdentifier name = TableIdentifier.of("db", "table");
Table table = catalog.createTable(name, schema, spec);

Inserting Data

Use SQL to insert data:

swiftLakeEngine.insertInto(table)
   .sql("SELECT * FROM (VALUES (1, 'a', 'category1', DATE'2025-01-01'), (2, 'b', 'category2', DATE'2025-01-01'), (3, 'c', 'category3', DATE'2025-03-01')) source(id, data, category, date)")
   .execute();

Querying Data

Execute SQL queries using a JDBC-like interface:

import javax.sql.DataSource;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;

DataSource dataSource = swiftLakeEngine.createDataSource();
String selectSql = "SELECT * FROM db.table WHERE id = 2";
try (Connection connection = dataSource.getConnection();
    Statement statement = connection.createStatement();
    ResultSet resultSet = statement.executeQuery(selectSql)) {
   // Process the resultSet
}

You can also perform aggregations:

import javax.sql.DataSource;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;

DataSource dataSource = swiftLakeEngine.createDataSource();
String aggregateSql = "SELECT count(1) as count, data FROM db.table WHERE id > 0 GROUP BY data;";
try (Connection connection = dataSource.getConnection();
    Statement statement = connection.createStatement();
    ResultSet resultSet = statement.executeQuery(aggregateSql)) {
   // Process the resultSet
}

AWS Integration

S3 Integration

To use SwiftLake with Amazon S3, you need to configure the S3 file system:

Add the below dependency:

Maven

<dependency>
    <groupId>com.arcesium.swiftlake</groupId>
    <artifactId>swiftlake-aws</artifactId>
    <version>0.2.0</version>
</dependency>

Gradle

implementation 'com.arcesium.swiftlake:swiftlake-aws:0.2.0'

Configure S3 in your SwiftLake setup:

Map<String, String> properties = new HashMap<>();
properties.put("warehouse", "s3://your-bucket-name/warehouse");
properties.put("io-impl", "com.arcesium.swiftlake.aws.SwiftLakeS3FileIO");
properties.put("client.region", "your-aws-region");
properties.put("s3.access-key-id", "YOUR_ACCESS_KEY");
properties.put("s3.secret-access-key", "YOUR_SECRET_KEY");

AWS Glue Catalog Integration

To use SwiftLake with AWS Glue Catalog:

Configure Glue Catalog in your SwiftLake setup:

Map<String, String> properties = new HashMap<>();
properties.put("warehouse", "s3://your-bucket-name/warehouse");
properties.put("io-impl", "com.arcesium.swiftlake.aws.SwiftLakeS3FileIO");
properties.put("client.region", "your-aws-region");
properties.put("s3.access-key-id", "YOUR_ACCESS_KEY");
properties.put("s3.secret-access-key", "YOUR_SECRET_KEY");
properties.put("type", "glue");

Catalog catalog = CatalogUtil.buildIcebergCatalog("glue", properties, new Configuration());
SwiftLakeEngine swiftLakeEngine = SwiftLakeEngine.builderFor("demo").catalog(catalog).build();

// Create table, insert data, and query

Configuration

SwiftLakeEngine Configuration

| Name | Default | Description | |-----------------------------------|------------------------------------------------------------------|-------------| | localDir | A unique directory under the system's temporary directory | Local storage where to write temp files. | | memoryLimitInMiB | 90% of memory available outside the JVM heap, expressed in MiB | Maximum memory of the DuckDB instance | | memoryLimitFraction | - | Fraction of total memory used for DuckDB instance. | | threads | Number of available processor cores | The number of total threads used by the DuckDB instance | | tempStorageLimitInMiB | - | Maximum amount of disk space DuckDB can use for temporary storage | | maxPartitionWriterThreads | Same as threa

Swiftlake

Install / Use

README

SwiftLake

Overview

Key Features and Benefits

When to Use SwiftLake

SwiftLake Capabilities and Constraints

Core Functionalities

Performance Optimizations

Current System Boundaries

Getting Started

Including SwiftLake Dependency

Maven

Gradle

Setup

Creating a Table

Inserting Data

Querying Data

AWS Integration

S3 Integration

Maven

Gradle

AWS Glue Catalog Integration

Configuration

SwiftLakeEngine Configuration