Swiftlake
SwiftLake: Java SQL engine built on Apache Iceberg and DuckDB for efficient lakehouse reads and writes
Install / Use
/learn @arcesium/SwiftlakeREADME
SwiftLake
Overview
SwiftLake is a Java library that bridges the gap between traditional SQL databases and cloud-native data lakes. By combining Apache Iceberg and DuckDB, it provides a lightweight, single-node solution that delivers SQL familiarity with cloud storage benefits, without the complexity of distributed systems.
Key Features and Benefits
-
Query and Manage Cloud Storage: SwiftLake brings familiar SQL queries and data management capabilities to object storage-based data lakes, providing a comfortable transition path for teams with RDBMS experience.
-
Efficient Data Operations: Leveraging DuckDB's columnar processing and Iceberg's transaction management, SwiftLake delivers fast data operations for ingestion, querying, and complex transformations.
-
Flexible Deployment: SwiftLake operates as a single-process application that connects DuckDB's lightweight engine with cloud storage, eliminating the need for distributed infrastructure for moderate workloads.
-
Core Data Lake Capabilities: SwiftLake provides CRUD operations, SCD support, schema evolution, and time travel functionality on cloud storage.
-
Cloud Economics: By using object storage for data and running compute only when needed, SwiftLake offers significant cost advantages over traditional database scaling approaches.
When to Use SwiftLake
SwiftLake is ideal for:
- Organizations wanting SQL database familiarity with cloud storage economics
- Teams needing schema evolution, time travel, or SCD merge capabilities
- Scenarios where distributed processing frameworks would be overkill
By providing a middle ground between traditional databases and complex distributed systems, SwiftLake enables teams to modernize their data architecture with minimal disruption and maximal flexibility.
SwiftLake Capabilities and Constraints
Core Functionalities
-
Comprehensive Data Management:
- Execute queries
- Perform write operations: insert, delete, update
- Implement Slowly Changing Dimensions (SCD):
- Type 1 merge
- Type 2 merge
-
Dynamic Schema Evolution:
- Add, drop, rename, and reorder columns
- Widen column types
-
Advanced Partitioning Strategies:
- Enhance query performance through intelligent data grouping
- Support for multiple partition transforms:
- Identity, bucket, truncate
- Time-based: year, month, day, hour
- Hidden partitioning capability
- Partition evolution without data rewrite
Performance Optimizations
- Efficient Caching: Optimize data access and query performance
- MyBatis Integration: Seamless interaction with MyBatis framework
Current System Boundaries
- File Format Compatibility: Currently supports only Parquet format
- Table Management Mode:
- Implements Copy-On-Write mode exclusively
- Merge-On-Read not supported
- Metadata Handling:
- Querying metadata tables not supported within SwiftLake
- Snapshot and metadata management requires external engines (e.g., Spark)
- Partitioning Limitation: Cannot partition on columns from nested structs
Note: For operations like data compaction, expiring snapshots, and deleting orphan files, use compatible external engines such as Apache Spark.
Getting Started
Including SwiftLake Dependency
To use SwiftLake in your project, add the following dependency to your build file:
Maven
Add this to your pom.xml:
<dependency>
<groupId>com.arcesium.swiftlake</groupId>
<artifactId>swiftlake-core</artifactId>
<version>0.2.0</version>
</dependency>
Gradle
Add this to your build.gradle:
implementation 'com.arcesium.swiftlake:swiftlake-core:0.2.0'
Setup
- Configure and create a Catalog:
import java.util.HashMap;
import java.util.Map;
import org.apache.hadoop.conf.Configuration;
import org.apache.iceberg.CatalogUtil;
import org.apache.iceberg.catalog.Catalog;
Map<String, String> properties = new HashMap<>();
properties.put("warehouse", "warehouse");
properties.put("type", "hadoop");
properties.put("io-impl", "com.arcesium.swiftlake.io.SwiftLakeHadoopFileIO");
Catalog catalog = CatalogUtil.buildIcebergCatalog("local", properties, new Configuration());
- Build SwiftLakeEngine:
import com.arcesium.swiftlake.SwiftLakeEngine;
SwiftLakeEngine swiftLakeEngine = SwiftLakeEngine.builderFor("demo").catalog(catalog).build();
Creating a Table
- Define the schema:
import org.apache.iceberg.PartitionSpec;
import org.apache.iceberg.Schema;
import org.apache.iceberg.types.Types;
Schema schema = new Schema(
Types.NestedField.required(1, "id", Types.LongType.get()),
Types.NestedField.required(2, "data", Types.StringType.get()),
Types.NestedField.required(3, "category", Types.StringType.get()),
Types.NestedField.required(4, "date", Types.DateType.get())
);
PartitionSpec spec = PartitionSpec.builderFor(schema)
.identity("date")
.identity("category")
.build();
- Create the table:
import org.apache.iceberg.Table;
import org.apache.iceberg.catalog.TableIdentifier;
TableIdentifier name = TableIdentifier.of("db", "table");
Table table = catalog.createTable(name, schema, spec);
Inserting Data
Use SQL to insert data:
swiftLakeEngine.insertInto(table)
.sql("SELECT * FROM (VALUES (1, 'a', 'category1', DATE'2025-01-01'), (2, 'b', 'category2', DATE'2025-01-01'), (3, 'c', 'category3', DATE'2025-03-01')) source(id, data, category, date)")
.execute();
Querying Data
Execute SQL queries using a JDBC-like interface:
import javax.sql.DataSource;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
DataSource dataSource = swiftLakeEngine.createDataSource();
String selectSql = "SELECT * FROM db.table WHERE id = 2";
try (Connection connection = dataSource.getConnection();
Statement statement = connection.createStatement();
ResultSet resultSet = statement.executeQuery(selectSql)) {
// Process the resultSet
}
You can also perform aggregations:
import javax.sql.DataSource;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
DataSource dataSource = swiftLakeEngine.createDataSource();
String aggregateSql = "SELECT count(1) as count, data FROM db.table WHERE id > 0 GROUP BY data;";
try (Connection connection = dataSource.getConnection();
Statement statement = connection.createStatement();
ResultSet resultSet = statement.executeQuery(aggregateSql)) {
// Process the resultSet
}
AWS Integration
S3 Integration
To use SwiftLake with Amazon S3, you need to configure the S3 file system:
- Add the below dependency:
Maven
<dependency>
<groupId>com.arcesium.swiftlake</groupId>
<artifactId>swiftlake-aws</artifactId>
<version>0.2.0</version>
</dependency>
Gradle
implementation 'com.arcesium.swiftlake:swiftlake-aws:0.2.0'
- Configure S3 in your SwiftLake setup:
Map<String, String> properties = new HashMap<>();
properties.put("warehouse", "s3://your-bucket-name/warehouse");
properties.put("io-impl", "com.arcesium.swiftlake.aws.SwiftLakeS3FileIO");
properties.put("client.region", "your-aws-region");
properties.put("s3.access-key-id", "YOUR_ACCESS_KEY");
properties.put("s3.secret-access-key", "YOUR_SECRET_KEY");
AWS Glue Catalog Integration
To use SwiftLake with AWS Glue Catalog:
Configure Glue Catalog in your SwiftLake setup:
Map<String, String> properties = new HashMap<>();
properties.put("warehouse", "s3://your-bucket-name/warehouse");
properties.put("io-impl", "com.arcesium.swiftlake.aws.SwiftLakeS3FileIO");
properties.put("client.region", "your-aws-region");
properties.put("s3.access-key-id", "YOUR_ACCESS_KEY");
properties.put("s3.secret-access-key", "YOUR_SECRET_KEY");
properties.put("type", "glue");
Catalog catalog = CatalogUtil.buildIcebergCatalog("glue", properties, new Configuration());
SwiftLakeEngine swiftLakeEngine = SwiftLakeEngine.builderFor("demo").catalog(catalog).build();
// Create table, insert data, and query
Configuration
SwiftLakeEngine Configuration
| Name | Default | Description |
|-----------------------------------|------------------------------------------------------------------|-------------|
| localDir | A unique directory under the system's temporary directory | Local storage where to write temp files. |
| memoryLimitInMiB | 90% of memory available outside the JVM heap, expressed in MiB | Maximum memory of the DuckDB instance |
| memoryLimitFraction | - | Fraction of total memory used for DuckDB instance. |
| threads | Number of available processor cores | The number of total threads used by the DuckDB instance |
| tempStorageLimitInMiB | - | Maximum amount of disk space DuckDB can use for temporary storage |
| maxPartitionWriterThreads | Same as threa
