Velox4j
Community Java bindings for https://github.com/facebookincubator/velox
Install / Use
/learn @boostscale/Velox4jREADME
Velox4J: Community Java Bindings for Velox
Project Status
Velox4J is currently an experimental project.
Introduction
What is Velox?
Velox is an open source unified execution engine. The project was originally funded by Meta in 2020. Projects often use Velox as a C++ library to accelerate SQL query executions.
Homepages of Velox:
Critical open source projects depending on Velox:
What is Velox4J?
Velox4J is a set of Java bindings for Velox. It enables JVM applications to directly invoke Velox's functionalities without writing and maintaining any C++ / JNI code.
Velox4J is an independent community project and is not affiliated with or endorsed by Meta Platforms, Inc.
Powered by Velox4J
The Gluten-Flink subproject is backed by Velox via Velox4J:
Design
Velox4J is designed within the following manners:
Portable
Velox4J is designed to be portable. The eventual goal is to make one Velox4J release to be shipped onto difference platforms without rebuilding the Jar file.
Seamless Velox API Mapping
Velox4J directly adopts Velox's existing JSON serde framework and implements the following JSON-serializable Velox components in Java-side:
- Data types
- Query plans
- Expressions
- Connectors
With the help of Velox's own JSON serde, there will be no re-interpreting layer for query plans in Velox4J's C++ code base. Which means, the Java side Velox components defined in Velox4J's Java code will be 1-on-1 mapped to Velox's associated components. The design makes Velox4J's code base even small, and any new Velox features easy to add to Velox4J.
Compatible With Arrow Java
Velox4J is compatible with Apache Arrow's Java implementation. Built-in utilities converting between Velox4J's RowVector / BaseVector and Arrow Java's VectorSchemaRoot / Table / FieldVector are provided.
Prerequisites
Platform
The project and its releases are now only tested on the following CPU architectures:
- x86-64
and on the following operating systems:
- Linux
Supports for platforms not on the above list will not be guaranteed to have by the main stream code of Velox4J at the time. But certainly, contributions are always welcomed if anyone tends to involve.
Build Toolchains
The minimum toolchain versions for building Velox4J:
- GCC 11
- JDK 8
Releases
Velox4J currently only provides SNAPSHOT jar releases.
Maven
<dependency>
<groupId>org.boostscale</groupId>
<artifactId>velox4j</artifactId>
<version>0.1.0-SNAPSHOT</version>
</dependency>
If the Central Portal Snapshots repository is not set in Maven, the following settings also need
to be added:
<repositories>
<repository>
<name>Central Portal Snapshots</name>
<id>central-portal-snapshots</id>
<url>https://central.sonatype.com/repository/maven-snapshots/</url>
<releases>
<enabled>false</enabled>
</releases>
<snapshots>
<enabled>true</enabled>
</snapshots>
</repository>
</repositories>
NOTE:
- The released Jar is built with x86-64 CPU + CentOS 7 (glibc 2.17) operating system.
- The released Jar is verified by daily CI job maintained in velox4j-integration-test.
Build From Source
mvn clean install
Get Started
The following is a brief example of using Velox4J to execute a query:
public static void main(String[] args) {
// 1. Initialize Velox4J.
Velox4j.initialize();
// 2. Define the plan output schema.
final RowType outputType = new RowType(List.of(
"n_nationkey",
"n_name",
"n_regionkey",
"n_comment"
), List.of(
new BigIntType(),
new VarCharType(),
new BigIntType(),
new VarCharType()
));
// 3. Create a table scan node.
final TableScanNode scanNode = new TableScanNode(
"plan-id-1",
outputType,
new HiveTableHandle(
"connector-hive",
"table-1",
false,
List.of(),
null,
outputType,
Map.of()
),
toAssignments(outputType)
);
// 4. Build the query.
final Query query = new Query(scanNode, Config.empty(), ConnectorConfig.empty());
// 5. Create a Velox4J session.
final MemoryManager memoryManager = Velox4j.newMemoryManager(AllocationListener.NOOP);
final Session session = Velox4j.newSession(memoryManager);
// 6. Execute the query. A Velox serial task will be returned.
final SerialTask task = session.queryOps().execute(query);
// 7. Add a split associating with the table scan node to the task, this makes
// the scan read a local file "/tmp/nation.parquet".
final File file = new File("/tmp/nation.parquet");
final ConnectorSplit split = new HiveConnectorSplit(
"connector-hive",
0,
false,
file.getAbsolutePath(),
FileFormat.PARQUET,
0,
file.length(),
Map.of(),
null,
null,
Map.of(),
null,
Map.of(),
Map.of(),
null,
null
);
task.addSplit(scanNode.getId(), split);
task.noMoreSplits(scanNode.getId());
// 8. Create a Java iterator from the Velox task.
final Iterator<RowVector> itr = UpIterators.asJavaIterator(task);
// 9. Collect and print results.
while (itr.hasNext()) {
final RowVector rowVector = itr.next(); // 9.1. Get next RowVector returned by Velox.
final VectorSchemaRoot vsr = Arrow.toArrowVectorSchemaRoot(new RootAllocator(), rowVector); // 9.2. Convert the RowVector into Arrow format (an Arrow VectorSchemaRoot in this case).
System.out.println(vsr.contentToTSVString()); // 9.3. Print the arrow table to stdout.
vsr.close(); // 9.4. Release the Arrow VectorSchemaRoot.
}
// 10. Close the Velox4J session.
session.close();
memoryManager.close();
}
Code of the toAssignment utility method used above:
private static List<Assignment> toAssignments(RowType rowType) {
final List<Assignment> list = new ArrayList<>();
for (int i = 0; i < rowType.size(); i++) {
final String name = rowType.getNames().get(i);
final Type type = rowType.getChildren().get(i);
list.add(new Assignment(name,
new HiveColumnHandle(name, ColumnType.REGULAR, type, type, List.of())));
}
return list;
}
Coding Style
Velox4J's code conforms to Java coding style from Google Java format and C++ coding style from Velox.
You can run the following command to fix all the code style issues during development, including both the C++ code and Java code:
bash .github/workflows/scripts/format/format.sh -fix
Note, Docker environment is required to run the script.
If you only need to check the code format without fixing them, use the subcommand-check instead:
bash .github/workflows/scripts/format/format.sh -check
License
This project is licensed under the Apache-2.0 License.
Related Skills
notion
343.3kNotion API for creating and managing pages, databases, and blocks.
feishu-drive
343.3k|
things-mac
343.3kManage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)
clawhub
343.3kUse the ClawHub CLI to search, install, update, and publish agent skills from clawhub.com
