Cobrix
A COBOL parser and Mainframe/EBCDIC data source for Apache Spark
Install / Use
/learn @AbsaOSS/CobrixREADME
Cobrix - COBOL Data Source for Apache Spark
Pain free Spark/Cobol files integration.
Seamlessly query your COBOL/EBCDIC binary files as Spark Dataframes and streams.
Add mainframe as a source to your data engineering strategy.
Motivation
Among the motivations for this project, it is possible to highlight:
-
Lack of expertise in the Cobol ecosystem, which makes it hard to integrate mainframes into data engineering strategies.
-
Lack of support from the open-source community to initiatives in this field.
-
The overwhelming majority (if not all) of tools to cope with this domain are proprietary.
-
Several institutions struggle daily to maintain their legacy mainframes, which prevents them from evolving to more modern approaches to data management.
-
Mainframe data can only take part in data science activities through very expensive investments.
Features
-
Supports primitive types (although some are "Cobol compiler specific").
-
Supports REDEFINES, OCCURS and DEPENDING ON fields (e.g. unchecked unions and variable-size arrays).
-
Supports nested structures and arrays.
-
Supports Hadoop (HDFS, S3, ...) as well as local file system.
-
The COBOL copybooks parser doesn't have a Spark dependency and can be reused for integrating into other data processing engines.
-
Supports reading files compressed in Hadoop-compatible way (gzip, bzip2, etc), but with limited parallelism. Uncompressed files are preferred for performance.
Videos
We have presented Cobrix at DataWorks Summit 2019 and Spark Summit 2019 conferences. The screencasts are available here:
DataWorks Summit 2019 (General Cobrix workflow for hierarchical databases): https://www.youtube.com/watch?v=o_up7X3ZL24
Spark Summit 2019 (More detailed overview of performance optimizations): https://www.youtube.com/watch?v=BOBIdGf3Tm0
Requirements
| spark-cobol | Spark | |-------------|---------| | 0.x | 2.2+ | | 1.x | 2.2+ | | 2.x | 2.4.3+ | | 2.6.x+ | 3.2.0+ |
Linking
You can link against this library in your program at the following coordinates:
<table> <tr><th>Scala 2.11</th><th>Scala 2.12</th><th>Scala 2.13</th></tr> <tr> <td align="center"> <a href = "https://mvnrepository.com/artifact/za.co.absa.cobrix/spark-cobol"><img src="https://img.shields.io/maven-central/v/za.co.absa.cobrix/spark-cobol_2.11?label=spark-cobol_2.11"></a></td> <td align="center"> <a href = "https://mvnrepository.com/artifact/za.co.absa.cobrix/spark-cobol"><img src="https://img.shields.io/maven-central/v/za.co.absa.cobrix/spark-cobol_2.12?label=spark-cobol_2.12"></a></td> <td align="center"> <a href = "https://mvnrepository.com/artifact/za.co.absa.cobrix/spark-cobol"><img src="https://img.shields.io/maven-central/v/za.co.absa.cobrix/spark-cobol_2.13?label=spark-cobol_2.13"></a></td> </tr> <tr> <td> <pre>groupId: za.co.absa.cobrix<br>artifactId: spark-cobol_2.11<br>version: 2.10.1</pre> </td> <td> <pre>groupId: za.co.absa.cobrix<br>artifactId: spark-cobol_2.12<br>version: 2.10.1</pre> </td> <td> <pre>groupId: za.co.absa.cobrix<br>artifactId: spark-cobol_2.13<br>version: 2.10.1</pre> </td> </tr> </table>Using with Spark shell
This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:
Spark compiled with Scala 2.11
$SPARK_HOME/bin/spark-shell --packages za.co.absa.cobrix:spark-cobol_2.11:2.10.1
Spark compiled with Scala 2.12
$SPARK_HOME/bin/spark-shell --packages za.co.absa.cobrix:spark-cobol_2.12:2.10.1
Spark compiled with Scala 2.13
$SPARK_HOME/bin/spark-shell --packages za.co.absa.cobrix:spark-cobol_2.13:2.10.1
Usage
Quick start
This repository contains several standalone example applications in examples/spark-cobol-app directory.
It is a Maven project that contains several examples:
SparkTypesAppis an example of a very simple mainframe file processing. It is a fixed record length raw data file with a corresponding copybook. The copybook contains examples of various numeric data types Cobrix supports.SparkCobolAppis an example of a Spark Job for handling multisegment variable record length mainframe files.SparkCodecAppis an example usage of a custom record header parser. This application reads a variable record length file having non-standard RDW headers. In this example RDH header is 5 bytes instead of 4SparkCobolHierarchicalis an example processing of an EBCDIC multisegment file extracted from a hierarchical database.
The example project can be used as a template for creating Spark Application. Refer to README.md of that project for the detailed guide how to run the examples locally and on a cluster.
When running mvn clean package in examples/spark-cobol-app an uber jar will be created. It can be used to run
jobs via spark-submit or spark-shell.
How to generate Code coverage report
sbt ++{scala_version} jacoco
Code coverage will be generated on path:
{project-root}/cobrix/{module}/target/scala-{scala_version}/jacoco/report/html
Reading Cobol binary files from Hadoop/local and querying them
-
Create a Spark
SQLContext -
Start a
sqlContext.readoperation specifyingza.co.absa.cobrix.spark.cobol.sourceas the format -
Inform the path to the copybook describing the files through
... .option("copybook", "path_to_copybook_file").- By default the copybook is expected to be in the default Hadoop filesystem (HDFS, S3, etc).
- You can specify that a copybook is located in the local file system by adding
file://prefix. - For example, you can specify a local file like this
.option("copybook", "file:///home/user/data/copybook.cpy"). - Alternatively, instead of providing a path to a copybook file you can provide the contents of the copybook itself by using
.option("copybook_contents", "...copybook contents..."). - You can store the copybook in the JAR itself at resources section in this case use
jar://prefix, e.g.:.option("copybook", "jar:///copybooks/copybook.cpy").
-
Inform the path to the Hadoop directory containing the files:
... .load("s3a://path_to_directory_containing_the_binary_files") -
Inform the query you would like to run on the Cobol Dataframe
Below is an example whose full version can be found at za.co.absa.cobrix.spark.cobol.examples.SampleApp and za.co.absa.cobrix.spark.cobol.examples.CobolSparkExample
val sparkBuilder = SparkSession.builder().appName("Example")
val spark = sparkBuilder
.getOrCreate()
val cobolDataframe = spark
.read
.format("cobol")
.option("copybook", "data/test1_copybook.cob")
.load("data/test2_data")
cobolDataframe
.filter("RECORD.ID % 2 = 0") // filter the even values of the nested field 'RECORD_LENGTH'
.take(10)
.foreach(v => println(v))
The full example is available here
In some scenarios Spark is unable to find "cobol" data source by it's short name. In that case you can use the full path to the source class instead: .format("za.co.absa.cobrix.spark.cobol.source")
Cobrix assumes input data is encoded in EBCDIC. You can load ASCII files as well by specifying the following option:
.option("encoding", "ascii").
If the input file is a text file (CRLF / LF are used to split records), use
.option("is_text", "true").
Multisegment ASCII text files are supported using this option:
.option("record_format", "D").
Cobrix has better handling of special characters and partial records using its extension format:
.option("record_format", "D2").
Read more on record formats at https://www.ibm.com/docs/en/zos/2.4.0?topic=files-selecting-record-formats-non-vsam-data-sets
Streaming Cobol binary files from a directory
-
Create a Spark
StreamContext -
Import the binary files/stream conversion manager:
za.co.absa.spark.cobol.source.streaming.CobolStreamer._ -
Read the binary files contained in the path informed in the creation of the
SparkSessionas a stream:... streamingContext.cobolStream() -
Apply queries on the stream:
... stream.filter("some_filter") ... -
Start the streaming job.
Below is an example whose full version can be found at za.co.absa.cobrix.spark.cobol.examples.StreamingExample
val spark = SparkSession
.builder()
.appName("CobolParser")
.master("local[2]")
.config("duration", 2)
.config("copybook", "path_to_the_copybook")
.config("path", "path_to_source_directory") // could be both, local or Hadoop (s3://, hdfs://, etc)
.getOrCreate()
val streamingContext = new StreamingContext(spark.sparkContext, Seconds(3))
import za.co.absa.spark.cobol.source.streaming.CobolStreamer._ // imports the Cobol streams manager
val stream = streamingContext.cobolStream() // streams the binary files into the application
stream
.filter(row => row.getAs[Integer]("NUMERIC_FLD") % 2 == 0) // filters the even values of the neste
Related Skills
feishu-drive
339.5k|
things-mac
339.5kManage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)
clawhub
339.5kUse the ClawHub CLI to search, install, update, and publish agent skills from clawhub.com
yu-ai-agent
2.0k编程导航 2025 年 AI 开发实战新项目,基于 Spring Boot 3 + Java 21 + Spring AI 构建 AI 恋爱大师应用和 ReAct 模式自主规划智能体YuManus,覆盖 AI 大模型接入、Spring AI 核心特性、Prompt 工程和优化、RAG 检索增强、向量数据库、Tool Calling 工具调用、MCP 模型上下文协议、AI Agent 开发(Manas Java 实现)、Cursor AI 工具等核心知识。用一套教程将程序员必知必会的 AI 技术一网打尽,帮你成为 AI 时代企业的香饽饽,给你的简历和求职大幅增加竞争力。
