Cobrix

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark

Generate Convert Improve

Install / Use

/learn @AbsaOSS/Cobrix

About this skill

Quality Score

0/100

README

Cobrix - COBOL Data Source for Apache Spark

Pain free Spark/Cobol files integration.

Seamlessly query your COBOL/EBCDIC binary files as Spark Dataframes and streams.

Add mainframe as a source to your data engineering strategy.

Motivation

Among the motivations for this project, it is possible to highlight:

Lack of expertise in the Cobol ecosystem, which makes it hard to integrate mainframes into data engineering strategies.
Lack of support from the open-source community to initiatives in this field.
The overwhelming majority (if not all) of tools to cope with this domain are proprietary.
Several institutions struggle daily to maintain their legacy mainframes, which prevents them from evolving to more modern approaches to data management.
Mainframe data can only take part in data science activities through very expensive investments.

Features

Supports primitive types (although some are "Cobol compiler specific").
Supports REDEFINES, OCCURS and DEPENDING ON fields (e.g. unchecked unions and variable-size arrays).
Supports nested structures and arrays.
Supports Hadoop (HDFS, S3, ...) as well as local file system.
The COBOL copybooks parser doesn't have a Spark dependency and can be reused for integrating into other data processing engines.
Supports reading files compressed in Hadoop-compatible way (gzip, bzip2, etc), but with limited parallelism. Uncompressed files are preferred for performance.

Videos

We have presented Cobrix at DataWorks Summit 2019 and Spark Summit 2019 conferences. The screencasts are available here:

DataWorks Summit 2019 (General Cobrix workflow for hierarchical databases): https://www.youtube.com/watch?v=o_up7X3ZL24

Spark Summit 2019 (More detailed overview of performance optimizations): https://www.youtube.com/watch?v=BOBIdGf3Tm0

Requirements

| spark-cobol | Spark | |-------------|---------| | 0.x | 2.2+ | | 1.x | 2.2+ | | 2.x | 2.4.3+ | | 2.6.x+ | 3.2.0+ |

Linking

You can link against this library in your program at the following coordinates:

<table> <tr><th>Scala 2.11</th><th>Scala 2.12</th><th>Scala 2.13</th></tr> <tr> <td align="center"> <a href = "https://mvnrepository.com/artifact/za.co.absa.cobrix/spark-cobol"><img src="https://img.shields.io/maven-central/v/za.co.absa.cobrix/spark-cobol_2.11?label=spark-cobol_2.11"></a></td> <td align="center"> <a href = "https://mvnrepository.com/artifact/za.co.absa.cobrix/spark-cobol"><img src="https://img.shields.io/maven-central/v/za.co.absa.cobrix/spark-cobol_2.12?label=spark-cobol_2.12"></a></td> <td align="center"> <a href = "https://mvnrepository.com/artifact/za.co.absa.cobrix/spark-cobol"><img src="https://img.shields.io/maven-central/v/za.co.absa.cobrix/spark-cobol_2.13?label=spark-cobol_2.13"></a></td> </tr> <tr> <td> <pre>groupId: za.co.absa.cobrix<br>artifactId: spark-cobol_2.11<br>version: 2.10.1</pre> </td> <td> <pre>groupId: za.co.absa.cobrix<br>artifactId: spark-cobol_2.12<br>version: 2.10.1</pre> </td> <td> <pre>groupId: za.co.absa.cobrix<br>artifactId: spark-cobol_2.13<br>version: 2.10.1</pre> </td> </tr> </table>

Using with Spark shell

This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:

Spark compiled with Scala 2.11

$SPARK_HOME/bin/spark-shell --packages za.co.absa.cobrix:spark-cobol_2.11:2.10.1

Spark compiled with Scala 2.12

$SPARK_HOME/bin/spark-shell --packages za.co.absa.cobrix:spark-cobol_2.12:2.10.1

Spark compiled with Scala 2.13

$SPARK_HOME/bin/spark-shell --packages za.co.absa.cobrix:spark-cobol_2.13:2.10.1

Usage

Quick start

This repository contains several standalone example applications in examples/spark-cobol-app directory. It is a Maven project that contains several examples:

SparkTypesApp is an example of a very simple mainframe file processing. It is a fixed record length raw data file with a corresponding copybook. The copybook contains examples of various numeric data types Cobrix supports.
SparkCobolApp is an example of a Spark Job for handling multisegment variable record length mainframe files.
SparkCodecApp is an example usage of a custom record header parser. This application reads a variable record length file having non-standard RDW headers. In this example RDH header is 5 bytes instead of 4
SparkCobolHierarchical is an example processing of an EBCDIC multisegment file extracted from a hierarchical database.

The example project can be used as a template for creating Spark Application. Refer to README.md of that project for the detailed guide how to run the examples locally and on a cluster.

When running mvn clean package in examples/spark-cobol-app an uber jar will be created. It can be used to run jobs via spark-submit or spark-shell.

How to generate Code coverage report

sbt ++{scala_version} jacoco

Code coverage will be generated on path:

{project-root}/cobrix/{module}/target/scala-{scala_version}/jacoco/report/html

Reading Cobol binary files from Hadoop/local and querying them

Create a Spark SQLContext
Start a sqlContext.read operation specifying za.co.absa.cobrix.spark.cobol.source as the format
Inform the path to the copybook describing the files through ... .option("copybook", "path_to_copybook_file").
- By default the copybook is expected to be in the default Hadoop filesystem (HDFS, S3, etc).
- You can specify that a copybook is located in the local file system by adding file:// prefix.
- For example, you can specify a local file like this .option("copybook", "file:///home/user/data/copybook.cpy").
- Alternatively, instead of providing a path to a copybook file you can provide the contents of the copybook itself by using .option("copybook_contents", "...copybook contents...").
- You can store the copybook in the JAR itself at resources section in this case use jar:// prefix, e.g.: .option("copybook", "jar:///copybooks/copybook.cpy").
Inform the path to the Hadoop directory containing the files: ... .load("s3a://path_to_directory_containing_the_binary_files")
Inform the query you would like to run on the Cobol Dataframe

Below is an example whose full version can be found at za.co.absa.cobrix.spark.cobol.examples.SampleApp and za.co.absa.cobrix.spark.cobol.examples.CobolSparkExample

val sparkBuilder = SparkSession.builder().appName("Example")
val spark = sparkBuilder
  .getOrCreate()

val cobolDataframe = spark
  .read
  .format("cobol")
  .option("copybook", "data/test1_copybook.cob")
  .load("data/test2_data")

cobolDataframe
    .filter("RECORD.ID % 2 = 0") // filter the even values of the nested field 'RECORD_LENGTH'
    .take(10)
    .foreach(v => println(v))

The full example is available here

In some scenarios Spark is unable to find "cobol" data source by it's short name. In that case you can use the full path to the source class instead: .format("za.co.absa.cobrix.spark.cobol.source")

Cobrix assumes input data is encoded in EBCDIC. You can load ASCII files as well by specifying the following option: .option("encoding", "ascii").

If the input file is a text file (CRLF / LF are used to split records), use .option("is_text", "true").

Multisegment ASCII text files are supported using this option: .option("record_format", "D").

Cobrix has better handling of special characters and partial records using its extension format: .option("record_format", "D2").

Read more on record formats at https://www.ibm.com/docs/en/zos/2.4.0?topic=files-selecting-record-formats-non-vsam-data-sets

Streaming Cobol binary files from a directory

Create a Spark StreamContext
Import the binary files/stream conversion manager: za.co.absa.spark.cobol.source.streaming.CobolStreamer._
Read the binary files contained in the path informed in the creation of the SparkSession as a stream: ... streamingContext.cobolStream()
Apply queries on the stream: ... stream.filter("some_filter") ...
Start the streaming job.

Below is an example whose full version can be found at za.co.absa.cobrix.spark.cobol.examples.StreamingExample

val spark = SparkSession
  .builder()
  .appName("CobolParser")
  .master("local[2]")
  .config("duration", 2)
  .config("copybook", "path_to_the_copybook")
  .config("path", "path_to_source_directory") // could be both, local or Hadoop (s3://, hdfs://, etc)
  .getOrCreate()          
      
val streamingContext = new StreamingContext(spark.sparkContext, Seconds(3))         
    
import za.co.absa.spark.cobol.source.streaming.CobolStreamer._ // imports the Cobol streams manager

val stream = streamingContext.cobolStream() // streams the binary files into the application    

stream
    .filter(row => row.getAs[Integer]("NUMERIC_FLD") % 2 == 0) // filters the even values of the neste

Related Skills

feishu-drive

339.5k

things-mac

339.5k

Manage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)

clawhub

339.5k

Use the ClawHub CLI to search, install, update, and publish agent skills from clawhub.com

yu-ai-agent

2.0k

编程导航 2025 年 AI 开发实战新项目，基于 Spring Boot 3 + Java 21 + Spring AI 构建 AI 恋爱大师应用和 ReAct 模式自主规划智能体YuManus，覆盖 AI 大模型接入、Spring AI 核心特性、Prompt 工程和优化、RAG 检索增强、向量数据库、Tool Calling 工具调用、MCP 模型上下文协议、AI Agent 开发（Manas Java 实现）、Cursor AI 工具等核心知识。用一套教程将程序员必知必会的 AI 技术一网打尽，帮你成为 AI 时代企业的香饽饽，给你的简历和求职大幅增加竞争力。

AbsaOSS

View profile

View on GitHub

GitHub Stars161

CategoryData

Updated1d ago

Forks90

AbsaOSS/cobrix

Languages

Scala

Security Score

100/100

Audited on Mar 27, 2026

No findings