SkillAgentSearch skills...

Cobrix

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark

Install / Use

/learn @AbsaOSS/Cobrix
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Cobrix - COBOL Data Source for Apache Spark

License: Apache v2 FOSSA Status Build Maven Central Maven Central

Pain free Spark/Cobol files integration.

Seamlessly query your COBOL/EBCDIC binary files as Spark Dataframes and streams.

Add mainframe as a source to your data engineering strategy.

Motivation

Among the motivations for this project, it is possible to highlight:

  • Lack of expertise in the Cobol ecosystem, which makes it hard to integrate mainframes into data engineering strategies.

  • Lack of support from the open-source community to initiatives in this field.

  • The overwhelming majority (if not all) of tools to cope with this domain are proprietary.

  • Several institutions struggle daily to maintain their legacy mainframes, which prevents them from evolving to more modern approaches to data management.

  • Mainframe data can only take part in data science activities through very expensive investments.

Features

  • Supports primitive types (although some are "Cobol compiler specific").

  • Supports REDEFINES, OCCURS and DEPENDING ON fields (e.g. unchecked unions and variable-size arrays).

  • Supports nested structures and arrays.

  • Supports Hadoop (HDFS, S3, ...) as well as local file system.

  • The COBOL copybooks parser doesn't have a Spark dependency and can be reused for integrating into other data processing engines.

  • Supports reading files compressed in Hadoop-compatible way (gzip, bzip2, etc), but with limited parallelism. Uncompressed files are preferred for performance.

Videos

We have presented Cobrix at DataWorks Summit 2019 and Spark Summit 2019 conferences. The screencasts are available here:

DataWorks Summit 2019 (General Cobrix workflow for hierarchical databases): https://www.youtube.com/watch?v=o_up7X3ZL24

Spark Summit 2019 (More detailed overview of performance optimizations): https://www.youtube.com/watch?v=BOBIdGf3Tm0

Requirements

| spark-cobol | Spark | |-------------|---------| | 0.x | 2.2+ | | 1.x | 2.2+ | | 2.x | 2.4.3+ | | 2.6.x+ | 3.2.0+ |

Linking

You can link against this library in your program at the following coordinates:

<table> <tr><th>Scala 2.11</th><th>Scala 2.12</th><th>Scala 2.13</th></tr> <tr> <td align="center"> <a href = "https://mvnrepository.com/artifact/za.co.absa.cobrix/spark-cobol"><img src="https://img.shields.io/maven-central/v/za.co.absa.cobrix/spark-cobol_2.11?label=spark-cobol_2.11"></a></td> <td align="center"> <a href = "https://mvnrepository.com/artifact/za.co.absa.cobrix/spark-cobol"><img src="https://img.shields.io/maven-central/v/za.co.absa.cobrix/spark-cobol_2.12?label=spark-cobol_2.12"></a></td> <td align="center"> <a href = "https://mvnrepository.com/artifact/za.co.absa.cobrix/spark-cobol"><img src="https://img.shields.io/maven-central/v/za.co.absa.cobrix/spark-cobol_2.13?label=spark-cobol_2.13"></a></td> </tr> <tr> <td> <pre>groupId: za.co.absa.cobrix<br>artifactId: spark-cobol_2.11<br>version: 2.10.1</pre> </td> <td> <pre>groupId: za.co.absa.cobrix<br>artifactId: spark-cobol_2.12<br>version: 2.10.1</pre> </td> <td> <pre>groupId: za.co.absa.cobrix<br>artifactId: spark-cobol_2.13<br>version: 2.10.1</pre> </td> </tr> </table>

Using with Spark shell

This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:

Spark compiled with Scala 2.11

$SPARK_HOME/bin/spark-shell --packages za.co.absa.cobrix:spark-cobol_2.11:2.10.1

Spark compiled with Scala 2.12

$SPARK_HOME/bin/spark-shell --packages za.co.absa.cobrix:spark-cobol_2.12:2.10.1

Spark compiled with Scala 2.13

$SPARK_HOME/bin/spark-shell --packages za.co.absa.cobrix:spark-cobol_2.13:2.10.1

Usage

Quick start

This repository contains several standalone example applications in examples/spark-cobol-app directory. It is a Maven project that contains several examples:

  • SparkTypesApp is an example of a very simple mainframe file processing. It is a fixed record length raw data file with a corresponding copybook. The copybook contains examples of various numeric data types Cobrix supports.
  • SparkCobolApp is an example of a Spark Job for handling multisegment variable record length mainframe files.
  • SparkCodecApp is an example usage of a custom record header parser. This application reads a variable record length file having non-standard RDW headers. In this example RDH header is 5 bytes instead of 4
  • SparkCobolHierarchical is an example processing of an EBCDIC multisegment file extracted from a hierarchical database.

The example project can be used as a template for creating Spark Application. Refer to README.md of that project for the detailed guide how to run the examples locally and on a cluster.

When running mvn clean package in examples/spark-cobol-app an uber jar will be created. It can be used to run jobs via spark-submit or spark-shell.

How to generate Code coverage report

sbt ++{scala_version} jacoco

Code coverage will be generated on path:

{project-root}/cobrix/{module}/target/scala-{scala_version}/jacoco/report/html

Reading Cobol binary files from Hadoop/local and querying them

  1. Create a Spark SQLContext

  2. Start a sqlContext.read operation specifying za.co.absa.cobrix.spark.cobol.source as the format

  3. Inform the path to the copybook describing the files through ... .option("copybook", "path_to_copybook_file").

    • By default the copybook is expected to be in the default Hadoop filesystem (HDFS, S3, etc).
    • You can specify that a copybook is located in the local file system by adding file:// prefix.
    • For example, you can specify a local file like this .option("copybook", "file:///home/user/data/copybook.cpy").
    • Alternatively, instead of providing a path to a copybook file you can provide the contents of the copybook itself by using .option("copybook_contents", "...copybook contents...").
    • You can store the copybook in the JAR itself at resources section in this case use jar:// prefix, e.g.: .option("copybook", "jar:///copybooks/copybook.cpy").
  4. Inform the path to the Hadoop directory containing the files: ... .load("s3a://path_to_directory_containing_the_binary_files")

  5. Inform the query you would like to run on the Cobol Dataframe

Below is an example whose full version can be found at za.co.absa.cobrix.spark.cobol.examples.SampleApp and za.co.absa.cobrix.spark.cobol.examples.CobolSparkExample

val sparkBuilder = SparkSession.builder().appName("Example")
val spark = sparkBuilder
  .getOrCreate()

val cobolDataframe = spark
  .read
  .format("cobol")
  .option("copybook", "data/test1_copybook.cob")
  .load("data/test2_data")

cobolDataframe
    .filter("RECORD.ID % 2 = 0") // filter the even values of the nested field 'RECORD_LENGTH'
    .take(10)
    .foreach(v => println(v))

The full example is available here

In some scenarios Spark is unable to find "cobol" data source by it's short name. In that case you can use the full path to the source class instead: .format("za.co.absa.cobrix.spark.cobol.source")

Cobrix assumes input data is encoded in EBCDIC. You can load ASCII files as well by specifying the following option: .option("encoding", "ascii").

If the input file is a text file (CRLF / LF are used to split records), use .option("is_text", "true").

Multisegment ASCII text files are supported using this option: .option("record_format", "D").

Cobrix has better handling of special characters and partial records using its extension format: .option("record_format", "D2").

Read more on record formats at https://www.ibm.com/docs/en/zos/2.4.0?topic=files-selecting-record-formats-non-vsam-data-sets

Streaming Cobol binary files from a directory

  1. Create a Spark StreamContext

  2. Import the binary files/stream conversion manager: za.co.absa.spark.cobol.source.streaming.CobolStreamer._

  3. Read the binary files contained in the path informed in the creation of the SparkSession as a stream: ... streamingContext.cobolStream()

  4. Apply queries on the stream: ... stream.filter("some_filter") ...

  5. Start the streaming job.

Below is an example whose full version can be found at za.co.absa.cobrix.spark.cobol.examples.StreamingExample

val spark = SparkSession
  .builder()
  .appName("CobolParser")
  .master("local[2]")
  .config("duration", 2)
  .config("copybook", "path_to_the_copybook")
  .config("path", "path_to_source_directory") // could be both, local or Hadoop (s3://, hdfs://, etc)
  .getOrCreate()          
      
val streamingContext = new StreamingContext(spark.sparkContext, Seconds(3))         
    
import za.co.absa.spark.cobol.source.streaming.CobolStreamer._ // imports the Cobol streams manager

val stream = streamingContext.cobolStream() // streams the binary files into the application    

stream
    .filter(row => row.getAs[Integer]("NUMERIC_FLD") % 2 == 0) // filters the even values of the neste

Related Skills

View on GitHub
GitHub Stars161
CategoryData
Updated1d ago
Forks90

Languages

Scala

Security Score

100/100

Audited on Mar 27, 2026

No findings