CopybookInputFormat
Using JRecord to build a mapred and mapreduce inputformat for HDFS, MAPREDUCE, PIG, HIVE, Spark, ...
Install / Use
/learn @tmalaska/CopybookInputFormatREADME
#CopybookInputFormat
##Overview This project has a collections of tools to allow you to read directly from copybook data files in HDFS, using Map/Reduce, Hive, or Spark
Here is what is in this project:
- BasicCopybookConvert: Example of how to read a copybook data file with the copybook schema with JRecord. This is single threaded.
- PrepCopybook: This tool with clean up a copybook file so it will work with Hive and JRecord.
- GenTestData: This will take a given cpl file and create sample rows for testing
- GenHiveCreateTable: This will read the copybook schema and generate a Hive table definition.
- mapred.InputFormat & RecordReader: This is an mapped implementation of FileInputFormat and RecordReader. It also works with Hive.
- mapreduce.InputFormat & RecordReader: This is an mapped implementation of FileInputFormat and RecordReader.
- Spark Exampl: An example of how to read a cpl data from with Spark.
##Build JRecord is not on a public maven repo so I have included the JRecord jars. To build you have to put these jars in your local repo under the following folders
~/.m2/repository/net/sf/JRecord/JRecord/0.80/JRecord-0.80.jar
~/.m2/repository/net/sf/cb2xml/cb2xml/1.0/cb2xml-1.0.jar
After you do that just do maven package and use target/copybookInputFormat.jar
##Credits Sekou Mckissick, Susan Greslik, Gwen Shapira, Jeremy Beard, and Ted Malaska
##Internal Notes java -jar copybookInputFormat.jar GenHiveCreateTable example.cbl createTable.hql exampleTable /user/root/exampleTable /tmp/example.cbl
hive -f createTable.hql
java -jar copybookInputFormat.jar GenTestData example.cbl copyGen/example.dat 100 10
hadoop fs -put copyGen/example.dat /user/root/exampleTable/example.dat
hadoop fs -put example.cbl /tmp/example.cbl
hive
add jar copybookInputFormat.jar;
set copybook.inputformat.cbl.hdfs.path=/tmp/example.cbl;
desc exampleTable;
select * from exampleTable;
select * from exampleTable where user_id > '570'
hadoop jar SparkCopybookExample.jar com.cloudera.sa.copybook.spark.CopybookSparkExample spark://{host}:7077 hdfs://{host}:8020/tmp/example.cbl hdfs://{host}:8020/user/root/exampleTable hdfs://{host}:8020/user/root/op2
or
java -cp SparkCopybookExample.jar com.cloudera.sa.copybook.spark.CopybookSparkExample spark://{host}:7077 hdfs://{host}:8020/tmp/example.cbl hdfs://{host}:8020/user/root/exampleTable hdfs://{host}:8020/user/root/op3
##Extra Notes <property> <name>hive.aux.jars.path</name> <value>hdfs:///user/root/copybook-0.0.1-SNAPSHOT.jar</value> </property>
Related Skills
node-connect
339.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.9kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
339.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.9kCommit, push, and open a PR
