MegaSparkDiff
A Spark-based data comparison tool at scale which facilitates software development engineers to compare a plethora of pair combinations of possible data sources. Multiple execution modes in multiple environments enable the user to generate a diff report as a Java/Scala-friendly DataFrame or as a file for future use. Comes with out of the box SparkFactory and SparkCompare tools.
Install / Use
/learn @FINRAOS/MegaSparkDiffREADME
MegaSparkDiff is an open source tool that helps you compare any pair
combination of data sets that are of the following types:
(HDFS, JDBC, S3, Hbase, Text Files, Hive, Json, DynamoDb).
MegaSparkDiff can run on
(a) Amazon EMR (Elastic Map Reduce),
(b) Amazon EC2 instances and cloud environments with compatible Spark distributions.
(c) DataBricks Interactive Notebooks with Visualizations via displayhtml function.
How to Use from Within a Java or Scala Project
<dependency>
<groupId>org.finra.megasparkdiff</groupId>
<artifactId>mega-spark-diff</artifactId>
<version>0.4.0</version>
</dependency>
SparkFactory
parallelizes source/target data.
The data sources can be in following forms:
Text File
HDFS File
SQL query over a JDBC data source
Hive Table
Json File
DynamoDb Table
SparkCompare
Compares pair combinations of supported sources,
Please note in case of comparing a schema-based source to a non-schema based source, the SparkCompare
class will attempt to flatten the schema based source to delimited values and then do the comparison. The delimiter
can be specified while launching the compare job.
How to use via shell script in EMR
There will exist a shell script named a3a.sh that will wrap around
this Java/Scala project. This script will accept several parameters
related to source definitions, output destination, and run
configurations, as well as which two data sets to compare.
The parameters are as follows:
-ds=<data_source_folder>: The folder where the database
connection parameters and data queries reside
-od=<output_directory>: The directory where MegaSparkDiff will write
its output
-rc=<run_config_file_name>: The file that will be used to load
any special run and Spark configurations. This parameter is
optional
To specify a data set to compare, pass in the name of one of the
data queries found in a config file inside <data_source_folder>
prepended by "--". The program will execute the queries assigned to
the names passed into the command line, store them into tables, and
perform the comparison.
Example call:
./msd.sh -ds=./data_sources/ -od=output --shraddha --carlos
Additionally, the user will have the option to add JDBC Driver jar
files by including them in the classpath. This is to enable them to
extract from whichever database they choose.
Run tests on Windows
- Download Hadoop winutils
- Extract to some path, e.g. C:\Users\MegaSparkDiffFan\bin
- Run tests while defining
hadoop.home.dir, e.g.mvn test -Dhadoop.home.dir=C:\Users\MegaSparkDiffFan
Related Skills
node-connect
341.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.5kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
341.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.5kCommit, push, and open a PR
