Sparkora
Powerful rapid automatic EDA and feature engineering library with a very easy to use API 🌟
Install / Use
/learn @Spratiher9/SparkoraREADME

Sparkora

Exploratory data analysis toolkit for Pyspark.
Dont forget to hit the star ⭐ button if you like this project and would love to see more such utilities.
Contents
<a name="summary"></a>
Summary
Sparkora is a Python library designed to automate the painful parts of exploratory data analysis in Apache Spark.
The library contains convenient functions for data cleaning, feature selection & extraction, visualization (databricks native), partitioning data for model validation, and versioning transformations of data.
The library uses the following modules:
- pyspark
- ipython
- os
- copy
It is intended to be a helpful addition to common Pyspark data analysis tools.
<a name="setup"></a>
Setup
$ pip install Sparkora
$ python
>>> from Sparkora import Sparkora
<a name="examples"></a>
Examples
Go to this jupyter notebook or import this in databricks workspace
<a name="use"></a>
Usage
<a name="config" ></a>
Reading Data & Configuration
# without initial config
>>> sparkora = Sparkora()
>>> sparkora.configure(output = 'A', data = 'path/to/data.csv')
# is the same as
>>> from pyspark.shell import spark
>>> dataframe = spark.read.csv('path/to/data.csv', header=True)
>>> sparkora = Sparkora(output = 'A', data = dataframe)
>>> sparkora.data
A B C D useless_feature
0 1 2 0 left 1
1 4 NaN 1 right 1
2 7 8 2 left 1
<a name="clean" ></a>
Cleaning
# read data with missing and poorly scaled values
>>> from pyspark.shell import spark
>>> d = [
... (1, 2, 100),
... (2, None, 200),
... (1, 6, None)
... ]
>>> df = spark.createDataFrame(d, "_c0 int, _c1 int, _c2 int")
>>> sparkora = Sparkora(output = '_c0', data = df)
>>> sparkora.data
_c0 _c1 _c2
0 1 2 100
1 2 NaN 200
2 1 6 NaN
# impute the missing values (by default, using 'mean' other options are 'median', 'mode')
# sparkora.impute_missing_values(strategy = 'mean')
>>> sparkora.impute_missing_values()
>>> sparkora.data
_c0 _c1 _c2
0 1 2 100
1 2 4 200
2 1 6 150
# scale the values of the input variables (center to mean and scale to unit variance)
>>> sparkora.scale_input_values()
>>> sparkora.data
_c0 _c1 _c2
0 1 -1.224745 -1.224745
1 2 0.000000 1.224745
2 1 1.224745 0.000000
<a name="feature" ></a>
Feature Selection & Extraction
# feature selection / removing a feature
>>> sparkora.data
A B C D useless_feature
0 1 2 0 left 1
1 4 NaN 1 right 1
2 7 8 2 left 1
>>> sparkora.remove_feature('useless_feature')
>>> sparkora.data
A B C D
0 1 2 0 left
1 4 NaN 1 right
2 7 8 2 left
# extract an ordinal feature through one-hot encoding
>>> sparkora.extract_ordinal_feature('D')
>>> sparkora.data
A B C D=left D=right
0 1 2 0 1 0
1 4 NaN 1 0 1
2 7 8 2 1 0
# extract a transformation of another feature
>>> f_udf = F.udf(lambda x: x * 2, T.IntegerType())
>>> sparkora.extract_feature('C', 'twoC', f_udf)
>>> sparkora.data
A B C D=left D=right twoC
0 1 2 0 1 0 0
1 4 NaN 1 0 1 2
2 7 8 2 1 0 4
# extract a transformation of multiple other features
>>> f_udf = F.udf(lambda x,y: x * y, T.IntegerType())
>>> sparkora.extract_feature(['C','A'], 'newC', f_udf)
A B C D=left D=right newC
0 1 2 0 1 0 0
1 4 NaN 1 0 1 4
2 7 8 2 1 0 14
<a name="visual" ></a>
Visualization
# Visualization features are only available if the platform is Databricks
# plot a single feature against the output variable
sparkora.plot_feature('column-name')
# render plots of each feature against the output variable
sparkora.explore()
<a name="model" ></a>
Model Validation
# create random partition of training / validation data (~ 80/20 split)
sparkora.set_training_and_validation(0.8)
# pass train size only, the validation size is calculated automatically
# train a model on the data
X = sparkora.training_data.select(sparkora.input_columns())
y = sparkora.training_data.select(sparkora.output)
some_model.fit(X, y)
# validate the model
X = sparkora.validating_data.select(sparkora.input_columns())
y = sparkora.validating_data.select(sparkora.output)
some_model.score(X, y)
<a name="version" ></a>
Data Versioning
# save a version of your data
>>> sparkora.data
A B C D useless_feature
0 1 2 0 left 1
1 4 NaN 1 right 1
2 7 8 2 left 1
>>> sparkora.snapshot('initial_data')
# keep track of changes to data
>>> sparkora.remove_feature('useless_feature')
>>> sparkora.extract_ordinal_feature('D')
>>> sparkora.impute_missing_values()
>>> sparkora.scale_input_values()
>>> sparkora.data
A B C D=left D=right
0 1 -1.224745 -1.224745 0.707107 -0.707107
1 4 0.000000 0.000000 -1.414214 1.414214
2 7 1.224745 1.224745 0.707107 -0.707107
>>> sparkora.logs
["self.remove_feature('useless_feature')", "self.extract_ordinal_feature('D')", 'self.impute_missing_values()', 'self.scale_input_values()']
# use a previous version of the data
>>> sparkora.snapshot('transform1')
>>> sparkora.use_snapshot('initial_data')
>>> sparkora.data
A B C D useless_feature
0 1 2 0 left 1
1 4 NaN 1 right 1
2 7 8 2 left 1
>>> sparkora.logs
["sparkora.snapshot('initial_data')"]
# switch back to your transformation
>>> sparkora.use_snapshot('transform1')
>>> sparkora.data
A B C D=left D=right
0 1 -1.224745 -1.224745 0.707107 -0.707107
1 4 0.000000 0.000000 -1.414214 1.414214
2 7 1.224745 1.224745 0.707107 -0.707107
>>> sparkora.logs
["self.remove_feature('useless_feature')", "self.extract_ordinal_feature('D')", 'self.impute_missing_values()', 'self.scale_input_values()']
<a name="test" ></a>
Testing
To run the test suite, simply run python testsparkora.py from the Sparkora directory.
<a name="contribute" ></a>
Contribute
Pull requests welcome! Feature requests / bugs will be addressed through issues on this repository. While not every feature request will necessarily be handled by me, maintaining a record for interested contributors is useful.
Additionally, feel free to submit pull requests which add features or address bugs yourself.
<a name="license" ></a>
License
GNU GENERAL PUBLIC LICENSE
Version 3, 29 June 2007
Copyright (C) 2007 Free Software Foundation, Inc. https://fsf.org/ Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
Preamble
The GNU General Public License is a free, copyleft license for software and other kinds of works.
The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change all versions of a program--to make sure it remains free software for all its users. We, the Free Software Foundation, use the GNU General Public License for most of our software; it applies also to any other work released this way by its authors. You can apply it to your programs, too.
When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things.
To protect your rights, we need to prevent others from denying you these rights or asking you to surrender the rights. Therefore, you have certain responsibilities if you distribute copies of the software, or if you modify it: responsibilities to respect the freedom of others.
For example, if you distribute copies of such a program, whether gratis or for a fee, you must pass on to the recipients the same freedoms that you received. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights.
Developers that use the GNU GPL protect your rights with two steps: (1) assert copyright on the software, and (2) offer you this License giving you legal permission to copy, distribute and/or modify it.
For the developers' and authors' protection, the GPL clearly explains that there is no warranty for this free software. For both users' and authors' sake, the GPL requires that modified versions be marked as changed, so that their problems wi
