Sparkora project logo

Sparkora

Exploratory data analysis toolkit for Pyspark.

Dont forget to hit the star ⭐ button if you like this project and would love to see more such utilities.

Summary
Setup
Examples
Usage
Testing
Contribute
License

Summary

Sparkora is a Python library designed to automate the painful parts of exploratory data analysis in Apache Spark.

The library contains convenient functions for data cleaning, feature selection & extraction, visualization (databricks native), partitioning data for model validation, and versioning transformations of data.

The library uses the following modules:

pyspark
ipython
os
copy

It is intended to be a helpful addition to common Pyspark data analysis tools.

Setup

$ pip install Sparkora
$ python
>>> from Sparkora import Sparkora

Examples

Go to this jupyter notebook or import this in databricks workspace

Usage

Reading Data & Configuration

# without initial config
>>> sparkora = Sparkora()
>>> sparkora.configure(output = 'A', data = 'path/to/data.csv')

# is the same as
>>> from pyspark.shell import spark
>>> dataframe = spark.read.csv('path/to/data.csv', header=True)
>>> sparkora = Sparkora(output = 'A', data = dataframe)

>>> sparkora.data
   A   B  C      D  useless_feature
0  1   2  0   left                1
1  4 NaN  1  right                1
2  7   8  2   left                1

Cleaning

# read data with missing and poorly scaled values
>>> from pyspark.shell import spark 
>>> d = [
...   (1, 2, 100),
...   (2, None, 200),
...   (1, 6, None)
... ]
>>> df = spark.createDataFrame(d, "_c0 int, _c1 int, _c2 int")
>>> sparkora = Sparkora(output = '_c0', data = df)
>>> sparkora.data
   _c0   _c1    _c2
0    1     2    100
1    2   NaN    200
2    1     6    NaN

# impute the missing values (by default, using 'mean' other options are 'median', 'mode')
# sparkora.impute_missing_values(strategy = 'mean')
>>> sparkora.impute_missing_values()
>>> sparkora.data
   _c0   _c1    _c2
0    1     2    100
1    2     4    200
2    1     6    150

# scale the values of the input variables (center to mean and scale to unit variance)
>>> sparkora.scale_input_values()
>>> sparkora.data
   _c0        _c1         _c2
0    1     -1.224745   -1.224745
1    2      0.000000    1.224745
2    1      1.224745    0.000000

Feature Selection & Extraction

# feature selection / removing a feature
>>> sparkora.data
   A   B  C      D  useless_feature
0  1   2  0   left                1
1  4 NaN  1  right                1
2  7   8  2   left                1

>>> sparkora.remove_feature('useless_feature')
>>> sparkora.data
   A   B  C      D
0  1   2  0   left
1  4 NaN  1  right
2  7   8  2   left

# extract an ordinal feature through one-hot encoding
>>> sparkora.extract_ordinal_feature('D')
>>> sparkora.data
   A   B  C  D=left  D=right
0  1   2  0       1        0
1  4 NaN  1       0        1
2  7   8  2       1        0

# extract a transformation of another feature
>>> f_udf = F.udf(lambda x: x * 2, T.IntegerType())
>>> sparkora.extract_feature('C', 'twoC', f_udf)
>>> sparkora.data
   A   B  C  D=left  D=right  twoC
0  1   2  0       1        0     0
1  4 NaN  1       0        1     2
2  7   8  2       1        0     4

# extract a transformation of multiple other features
>>> f_udf = F.udf(lambda x,y: x * y, T.IntegerType())
>>> sparkora.extract_feature(['C','A'], 'newC', f_udf)
   A   B  C  D=left  D=right  newC
0  1   2  0       1        0     0
1  4 NaN  1       0        1     4
2  7   8  2       1        0    14

Visualization

# Visualization features are only available if the platform is Databricks

# plot a single feature against the output variable
sparkora.plot_feature('column-name')

# render plots of each feature against the output variable
sparkora.explore()

Model Validation

# create random partition of training / validation data (~ 80/20 split)
sparkora.set_training_and_validation(0.8) 
# pass train size only, the validation size is calculated automatically

# train a model on the data
X = sparkora.training_data.select(sparkora.input_columns())
y = sparkora.training_data.select(sparkora.output)

some_model.fit(X, y)

# validate the model
X = sparkora.validating_data.select(sparkora.input_columns())
y = sparkora.validating_data.select(sparkora.output)

some_model.score(X, y)

Data Versioning

# save a version of your data
>>> sparkora.data
   A   B  C      D  useless_feature
0  1   2  0   left                1
1  4 NaN  1  right                1
2  7   8  2   left                1
>>> sparkora.snapshot('initial_data')

# keep track of changes to data
>>> sparkora.remove_feature('useless_feature')
>>> sparkora.extract_ordinal_feature('D')
>>> sparkora.impute_missing_values()
>>> sparkora.scale_input_values()
>>> sparkora.data
   A         B         C    D=left   D=right
0  1 -1.224745 -1.224745  0.707107 -0.707107
1  4  0.000000  0.000000 -1.414214  1.414214
2  7  1.224745  1.224745  0.707107 -0.707107

>>> sparkora.logs
["self.remove_feature('useless_feature')", "self.extract_ordinal_feature('D')", 'self.impute_missing_values()', 'self.scale_input_values()']

# use a previous version of the data
>>> sparkora.snapshot('transform1')
>>> sparkora.use_snapshot('initial_data')
>>> sparkora.data
   A   B  C      D  useless_feature
0  1   2  0   left                1
1  4 NaN  1  right                1
2  7   8  2   left                1
>>> sparkora.logs
["sparkora.snapshot('initial_data')"]

# switch back to your transformation
>>> sparkora.use_snapshot('transform1')
>>> sparkora.data
   A         B         C    D=left   D=right
0  1 -1.224745 -1.224745  0.707107 -0.707107
1  4  0.000000  0.000000 -1.414214  1.414214
2  7  1.224745  1.224745  0.707107 -0.707107
>>> sparkora.logs
["self.remove_feature('useless_feature')", "self.extract_ordinal_feature('D')", 'self.impute_missing_values()', 'self.scale_input_values()']

Testing

To run the test suite, simply run python testsparkora.py from the Sparkora directory.

Contribute

Pull requests welcome! Feature requests / bugs will be addressed through issues on this repository. While not every feature request will necessarily be handled by me, maintaining a record for interested contributors is useful.

Additionally, feel free to submit pull requests which add features or address bugs yourself.

License

                GNU GENERAL PUBLIC LICENSE
                   Version 3, 29 June 2007

                        Preamble

The GNU General Public License is a free, copyleft license for software and other kinds of works.

The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change all versions of a program--to make sure it remains free software for all its users. We, the Free Software Foundation, use the GNU General Public License for most of our software; it applies also to any other work released this way by its authors. You can apply it to your programs, too.

When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things.

To protect your rights, we need to prevent others from denying you these rights or asking you to surrender the rights. Therefore, you have certain responsibilities if you distribute copies of the software, or if you modify it: responsibilities to respect the freedom of others.

For example, if you distribute copies of such a program, whether gratis or for a fee, you must pass on to the recipients the same freedoms that you received. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights.

Developers that use the GNU GPL protect your rights with two steps: (1) assert copyright on the software, and (2) offer you this License giving you legal permission to copy, distribute and/or modify it.

For the developers' and authors' protection, the GPL clearly explains that there is no warranty for this free software. For both users' and authors' sake, the GPL requires that modified versions be marked as changed, so that their problems wi

Sparkora

Install / Use

README

Sparkora

Contents

Summary

Setup

Examples

Usage

Reading Data & Configuration

Cleaning

Feature Selection & Extraction

Visualization

Model Validation

Data Versioning

Testing

Contribute

License