Handyspark

HandySpark - bringing pandas-like capabilities to Spark dataframes

Generate Convert Improve

Install / Use

/learn @dvgodoy/Handyspark

About this skill

Quality Score

0/100

README

HandySpark

Bringing pandas-like capabilities to Spark dataframes!

HandySpark is a package designed to improve PySpark user experience, especially when it comes to exploratory data analysis, including visualization capabilities!

It makes fetching data or computing statistics for columns really easy, returning pandas objects straight away.

It also leverages on the recently released pandas UDFs in Spark to allow for an out-of-the-box usage of common pandas functions in a Spark dataframe.

Moreover, it introduces the stratify operation, so users can perform more sophisticated analysis, imputation and outlier detection on stratified data without incurring in very computationally expensive groupby operations.

It brings the long missing capability of plotting data while retaining the advantage of performing distributed computation (unlike many tutorials on the internet, which just convert the whole dataset to pandas and then plot it - don't ever do that!).

Finally, it also extends evaluation metrics for binary classification, so you can easily choose which threshold to use!

Google Colab

Eager to try it out right away? Don't wait any longer!

Open the notebook directly on Google Colab and try it yourself:

Exploring Titanic

Installation

To install HandySpark from PyPI, just type:

pip install handyspark

Documentation

You can find the full documentation here.

Here is a handy list of direct links to some classes, objects and methods used:

HandyFrame
- cols
- pandas
- transformers
- isnull
- fill
- outliers
- fence
- stratify
Bucket
Quantile
HandyImputer
HandyFencer

Quick Start

To use HandySpark, all you need to do is import the package and, after loading your data into a Spark dataframe, call the toHandy() method to get your own HandyFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

from handyspark import *
sdf = spark.read.csv('./tests/rawdata/train.csv', header=True, inferSchema=True)
hdf = sdf.toHandy()

Fetching and plotting data

Now you can easily fetch data as if you were using pandas, just use the cols object from your HandyFrame:

hdf.cols['Name'][:5]

It should return a pandas Series object:

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object

If you include a list of columns, it will return a pandas DataFrame.

Due to the distributed nature of data in Spark, it is only possible to fetch the top rows of any given HandyFrame.

Using cols you have access to several pandas-like column and DataFrame based methods implemented in Spark:

min / max / median / q1 / q3 / stddev / mode
nunique
value_counts
corr
hist
boxplot
scatterplot

For instance:

hdf.cols['Embarked'].value_counts(dropna=False)

S      644
C      168
Q       77
NaN      2
Name: Embarked, dtype: int64

You can also make some plots:

from matplotlib import pyplot as plt
fig, axs = plt.subplots(1, 4, figsize=(12, 4))
hdf.cols['Embarked'].hist(ax=axs[0])
hdf.cols['Age'].boxplot(ax=axs[1])
hdf.cols['Fare'].boxplot(ax=axs[2])
hdf.cols[['Fare', 'Age']].scatterplot(ax=axs[3])

cols plots

Handy, right (pun intended!)? But things can get even more interesting if you use stratify!

Stratify

Stratifying a HandyFrame means using a split-apply-combine approach. It will first split your HandyFrame according to the specified (discrete) columns, then it will apply some function to each stratum of data and finally combine the results back together.

This is better illustrated with an example - let's try the stratified version of our previous value_counts:

hdf.stratify(['Pclass']).cols['Embarked'].value_counts()

Pclass  Embarked
1       C            85
        Q             2
        S           127
2       C            17
        Q             3
        S           164
3       C            66
        Q            72
        S           353
Name: value_counts, dtype: int64

Cool, isn't it? Besides, under the hood, not a single group by operation was performed - everything is handled using filter clauses! So, no data shuffling!

What if you want to stratify on a column containing continuous values? No problem!

hdf.stratify(['Sex', Bucket('Age', 2)]).cols['Embarked'].value_counts()

Sex     Age                                Embarked
female  Age >= 0.4200 and Age < 40.2100    C            46
                                           Q            12
                                           S           154
        Age >= 40.2100 and Age <= 80.0000  C            15
                                           S            32
male    Age >= 0.4200 and Age < 40.2100    C            53
                                           Q            11
                                           S           287
        Age >= 40.2100 and Age <= 80.0000  C            16
                                           Q             5
                                           S            81
Name: value_counts, dtype: int64

You can use either Bucket or Quantile to discretize your data in any given number of bins!

What about plotting it? Yes, HandySpark can handle that as well!

hdf.stratify(['Sex', Bucket('Age', 2)]).cols['Embarked'].hist(figsize=(8, 6))

stratified hist

Handling missing data

HandySpark makes it very easy to spot and fill missing values. To figure if there are any missing values, just use isnull:

hdf.isnull(ratio=True)

PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
Name: missing(ratio), dtype: float64

Ok, now you know there are 3 columns with missing values: Age, Cabin and Embarked. It's time to fill those values up! But, let's skip Cabin, which has 77% of its values missing!

So, Age is a continuous variable, while Embarked is a categorical variable. Let's start with the latter:

hdf_filled = hdf.fill(categorical=['Embarked'])

HandyFrame has a fill method which takes up to 3 arguments:

categorical: a list of categorical variables
continuous: a list of continuous variables
strategy: which strategy to use for each one of the continuous variables (either mean or median)

Categorical variables use a mode strategy by default.

But you do not need to stick with the basics anymore... you can fancy it up using stratify together with fill:

hdf_filled = hdf_filled.stratify(['Pclass', 'Sex']).fill(continuous=['Age'], strategy=['mean'])

How do you know which values are being used? Simple enough:

hdf_filled.statistics_

{'Age': {'Pclass == "1" and Sex == "female"': 34.61176470588235,
  'Pclass == "1" and Sex == "male"': 41.28138613861386,
  'Pclass == "2" and Sex == "female"': 28.722972972972972,
  'Pclass == "2" and Sex == "male"': 30.74070707070707,
  'Pclass == "3" and Sex == "female"': 21.75,
  'Pclass == "3" and Sex == "male"': 26.507588932806325},
 'Embarked': 'S'}

There you go! The filter clauses and the corresponding imputation values!

But there is more - once you're with your imputation procedure, why not generate a custom transformer to do that for you, either on your test set or in production?

You only need to call the imputer method of the transformer object that every HandyFrame has:

imputer = hdf_filled.transformers.imputer()

In the example above, imputer is now a full-fledged serializable PySpark transformer! What does that mean? You can use it in your pipeline and save / load at will :-)

Detecting outliers

Second only to the problem of missing data, outliers can pose a challenge for training machin

Related Skills

node-connect

339.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

claude-opus-4-5-migration

83.9k

Migrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5

frontend-design

83.9k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

model-usage

339.3k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

dvgodoy

View profile

View on GitHub

GitHub Stars199

CategoryDevelopment

Updated1mo ago

Forks27

dvgodoy/handyspark

Languages

Jupyter Notebook

Security Score

100/100

Audited on Feb 18, 2026

No findings