SkillAgentSearch skills...

Popmon

Monitor the stability of a Pandas or Spark dataframe ⚙︎

Install / Use

/learn @ing-bank/Popmon

README

=========================== Population Shift Monitoring

|build| |docs| |release| |release_date| |downloads| |ruff|

|logo|

popmon is a package that allows one to check the stability of a dataset. popmon works with both pandas and spark datasets.

popmon creates histograms of features binned in time-slices, and compares the stability of the profiles_ and distributions of those histograms using statistical tests <https://popmon.readthedocs.io/en/latest/comparisons.html>_, both over time and with respect to a reference. It works with numerical, ordinal, categorical features, and the histograms can be higher-dimensional, e.g. it can also track correlations between any two features. popmon can automatically flag and alert on changes observed over time, such as trends, shifts, peaks, outliers, anomalies, changing correlations, etc, using monitoring business rules.

Latest update: Jan 2026.

|example|

|histograms|

Spark

For Spark make sure to pick up the correct histogrammar jar files. Spark 4.X is based on Scala 2.13; Spark 3.X is based on Scala 2.12 or 2.13.

.. code-block:: python

spark = SparkSession.builder.config( "spark.jars.packages", "io.github.histogrammar:histogrammar_2.13:1.0.30,io.github.histogrammar:histogrammar-sparksql_2.13:1.0.30", ).getOrCreate()

For Scala 2.12, in the string above simply replace "2.13" with "2.12" (twice).

Examples

  • Flight Delays and Cancellations Kaggle data <https://crclz.com/popmon/reports/flight_delays_report.html>_
  • Synthetic data (code example below) <https://crclz.com/popmon/reports/test_data_report.html>_

Documentation

The entire popmon documentation including tutorials can be found at read-the-docs <https://popmon.readthedocs.io>_.

Notebooks

.. list-table:: :widths: 80 20 :header-rows: 1

    • Tutorial
    • Colab link
    • Basic tutorial <https://nbviewer.jupyter.org/github/ing-bank/popmon/blob/master/popmon/notebooks/popmon_tutorial_basic.ipynb>_
    • |notebook_basic_colab|
    • Detailed example (featuring configuration, Apache Spark and more) <https://nbviewer.jupyter.org/github/ing-bank/popmon/blob/master/popmon/notebooks/popmon_tutorial_advanced.ipynb>_
    • |notebook_advanced_colab|
    • Incremental datasets (online analysis) <https://nbviewer.jupyter.org/github/ing-bank/popmon/blob/master/popmon/notebooks/popmon_tutorial_incremental_data.ipynb>_
    • |notebook_incremental_data_colab|
    • Report interpretation (step-by-step guide) <https://nbviewer.jupyter.org/github/ing-bank/popmon/blob/master/popmon/notebooks/popmon_tutorial_reports.ipynb>_
    • |notebook_reports_colab|

Check it out

The popmon library requires Python 3.6+ and is pip friendly. To get started, simply do:

.. code-block:: bash

$ pip install popmon

or check out the code from our GitHub repository:

.. code-block:: bash

$ git clone https://github.com/ing-bank/popmon.git $ pip install -e popmon

where in this example the code is installed in edit mode (option -e).

You can now use the package in Python with:

.. code-block:: python

import popmon

Congratulations, you are now ready to use the popmon library!

Quick run

As a quick example, you can do:

.. code-block:: python

import pandas as pd import popmon from popmon import resources

open synthetic data

df = pd.read_csv(resources.data("test.csv.gz"), parse_dates=["date"]) df.head()

generate stability report using automatic binning of all encountered features

(importing popmon automatically adds this functionality to a dataframe)

report = df.pm_stability_report(time_axis="date", features=["date:age", "date:gender"])

to show the output of the report in a Jupyter notebook you can simply run:

report

or save the report to file

report.to_file("monitoring_report.html")

To specify your own binning specifications and features you want to report on, you do:

.. code-block:: python

time-axis specifications alone; all other features are auto-binned.

report = df.pm_stability_report( time_axis="date", time_width="1w", time_offset="2020-1-6" )

histogram selections. Here 'date' is the first axis of each histogram.

features = [ "date:isActive", "date:age", "date:eyeColor", "date:gender", "date:latitude", "date:longitude", "date:isActive:age", ]

Specify your own binning specifications for individual features or combinations thereof.

This bin specification uses open-ended ("sparse") histograms; unspecified features get

auto-binned. The time-axis binning, when specified here, needs to be in nanoseconds.

bin_specs = { "longitude": {"bin_width": 5.0, "bin_offset": 0.0}, "latitude": {"bin_width": 5.0, "bin_offset": 0.0}, "age": {"bin_width": 10.0, "bin_offset": 0.0}, "date": { "bin_width": pd.Timedelta("4w").value, "bin_offset": pd.Timestamp("2015-1-1").value, }, }

generate stability report

report = df.pm_stability_report(features=features, bin_specs=bin_specs, time_axis=True)

These examples also work with spark dataframes. You can see the output of such example notebook code here <https://crclz.com/popmon/reports/test_data_report.html>. For all available examples, please see the tutorials <https://popmon.readthedocs.io/en/latest/tutorials.html> at read-the-docs.

Pipelines for monitoring dataset shift

Advanced users can leverage popmon's modular data pipeline to customize their workflow. Visualization of the pipeline can be useful when debugging, or for didactic purposes. There is a script <https://github.com/ing-bank/popmon/tree/master/tools/>_ included with the package that you can use. The plotting is configurable, and depending on the options you will obtain a result that can be used for understanding the data flow, the high-level components and the (re)use of datasets.

|pipeline|

Example pipeline visualization (click to enlarge)

Reports and integrations

The data shift computations that popmon performs, are by default displayed in a self-contained HTML report. This format is favourable in many real-world environments, where access may be restricted. Moreover, reports can be easily shared with others.

Access to the datastore means that its possible to integrate popmon in almost any workflow. To give an example, one could store the histogram data in a PostgreSQL database and load that from Grafana and benefit from their visualisation and alert handling features (e.g. send an email or slack message upon alert). This may be interesting to teams that are already invested in particular choice of dashboarding tool.

Possible integrations are:

+----------------+---------------+ | |grafana_logo| | |kibana_logo| | +----------------+---------------+ | Grafana | Kibana | +----------------+---------------+

Resources on how to integrate popmon are available in the examples directory <https://github.com/ing-bank/popmon/tree/master/examples/integrations>_. Contributions of additional or improved integrations are welcome!

.. |grafana_logo| image:: https://upload.wikimedia.org/wikipedia/commons/a/a1/Grafana_logo.svg :alt: Grafana logo :height: 120 :target: https://github.com/grafana/grafana

.. |kibana_logo| image:: https://miro.medium.com/max/1400/1*HW_x9ZvIbUkyaqHstsB1ig.png :alt: Kibana logo :height: 120 :target: https://github.com/elastic/kibana

Comparison and profile extensions

External libraries or custom functionality can be easily added to Profiles_ and Comparisons_. If you developed an extension that could be generically used, then please consider contributing it to the package.

Popmon currently integrates:

  • Diptest <https://github.com/RUrlus/diptest>_

A Python/C++ implementation of Hartigan & Hartigan's dip test for unimodality. The dip test tests for multimodality in a sample by taking the maximum difference, over all sample points, between the empirical distribution function, and the unimodal distribution function that minimizes that maximum difference. Other than unimodality, it makes no further assumptions about the form of the null distribution.

To enable this extension install diptest using pip install diptest or pip install popmon[diptest].

Resources

Presentations

+------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+-------------------+-------------------------+ | Title | Host | Date | Speaker | +------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+-------------------+-------------------------+ | popmon: Analysis Package for Dataset Shift Detection | SciPy Conference 2022 <https://www.scipy2022.scipy.org/>_ | July 13, 2022 | Simon Brugman | +------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+-------------------+-------------------------+ | Popmon - population monitoring made easy | Big Data Technology Warsaw Summit 2021 <https://bigdatatechwarsaw.eu/>_ | February 25, 2021 | Simon Brugm

View on GitHub
GitHub Stars511
CategoryOperations
Updated4d ago
Forks37

Languages

Python

Security Score

100/100

Audited on Mar 24, 2026

No findings