Histogram

Streaming Histograms for Clojure/Java

Generate Convert Improve

Install / Use

/learn @bigmlcom/Histogram

About this skill

Quality Score

0/100

README

Overview

This project is an implementation of the streaming, one-pass histograms described in Ben-Haim's Streaming Parallel Decision Trees. Inspired by Tyree's Parallel Boosted Regression Trees, the histograms are extended so that they may track multiple values.

The histograms act as an approximation of the underlying dataset. They can be used for learning, visualization, discretization, or analysis. The histograms may be built independently and merged, making them convenient for parallel and distributed algorithms.

While the core of this library is implemented in Java, it includes a full featured Clojure wrapper. This readme focuses on the Clojure interface, but Java developers can find documented methods in com.bigml.histogram.Histogram.

Installation

histogram is available as a Maven artifact from Clojars.

For Leiningen:

For Maven:

<repository>
  <id>clojars.org</id>
  <url>http://clojars.org/repo</url>
</repository>
<dependency>
  <groupId>bigml</groupId>
  <artifactId>histogram</artifactId>
  <version>4.1.2</version>
</dependency>

Basics

In the following examples we use Incanter to generate data and for charting.

The simplest way to use a histogram is to create one and then insert! points. In the example below, ex/normal-data refers to a sequence of 200K samples from a normal distribution (mean 0, variance 1).

user> (ns examples
        (:use [bigml.histogram.core])
        (:require (bigml.histogram.test [examples :as ex])))
examples> (def hist (reduce insert! (create) ex/normal-data))

You can use the sum fn to find the approximate number of points less than a given threshold:

examples> (sum hist 0)
99814.63248

The density fn gives us an estimate of the point density at the given location:

examples> (density hist 0)
80936.98291

The uniform fn returns a list of points that separate the distribution into equal population areas. Here's an example that produces quartiles:

examples> (uniform hist 4)
(-0.66904 0.00229 0.67605)

Arbritrary percentiles can be found using percentiles:

examples> (percentiles hist 0.5 0.95 0.99)
{0.5 0.00229, 0.95 1.63853, 0.99 2.31390}

We can plot the sums and density estimates as functions. The red line represents the sum, the blue line represents the density. If we normalized the values (dividing by 200K), these lines approximate the cumulative distribution function and the probability distribution function for the normal distribution.

examples> (ex/sum-density-chart hist) ;; also see (ex/cdf-pdf-chart hist)

Histogram from normal distribution

The histogram approximates distributions using a constant number of bins. This bin limit is a parameter when creating a histogram (:bins, defaults to 64). A bin contains a :count of the points within the bin along with the :mean for the values in the bin. The edges of the bin aren't captured. Instead the histogram assumes that points of a bin are distributed with half the points less than the bin mean and half greater. This explains the fractional sum in the example below:

examples> (def hist (-> (create :bins 3)
                        (insert! 1)
                        (insert! 2)
                        (insert! 3)))
examples> (bins hist)
({:mean 1.0, :count 1} {:mean 2.0, :count 1} {:mean 3.0, :count 1})
examples> (sum hist 2)
1.5

As mentioned earlier, the bin limit constrains the number of unique bins a histogram can use to capture a distribution. The histogram above was created with a limit of just three bins. When we add a fourth unique value it will create a fourth bin and then merge the nearest two.

examples> (bins (insert! hist 0.5))
({:mean 0.75, :count 2} {:mean 2.0, :count 1} {:mean 3.0, :count 1})

A larger bin limit means a higher quality picture of the distribution, but it also means a larger memory footprint. In the chart below, the red line represents a histogram with 8 bins and the blue line represents 64 bins.

examples> (ex/multi-pdf-chart
           [(reduce insert! (create :bins 8) ex/mixed-normal-data)
            (reduce insert! (create :bins 64) ex/mixed-normal-data)])

8 and 64 bins histograms

Another option when creating a histogram is to use gap weighting. When :gap-weighted? is true, the histogram is encouraged to spend more of its bins capturing the densest areas of the distribution. For the normal distribution that means better resolution near the mean and less resolution near the tails. The chart below shows a histogram without gap weighting in blue and with gap weighting in red. Near the center of the distribution, red uses more bins and better captures the gaussian distribution's true curve.

examples> (ex/multi-pdf-chart
           [(reduce insert! (create :bins 8 :gap-weighted? true)
                    ex/normal-data)
            (reduce insert! (create :bins 8 :gap-weighted? false)
                    ex/normal-data)])

Gap weighting vs. No gap weighting

Merging

A strength of the histograms is their ability to merge with one another. Histograms can be built on separate data streams and then combined to give a better overall picture.

In this example, the blue line shows a density distribution from a histogram after merging 300 noisy histograms. The red shows one of the original histograms:

examples> (let [samples (partition 1000 ex/mixed-normal-data)
                hists (map #(reduce insert! (create) %) samples)
                merged (reduce merge! (create) (take 300 hists))]
            (ex/multi-pdf-chart [(first hists) merged]))

Merged histograms

Targets

While a simple histogram is nice for capturing the distribution of a single variable, it's often important to capture the correlation between variables. To that end, the histograms can track a second variable called the target.

The target may be either numeric or categorical. The insert! fn is overloaded to accept either type of target. Each histogram bin will contain information summarizing the target. For numeric targets the sum and sum-of-squares are tracked. For categoricals, a map of counts is maintained.

examples> (-> (create)
              (insert! 1 9)
              (insert! 2 8)
              (insert! 3 7)
              (insert! 3 6)
              (bins))
({:target {:sum 9.0, :sum-squares 81.0, :missing-count 0.0},
  :mean 1.0,
  :count 1}
 {:target {:sum 8.0, :sum-squares 64.0, :missing-count 0.0},
  :mean 2.0,
  :count 1}
 {:target {:sum 13.0, :sum-squares 85.0, :missing-count 0.0},
  :mean 3.0,
  :count 2})
examples> (-> (create)
              (insert! 1 :a)
              (insert! 2 :b)
              (insert! 3 :c)
              (insert! 3 :d)
              (bins))
({:target {:counts {:a 1.0}, :missing-count 0.0},
  :mean 1.0,
  :count 1}
 {:target {:counts {:b 1.0}, :missing-count 0.0},
  :mean 2.0,
  :count 1}
 {:target {:counts {:d 1.0, :c 1.0}, :missing-count 0.0},
  :mean 3.0,
  :count 2})

Mixing target types isn't allowed:

examples> (-> (create)
              (insert! 1 :a)
              (insert! 2 999))
Can't mix insert types
  [Thrown class com.bigml.histogram.MixedInsertException]

insert-numeric! and insert-categorical! allow target types to be set explicitly:

examples> (-> (create)
              (insert-categorical! 1 1)
              (insert-categorical! 1 2)
              (bins))
({:target {:counts {2 1.0, 1 1.0}, :missing-count 0.0}, :mean 1.0, :count 2})

The extended-sum fn works similarly to sum, but returns a result that includes the target information:

examples> (-> (create)
              (insert! 1 :a)
              (insert! 2 :b)
              (insert! 3 :c)
              (extended-sum 2))
{:sum 1.5, :target {:counts {:c 0.0, :b 0.5, :a 1.0}, :missing-count 0.0}}

The average-target fn returns the average target value given a point. To illustrate, the following histogram captures a dataset where the input field is a sample from the normal distribution while the target value is the sine of the input. The density is in red and the average target value is in blue:

examples> (def make-y (fn [x] (Math/sin x)))
examples> (def hist (let [target-data (map (fn [x] [x (make-y x)])
                                           ex/normal-data)]
                      (reduce (fn [h [x y]] (insert! h x y))
                              (create)
                              target-data)))
examples> (ex/pdf-target-chart hist)

Numeric target

Continuing with the same histogram, we can see that average-target produces values close to original target:

examples> (def view-target (fn [x] {:actual (make-y x)
                                    :approx (:sum (average-target hist x))}))
examples> (view-target 0)
{:actual 0.0, :approx -0.00051}
examples>  (view-target (/ Math/PI 2))
{:actual 1.0, :approx 0.9968169965429206}
examples> (view-target Math/PI)
{:actual 0.0, :approx 0.00463}

Missing Values

Information about missing values is captured whenever the input field or the target is nil. The missing-bin fn retrieves information summarizing the instances with a missing input. For a basic histogram, that is simply the count:

examples> (-> (create)

Related Skills

node-connect

339.1k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

83.8k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

339.1k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

83.8k

Commit, push, and open a PR