| API docs | Clojars | Purpose | |----------|---------|---------| | tesser.core | tesser.core | The core library and essential folds | | tesser.math | tesser.math | Statistical folds: means, correlations, covariance matrices, quantiles, etc. |

Tesser

"Now we will tesser, we will wrinkle again. Do you understand?" "No," Meg said flatly. Mrs. Whatsit sighed. "Explanations are not easy when they are about things for which your civilization still has no words. Calvin talked about traveling at the speed of light. You understand that, little Meg?" "Yes," Meg nodded. "That, of course, is the impractical, long way around. We have learned to take short cuts wherever possible." "Sort of like in math?" Meg asked. "Like in math."

— Madeline L'Engle, A Wrinkle In Time.

A Motivating Example

You've got a big pile of data--say, JSON in files on disk, or TSVs in Hadoop--and you'd like to reduce over that data: computing some statistics, searching for special values, etc. You might want to find the median housing price in each city given a collection of all sales, or find the total mass of all main-sequence stars in a region of sky, or search for an anticorrelation between vaccine use and the prevalence of a disease. These are all folds: collapsing a collection of data into a smaller value.

In Clojure, we're used to writing programs like

(->> stars
     (filter main-sequence?)
     (map :mass)
     (reduce +))

But this reduction is singlethreaded, and can only run on a single machine. You've got 48 cores in your desktop computer. Why aren't they all helping?

(require '[tesser.core :as t])
(->> (t/filter main-sequence)
     (t/map :mass)
     (t/fold +)
     (t/tesser (t/chunk 1024 stars)))

Tesser goes much deeper, but this is the essence: writing understandable, composable, fast programs for exploring datasets.

A Clojure Library for Concurrent & Commutative Folds

Tesser gives us a library for building up folds, and applying those folds to a collection of inputs, divided into chunks. Chunks are reduced with maximal parallelism, and the results of those reductions are reduced together. We call the concurrent reduction :reducer, and the serial reduction :combiner.

Reduce/combine diagram

In order to reduce over a chunk, we need an initial value. A :reducer-identity function generates this initial value. Once the reduction is complete, we may want to transform it using a :post-reducer function--for instance, converting from a transient to a persistent data structure, or discarding internal reducer state.

Diagram of reducer identities and post-reduce

Likewise, we need a :combiner-identity function to generate an initial value for the combine reduction, and a final :post-combine function to transform the combiner's output.

Diagram of combiner identity and post-combine

Tesser folds are associative and commutative monoids: they can be applied concurrently and do not preserve order. We make this tradeoff to reduce the need for coordination between reducers. Not all concurrent folds require/provide commutativity: Algebird, for example, provides ordered distributed folds, and clojure.core.reducers preserves order as well. Right now Tesser doesn't, but we might change that someday.

Unlike CRDTs, Tesser's folds do not require idempotence. In Tesser, 1 + 1 is 2, not 1. Each input factors in to the reduced value exactly once--though the order is up to Tesser.

Representing a Fold

If all you need is a quick-n-dirty replacement for an existing reduce or fold, just drop tesser.simple into your program.

(require '[tesser.simple :as s])
user=> (s/reduce + 0 (range 10000000))
  49999995000000

But Tesser is capable of composing folds together--combining simple functions to express complex ideas. In Tesser, we represent a compiled fold as map of six functions:

{:reducer-identity  (fn [] ...)
 :reducer           (fn [accumulator input] ...)
 :post-reducer      (fn [accumulator] ...)
 :combiner-identity (fn [] ...)
 :combiner          (fn [accumulator post-reducer-result] ...)
 :post-combiner     (fn [accumulator] ...)}

For instance, here's a fold to find the sum of all inputs. While the reducer and combiner often have the same accumulator type and identities, this is not always the case.

(require '[tesser.core :as t])
(t/tesser [[1] [2 3]]
          (t/fold {:reducer-identity  (constantly 0)
                   :reducer           +
                   :post-reducer      identity
                   :combiner-identity (constantly 0)
                   :combiner          +
                   :post-combiner     identity}))
; => 6

Since (+) returns 0 (the additive identity), we can leave off the identities; t/fold will use the reducer as the identity function. The same goes for the post-reducer and post-combiner: (+ x) returns x, so we can leave them off too:

(t/tesser [[1] [2 3]]
          (t/fold {:reducer  +
                   :combiner +}))
; => 6

Or simply pass a function to t/fold, which will be used for all 6 functions.

(t/tesser [[1] [2 3]] (t/fold +))
; => 6

But that's not all; we can transform the summing fold into one that operates on strings like "1" with the map function:

(->> (t/map read-string)
     (t/fold +)
     (t/tesser [["1" "2" "3"]]))
; => 6

Tesser provides a rich library of fold transforms, allowing you to build up complex folds out of simple, modular parts.

Core

Tesser.core looks a lot like the Clojure seq API, and many of its functions have similar names. Their semantics differ, however: Tesser folds do not preserve the order of inputs, and when executed, they run in parallel.

Applying a fold using tesser.core/tesser uses multiple threads proportional to processor cores. Unlike reducers, we don't use the Java forkjoin pool, just plain old threads; it avoids contention issues and improves performance on most JDKs.

(require '[tesser.core :as t])

(t/tesser [[1 2 3] [4 5 6]] (t/into [] (t/map str)))
=> ["4" "5" "6" "1" "2" "3"]

(->> (t/map inc)      ; Increment each number
     (t/filter odd?)  ; Take only odd numbers
     (t/take 5)       ; *which* five odd numbers are selected is arbitrary
     (t/mapcat range) ; Explode each odd number n into the numbers from 0 to n
     (t/frequencies)  ; Compute the frequency of appearances
     (t/tesser (partition 3 (range 100))))
=> {0 5, 7 2, 20 1, 27 1, 1 4, 24 1, 4 3, 15 2, 21 1, 13 2, 22 1, 6 2, 28 1, 25 1, 17 1, 3 3, 12 2, 2 4, 23 1, 19 1, 11 2, 9 2, 5 2, 14 2, 26 1, 16 2, 10 2, 18 1, 8 2}

Fold combinators like facet and fuse allow multiple reductions to be done in a single pass, possibly sharing expensive operations like deserialization. This is a particularly effective way of working with a set of data files on disk or in Hadoop.

Given JSON records about a codebase like

{"year":         2004,
 "lines-of-code" {"ruby": 100,
                  "c":    1693}}

We can find the range of years and the total lines of code in each language in a single pass.

(->> (t/map #(json/parse-string % true))
     (t/fuse {:year-range (t/range (t/map :year))
              :total-code (->> (t/map :lines-of-code)
                               (t/facet)
                               (t/reduce + 0)})
     (t/tesser records))

=> {:year-range [1986 2014]
    :total-code {:c    153423,
                 :ruby 4578,
                 :tcl  3453
                 :bf   1}}

Ready to get started? Start with the tesser.core docs

Math

Tesser.math provides statistical folds. Some, like sum, mean, and correlation, are exact. Others, like quantiles, are estimates.

For instance, to find a Pearson correlation matrix between ln(age), height, and weight; and in the same pass, to find the total number of samples:

(require '[tesser.math :as m])
(->> (t/fuse {:count (t/count)
              :corrs (m/correlation-matrix {:log-age #(Math/ln (:age %))
                                            :height  :height
                                            :weight  :weight})})
     (t/tesser records))

=> {:count 7123525
    :corrs {[:log-age :weight] 0.74
            [:log-age :height] 0.86
            [:weight  :height] 0.91
            ... and the same keys in symmetric order ...}}

Ready? To the tesser.math API!

Hadoop

Update, 2021-03-01: Parkour, the library tesser.hadoop used to talk to Hadoop, hasn't been maintained in some time: its dependencies are now unresolvable. Tesser is likely still a good fit for Hadoop, but I need someone who actually uses Hadoop to take this over. PRs, please!

The tesser.hadoop API takes Tesser folds and distributes them using the Parkour Hadoop library. You can test your folds locally, then run them on a cluster to reduce over huge datasets.

(require '[tesser [core :as t]
                  [math :as m]
                  [hadoop :as h]])

(defn analyze
  "A fold that analyzes measurements of trees from a certain location."
  [location]
  (->> (t/map parse-record)
       (t/filter #(= location (:location %)))
       (t/fuse {:count (t/count)
                :oldest (->> (t/map :age)
                             (t/max))
                :corrs (m/correlation

Tesser

Install / Use

README

Tesser

A Motivating Example

A Clojure Library for Concurrent & Commutative Folds

Representing a Fold

Core

Math

Hadoop