Distribution Fitting in Clojure

Library provides the set of functions to fit univariate distribution to your (uncensored) data.

Entry point: fit function with supported methods:

Maximum log likelihood estimation - :mle
Maximum goodness-of-fit estimation:
- Kolmogorov-Smirnov - :ks
- Cramer-von-Mises - :cvm
- Anderson-Darling - :ad, :adr, :adl, :ad2r, :ad2l and :ad2
Quantile matching estimation - :qme
Method of moments (modified) - :mme
Maximum (Product of) Spacing Estimation - :mps

Additionally you can use:

bootstrap to generate parameters from set of resampled data
infer to generate parameters computationally from data

Library is highly based on fitdistrplus R package.

For details please read this paper.

fastmath distributions and optimization methods are used here.

How does it work?

For every method target function is created which accepts distribution parameters and returns log-likelihood, MSE/MAE of quantiles or differences between cdfs. Such function is minimized or maximized using one of the available algorithms (gradient based or simplex based). Optimization is bounded. Initial values for optimization are infered from data.

For a bootstrap, sequences of resampled data are created and then each sequence is fitted. Best result (mean or median) is used as a final parametrization. Additionally confidence interval (or other extent like iqr or min-max) is returned.

Values of any target function can be calculated and returned as a fitness measure.

Method of moments - modified version

Distributions implementation don't provide higher order moments but can calculate mean (first moment) and variance (second central moment). MME method uses both to match empirical mean and variance (regardless number of parameters to estimate). MSE or MAE is used as a target for optimization.

Usage

To run inference just call one of the following functions:

(fit method distribution data params)
(bootstrap method distribution data params)
(infer distribution data params)

where:

method - one of supported methods as a keyword (like: :mle or :qme)
distribution - a name of the distribution as keywords (see below) (like: :beta)
data - any sequable of numbers
params - parametrization (optional, see below)

All methods return map with following keys:

:params - best parametrization
:distribution - distribution object
:distribution-name - name of the distribution
:method - used fitting method
:stats - statistics (see below)

For bootstrap you receive additionally:

:ci - confidence interval (several methods, see below)
:ci-type - name of the interval method
:all-params - (optional) list of parameters for each resampled dataset
:params - best parametrization (mean or median, depending on confidence interval)

Some validations of the data and initial parameters are made.

Examples

(require '[fitdistr.core :refer :all]
         '[fitdistr.distributions :refer [distribution-data]])

Example 1

Proof that matching is accurate enough

(def target-data (->seq (distribution :weibull {:alpha 0.5 :beta 2.2}) 10000))

(fit :ad :weibull target-data {:stats [:mle]})
;; => {:stats
;;     {:ad 0.19749431207310408,
;;      :mle -19126.212671469282,
;;      :aic 38256.425342938564,
;;      :bic 38270.84602368252},
;;     :params {:alpha 0.5014214878565807, :beta 2.203213102262515},
;;     :distribution-name :weibull,
;;     :distribution #object[org.apache.commons.math3.distribution.WeibullDistribution 0x430997b7 "org.apache.commons.math3.distribution.WeibullDistribution@430997b7"],
;;     :method :ad}

(bootstrap :mle :weibull target-data {:stats #{:ad}})

;; => {:stats
;;     {:mle -19126.178345014738,
;;      :ad 0.35561024021990306,
;;      :aic 38256.356690029475,
;;      :bic 38270.77737077343},
;;     :mad-median
;;     {:alpha [0.4910043451070347 0.5056336146263343 0.4983189798666845],
;;      :beta [2.1185018316179285 2.326029409552982 2.222265620585455]},
;;     :params {:alpha 0.4983189798666845, :beta 2.222265620585455},
;;     :distribution-name :weibull,
;;     :distribution #object[org.apache.commons.math3.distribution.WeibullDistribution 0x63a766b9 "org.apache.commons.math3.distribution.WeibullDistribution@63a766b9"]}

(infer :weibull target-data {:stats #{:mle :ad}})
;; => {:stats
;;     {:mle -19126.13369575803,
;;      :ad 0.22838225327177497,
;;      :aic 38256.26739151606,
;;      :bic 38270.68807226002},
;;     :params {:alpha 0.5012938746206328, :beta 2.215448048490149},
;;     :distribution-name :weibull,
;;     :distribution #object[org.apache.commons.math3.distribution.WeibullDistribution 0x3a0f2314 "org.apache.commons.math3.distribution.WeibullDistribution@3a0f2314"]}

Using the distribution

(def inferred-distribution (fit :ad :weibull target-data {:stats [:mle]}))

inferred-distribution
;; => {:stats
;;     {:ad 0.22793837346762302,
;;      :mle -18836.462685291066,
;;      :aic 37676.92537058213,
;;      :bic 37691.34605132609},
;;     :params {:alpha 0.5020961308787267, :beta 2.1661515133303646},
;;     :distribution-name :weibull,
;;     :distribution
;;     #object[org.apache.commons.math3.distribution.WeibullDistribution 0x7f421db2 "org.apache.commons.math3.distribution.WeibullDistribution@7f421db2"],
;;     :method :ad}

(->distribution inferred-distribution)
;; => #object[org.apache.commons.math3.distribution.WeibullDistribution 0x7f421db2 "org.apache.commons.math3.distribution.WeibullDistribution@7f421db2"]

(cdf inferred-distribution 0.5)
;; => 0.38057731286029817
(cdf inferred-distribution 0.5 10.0)
;; => 0.5035774570603441
(pdf inferred-distribution 0.5)
;; => 0.2979270382668242
(lpdf inferred-distribution 0.5)
;; => -1.2109066604852476
(icdf inferred-distribution 0.5)
;; => 1.0439237628813434
(sample inferred-distribution)
;; => 0.002386261009069703
(log-likelihood inferred-distribution (take 10 target-data))
;; => -24.233608452146168
(likelihood inferred-distribution (take 10 target-data))
;; => 2.988667305414128E-11
(mean inferred-distribution)
;; => 4.299110979375332
(variance inferred-distribution)
;; => 91.33715492734707
(lower-bound inferred-distribution)
;; => 0.0
(upper-bound inferred-distribution)
;; => ##Inf
(distribution-id inferred-distribution)
;; => :weibull
(distribution-parameters inferred-distribution)
;; => [:beta :alpha]
(drandom inferred-distribution)
;; => 7.4061584562769776
(lrandom inferred-distribution)
;; => 4
(irandom inferred-distribution)
;; => 40
(set-seed! inferred-distribution 1337)
(take 10 (->seq inferred-distribution))
;; => (2.5184760984751717
;;     2.9550268761778735
;;     9.930032804583968
;;     11.259341860117786
;;     0.0808352042851777
;;     17.399335542961957
;;     0.0564922448326893
;;     0.32170752149468795
;;     6.063628565109016
;;     2.2215931112225826)
(set-seed! inferred-distribution 1337)
(->seq inferred-distribution 10)
;; => (2.5184760984751717
;;     2.9550268761778735
;;     9.930032804583968
;;     11.259341860117786
;;     0.0808352042851777
;;     17.399335542961957
;;     0.0564922448326893
;;     0.32170752149468795
;;     6.063628565109016
;;     2.2215931112225826)

Example 2

Search for the best distribution and its parameters. Look at last example where Pareto distribution is wrongly considered best when using inadequate method.

(def atv [0.6 2.8 182.2 0.8 478.0 1.1 215.0 0.7 7.9 316.2 0.2 17780.0 7.8 100.0 0.9 180.0 0.3 300.9
          0.6 17.5 10.0 0.1 5.8 87.7 4.1 3.5 4.9 7060.0 0.2 360.0 100.8 2.3 12.3 40.0 2.3 0.1
          2.7 2.2 0.4 2.6 0.2 1.0 7.3 3.2 0.8 1.2 33.7 14.0 21.4 7.7 1.0 1.9 0.7 12.6
          3.2 7.3 4.9 4000.0 2.5 6.7 3.0 63.0 6.0 1.6 10.1 1.2 1.5 1.2 30.0 3.2 3.5 1.2
          0.2 1.9 0.7 17.0 2.8 4.8 1.3 3.7 0.2 1.8 2.6 5.9 2.6 6.3 1.4 0.8 670.0 810.0
          1890.0 1800.0 8500.0 21000.0 31.0 20.5 4370.0 1000.0 39891.8
          316.2 6400.0 1000.0 7400.0 31622.8])

(defn find-best
  [method ds]
  (let [selector (if (= method :mle) last first)]
    (dissoc (->> (map #(fit method % atv {:stats #{:mle :ad :ks :cvm}}) ds)
                 (remove (fn [v] (Double/isNaN (method (:stats v)))))
                 (sort-by (comp method :stats))
                 (selector))
            :distribution)))

(find-best :mle [:weibull :log-normal :gamma :exponential :normal :pareto])
;; => {:stats
;;     {:mle -532.4052019871922,
;;      :cvm 0.6373592936482382,
;;      :ks 0.1672497620724005,
;;      :ad 3.4721179220009617,
;;      :aic 1068.8104039743844,
;;      :bic 1074.0991857726672},
;;     :params {:scale 2.553816262077493, :shape 3.147240361221695},
;;     :distribution-name :log-normal,
;;     :method :mle}

(find-best :ad [:weibull :log-normal :gamma :exponential :normal :pareto])
;; => {:stats
;;     {:ad 3.0345123029861156,
;;      :cvm 0.4615381958965107,
;;      :ks 0.1332827771382316,
;;      :mle -532.9364810533066,
;;      :aic 1069.8729621066132,
;;      :bic 1075.161743904896},
;;     :params {:scale 2.2941800698596815, :shape 3.2934516278879205},
;;     :distribution-name :log-normal,
;;     :method :ad}

(find-best :ks [:weibull :log-normal :gamma :exponential :normal :pareto])
;; => {:stats
;;     {:ks 0.10129796316277348,
;;      :mle -535.0197747143928,
;;      :cvm 0.3675703954412623,
;;      :ad 3.830809551957188,
;;      :aic 1074.0395494287857,
;;      :bic 1079.3283312270685},
;;     :params {:scale 2.03465815391538, :shape 2.873339450786136},
;;     :distribution-name :log-normal,
;;     :method :ks}

ex2

Example 3

This i

Fitdistr

Install / Use

README