Tdigest
t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark
Install / Use
/learn @CamDavidsonPilon/TdigestREADME
tdigest
Efficient percentile estimation of streaming or distributed data
This is a Python implementation of Ted Dunning's t-digest data structure. The t-digest data structure is designed around computing accurate estimates from either streaming data, or distributed data. These estimates are percentiles, quantiles, trimmed means, etc. Two t-digests can be added, making the data structure ideal for map-reduce settings, and can be serialized into much less than 10kB (instead of storing the entire list of data).
See a blog post about it here: Percentile and Quantile Estimation of Big Data: The t-Digest
Installation
tdigest is compatible with both Python 2 and Python 3.
pip install tdigest
Usage
Update the digest sequentially
from tdigest import TDigest
from numpy.random import random
digest = TDigest()
for x in range(5000):
digest.update(random())
print(digest.percentile(15)) # about 0.15, as 0.15 is the 15th percentile of the Uniform(0,1) distribution
Update the digest in batches
another_digest = TDigest()
another_digest.batch_update(random(5000))
print(another_digest.percentile(15))
Sum two digests to create a new digest
sum_digest = digest + another_digest
sum_digest.percentile(30) # about 0.3
To dict or serializing a digest with JSON
You can use the to_dict() method to turn a TDigest object into a standard Python dictionary.
digest = TDigest()
digest.update(1)
digest.update(2)
digest.update(3)
print(digest.to_dict())
Or you can get only a list of Centroids with centroids_to_list().
digest.centroids_to_list()
Similarly, you can restore a Python dict of digest values with update_from_dict(). Centroids are merged with any existing ones in the digest.
For example, make a fresh digest and restore values from a python dictionary.
digest = TDigest()
digest.update_from_dict({'K': 25, 'delta': 0.01, 'centroids': [{'c': 1.0, 'm': 1.0}, {'c': 1.0, 'm': 2.0}, {'c': 1.0, 'm': 3.0}]})
K and delta values are optional, or you can provide only a list of centroids with update_centroids_from_list().
digest = TDigest()
digest.update_centroids([{'c': 1.0, 'm': 1.0}, {'c': 1.0, 'm': 2.0}, {'c': 1.0, 'm': 3.0}])
If you want to serialize with other tools like JSON, you can first convert to_dict().
json.dumps(digest.to_dict())
Alternatively, make a custom encoder function to provide as default to the standard json module.
def encoder(digest_obj):
return digest_obj.to_dict()
Then pass the encoder function as the default parameter.
json.dumps(digest, default=encoder)
API
TDigest.
update(x, w=1): update the tdigest with valuexand weightw.batch_update(x, w=1): update the tdigest with values in arrayxand weightw.compress(): perform a compression on the underlying data structure that will shrink the memory footprint of it, without hurting accuracy. Good to perform after adding many values.percentile(p): return thepth percentile. Example:p=50is the median.cdf(x): return the CDF the valuexis at.trimmed_mean(p1, p2): return the mean of data set without the values below and above thep1andp2percentile respectively.to_dict(): return a Python dictionary of the TDigest and internal Centroid values.update_from_dict(dict_values): update from serialized dictionary values into the TDigest object.centroids_to_list(): return a Python list of the TDigest object's internal Centroid values.update_centroids_from_list(list_values): update Centroids from a python list.
Related Skills
node-connect
349.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
claude-opus-4-5-migration
109.4kMigrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5
frontend-design
109.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
model-usage
349.0kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
