Chimpmark
ChimpMARK-2010 is a collection of massive real-world datasets, interesting real-world problems, and simple example code to solve them. Learn Big Data processing, benchmark your cluster, or compete on implementation!
Install / Use
/learn @mrflip/ChimpmarkREADME
ChimpMARK-2010: Big Data Benchmark on Real-World Datasets
The ChimpMARK-2010 is a collection of massive real-world datasets, interesting real-world problems, and simple example code to solve them. Learn Big Data processing, benchmark your cluster, or compete on implementation!
Why?
- Estimate run time for a job using runtimes for tasks with similar size, shape and atomic operations.
- With known data and code, you can performance-qualify cluster. How fast should a join run on a 10-machine cluster of m1.large instances?
- Understand how cluster settings should be tuned in response to job characteristics.
- Compare clusters of equivalent nominal power (CPU, cores, memory) across data centers or hardware configs: AWS vs Rackspace vs private cluster, go!
- Different core technologies (pig vs. wukong vs. raw Java) can compete on run time, efficiency, elegance, and size of code.
Each problem is meant to a) to be a real-world task, b) to nonetheless encapsulate a very small, generic number of operations. For instance, adjacency pairs => adjacency list on the wikipedia link graph is really just a GROUP, but in particular it's a group on very small records with median multiplicity of ~5, maximum in the thousands, and low skew.
Note: this is a planning document: the repo contains no code yet. There's a great variety of example code -- the foundations of what will appear here -- in the "wukong repo":http://github.com/infochimps/wukong/tree/master/examples/
Datasets
The required datasets all bear open licenses and are available through Infochimps
- Daily Weather (
daily_weather) -- NCDC Global Summary of US Daily Weather (GSOD), 1930-2011 (Public Domain). Contains about 20 numeric columns; the basis for several geospatial, timeseries and statistical questions.- Weather Observations: weather station id, date, and observation data (temperature, wind speed and so forth).
- Weather Station metadata: weather station id, longitude/latitude/elevation, periods of service
- Weather Station spatial coverage: Polygons approximating each weather station's area of coverage by year. (Within each polygon, you are closer to the contained weather station than to any other weather station operating that year.) Contains a GeoJSON polygon feature describing the spatial extent along wiht the weather station id, year, and longitude/latitude/elevation.
- Wikipedia Corpus (
wp_corpus) -- The October 2011 dump of all english-language wikipedia articles in source (mediawiki) format. (This is the snapshot of each page's current state, not the larger dataset containing all edits.)page title | page id | redirect | extended abstract | lng | lat | keywords | text
- Wikipedia Extracted Facts (
wp_extraction) - Wikipedia Pagelinks (
wp_linkgraph)page_id | dest,dest,dest
- Wikipedia Pageview logs (
wp_weblogs)page_id | date | hour | count
Size and Shape
Dataset Rows GB RecSz +/- Skew
------- ---- -- ----- --- ----
Daily Weather xx xx xx xx low
WP Corpus xx xx xx xx
WP Page Graph xx
WP Page Views
Challenges
Statistical Summary of Global Weather Trends
Send weather observations to macro tiles, and calculate statistics (min, max, average, stdev, mode, median and percentiles).
-
Join weather data and voronoi tiles; dispatch
-
calculate same but on all rows
demonstrates: support for statistical analysis; high skew reduce
Similarity
Anomaly Detection on a Timeseries
Pageview anomalies
Sessionize pageviews
Logistic Regression
Which pages are correlated with the weather?
Simple Geopatial Rollup of Weather Data
Pagerank
Demonstrates: iterative workflows
Enumerate Triangles (Clustering Coefficient)
demonstrates: large amount of midstream data
- count in- and out-degree for each noe; filter out nodes with total degree <= 1.
- assemble the min-degree-ordered adjacency list (maps node to all its in- or out-neighbors of higher degree)
- emit cross product of neighbor+neighbor pairs
- intersect to find only triangles
Clustering
Document Clustering
-
Calculate top-k TF/IDF terms (based on threshold importance and max size)
-
use minhashing into LSH
- For i = 1...m
- Generate a random order h i on words
- m hi(u) = argmin {hi(w) | w Bu}
- For i = 1...m
-
Tokenize the documents (following mildly complicated rules)
-
Generate wordbag for each doc, stripping out words that occur > hi_thresh (fixed, and these stopwords are given in advance, or you can enumerate and then chop) and stripping out words that occur < lo_thresh in full corpus
-
(want circa 50_000 terms? 50,000 terms . 3,000,000 documents = 150 B cells
-
count of terms in document, count of usages in document
-
ent:
-sum(s*log(s))/log(length(s))(relative entropy of all sizes of the corpus parts)
For each term:
- Rg: range -- count of docs it occurs in
- f: freq (fractional count in whole corpus), dispersion (), log likelihood
- stdev: standard deviation
- chisq: chi-squared
- disp: Julliand's D --
1 - ( (sd(v/s) / mean(v/s)) / sqrt(length(v/s) - 1) ) - IDF:
log_2( N / Rg )
http://www.linguistics.ucsb.edu/faculty/stgries/research/2008_STG_Dispersion_IJCL.pdf
Co-ocurrence Graph
See Mining of Massive Datasets p208 - a-priori algorithm... step I calculate frequent items
Step II:
- For each basket, look in the frequent-items table to see which of its items are frequent.
- In a double loop, generate all frequent pairs.
- For each frequent pair, add one to its count in the data structure used to store counts.
- Finally, at the end of the second pass, examine the structure of counts to determine which pairs are frequent.
Park, Chen, and Yu (PCY)
In step I, keep a count of items. Also, for each piar take hash and bump the count in that bucket.
We can define the set of candidate pairs C2 to be those pairs {i, j} such that:
- i and j are frequent items.
- {i, j} hashes to a frequent bucket. For even better results, use two or more hashes (each in a separate hash table)
Tasks:
|_. Problem | Dataset | Operations exercised |
| | | |
|_. Cat wrangling | | |
| total sort on numeric field, small records | wikistats | ORDER |
| total sort on numeric field, medium records | weather tiles | ORDER |
| total sort on text field (wp page titles) | long documents | ORDER |
| uniform (x% random sample) | short documents | FILTER |
| uniform (x% random sample) | long documents | FILTER |
| prepare .tsv.gz files of 1G +/- 10% | pagerank | STORE, compress (very low entropy) |
| .gz compress in-place those files | pagerank.gz | STORE, compress (very high entropy) |
| simple transform on 100k distinct files | raw weather data | job startup time |
| uniq -c | short documents | distinct, count |
| uniq -c | long documents | distinct, count |
| | | |
|_. Text | | |
| create inverted index | long documents | tokenize, group |
| word count (simple tokenization) | long documents | tokenize, group, count |
| word count (advanced tokenization) | long documents | complex code, group |
| | | |
|_. Graph | | |
| edge pairs => adj list | wp graph | GROUP |
| in-degree + out degree for each node | wp graph | 1:1 JOIN |
| degree-sorted adj list (w/ degrees) | wp graph | FOREACH |
| pagerank | wp graph | workflow; complex code |
| sub-universe | graph + short documents | complex code |
| put aggregated counts on adjacency list | wp graph + wikistats | JOIN huge on big with 100% overlap |
| | | |
|_. Filter |
Related Skills
node-connect
352.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
