SkillAgentSearch skills...

Chimpmark

ChimpMARK-2010 is a collection of massive real-world datasets, interesting real-world problems, and simple example code to solve them. Learn Big Data processing, benchmark your cluster, or compete on implementation!

Install / Use

/learn @mrflip/Chimpmark
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

ChimpMARK-2010: Big Data Benchmark on Real-World Datasets

The ChimpMARK-2010 is a collection of massive real-world datasets, interesting real-world problems, and simple example code to solve them. Learn Big Data processing, benchmark your cluster, or compete on implementation!

Why?

  • Estimate run time for a job using runtimes for tasks with similar size, shape and atomic operations.
  • With known data and code, you can performance-qualify cluster. How fast should a join run on a 10-machine cluster of m1.large instances?
  • Understand how cluster settings should be tuned in response to job characteristics.
  • Compare clusters of equivalent nominal power (CPU, cores, memory) across data centers or hardware configs: AWS vs Rackspace vs private cluster, go!
  • Different core technologies (pig vs. wukong vs. raw Java) can compete on run time, efficiency, elegance, and size of code.

Each problem is meant to a) to be a real-world task, b) to nonetheless encapsulate a very small, generic number of operations. For instance, adjacency pairs => adjacency list on the wikipedia link graph is really just a GROUP, but in particular it's a group on very small records with median multiplicity of ~5, maximum in the thousands, and low skew.

Note: this is a planning document: the repo contains no code yet. There's a great variety of example code -- the foundations of what will appear here -- in the "wukong repo":http://github.com/infochimps/wukong/tree/master/examples/


Datasets

The required datasets all bear open licenses and are available through Infochimps

  • Daily Weather (daily_weather) -- NCDC Global Summary of US Daily Weather (GSOD), 1930-2011 (Public Domain). Contains about 20 numeric columns; the basis for several geospatial, timeseries and statistical questions.
    • Weather Observations: weather station id, date, and observation data (temperature, wind speed and so forth).
    • Weather Station metadata: weather station id, longitude/latitude/elevation, periods of service
    • Weather Station spatial coverage: Polygons approximating each weather station's area of coverage by year. (Within each polygon, you are closer to the contained weather station than to any other weather station operating that year.) Contains a GeoJSON polygon feature describing the spatial extent along wiht the weather station id, year, and longitude/latitude/elevation.
  • Wikipedia Corpus (wp_corpus) -- The October 2011 dump of all english-language wikipedia articles in source (mediawiki) format. (This is the snapshot of each page's current state, not the larger dataset containing all edits.)
    • page title | page id | redirect | extended abstract | lng | lat | keywords | text
  • Wikipedia Extracted Facts (wp_extraction)
  • Wikipedia Pagelinks (wp_linkgraph)
    • page_id | dest,dest,dest
  • Wikipedia Pageview logs (wp_weblogs)
    • page_id | date | hour | count

Size and Shape

Dataset             Rows    GB      RecSz   +/-     Skew
-------             ----    --      -----   ---     ----
Daily Weather       xx      xx      xx      xx      low 
WP Corpus           xx      xx      xx      xx      
WP Page Graph       xx
WP Page Views

Challenges

Statistical Summary of Global Weather Trends

Send weather observations to macro tiles, and calculate statistics (min, max, average, stdev, mode, median and percentiles).

  • Join weather data and voronoi tiles; dispatch

  • calculate same but on all rows

demonstrates: support for statistical analysis; high skew reduce

Similarity

Anomaly Detection on a Timeseries

Pageview anomalies

Sessionize pageviews

Logistic Regression

Which pages are correlated with the weather?

Simple Geopatial Rollup of Weather Data

Pagerank

Demonstrates: iterative workflows

Enumerate Triangles (Clustering Coefficient)

demonstrates: large amount of midstream data

  • count in- and out-degree for each noe; filter out nodes with total degree <= 1.
  • assemble the min-degree-ordered adjacency list (maps node to all its in- or out-neighbors of higher degree)
  • emit cross product of neighbor+neighbor pairs
  • intersect to find only triangles

Clustering

Document Clustering

  • Calculate top-k TF/IDF terms (based on threshold importance and max size)

  • use minhashing into LSH

    • For i = 1...m
      • Generate a random order h i on words
      • m hi(u) = argmin {hi(w) | w  Bu}
  • Tokenize the documents (following mildly complicated rules)

  • Generate wordbag for each doc, stripping out words that occur > hi_thresh (fixed, and these stopwords are given in advance, or you can enumerate and then chop) and stripping out words that occur < lo_thresh in full corpus

  • (want circa 50_000 terms? 50,000 terms . 3,000,000 documents = 150 B cells

  • count of terms in document, count of usages in document

  • ent: -sum(s*log(s))/log(length(s)) (relative entropy of all sizes of the corpus parts)

For each term:

  • Rg: range -- count of docs it occurs in
  • f: freq (fractional count in whole corpus), dispersion (), log likelihood
  • stdev: standard deviation
  • chisq: chi-squared
  • disp: Julliand's D -- 1 - ( (sd(v/s) / mean(v/s)) / sqrt(length(v/s) - 1) )
  • IDF: log_2( N / Rg )

http://www.linguistics.ucsb.edu/faculty/stgries/research/2008_STG_Dispersion_IJCL.pdf

Co-ocurrence Graph

See Mining of Massive Datasets p208 - a-priori algorithm... step I calculate frequent items

Step II:

  1. For each basket, look in the frequent-items table to see which of its items are frequent.
  2. In a double loop, generate all frequent pairs.
  3. For each frequent pair, add one to its count in the data structure used to store counts.
  4. Finally, at the end of the second pass, examine the structure of counts to determine which pairs are frequent.

Park, Chen, and Yu (PCY)

In step I, keep a count of items. Also, for each piar take hash and bump the count in that bucket.

We can define the set of candidate pairs C2 to be those pairs {i, j} such that:

  1. i and j are frequent items.
  2. {i, j} hashes to a frequent bucket. For even better results, use two or more hashes (each in a separate hash table)

Tasks:

|_. Problem                                      | Dataset                 | Operations exercised                               |
|                                                |                         |                                                    |
|_. Cat wrangling                                |                         |                                                    |
|  total sort on numeric field, small records    | wikistats               | ORDER                                              |
|  total sort on numeric field, medium records   | weather tiles           | ORDER                                              |
|  total sort on text field (wp page titles)     | long documents          | ORDER                                              |
|  uniform (x% random sample)                    | short documents         | FILTER                                             |
|  uniform (x% random sample)                    | long documents          | FILTER                                             |
|  prepare .tsv.gz files of 1G +/- 10%           | pagerank                | STORE, compress (very low entropy)                 |
|  .gz compress in-place those files             | pagerank.gz             | STORE, compress (very high entropy)                |
|  simple transform on 100k distinct files       | raw weather data        | job startup time                                   |
|  uniq -c                                       | short documents         | distinct, count                                    |
|  uniq -c                                       | long documents          | distinct, count                                    |
|                                                |                         |                                                    |
|_. Text                                         |                         |                                                    |
|  create inverted index                         | long documents          | tokenize, group                                    |
|  word count (simple tokenization)              | long documents          | tokenize, group, count                             |
|  word count (advanced tokenization)            | long documents          | complex code, group                                |
|                                                |                         |                                                    |
|_. Graph                                        |                         |                                                    |
|  edge pairs => adj list                        | wp graph                | GROUP                                              |
|  in-degree + out degree for each node          | wp graph                | 1:1 JOIN                                           |
|  degree-sorted adj list (w/ degrees)           | wp graph                | FOREACH                                            |
|  pagerank                                      | wp graph                | workflow; complex code                             |
|  sub-universe                                  | graph + short documents | complex code                                       |
|  put aggregated counts on adjacency list       | wp graph + wikistats    | JOIN huge on big with 100% overlap                 |
|                                                |                         |                                                    |
|_. Filter                                       |                   

Related Skills

View on GitHub
GitHub Stars17
CategoryDevelopment
Updated5y ago
Forks0

Languages

Ruby

Security Score

75/100

Audited on Jun 12, 2020

No findings