Spindle
Next-generation web analytics processing with Scala, Spark, and Parquet.
Install / Use
/learn @adobe-research/SpindleREADME
Spindle
Spindle is Brandon Amos' 2014 summer internship project with Adobe Research and is not under active development.

Analytics platforms such as [Adobe Analytics][adobe-analytics] are growing to process petabytes of data in real-time. Delivering responsive interfaces querying this amount of data is difficult, and there are many distributed data processing technologies such as [Hadoop MapReduce][mapreduce], [Apache Spark][spark], [Apache Drill][drill], and [Cloudera Impala][impala] to build low-latency query systems.
Spark is part of the [Apache Software Foundation][apache] and claims speedups up to 100x faster than Hadoop for in-memory processing. Spark is shifting from a research project to a production-ready library, and academic publications and presentations from the [2014 Spark Summit][2014-spark-summit] archives several use cases of Spark and related technology. For example, [NBC Universal][nbc] presents their use of Spark to query [HBase][hbase] tables and analyze an international cable TV video distribution [here][nbc-pres]. Telefonica presents their use of Spark with [Cassandra][cassandra] for cyber security analytics [here][telefonica-pres]. [ADAM][adam] is an open source data storage format and processing pipeline for genomics data built in Spark and [Parquet][parquet].
Even though people are publishing use cases of Spark, few people have published experiences of building and tuning production-ready Spark systems. Thorough knowledge of Spark internals and libraries that interoperate well with Spark is necessary to achieve optimal performance from Spark applications.
Spindle is a prototype Spark-based web analytics query engine designed around the requirements of production workloads. Spindle exposes query requests through a multi-threaded HTTP interface implemented with [Spray][spray]. Queries are processed by loading data from [Apache Parquet][parquet] columnar storage format on the [Hadoop distributed filesystem][hdfs].
This repo contains the Spindle implementation and benchmarking scripts to observe Spindle's performance while exploring Spark's tuning options. Spindle's goal is to process petabytes of data on thousands of nodes, but the current implementation has not yet been tested at this scale. Our current experimental results use six nodes, each with 24 cores and 21g of Spark memory, to query 13.1GB of analytics data. The trends show that further Spark tuning and optimizations should be investigated before attempting larger scale deployments.
Demo
We used Spindle to generate static webpages that are hosted statically [here][demo]. Unfortunately, the demo is only for illustrative purposes and is not running Spindle in real-time.

[Grunt][grunt] is used to deploy demo to [Github pages][ghp]
in the [gh-pages][ghp] branch with the [grunt-build-control][gbc] plugin.
The [npm][npm] dependencies are managed in [package.json][pjson]
and can be installed with npm install.
Loading Sample Data
The load-sample-data directory contains a Scala program
to load the following sample data into [HDFS][hdfs]
modeled after
[adobe-research/spark-parquet-thrift-example][spark-parquet-thrift-example].
See [adobe-research/spark-parquet-thrift-example][spark-parquet-thrift-example]
for more information on running this application
with [adobe-research/spark-cluster-deployment][spark-cluster-deployment].
hdfs://hdfs_server_address:8020/spindle-sample-data/2014-08-14
| post_pagename | user_agent | visit_referrer | post_visid_high | post_visid_low | visit_num | hit_time_gmt | post_purchaseid | post_product_list | first_hit_referrer | |---|---|---|---|---|---|---|---|---|---| | Page A | Chrome | http://facebook.com | 111 | 111 | 1 | 1408007374 | | | http://google.com | Page B | Chrome | http://facebook.com | 111 | 111 | 1 | 1408007377 | | | http://google.com | Page C | Chrome | http://facebook.com | 111 | 111 | 1 | 1408007380 | purchase1 | ;ProductID1;1;40;,;ProductID2;1;20; | http://google.com | Page B | Chrome | http://google.com | 222 | 222 | 1 | 1408007379 | | | http://google.com | Page C | Chrome | http://google.com | 222 | 222 | 1 | 1408007381 | | | http://google.com | Page A | Firefox | http://google.com | 222 | 222 | 1 | 1408007382 | | | http://google.com | Page A | Safari | http://google.com | 333 | 333 | 1 | 1408007383 | | | http://facebook.com | Page B | Safari | http://google.com | 333 | 333 | 1 | 1408007386 | | | http://facebook.com
hdfs://hdfs_server_address:8020/spindle-sample-data/2014-08-15
| post_pagename | user_agent | visit_referrer | post_visid_high | post_visid_low | visit_num | hit_time_gmt | post_purchaseid | post_product_list | first_hit_referrer | |---|---|---|---|---|---|---|---|---|---| | Page A | Chrome | http://facebook.com | 111 | 111 | 1 | 1408097374 | | | http://google.com | Page B | Chrome | http://facebook.com | 111 | 111 | 1 | 1408097377 | | | http://google.com | Page C | Chrome | http://facebook.com | 111 | 111 | 1 | 1408097380 | purchase1 | ;ProductID1;1;60;,;ProductID2;1;100; | http://google.com | Page B | Chrome | http://google.com | 222 | 222 | 1 | 1408097379 | | | http://google.com | Page A | Safari | http://google.com | 333 | 333 | 1 | 1408097383 | | | http://facebook.com | Page B | Safari | http://google.com | 333 | 333 | 1 | 1408097386 | | | http://facebook.com
hdfs://hdfs_server_address:8020/spindle-sample-data/2014-08-16
| post_pagename | user_agent | visit_referrer | post_visid_high | post_visid_low | visit_num | hit_time_gmt | post_purchaseid | post_product_list | first_hit_referrer | |---|---|---|---|---|---|---|---|---|---| | Page A | Chrome | http://facebook.com | 111 | 111 | 1 | 1408187380 | purchase1 | ;ProductID1;1;60;,;ProductID2;1;100; | http://google.com | Page B | Chrome | http://facebook.com | 111 | 111 | 1 | 1408187380 | purchase1 | ;ProductID1;1;200; | http://google.com | Page D | Chrome | http://google.com | 222 | 222 | 1 | 1408187379 | | | http://google.com | Page A | Safari | http://google.com | 333 | 333 | 1 | 1408187383 | | | http://facebook.com | Page B | Safari | http://google.com | 333 | 333 | 1 | 1408187386 | | | http://facebook.com | Page C | Safari | http://google.com | 333 | 333 | 1 | 1408187388 | | | http://facebook.com
Queries.
Spindle includes eight queries that are representative of the data sets and computations of real queries the Adobe Marketing Cloud processes. All collect statements refer to the combined filter and map operation, not the operation to gather an RDD as a local Scala object.
- Q0 (Pageviews) is a breakdown of the number of pages viewed each day in the specified range.
- Q1 (Revenue) is the overall revenue for each day in the specified range.
- Q2 (RevenueFromTopReferringDomains) obtains the top referring
domains for each visit and breaks down the revenue by day.
The
visit_referrerfield is preprocessed into each record in the raw data. - Q3 (RevenueFromTopReferringDomainsFirstVisitGoogle) is
the same as RevenueFromTopReferringDomains, but with the
visitor's absolute first referrer from Google.
The
first_hit_referrerfield is preprocessed into each record in the raw data. - Q4 (TopPages) is a breakdown of the top pages for the entire date range, not per day.
- Q5 (TopPagesByBrowser) is a breakdown of the browsers used for TopPages.
- Q6 (TopPagesByPreviousTopPages) breaks down the top previous pages a visitor was at for TopPages.
- Q7 (TopReferringDomains) is the top referring domains for the entire date range, not per day.
The following table shows the columnar subset each query utilizes.

The following table shows the operations each query performs and is intended as a summary rather than full description of the implementations. The bold text in indicate operations in which the target partition size is specified, which is further described in the "Partitioning" section below.

Spindle Architecture
The query engine provides a request and response interface to interact with the application layer, and Spindle's goal is to benchmark a realistic low latency web analytics query engine.
Spindle provides query requests and reports over HTTP with the [Spray][spray] library, which is multi-threaded and provides REST/HTTP-based integration layer on Scala for queries and parameters, as illustrated in the figure below.

When a user request to execute a query over HTTP,
Spray allocates a thread to process the HTTP request and converts
it into a Spray request.
The Spray request follows a route defined in the QueryService Actor,
and queries are processed with the QueryProcessor singleton object.
The QueryProcessor interacts with a global Spark context,
which connects the Scala application to the Spark cluster.
The Spark context supports multi-threading and offers a
FIFO and FAIR scheduling options for concurrent queries.
Spindle uses Spark's FAIR scheduling option to minimize overall latency.
Future Work - Utilizing Spark job servers or resource managers.
Spindle's architecture can likely be improved on larger clusters by utilizing a job server or resource manager to maintain a pool of Spark contexts for query execution. [Ooyala's spark-jobserver][spark-jobserver] provides a RESTful interface for submitting Spark jobs that Spindle could interface with instead of interfacing with Spark directly. [YARN][yarn] can also be used to manage Spark's resources on a cluster, a
