[ARCHIVED] edX Insights Early Prototype

This repository has been archived and is no longer supported—use it at your own risk. This repository may depend on out-of-date libraries with security issues, and security updates will not be provided. Pull requests against this repository will also not be merged.

Warning: This repository contains an early prototype of the edX analytics infrastructure. edX won't be continuing the developing of this source code, but interested parties are free to fork and modify.

The code base for the current version of the Insights product can be found in the following repositories:

https://github.com/edx/edx-analytics-dashboard
https://github.com/edx/edx-analytics-data-api
https://github.com/edx/edx-analytics-pipeline

This is a development version of an analytics framework for the edX infrastructure. It will replace the ad-hoc analytics used in the software previously. The goal of this framework is to define an architecture for simple, pluggable analytics modules. The architecture must have the following properties:

Easy to use. Professors, graduate students, etc. should be able to write plug-ins quickly and easily. These should be able to run in the system without impacting the overall stability. Results should be automatically shown to customers.
The API must support robust, scalable implementations. The current back-end is not designed for mass scaling, but the API for the modules should permit e.g. sharding in the future.
Reusable. The individual analytics modules should be able to use the results from other modules, and people should be able to build on each others' work.
Interoperable. We would like the framework to be sufficiently generic to be useable outside of edX.
Cross-scope. There should be a smooth path from off-line analytics, to on-line batched analytics (e.g. for an instructor dashboard), to on-line realtime analytics (e.g. for the system to react to an event the analytics detects).

The model behind Insights is the app store model: As with an app store (Android shown above), we provide a runtime. This runtime provides a fixed set of technologies (Python, numpy, scipy, pylab, pandas, mongo, a cache, etc.). If you restrict yourself to this runtime, anyone running Insights can host your analytic. If you'd like to move outside this set of tools, you can do that too, but then you may have to host your own analytics server.

Comparison to other systems:

Tincan is an SOA and a format for streaming analytics. Insights is an API and runtime for handling those events. The two are very complementary.
Twitter Storm is a framework for sending events around. Insights is an API and runtime which would benefit from moving to something like storm.
Hadoop is a distributed computation engine. For most learning analytics, hadoop is overkill, but it could be embedded in an analytics module if desired.

Examples

Views show up in the dashboards. To define an analytic which just shows "Hello World" in the analytics dashboard:

@view()
def hello_world():
   return "<html>Hello world!</html>"

Queries return data for use in other parts of the system. If you would like to define a new analytic which shows a histogram of grades, the first step would be to define a query while will return grades. How this is done depends on your LMS, but it is often convenient to define a dummy one which does not rely on having a functioning LMS present. This is convenient for off-line development without live student data:

@query()
def get_grades(course):
    ''' Dummy data module. Returns grades
    '''
    grades = 3*numpy.random.randn(1000,4)+ \
        12*numpy.random.binomial(1,0.3,(1000,4))+40
    return grades

Once this is in place, you can define a view which will call this query:

@view()
def plot_grades(fs, query, course):
    grades = query.get_grades(course)
    filename = course+"_"+str(time.time())+".png"
    title("Histogram of course grades")
    hist(grades)
    f = fs.open(filename, "w")
    savefig(f)
    f.close()
    fs.expire(filename, 5*60)
    return "<img src="+fs.get_url(filename)+">"

At this point, the following will show up in the instructor dashboard:

Grade histogram

Note that the query and the view don't have to live on the same machine. If someone wants to reuse your grade histogram in a different LMS, all they need to do is define a new get_grades query.

To build a module which takes all in coming events and dumps them into a database:

@event_handler()
def dump_to_db(mongodb, events):
    collection = mongodb['event_log']
    collection.insert([e.event for e in events])

Except for imports, that's all that's required.

Architecture

A block diagram of where the analytics might fit into an overall learning system is:

System structure

The learning management system (and potentially other sources) stream events to the analytics framework. In addition, the modules in the framework will typically have access to read replicas of production databases. In practice, a lot of analytics can be performed directly from the LMS databases with a lot less effort than processing events.

A single module

A rough diagram of a single analytics module is:

Analytics module

Each module in the analytics framework is an independent Python module. It has its own Mongo database, a filesystem abstraction, as well as a cache. In addition, it can have access to read-replicas of production databases, and in the near future, it will have access to read replicas of other module's databases.

Note that all of these are optional. A hello world module could be as simple as defining a single view:

@view()
def hello_world():
   return "<html>Hello world!</html>"

If you wanted the view to be per-user, you could include a user parameter:

@view()
def hello_world(user):
   return "<html>Hello "+user+"</html>"

The views and queries are automatically inspect for parameters, and the system will do the right thing. If you would like to have a per-module database, simply take a db parameter. Etc.

To understand how to build modules in more detail, the best place to start is by reading the module which defines testcases -- the file modules/testmodule/init.py. Next place is to look at the code for the decorators. Final place is for the main views and dashboard.

Using with other LMSes

The architecture is designed to be usable with common analytics shared between multiple LMSes. The structure for this is:

Multipule LMSes

Here, each instance has a data layer module. This module translates the data generate by the particular LMS into a common representation. Higher-level analytics are built on top of that common representation. We're trying to come up with process for creating this data layer, but it's not essential we get it 100% right. In most cases, it is relatively easy to include backwards-compatibility queries.

Structuring servers

The system is transparent to how analytics are split across servers. There are several models for how this might be used.

First, we might have a production-grade code on e.g. a critical server which keeps student profile/grading/etc. information, while still maintaining prototype analytics servers, which may be on-line more intermittently:

Multiple servers

A second way to use this might be by function. For example, we might embed analytics in the LMS, in the forums, in the wiki, in the student registration system, and in other parts of the system. Those would provide access to data from those subsystems. We may also wish to have specialized runtimes providing access to additional tools like Hadoop or R. A single computer can query across all of these servers from the Insights API:

Per-system analytic

Installing

Follow the instructions in INSTALL.md

If installed for development, the advertised views and queries for the test module will be at:

http://127.0.0.1:8000/static/index.html

Running periodic tasks

Periodic tasks (which are scheduled with core.decorators.cron) rely on Celery for execution. It is the reponsability of the client django project to ensure Celery is configured and running. To configure, add the following to settings.py of your django project:

from edinsights.celerysettings import *

To start celery, run from your django project

python manage.py celery worker -B

Only tasks located in files named "tasks.py" located in the main directory of your django project or installed django app will be scheduled.

Building on top of the framework

To build on top of the framework, you will need several things:

A log handler which can stream the events out over an SOA. The ones we wrote for edX are available at: https://github.com/edx/loghandlersplus
A way of piping these events into the analytics framework. The version we wrote for edX is available at: https://github.com/edx/djeventstream On a high level, this is a module which sends Django signals of type djeventstream.signals.event_received. The events are JSON dictionaries. The event handler can handle either dictionaries, lists of dictionaries, or JSON-encoded string representations of both.
A way of embedding the analytics in your LMS based on the SOA.
Potentially, some set of analytics modules. At the very least, you should define appropriate (TBD) properties to ornament your events with and appropriate queries (TBD) to get data out of your read-replica databases, so that modules written by other

Insights

Install / Use

README