edx2bigquery

edx2bigquery is a tool for importing edX SQL and log data into Google BigQuery for research and analysis.

Read more about it in this writeup by the Harvard University Office of the Vice Provost for Advances in Learning, and the article "Google BigQuery for Education: Framework for Parsing and Analyzing edX MOOC Data", by Glenn Lopez, Daniel Seaton, Andrew Ang, Dustin Tingley, and Isaac Chuang.

Getting Started

To get started, install the Google Cloud SDK first:

https://cloud.google.com/sdk/

Then generate authentication tokens using "gcloud auth login". Make sure that "bq" and "gsutil" work properly.

Install edxcut dependency:

git clone https://github.com/mitodl/edxcut.git
pip install ./edxcut

To install:

python setup.py develop

Also, setup the file edx2bigquery_config.py in your current working directory. Example:

#-----------------------------------------------------------------------------
#
# sample edx2bigquery_config.py file
#
course_id_list = [
        "MITx/2.03x/3T2013",
]

courses = {
    'year2': course_id_list,
    'all_harvardx': [
        "HarvardX/AI12.1x/2013_SOND",
    ],
}

# google cloud project access
auth_key_file = "USE_GCLOUD_AUTH"
auth_service_acct = None

# google bigquery config
PROJECT_ID = "x-data"

# google cloud storage
GS_BUCKET = "gs://x-data"

# local file configuration
COURSE_SQL_BASE_DIR = "X-Year-1-data-sql"
COURSE_SQL_DATE_DIR = '2013-09-08'
TRACKING_LOGS_DIRECTORY = "TRACKING_LOGS"

#-----------------------------------------------------------------------------

To run:

edx2bigquery

Command line parameters and options

Command help message:

usage: edx2bigquery [-h] [--course-base-dir COURSE_BASE_DIR] [--course-date-dir COURSE_DATE_DIR] [--start-date START_DATE] [--end-date END_DATE]
                    [--tlfn TLFN] [-v] [--parallel] [--year2] [--clist CLIST] [--clist-from-missing-table CLIST_FROM_MISSING_TABLE]
                    [--force-recompute] [--dataset-latest] [--download-only] [--course-id-type COURSE_ID_TYPE] [--latest-sql-dir] [--skiprun]
                    [--external] [--extparam EXTPARAM] [--submit-condor] [--max-parallel MAX_PARALLEL] [--skip-geoip] [--skip-if-exists]
                    [--skip-log-loading] [--just-do-nightly] [--just-do-geoip] [--just-do-totals] [--just-get-schema] [--only-if-newer]
                    [--limit-query-size] [--table-max-size-mb TABLE_MAX_SIZE_MB] [--nskip NSKIP] [--only-step ONLY_STEP] [--logs-dir LOGS_DIR]
                    [--listings LISTINGS] [--dbname DBNAME] [--project-id PROJECT_ID] [--table TABLE] [--org ORG] [--combine-into COMBINE_INTO]
                    [--add-courseid] [--combine-into-table COMBINE_INTO_TABLE] [--skip-missing] [--output-format-json] [--collection COLLECTION]
                    [--output-project-id OUTPUT_PROJECT_ID] [--output-dataset-id OUTPUT_DATASET_ID] [--output-bucket OUTPUT_BUCKET] [--dynamic-dates]
                    [--logfn-keepdir] [--skip-last-day] [--gzip] [--time-on-task-config TIME_ON_TASK_CONFIG] [--subsection]
                    command [courses [courses ...]]

usage: %prog [command] [options] [arguments]

Examples of common commands:

edx2bigquery --clist=all_mitx logs2gs 
edx2bigquery setup_sql MITx/24.00x/2013_SOND
edx2bigquery --tlfn=DAILY/mitx-edx-events-2014-10-14.log.gz  --year2 daily_logs
edx2bigquery --year2 person_course
edx2bigquery --year2 report
edx2bigquery --year2 combinepc
edx2bigquery --year2 --output-bucket="gs://harvardx-data" --nskip=2 --output-project-id='harvardx-data' combinepc >& LOG.combinepc

Examples of not-so common commands:

edx2bigquery person_day MITx/2.03x/3T2013 >& LOG.person_day
edx2bigquery --force-recompute person_course --year2 >& LOG.person_course
edx2bigquery testbq
edx2bigquery make_uic --year2
edx2bigquery logs2bq MITx/24.00x/2013_SOND
edx2bigquery person_course MITx/24.00x/2013_SOND >& LOG.person_course
edx2bigquery split DAILY/mitx-edx-events-2014-10-14.log.gz 

positional arguments:
  command               A variety of commands are available, each with different arguments:
                        
                        --- TOP LEVEL COMMANDS
                        
                        setup_sql <course_id> ...   : Do all commands (make_uic, sql2bq, load_forum) to get edX SQL data into the right format, upload to
                                                      google storage, and import into BigQuery.  See more information about each of those commands, below.
                                                      This step is idempotent - it can be re-run multiple times, and the result should not change.
                                                      Returns when all uploads and imports are completed.
                        
                                                      Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                        
                                                      Accepts the --dataset-latest flag, to use the latest directory date in the SQL data directory.
                                                      Directories should be named YYYY-MM-DD.  When this flag is used, the course SQL dataset name has
                                                      "_latest" appended to it.
                        
                                                      Also accepts the "--clist=XXX" option, to specify which list of courses to act upon.
                        
                                                      Before running this command, make sure your SQL files are converted and formatted according to the 
                                                      "Waldo" convention established by Harvard.  Use the "waldofy" command (see below) for this, 
                                                      if necessary.
                        
                        daily_logs --tlfn=<path>    : Do all commands (split, logs2gs, logs2bq) to get one day's edX tracking logs into google storage 
                                   <course_id>        and import into BigQuery.  See more information about each of those commands, below.
                                   ...                This step is idempotent - it can be re-run multiple times, and the result should not change.
                                                      Returns when all uploads and imports are completed.
                        
                                                      Accepts the "--year2" flag, to process all courses in the config file's course_id_list.
                        
                        doall <course_id> ...       : run setup_sql, analyze_problems, logs2gs, logs2bq, axis2bq, person_day, enrollment_day,
                                                      person_course, and problem_check for each of the specified courses.  This is idempotent, and can be run
                                                      weekly when new SQL dumps come in.
                        
                        nightly <course_id> ...     : Run sequence of commands for common nightly update (based on having new tracking logs available).
                                                      This includes logs2gs, logs2bq, person_day, enrollment_day, person_course (forced recompute),
                                                      and problem_check.
                        
                        --external command <cid>... : Run external command on data from one or more course_id's.  Also uses the --extparam settings.
                                                      External commands are defined in edx2bigquery_config.  Use --skiprun to create the external script
                                                      without running.  Use --submit-condor to submit command as a condor job.
                        
                        --- SQL DATA RELATED COMMANDS
                        
                        waldofy <sql_data_dir>      : Apply HarvardX Jim Waldo conventions to SQL data as received from edX, which renames files to be
                                <course_id> ...       more user friendly (e.g. courseware_studentmodule -> studentmodule) and converts the tab-separated
                                                      values form (*.sql) to comma-separated values (*.csv).  Also compresses the resulting csv files.
                                                      Does this only for the specified course's, because the edX SQL dump may contain a bunch of
                                                      uknown courses, or scratch courses from the edge site, which should not be co-mingled with
                                                      course data from the main production site.  
                                                      
                                                      It is assumed that <sql_data_dir> has a name which contains a date, e.g. xorg-2014-05-11 ;
                                                      the resulting data are put into the course SQL base directory, into a subdirectory with
                                                      name given by the course_id and date, YYYY-MM-DD.
                        
                                                      The SQL files from edX must already be decrypted (not *.gpg), before running this command.
                        
                                                      Be sure to specify which course_id's to act upon.  Courses which are not explicitly specified
                                                      are put in a subdirectory named "UNKNOWN".

Edx2bigquery

Install / Use

README

edx2bigquery

Getting Started

Command line parameters and options