Demystifying Educational MOOC Data Using Google BigQuery: The Person-Course Dataset (Part 1)

by Glenn Lopez (VPAL Data Scientist)

HarvardX learners average 10K unique users daily accessing videos, posting and reading discussion forums or completing problem sets, generating over 1 million clicks per day. With thousands of users generating millions of click events daily, how do we make useful data products to help understand trends and improve learning? Why do we need to use Big Data technologies, such as Google BigQuery, to process this data?

Starting in October 2014, Isaac ‘Ike’ Chuang (MIT) developed an initial prototype for processing edX MOOC data using Google BigQuery called edx2bigquery. My role at Harvard has provided me the opportunity to contribute code to develop canonical datasets using this open-source framework, while continuing the joint Harvardx-MITx collaboration that began under the guidance of the HarvardX Research Committee Chair, Andrew Ho in 2012. Chuang, Jim Waldo, myself and others have made incremental improvements to the open-source edX data processing framework over the years to support scholarly research to improving learning, enabling collaborative research between HarvardX/MITx as evidenced through the HarvardX and MITx: The First Year of Open Online Courses and HarvardX and MITx: Two Years of Open Online Courses Fall 2012 - Summer 2014 reports.

Prior to edx2bigquery, Chuang also developed an edx processing pipeline using MongoDB, however, the amount of data generated per day and processing time required soon exceeded the desired nightly refresh timeframe taking longer than 24 hours to generate canonical data products. Instead of upgrading the on-premise MongoDB configuration to handle the increasing amount of data by scaling up our resources, Harvard/MIT both decided it was best to choose the Google BigQuery cloud solution to save cost and reduce complexity in the long run.

Since the beginning, HarvardX/MITx goal was to pave the way for Universities across the entire edX consortium to develop practical data products that researchers, institutional administration and course team leads could use to satsify a wide number of use cases including generating course reports, year-end public reports, or research datasets that could be used as the basis for ground-breaking educational MOOC research. Three years later and we would like to grow the community using these open-source tools beyond Harvard and MIT. Other institutions have begun using the edx2bigquery framework and benefitting from the open-source code; Daniel Seaton (Harvard) has developed a nice tutorial and most recently led a joint Harvard-MIT data workflow workshop for a number of edX partner institutions on how to setup and automate the edx2bigquery framework.

The Google BigQuery Solution: Scalable and Affordable

Google BigQuery is a fully-managed and cloud-based interactive query service for massive datasets. At the time on this writing, the current cost is based on the number bytes processed from all queries at 1 TB of data processed / month for free, and $5 per additional TB of data processed. The pricing model for charging per data query should be widely affordable for most institutions given the amount of MOOC data that exists. As an example, Harvard spends in the range of $20-$50 per month to process 100+ courses, containing over 700 million click events generated by almost 2 million users.

In summary, Google BigQuery provides some notable advantages over the previous on-premise MongoDB solution that Harvard-MIT used in the past. These benefits include:

  1. No technical overhead costs for maintaining the server infrastructure;
  2. Scalability of processing data products across a growing number of courses;
  3. Quick interactive data analysis and ability to perform ad-hoc queries across any or all courses, and obtain results within seconds
  4. Better visibility into how tables are generated with the ability to add descriptions attached to tables, allowing better table metadata and insights into how a table was generated (for instance, including the exact SQL query itself, how long it took to process the query, who executed the query and when)

edx2bigquery: A MOOC Research Framework

By adopting the edx2bigquery framework, the immediate benefits include the automated generation of MOOC canonical data products which can be used to answer questions posed by course developers, business analysts, researchers, faculty and institutional administration. Two of the most widely used datasets that answer a majority of these questions include the Person-Course and Person-Course-Day datasets. The Person-Course dataset will be described in this blog post in more detail. The edx2bigquery framework allows educational institutions using the edX platform to synchronize data extraction, transformation and loading (ETL) processes with the weekly database dumps and daily extraction of event tracking logs. The edx2bigquery framework performs the extract, transform and load (ETL) processes for converting the raw edX data products, into more useful and interpretable datasets for all stakeholders in the form of CSV (comma-separated values) files.

For every unique edX Course ID, the Google BigQuery tables above are refreshed daily (under each respective courses dataset <edX Course ID>_latest' in Figure 1) using SQL queries and then exported into CSV files to be analyzed by researchers, course teams or institutional administration. This is accomplished with a nightly cronjob that processes the daily click stream data by first splitting the data into separate compressed json files for each course in the format DIR/course/tracklog-YYYY-MM-DD.json.gz and transforming JSON variables into strings (e.g.: "event") or JSON dictionaries with known key values that are relevant to research or course reports. One additional table is created daily under the dataset <edX Course ID>_logs for each respective course. Only the latest daily event log is processed taking about 1.5 - 2 hours to process over a million click events for courses with activity on that day.

The weekly database dumps for edX files containing *.sql extensions are updated once a week, typically on a Sunday night, and then downloaded from Amazon S3 and processed. This job takes about 5.5 - 6 hours to extract, transform and load data in Google BigQuery for about 4 million registrants with the main bottleneck being the loading phase. Lastly, the final job performed weekly is the generation of a combined Person Course dataset for over 130+ HarvardX courses containing all registrants ever. Once a week, HarvardX generates a new table called 'person_course_YYYY_MM_DD' under the dataset 'course_report_latest' for archival purposes. The combined person course table provides a readily accessible, update-to-date dataset for researchers, course teams, faculty and institutional administration to ask questions or analyze data across all courses, a subset of courses or individual courses quickly through ad-hoc Google BigQuery SQL queries, or through accessing CSV files.

The Person-Course Dataset

One useful way to think about Person-Course dataset is that it summarizes the learning activity of a particular learner enrolled in an individual course by capturing information related to enrollment time and mode, demographics, resources accessed, progression through the course, and time spent in the course. The definition of a canonical dataset or data product for a learning management system is a useful framework to define since it provides a common interface to be used across all stakeholders. Here, the canonical Person-Course data product is defined at the course enrollment level, where a learner is enrolled in a specific course, represented by a unique edX course ID. A machine-readable Person Course Schema in JSON format is defined, along with a Harvard VPAL Private Datasets Documentation maintained on Dataverse with version control by the Harvard Vice Provost for Advances in Learning (VPAL) Research Team. The Person-Course data product contains 60+ variables mainly centered on the learners and summarizing their activity within the course, including sample visualizations generated with this dataset from various HarvardX/MITx Working Papers:

Course Enrollment Information:

The course enrollment information is extracted from five (5) file types from edX: users, profiles, enrollment, certification and tracking log files. The start_time and cert_created_date variables are extracted directly from enrollment and certification files, while the verified_* variables are parsed from the tracking log files for each course.

•   course_id: unique edX course id

•   username: unique edX username

•   mode: Mode of registrant, e.g.: honor, audit, verified

•   start_time: date of enrollment in the course

•   cert_created_date: date when the certificate was generated

•   verified_enroll_time: time when user initially had their first id-verified enrollment

•   verified_unenroll_time: time when user had their last id-verified un-enrollment event

Demographic Information:

The basic profile information for LoE, YoB and gender are extracted from profiles, and indicate values at the time of registering for an edX user account. Generation of the demographic geo-location variables, however, are arguably require the most computationally intensive operation in the dataset, requiring parsing of the tracking log events to determine the modal IP per user, and then performing an IP address lookup using multiple external geo-location datasets, including Google's free built-in table fh-bigquery:geocode.geolite_city_bq_b2b, Maxmind public geoip dataset, and United Nations Geographic Region datasets. All geo-location variables, including cc_by_ip, city, zipcode, un_economic_group and subdivision, are derived using the modal IP address for each user.

•   LoE: Level of education when registering for an edX user account

•   YoB: Year of birth

•   gender: gender

•   cc_by_ip: Two-letter country code

•   city: Full name of the city

•   zipcode: Zip code of the city in the United States

•   un_economic_group: UN defined major geographical economic groups

•   subdivision: Full name of region/subdivision or U.S. state

Course Role Information:

The Course Role information is extracted using edX data provided on a per-request basis using 2 files - one file for course roles, and a separate file for forum roles. EdX should be able to provide this data readily upon request.

•   roles_isStaff: 1, if user is staff

•   roles_isBetaTester: 1, if user is a beta tester

•   roles_isInstructor: 1, if user is an instructor

•   forumRoles_isStudent: 1, if user is a student

•   forumRoles_isAdmin: 1, if user is a forum administrator

•   forumRoles_isCommunityTA: 1, if user is a community TA

•   forumRoles_isModerator: 1, if user is a forum moderator

Course Resource Access:

The course resource access data are extracted from the studentmodule, tracking logs and forum.mongo edx files. Course Resource data include problems/assessments, discussion forums, and video interactions. Course chapter access variable 'nchapters' is obtained from studentmodule. Forum data is extracted using forum.mongo. Video events are extracted from tracking log events.

•   nchapters: Number of chapters visited by user

•   nproblem_check: Number of problem check events

•   nprog_check: Number of progress check events

•   nshow_answer: Number of show answer events

•   nforum_posts: Number of forum posts and comments made by user

•   nforum_threads: Number of forum post threads by user

•   nforum_comments: Number of forum post comments by user

•   nvideo: Number of video events, from the tracking logs

•   nseek_video: Number of 'video seek' events

•   npause_video:Number of 'video pause' events

•   ntranscript: Number of video transcript events

Course Progression and current Course Grade:

The course progression variables categorize learner progression within a course based on chapters accessed. This data is extracted using studentmodule. The current course grade is obtained from the certificates file.

•   viewed: Boolean flag, indicating that the user visited the course at least once

•   explored: Boolean flag, indicating that the user viewed at least half of the chapters of the course

•   certified: Boolean flag, indicating that the user earned a certificate in the course

•   grade: final grade earned in the course, out of 100 points

Course Time on Task:

The time on task variables attempts to calculate and approximate the amount of time a learner has spent in a given course, based on tracking log activity. The current timing threshold used is 5 minutes which is a more conservative estimate.

•   sum_dt: Total elapsed time (in seconds) spent by user on this course, based on time difference of consecutive events, within 5 min max cutoff, from tracking logs

•   avg_dt: Average time difference (in seconds) between consecutive events from tracking logs

•   max_dt: Maximum difference (in seconds) between consecutive events from tracking logs

•   n_dt:Number of consecutive events used in time difference computations

•   ndays_act: Number of days with activity, from tracking logs

Research and Institutional Benefits

The canonical Person-course dataset can serve a wide-variety of stakeholders in different ways and some example use cases for each stakeholder are described below.

For Course Team designers, the Person-Course dataset provides descriptive statistics at the individual course level. Course teams can find out useful statistics for their courses including out the total registrants, viewers, explorers, certified, verified-id and certified verified-id.

For Business Analysts, the Person-Course dataset provides the ability to filter learners based on geographic location, course completion ('certified'), age, or courses taken, for instance. This may be useful for targeted marketing campaigns for new programs or events that learners may be interested in pursuing.

For Faculty, the Person-Course dataset can be grouped across multiple courses taught by that respective faculty member in order to compare multiple versions and runs of the same course offering.

For Institutional Administration, the Person-Course dataset can be grouped across multiple courses in various groupings to observe possible trends, for example. Some example groupings may be courses offered by a school, department, or across all courses.


The Person-Course dataset is only one of many canonical data products generated using edx2bigquery. If all Universities within the edX consortium were to adopt the edx2bigquery framework, or at a minimum the schemas for data products, then the overarching benefit would be the ability to easily merge datasets across multiple Universities together in a common way, using a common definition of variables, and using a standardized way to process data. Ultimately, this would allow the ability to ask interesting multi-institutional questions such cross-enrollment, thus allowing us to visualize a course-enrollment network diagram across multiple institutions apart from Harvard-MIT.

We've discussed the canonical dataset Person-Course in this blog post, but the edx2bigquery framework provides the ability to generate other useful and meaningful canonical data products which will be discussed in more detail in the future as part of a multi-series blog post including:

•   Person Course

•   Person Course Day

•   Discussion Forums

•   Problems and Assessments

•   Video Data


  1. edx2bigquery
  2. HarvardX and MITx: The First Year of Open Online Courses
  3. HarvardX and MITx: Two Years of Open Online Courses Fall 2012 - Summer 2014
  4. MITx Spring 2014 Course Report
  5. HarvardX Working Papers

Disclaimer: Responsibility for the information and views expressed in this blog lies entirely with the author and are NOT endorsed in any way by organizations mentioned in the blog.