Syllabus Explorer: Technical Notes

How We Did It

Course syllabi texts are acquired from multiple sources. Instructors regularly name the file with a variant of the word "syllabus" (case insensitive and allowing for common misspellings) as part of the name, which we use to infer which files correspond to syllabi. Another source is the native Canvas Syllabus tool, where the instructor can either input the syllabus text, upload the syllabus file, or embed a link to the syllabus file. In the last two cases, we look for a variant of the word "syllabus" in the file name or the link. With Canvas API access permission granted by the participating School (FAS Dean, for example) and HUIT, we programmatically download those syllabi files across all the School's courses. Additionally, we download all course metadata (supplemental information about course departments, instructors, and terms) from the iCommons API, provided by HUIT Academic Technology.

Text from syllabi and associated metadata are cleaned, filtered, and reformatted for the purpose of modeling the text sources using an ensemble of different algorithms. These models are used to extract course and department keywords, compute the similarity between courses, and enable searching of the documents and their metadata. A final dataset is loaded in an R Shiny application, where the data can be searched and explored using a web browser.

Calculating Course Similarity

The similarity is established based on the description and syllabus texts using an ensemble of several methods: Latent semantic analysis, alignment of rows in the document-term matrix, tf-idf, and three versions of a correlated topic model (PDF). The output of each model is the distance between any two courses: a number between 0 and 1. Smaller distances indicate higher similarity. This ensemble of distance matrices is collapsed into a single list of pairwise distances. For each model's distance matrix, we impute missing distances for pairs for each model using that model's median pairwise distance and convert raw pairwise distances to percentiles. Then, the element-wise mean of the distance matrices is calculated and used as a similarity score between pairs of courses.