Categorizing learning content

From hand-coded to an algorithmic approach

The Previous Course Categories

Coursera’s original categorization scheme dated back to our founding in 2012, and was heavily influenced by the content available at the time. For example, we had five categories of computer science subfields, but only one category for all of the humanities. The categories were also manually and arbitrarily defined, resulting in redundancies (e.g., “Food and Nutrition” being nearly a subset of “Health and Society”) and vagueness (e.g., “Information, Tech & Design”).

  1. Simple (as few categories as possible)
  2. Minimally redundant (as mutually exclusive as possible)

t-SNE to the rescue

Rather than re-coding by hand, or replicating traditional university departments, we took a data-driven approach.

Figure 1: t-SNE visualization of courses colored by cluster, circa 2015.

The general structure of our content

Figure 2. General subject area of courses.
  • Courses on business and finance are clustered together on the right
  • Courses about the natural sciences (physics, chemistry, and biology) are on the left
  • Courses on the computational sciences (math, cs, and statistics) are at the bottom
  • Courses on the social sciences and humanities are at the top
Figure 3. General division of science and humanities courses.
Figure 4. Substructure of courses within each half of the plot.
Figure 5. Interdisciplinary courses sit roughly between the right clusters.


Credit for the t-SNE approach goes to Zhenghao Chen, a previous Coursera data scientist.



We're changing the way the world learns! Posts from Coursera engineers and data scientists.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Chris Liu

Passionate about education and solving hard problems in a collaborative fashion.