Regression Discontinuity: Understanding the Benefit of Subtitles on Coursera
This is Part III of our Causal Impact @ Coursera series. (Part II is here)
At Coursera we use data to power strategic decision making, leveraging a variety of causal inference techniques to inform our product and business roadmaps. In this causal inference series, we will show how we utilize the following techniques to understand the stories in our data:
(1) controlled regression
(2) instrumental variables
(3) regression discontinuity
(4) difference in difference
This third post in the series covers an application of regression discontinuity to measure the impact of subtitling courses in other languages on enrollments.
At Coursera we believe in making our content as accessible as possible to everyone around the world, in accordance with our mission of expanding universal access to high-quality education.
One important aspect of this is ensuring that our content is available in different languages to accommodate diverse audiences, and a big part of that is subtitling our content in different languages. Understanding the value of this subtitling helps us know where our efforts might have the largest impact.
As soon as we subtitle an individual video lecture within a course in a particular language, we work to make it available to learners, so we would expect that as the fraction of subtitled videos in a course increases, we would see more enrollments — which we do see historically.
However, to understand the actual impact of subtitles on enrollments we can’t simply correlate the amount of subtitles in a course with enrollments. Historically, more popular courses also tended to be the ones with the most subtitles, making course popularity a confounder to measuring the causal impact of subtitles with historical data.
It turns out though that when a course reaches a threshold of 80% of its videos being subtitled in a particular language, we actually advertise the course as being available in that language, while prior to that threshold, it is not explicitly displayed.
This 80% threshold then provides a discontinuity point around which to compare the relationship of subtitle percentage with enrollments. If we see a jump in enrollments after a course crosses the 80% threshold, then we will know that advertising the presence of subtitles has a causal impact on enrollments, allowing us to rigorously quantify the causal impact using the technique of regression discontinuity. This is because the 80% threshold is the same for all courses and therefore has no relationship to course popularity or other confounders.
To do this we regress daily enrollments on the fraction of videos subtitled in a course as of each day, an indicator of whether this fraction is above 80% or not, and the interaction of the two (along with potential controls). This effectively fits two separate regression lines around the 80% subtitle threshold, one before and one after, allowing us to quantify the causal impact of advertising subtitles on enrollments as the difference in the two regression lines.
The table below shows the results of the different regressions we ran:
We can estimate the causal impact of subtitles on daily enrollments using this formula:
% Subtitled Above 80% + % Subtitled * 80% X % Subtitled Above 80%
which is just the difference in regression lines before and after the discontinuity.
Using this method we can see that the effect of subtitles on daily enrollments is small, but because we can leverage our large community of learners to increase access to Coursera, we still promote the subtitling of courses through our GTC or Global Translator Community. The GTC allows learner volunteers to subtitle any course in any language they want, and helps us ensure that anyone, anywhere, can access the transformative learning available on Coursera.
Interested in Data Science @ Coursera? Check out available roles here.