Improving Coursera Global Site Performance: A Head-to-Head CDN Battle With Production Traffic

Frank Chen, Brennan Saeta (Infrastructure Engineering @ Coursera)

In order to fulfill our mission of universal access to the world’s best education, our learning platform must be fast and reliable throughout the globe. Because more than 60% of our learners come from outside North America, we use a content distribution network (CDN) to serve our static assets (e.g. Javascript, CSS, and images). With a CDN, learners can connect to the closest PoP instead of our primary datacenter in North Virginia when downloading our static assets and the site loads significantly more quickly.

Earlier this year, we evaluated multiple CDN providers to find the best partner for Coursera’s static asset delivery. Although we have many different types of traffic, this specific use case involves delivering lots of relatively small, highly cacheable CSS, JS and image files to millions of learners around the globe–and in developing nations in particular. Of course, being a startup, we are also very cost conscious and are always looking to reduce costs in all aspects of our infrastructure.

Traditionally, when evaluating alternate CDNs, web beacons in select locations are specially configured to load the site with an alternate CDN. Unfortunately, our previous performance work informed us that while beacons can be useful to delve into performance details, they must be validated with real user monitoring data; website performance from a high-bandwidth datacenter in Sydney is not representative of our learners in Southeast Asia.

Although testing multiple CDNs on real users of a website has traditionally been very hard, we have a powerful internal A/B testing framework: EPIC (Experimentation Platform and Instrumentation for Coursera). We coupled this system with Edge — our outermost tier through which all external traffic for Coursera flows — giving us complete control to direct individual learners and sessions to different CDNs in real time. In this case, Edge transforms a skeleton HTML page and converts all the relative URLs to absolute URLs for a particular CDN before sending the page back to our learners.

Technical architecture for our multi-CDN experiment

As part of our experimentation and data analysis platform, Edge assigns a unique identifier to every learner that visits our site, and EPIC uses this identifier to assign the learner to one of several buckets per experiment. EPIC also instruments each session, collecting aggregate information for each experiment variant including signup rate, enrollment rate, and pageview rate. This in turn allows us to correlate the success and failure of each variant in our experiments.

Because our traffic exhibits daily and weekly cycles, we ran our experiment over multiple weeks to ensure a representative view of potential performance improvements. Further, each CDN received an equal proportion of traffic in order to mitigate any potential advantages of higher traffic (i.e. higher cache hit rate).

Results of our multi-CDN experiments at the 75th percentile

After months of work, we found that for this particular class of traffic, and for our current learner population, there were no significant performance differences between the top global CDNs. Even CourCDN — a home-grown system composed of Nginx in each AWS EC2 region — had performance within 10–15% of the best CDNs in terms of page load times. More importantly, in context with the rest of the site’s performance envelope, we saw no statistically significant differences in learner behavior. Signups, registrations, and lecture watching did not change in any statistically significant manner. In the end, we opted to renew our contract with Amazon Cloudfront for global static asset delivery, and pursue alternate strategies to optimize our site’s performance. Stay tuned for further details!

Finally, we would like to sincerely thank the support and business development teams at all of our CDN partners, all of whom put in a significant amount of time and effort to enable us to perform this experiment: Akamai, ChinaCache, Cloudflare, Cloudfront, Limelight and Quantil (not the same order as the graph).

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.