Analytics at Coursera: three years later

8 min readMay 16, 2016

What is it like to work in the Analytics organization at Coursera? Three years ago, I wrote a Quora answer describing what was then the Analytics engineering team at Coursera. Today, the Analytics organization consists of engineers and data scientists who collaborate with teams throughout the company. Analytics is Coursera’s primary data hub — the home of business intelligence, product and business analysis, data products, and data infrastructure engineering.

As a founding member of the Analytics team, I’ve had the unique privilege of watching it evolve over the past three and a half years. In this post, I’ll describe where we are today and provide some reflections on the past and future of Analytics at Coursera.

Engineering

From the beginning, Analytics at Coursera has always been strongly rooted in engineering. Even though Analytics is no longer purely a software engineering function, the integration of data engineering and data science functions within the organization has been a core driver of the organization’s productivity and effectiveness.

Three years ago, we were just beginning to build out Coursera’s first data warehouse. At the time, Amazon Redshift and AWS Data Pipeline (the technologies on which we based our data warehouse) were new, and no mature software packages for working with these systems were available. In lieu of a data warehouse, most internal data analysis relied on makeshift libraries that executed scatter-gather queries across the hundreds of replica databases backing each of Coursera’s courses. Needless to say, even simple aggregations of engagement metrics across all courses were significant undertakings!

Today, our data warehouse is central to all data operations at Coursera, and we have open-sourced the framework on which it is based. Over time, we have also developed an ecosystem of internal tools around our data warehouse to make data available and usable. The impact of these tools on our business has been enormous. For example, in early 2015, we rolled out a new experimentation platform for variant allocation and reporting, which has served over 400 A/B tests in its first year alone. By integrating with our data warehouse, this tool reports on not only direct conversion events (e.g., clicks) but also long-term downstream user behavior (e.g., eight week course completion rates). Similarly, we built a web-based portal for data access, visualization, and documentation. Every week, roughly 40% of all employees at Coursera use this portal (which, remarkably, does not even include our internal metrics dashboard and self-serve reporting platforms!).

As Coursera’s data needs have become more sophisticated, data tools and libraries originally developed in isolation by teams throughout the company have been redesigned and integrated into the scope of our data infrastructure. For example, our internally-developed email communication tools have full access to the personalization and experimentation capabilities that power our website. Recently, we have centralized development of our client-side and server-side eventing libraries for instrumenting our website and mobile apps. To shift from building one-off tools and libraries to building an end-to-end ecosystem of deeply integrated, interoperable data tools has required significant investment. However, the gains in long-term team productivity we’ve seen already have been enormous.

With the breadth of technologies we use, Analytics engineers have become increasingly full-stack over time, with team members writing code across our front-end, back-end, and mobile platforms (both iOS and Android). Aside from our tool development work, we have also recently begun to invest more heavily in developing high-quality data assets including shared tables, metrics definitions, documentation, and training materials. Our newly-created business intelligence engineering team is the primary steward of these Coursera data resources, which are consumed not only within Analytics but also throughout the company and by external researchers at our partner institutions.

Data science

Three years ago, the notion of a data science group distinct from engineering that would focus on contributing to the product through data analysis and experimentation did not exist. At the time, the team still largely consisted of computer scientists. A few weeks after posting my Quora answer, we extended our first non-engineering offer to a talented economist from Harvard. But even then, creating a purely analyst role felt like a significant step, as much of our internal data infrastructure was still immature (e.g., we had no data warehouse, so all data access involved writing scripts to directly access replicas of production SQL database).

Around this time, the broader Coursera organization began its transition into a matrix structure. Based on reading about industry best practices and hearing the recommendation of our advisors at more mature tech companies, we began exploring the concept of “embedding” — i.e., directly allocating data scientists to specific product and business teams to optimize their ability to form relationships, gather context, and influence direction. Our year-long experiment with embedding was largely successful, but we found this model difficult to scale effectively. As the organization grew from roughly 50 employees to over 140 a year later, the number of internal teams increased more quickly than we could hire data scientists, often leading to resourcing gaps. Furthermore, embedding data scientists sometimes led to information silos with reduced knowledge sharing and duplication of work, despite our attempts to create a “center-of-excellence”. Finally, embedding generally resulted in poor investment in developing shared resources that could be used by the group as a whole.

Today, we’ve evolved our embedding model into a more nuanced structure in which a small number of data science sub-teams (or “clusters”) collaborate with one or more partner teams throughout the company. We assign partner teams to clusters based on the relatedness of their goals. For example, one of our clusters works with both the customer operations team and the learning experience product team, since both groups deal with improving the “within-course” experience for learners. Another cluster works with both the growth and marketing teams, who have a shared goal around top-of-funnel optimization.

We’ve observed that compared to a pure embedding model, clusters promote better communication and collaboration among data scientists working in related areas while remaining closely connected with partner teams. Clusters also improve team efficiency by reducing redundant work and making it easier to reallocate resources to solving the highest impact efforts within a given domain. For example, in late 2015, we recognized the importance of improving event tracking for mobile devices, but the significant investment required made it difficult to carve out time to get this work done. Here, clusters allowed us to shift responsibilities internally to accommodate this longer-term effort while continuing to collaborate effectively with our partner teams.

Within clusters, most data scientists function as decision scientists, representing the “voice” of data in informing product and business decisions. In practice, this involves balancing reactive work (i.e., supporting the ongoing efforts of teams by answering questions or guiding the design of complicated experiments) with proactive work (i.e., identifying novel insights through exploratory deep-dive analysis, ad hoc modeling, or advanced statistical inference). When done well, decision science at Coursera has had a powerful influence on the way we do business. For example, data scientists played a critical supporting role in guiding experimentation efforts that led to a quadrupling of course completion rates on our new learning platform over the past year. Similarly, proactive analyses identified content quantity and quality as key drivers of learner engagement and satisfaction, respectively. These discoveries helped motivate changes in Coursera’s strategy for content sourcing and sharpen the company’s resolve to implement quality standards for course production. Because of Coursera’s strong internal culture around using data to make decisions, data scientists often have very significant opportunities to shape the direction of the organization.

Paradoxically, for a company founded by two experts in machine learning, we’ve actually invested less to date into leveraging machine learning than one might expect. Historically, this has been driven by a need to prioritize fundamental improvements in infrastructure and data quality as the company has been navigating a major platform transition over the past two years. Recently, as this transition has neared completion, we’ve begun to increase our emphasis on driving impact directly through the development of data products. Today, we have one cluster dedicated to the application of machine learning methods for personalizing the learning experience. The current focus of this cluster is on improving our course recommendation emails, which today account for roughly one-sixth of all enrollments on the site. In other parts of the data science organization, we also have ongoing efforts in improving the quality of our catalog search and are beginning to explore the development of new data products for improving the learning experience. With many of these efforts just beginning, we believe the time is ripe to invest in data products as a way to improve and scale the product.

Reflections

The past three years at Coursera have been an incredible experience. Looking back at the state of the team and the company and all the challenges we’ve overcome, I couldn’t be prouder of the team’s progress and everything we have accomplished.

To put things in perspective, three years ago, manual edits to production databases were standard practice, data quality issues resulted from major architectural oversights in our backend systems, and the business use of data primarily involved answering ad hoc requests. Today, the backend infrastructure powering the Coursera website is world-class, product engineering teams more actively consider data quality when deploying code, and hardly any product or business decisions are made without consulting data.

Of course, there’s still much more to be done, and quite frankly, we’ve hardly begun to scratch the surface of what is possible. How can we use data to develop richer, personalized online learning experiences that are maximally effective at generating real learning outcomes? How do we develop data-driven algorithms that take advantage of Coursera’s unprecedented scale to provide actionable insights to instructors? How can analytics help grow Coursera’s business and ensure that we successfully provide access to educational resources for millions of learners around the world? We don’t have a great answer to any of these questions, but we believe that solving these challenges will be critical for achieving the company’s mission. If solving technical problems in the education space that have the potential to transform people’s lives sounds appealing, Coursera Analytics might be the place for you. We’re just getting started.

Originally published at building.coursera.org on May 14, 2016.

Analytics at Coursera: three years later

Engineering

Data science

Reflections

Written by Chuong Do