Building data infrastructure in Coursera
I’ve been working on building data infrastructure in Coursera for about 3.5 years. This week, I had an opportunity to speak at Data Engineering in EdTect event at Udemy about our data infrastructure. To better suit readers, this article is an adapted based on my notes of the talk, in which I shared a few lessons of building a real world data infrastructure from scratch.
In Coursera, data plays a big role in our daily work. From overall aggregated numbers, e.g., 27M learners on the platform, to specific zone-in metrics, e.g., 45% of our learners are from emerging markets. All these numbers help us to make data-driven decisions in everyday. Empowering people across the company to have an easy access to our data is always the first priority of the data infrastructure team, and it is definitely not an easy task.
Challenge 1: Data is everywhere
Like every other internet company, data is everywhere is Coursera. When I joined Coursera, we started rebuilding our web application and building our mobile applications. Data were tracked in multiple channels through a a few inconsistent ways: some data were tracked in an unstructured format and directly sent to our eventing system, some data were tracked in our MySQL or Cassandra databases, and some data were only in third party tools like SurveyMonkey. I even heard the story of manually logging into each database to calculate the daily activity learner metric. Luckily I didn’t need to do any of that as we just started of building our enterprise data warehouse (EDW). Though there were merely a few dozens of tables in our EDW system, it is still a solid start.
We picked Redshift as our EDW system. Besides of the standard SQL interface that every data scientist understands, Redshift is fast, and more importantly, reliable. We only had three engineers at that time; by only a few clicks, we can operate reboot, resize, and other actions on Redshift through its console. Unlike Hadoop or Spark at that time (I heard that both Hadoop and Spark are getting much better now and we will look into them again when Redshift is not sufficient for us), we rarely need to debug any issue (e.g., OOM), Redshift can reliably execute most of our queries without any memory or performance optimizations on the query.
We tried Hadoop and Spark at that time, comparing with Redshift, we saw a huge amount of operational cost which we couldn’t afford as a team at the time. Thanks to Redshift, we now focus on building tools which have direct impact to our business instead of spending time on operational tasks. Redshift has served us well in the past four years, and we haven’t looked back yet.
Solution: build an EDW system to keep all your data in one place (latency is OK for most of cases)
Challenge 2: Data requests are from everywhere.
Once we started moving data into Redshift, data requests flooded into us from everywhere: engineers, data scientists, marketing, customer support, sales, external users like universities and enterprise content providers and customers. Everyone in the company wants to understand our data in a more quantitative way, and we saw a variety of requests across a huge spectrum of different domains.
In order to meet the demands while the team is small, our solution is very simple, we build an internal query page, which gives people the ability of writing SQL queries, simple charting functions and basic sharing functions by accessing a web page. This helps address a few issues:
- Access centralization. Before this tool, people use all sorts of tools to access Redshift, which gave us a hard time as any misuse of a tool could potentially bring down EDW. For example, some tool doesn’t implicitly release the locks on the tables in Redshift until the disconnection, and if people forget to disconnect (which they often do), our ETL system will break because it cannot write any new data into the tables as they are locked by the connection. On the other side, because people always access EDW through this tool, we can easily monitor and operate on people’s queries, e.g., if EDW is hot and overwhelmed, we can limit people’s access to this tool to throttle jobs sent to EDW until it cools down.
- Democratization. Since this tool is build on web, every query and execution result is saved in the tool as well. People share any query or result by copy-and-paste of a URL. This allows anyone in the comopany to conduct a simple ad hoc analysis through this tool and share the result with other people. Non-data-scientist role especially loves it. As long as they can access internet, they can go to the querypage and write queries to get answers they want. This self-serve query system helps reduce the daily load of data scientists, allowing them to focus more on deep dive analysis and less on daily data inquiry support.
Solution: focus on building a self-serve data access by providing a centralized access point; avoid the situation that your customers choose tools to access EDW because it is hard to debug and manage.
Challenge 3: Everyone hates ETLs, everyone needs ETLs.
As we were expanding our product lines and business, the demand of writing ETLs became higher and higher.
This illustration accurately described what happened in Coursera before we built our in-house ETL system. tl;dr: no one was really happy.
So, let’s imagine a case: our customers (data scientists or PMs) want to understand the performance of a new feature they just launched. They ask us where the data is in EDW, we tell them that it is actually not in EDW yet, and then they ask whether we could ETL the data. The problem is: we don’t directly build products, we don’t know when a new product is launched and what type of data is tracked. How the hell we end up building this ETL?!
Trying to be a good neighbor, we spent a lot of our time with product engineers to help our customers to figure out where the data is and help write ETL jobs for them. There are lots of back and forth around this, and finally we build the ETL job. But because we are not the users of the data, we don’t know whether data have data quality issue or not. We are not the creator of this data either, our product engineers are, so if our customers see problems in the data, they ask us, and then we ask product engineer to fix the quality issue, and again and again until all issues are resolved. Our customers can’t get good data in time, our product engineers are constantly bugged by us and we are constantly involved this process. No one is happy.
From the lessons we learned of building the query page, we see the power of providing an easily usable tool around data infrastructure. So, we spent time and talked with our customers and found a set of common operations. We developed a set of operators to allow them to specify the details of each ETL job without worrying about the implementation details. These operators are implemented as standard docker images managed by AWS ECS (EC2 Cloud Service). Also, we developed this through a web page and people can define their ETL jobs by just a few clicks and parameters.
The result is everyone is happy. We are happy because don’t need to be the middleman of every single ETL process, we don’t need to understand the nitty gritty details of every single ETL either. We can just focus on maintaining this tool, and product engineers and data scientists and other customers can talk directly with each other by using this tool. The data takes a much shorter time to be ETLed into Redshift, and it is quicker to resolve data quality issues because they can talk with the owner of the data directly.
Solution: build an easy-to-use ETL system and don’t be the middleman. Don’t write ETLs on your own, and let your customers write ETLs in an effective way.
Challenge 4: Data scientists are not engineers, and they are not the same as each other.
The tools I described are definitely used by our data scientists and they love these tools, but at the same time they often are the advanced users of the data infrastructure: for example, if they want to do advanced analysis or model building, SQL access is not enough.
My own experience with data scientists is that their title is a lie, every single one has their own role and function, and every single one has their own tools they use to do data analysis.
From a simple Google query, we can easily tell that people have thought about how many different types of data scientists in the world. Be mindful that the data scientists are different is super important for building a useful data infrastructure for them.
For a few years, our data scientists can pick whatever tools they want to use, and the result is meh. For ad-hoc analysis, this is actually fine because people care the conclusion the most. But if they work on building daily dashboards or advanced models, this easily becomes a problem. People pick Python 2, Python 3 and/or R for different tasks, even people choose the same language, they could still pick different libraries for the same task. Also, the development environment was maintained by data scientists in their local laptop. Because of the inconsistency of the development environments, this is an operational nightmare when people want to work together or pick up others tasks.
Our solution is to provide a standard docker image for them and force them to use the same set of tool instead of inventing their own ones. On top of this, we also provide a cloud based service that people can just login to the browser and access RStudio and Jupyter notebook, both RStudio and Jupyter Hub runs on top of this standardized container as well.
It turns out that they are more than happy to use this. I guess the reason that they picked random tools to begin with is that they don’t care which tool they use. Now they are happier than before because they can also collaborate easily with each other.
Please be noted that one important thing that we believe that we did right is to run this analytics dev environment on the cloud and ask people to access this through browsers. Right now, both RStudio and Juptyer provide good tools to allow people to access them in browser.
One benefit is that once they have technical issues, we can just login to their account to see what’s wrong and help them to fix the issue remotely without being physically next to them. The other benefit is that this tool is also accessible by everyone in Coursera, and besides data scientists, other roles like engineers and content managers also love this tool, this also give them the power of doing advanced analysis without worrying about maintaining their own dev environment.
Solution: Standardize your analytics dev environment for advanced analysis.
Data Infrastructure Orchestration
We developed our data access tool, ETL tool, and analytics dev environment on top of EDW using restful APIs, docker and ECS. These decisions have been served us well. You might ask why didn’t we buy enterprise solutions, and the simple answer is that we didn’t see many good alternatives three years ago. I admit that there are many good alternatives in the recent years that could potentially replace these in-house systems, but a nice benefit of building these tools in house is that we could easily build other systems on top of them and adapt them to suit for new business needs.
For example, our experimentation platform and email management system can talk to our data access point through API without worrying about the implementation details of data access. Similarly, for our machine learning system, our data scientists can build models in the dev environment, push it to our ML system, our ML system can help manage the pipeline without knowing the details of the model because it is containerized in docker. The ML system can also access data to give people the ability to introspect our model and monitor our ML products.
There are three principles that I think are super critical for building a data infrastructure
- Centralize: Everything should start with centralization. Put all the data into one place, provide a centralized toolset to allow people to access data, centralize all analytics dev environments into cloud. It turns out to be a super useful principle for us and set up the foundation for us to take on other challenges.
- Standardize: Once we centralize data or data access, standardization becomes a natural next step. Because it is super easy to spot inconsistency among data and tool. Standardization also helps us a lot to have basic building blocks and increase reusability of our data infrastructure.
- Democratize: At last, really keep the idea of democratization of data. You could argue that democratization is a side effect of centralization and standardization of data infrastructure, but we think it differently. We intentionally build our tools to be able to accessible by everyone and the result is tremendous.
I want to end up this with an chart above. This chart shows the number of unique weekly users of EDW platform and experimentation platform through their web UI. We have 300 people in the company, and our analytics team is only 27 people. Each week, basically everyone in the company will use the tools we build to access the data.
When you make data accessible, people will access the data. This means a lot to us, and we will continue building the self-served data infrastructure to democratize data in Coursera.