How do we collect data and use data in Coursera?

Copied from my Quora answer: https://www.quora.com/What-tools-do-startups-use-to-collect-data-How-can-they-use-the-data-to-improve-the-product

In Coursera, we use a hybrid of in-house solutions, open-source solutions and cloud-based enterprise based solutions.

  • User and service activity data. In Coursera, our clients make API calls to backend services, and our backend is built by a fleet of different services based on service oriented architecture design. We developed tracking libraries in web, mobile and backend, and we also developed a service called eventing in the backend to receive all the events collected by our clients and other services. The eventing service periodically uploads all received events to S3 and continuously publishes events to Kafka.
  • User data stored in our production databases. We also periodically dump data stored in our production databases into our enterprise data warehouse powered by Redshift. The data is also called dimensional data, which not only does it provide great insights about our users alone, it also works great when cross joined with user activity data.
  • Site health. We also use datadog and sumologic to monitor backend services and numerous logs written by our services.

How do we use these data to improve our product?

  • We build dashboards based on these data to inform company about how our product is doing, and help executives to make strategic decisions. Every morning, all of our dashboards are refreshed with the newest data we collected from yesterday, and everyone in the company can access the data.
  • We build A/B testing framework to help our product team to make informed decisions on launching features.
  • Our data scientists also use common data science toolkit like Python notebook, R, SQL, statistics, Machine Learning, etc to conduct deep dive analysis on particular issues of our product: e.g., predict next course to enroll for course completers, understand why students drop out in our courses, so on and so forth.
  • We improve our site reliability and debug site issues using all the data we collect. Most production bugs can’t be deciphered without the support of data we collected, on the other side, with the help of massive collected data, lots of site issues are addressed like a breeze.

Data is only valuable when it provides business values. I would focus on developing a coherent view of the product and develop tools to collect data that have concrete values and goals.