We have a data lake, now what ?
At Kapten, we are proud to write that in about a year we developed a pretty stable and mature data lake. Our architecture mainly revolves around daily batch data ingestions and processing jobs handled by Airflow. We host all our workers and services on Google Cloud Platform and deploy pretty much everything on their awesome Kubernetes Engine. We manage around 50To of data and 400 jobs (a.k.a DAGs in Airflow). We also developed a custom framework on top of Airflow so that mere mortals with SQL knowledge can create their own data ingestion and processing with a pull request of a few lines of configuration. At Kapten, data-engineers do not write custom ETLs anymore, it’s the future user of this ETL that writes it, mainly data-analysts. Delivering this architecture was a great milestone in the life of our data-engineering team. But with the growing data volumes and new business needs comes a new challenge: making our data lake real time.
Why going real time ?
Apart from the fact that stream processing is an innovative field in data engineering we have other valid reasons to make the shift to real time. These reasons come from technical and business contexts our company is currently in.
Data growth impacts our capacity of delivering fresh data everyday
All company data is updated in our data lake every day. It means that every night we ingest into our data lake all changes made the day before on production databases. It might represent quite a lot of data, and extracting this new and cumulated data has a cost. Transferring it to our data lake create spike loads on production databases due to our activity and it can take several hours for our ingestion jobs to finish. As the company and data grow, these long ingestions delay the BI dashboards availability in the morning and late data is not good for business.
If we acquire the ability to process new data as it’s created we could smooth out our ingestion process along the day, forgetting the need to wait for midnight to ingest this new data.
Capturing data evolution, not only snapshots
Data lifecycle in production databases and in our data lake is not the same. On a production database, a document can be updated several times a day. However, changes may occasionally be missed by our ingestion jobs as we only get the state of the document at midnight. It means in some case our data lake only has coarse grained snapshots of the data. Real time ingestion could allow us to go beyond this limit. By capturing all changes made in a database in real time we would ingest all mutations made in a document, transforming data update as event and not only snapshots. It offers a finer data resolution and enables new data mining capabilities.
Evolving business needs of data customers
In our current architecture data is refreshed once a day.
Reporting and dashboard are made available with yesterday’s data to analysts and business teams around 8 am. Everyone arrives around 9 at the office and all decisions made in the day are made upon figures that have been computed and updated according to yesterday’s platform activity. You might think it’s fine and the data is fresh enough to make accurate assumptions about our market. In most case it is true but new needs that require more frequent data updates are arising.
Send the right message, at the right time
At Kapten, as a ride hailing market place, we have two types of customers, riders and drivers. We have various communication channels to manage our communication with them in an automated way. All these systems work on trigger logics : send a templated sms to a driver when he has made more than X rides in the last X days or send a coupon to a rider when he has not made rides since X days.
Efficiency of these automated operation is multiplied when the right email is sent at the right time. And the right time is probably not the time when the ETL jobs end in the early morning. Making customer data ingestion realtime will enable more fluid interaction with our customers and empower our CRM teams.
Gauge the pulse of the market in real time
At Kapten, operation teams are involved with daily interactions with drivers in each city. They often need to ensure everything is going fine in the market by reading some macro metrics or by investigating specific driver cases. Hence, they would welcome an access to real time data. For instance, when we open a new city it is crucial to follow minute by minute the supply and demand evolution and act upon these figures so that every problem can be fixed in a matter of hours and not days.
Machine learning wants hot data
Data science and more specifically prediction models are all about predicting an outcome from a context. When you ship a model into an app, the current context is the one your user is in. So for accurate prediction you need accurate context and view on customer data. It is not relevant to make a prediction about a current behavior if all your features are not computed from the latest customer action in your app.
In a nutshell, for performant data science models in the wild (in production) you need to feed them with lively computed features, otherwise your data scientists will become frustrated.
Ok but why don’t you give access to production databases for these use cases ?
We could and we did. After all, in production databases the data is always up to date, not every morning.
But running analysis on top of production databases is what we called pre-history data-engineering at Kapten.
Why should the company bother having a data lake if a business team can query production databases ?
For several important reasons :
GDPR : Business teams must not access personal data and data not related to their expertise.
Load and tooling : Running some analysis and computation might requires a lot of computing resources that are not available in our production database. Production databases are made to serve our app and micro services with data. Dedicated computing infrastructure and different technology stacks need to be deployed to perform analysis. This is why we use BigQuery as a highly scalable and performant query engine.
DSL : Our production databases run on MongoDB, which DSL and core structure is not made for analytical purposes. BigQuery offers a SQL DSL which is a basic commodity for analysts.
Centralized data : One of the biggest advantages of our data lake is that it offers a clean registry of our business data. A lot of work has been built upon the raw data provided by production databases. Our data cleanings and standardization allows a convenient access to the data for business operational.
All these reasons explain the purpose of having a data lake at Kapten and that using our production databases as a real time datasource is not conceivable.
Ok, now let’s code
We are currently working on this real time shift and will write a follow up article to walk you through the technical process that implies shift to a real-time data lake.