Every single tech company that operates at a very large scale will tell you about the importance of knowing how to properly handle data transport and manipulation. When providing context-aware location services for 50 million users on mobile phones all over the world, we at In Loco need to constantly re-imagine and re-invent our infrastructure such that not only our developers and overall data analysts are able to inspect and extract insights from our data — but also that applications are able to efficiently communicate with very high traffic data volume.
While designing our data infrastructure, our data engineering team worked under two very strong premises that would shape the current architecture:
At In Loco, we avoid making uninformed decisions at all costs. To have the support of a statistical measure or of a key performance indicator it is extremely important for those who have to make a decision — be it at a product level or at a development level. Not only marketing and business teams should be able to make informed decisions, but also development teams.
Having a centralized “data team” as a squad responsible for collecting metrics is not scalable, either from a business or engineering perspective. Also, the producer of the data is the person (or team) that will be able to extract the most information and knowledge out of it; and they should have the least amount of barriers to it as possible.
Motivation #2: For efficient application communication
Just as employees at a company need to have crisp and clear communication in order to achieve a common goal, so do applications. We designed our data infrastructure such that applications can easily propagate data without having to worry (much) about cumbersome procedures and protocols.
The following diagram illustrates the main components on our engineering stack. We follow with more details of each layer below.
Application data transport layer
We rely heavily on Kafka to transport data amongst applications in an efficient manner. There are several benefits to use a message broker system to transfer data amongst applications:
- Responsibility isolation: instead of an application having to track every single consumer of its data, it is much easier for future applications to integrate themselves on a publisher-subscriber system. Having a message queue allow to scale or tweak data producers and consumers separately;
- Backpressure: if an application generates data at much a higher pace than its clients can consume (eg: one app sending a lot of POST requests to another), instead of those applications communicating directly it is possible to apply backpressure by having the data to be produced at a msessage queue such as Kafka and its clients consume the data from the broker. It is much easier to control the rate in which data is transferred through Kafka with clients consuming data at a steady pace and scaling the consumers whenever the record consumption starts getting delayed;
- Database data ingestion: putting both points above together, it is also much easier to transfer data from an application to a database through a message queue. Instead of an application writing directly to database (and again, having to deal with the issue of scaling the whole application to just to perform better database data insertion), a better approach is to have the service produce its data to Kafka and then to use another service to ingest data from Kafka to many databases — for example, our team has developed an app written in Go to efficiently transfer data from Kafka to ElasticSearch. This allows us to have different views of the same data, for example — to easily cache key attributes on Redis, store on Postgres and index for textual search on ElasticSearch.
Almost all of our data that goes through Kafka is on Avro. Using Avro allows us to easily generate code for data manipulation in many languages (Java, Go) without having to write clients or libs by ourselves, while also taking advantage of well defined data schemas and efficient (de)serialization.
You can't improve what you can't measure! We strongly monitor all of our applications, leveraging Prometheus on our Kubernetes' clusters to easily expose metrics. We use Grafana to easily create alerts systems — most of which either triggers on-call developers or triggers alerts to Slack channels — and dashboards to keep track of the overall health of our applications.
Some kinds of processing workloads are heavier by nature and require a batch-oriented, high IO throughput and very long time frames of data to process (several days).
We use Secor to backup our data from Kafka to S3. Once the data is written in Avro files on S3, we have scheduled Spark jobs that read and convert those files to Parquet and write it back to S3. Spark's DataFrame API has lots of performance improvements when working on Parquet files, mainly due to its column-oriented storage, which greatly helps also on analytics. Spark jobs either reads data directly from S3 or caches it into its own HDFS cluster for higher locality. We mostly use Airflow to schedule the Spark jobs and for tasks automation.
Analytics and Business Intelligence layer
Although having Parquet files is very efficient for Spark and other distributed processing workloads, it is not easy for non experienced developers to write Scala or Python code for Spark on Zeppelin notebooks. To empower our BI team (and analytics-oriented employees along the company), we expose our Parquet data managed on S3 through external tables on Presto. After trying to manage several HDFS clusters, we have found that the performance gains on data locality by distributed query engines directly on top of HDFS was not worth of the burdensome operation of maintaining Hadoop clusters.
Presto was our top choice for running interactive, ad-hoc queries for several reasons:
- SQL is an universal query language and Presto has several extensions on top of the base ANSI SQL. Its extensions enables creating more complex and advanced queries without having to resort to imperative programming;
- Easy to scale horizontally with more worker nodes;
- Tight integration with Hive, HDFS and S3;
- Built for big datasources on the terabytes range;
- Still on active development by the community, open source and battle-tested by the industry.
Finally, we expose our Presto analytics clusters on several dashboard tools, such as Metabase, Redash, etc. Using dashboards gives more flexibility for BI with query versioning, data visualization and enhanced control of its analytics with more independence without the need to action experienced engineers with appropriate know-how.
From a technical standpoint, the presented data architecture has allowed us to scale to the throughput of over 200MBps (or over 15TB per day) produced on a daily basis on our Kafka clusters. This infrastructure has allowed us to efficiently build both our real time beaconless location-based intelligence services, such as fraud detection, indoor visits detection and behavior driven mobile engagement and analytics, while also powering up our workload pipelines.
From the business intelligence point of view, instead of centralizing the knowledge of our data on a single team, we built an infrastructure to make it possible that people are closer to the data and to empower themselves to easily make decisions with better analytics and data support.
Are you interested? If you are interested in building location and context-aware intelligent services and products, check out our jobs postings! Also, we'd love to hear from you in case you have any questions or just want to learn more from us! Let us know what you would hear from us in the upcoming new posts.