Efficient and cost-effective Data Engineering at Tile

Published in

Tile Engineering

4 min readApr 12, 2019

At Tile, data is at its heart of everything we do. The Tile community has grown from 4 million daily active Tiles (Sept 2018) to 5.6 million daily active tiles today. Since then we have also successfully launched our Tile Premium subscription service, Tiles with replaceable batteries, Bose headsets with Tile tracking and much more!

These growing numbers and successful product launches also imply an exponential and abrupt increase in data volume. This calls for scalable, robust and effective data pipelining to do near effortless data analysis.

In this blog, I will take you through (a) Tile’s data engineering infrastructure growth (b) where are we now? (c ) where do we see us going forward?

Data Science Hierarchy of Needs (shown below) most appropriately demonstrates how important a solid data engineering foundation is for building out effective metrics, analytics, experimentation, and machine learning systems.

Figure 1: The pyramid of data needs illustrated by Monica Rogati

At the bottom of the pyramid, is data collection. Since Tile works in a unique space where we control the hardware design, the variety of data is richer than a typical software company and so are the challenges.

Figure 2: Incoming data landscape at Tile

The frequency at which data points are collected varies from sub-seconds to once a week. Airflow is our tool for managing jobs scheduling. Multiple incoming data sources implicitly mean more points of potential failures.

Which brings us to the 2nd layer from the bottom of the data needs pyramid (Refer Figure 1). This also is one of the most important components of Reliable data flow infrastructure. Airflow has been the tool to help us achieve this.

Next comes, processing (a.k.a ETL). We rely on Apache Spark to do most of the ET (Extract, Transform). We use spot instances on AWS heavily to achieve this at a minimal possible expense. A major part of this phase also involves data cleaning. This is considered one of the most infamous jobs yet everyone is doing it under different titles (Data Engineers/ Data Scientists/ Machine Learning Engineers/ Analysts).

Figure 3: Forbes Analysis of how data scientists spend their time.

For all the storage requirements, we rely on Amazon S3. Besides using standard storage, we also use glacier storage to help reduce overall cost.

Once we have cleaned the data, we move on to phase-3, i.e Anomaly Detection. We use a own homegrown system for data monitoring (metadata and data). The system was contributed by Ben as part of his 2018 Tile internship.

For Analytics, we rely on mighty Presto and Hive metastore combination deployed on EMR. If you have not used this combination earlier, you can think of it as shown in Figure 4. We use Tableau, Mode and Grafana for all of the reporting.

Figure 4: Presto and Hive combination for Scalable analytics at Tile

Zooming out a bit, the current infrastructure which serves Tile data analytics is as shown below.

Figure 5: A high-level illustration of the Data Infrastructure at Tile

Tying it back to the Hierarchy of needs (Figure-1), we are successfully accomplishing (mostly) the lower 5 strata of the pyramid. We have achieved this with continuous improvement and proactive issue resolution.

We at Tile have taken a conscious approach towards building the pyramid correctly. We are building it with a bottoms-up approach. We are not trying to plug in data that’s dirty & full of gaps, that spans years but not understood yet. We are looking forward to 2019 to unleash the power of quality data.

Future Work

Going forward we plan to

Continue to improve the data infrastructure towards taking on scaling related challenges.
Better monitoring (Metadata and Data )
Data anomaly detection (in multiple scenarios)
Streaming Analytics and online learning
Build data-based products

And more … We will publish more as we get closer to some of the challenging problems we are solving.

We’d love to have some help — see our current open Engineering positions.

If you would like to learn more about any of these components in details, free to reach out to me.

Back to the Tile Engineering Blog

Efficient and cost-effective Data Engineering at Tile

Future Work

Written by manishranjan