Building a Data Lake in AWS

Kostya Tsykulenko
Oct 3 · 4 min read

People.ai provides revenue intelligence and CRM data input automation by analyzing business activities and matching them with business entities such as leads, contacts, or deals. Our matching technology is enabled by our data pipeline, which ingests and analyzes millions of activities (such as emails, meetings, or calls) on a daily basis.

These activities are then used as a source to build a presentation data layer (powered by Elasticsearch), to export data to our customers, or to perform a variety of analysis and exploration tasks. Until recently, most of our data stores were either key-value or relational databases and served both online and offline data processing tasks, which required expensive replicas or oversized clusters. This approach proved to be extremely expensive and not scalable in the long run. We faced issues with offline batch jobs impacting the performance of the app as well as poor performance and bottlenecks when reading the data.

That’s when we decided to fully separate online and offline storage and use optimal technology for both.

Enter Data Lake

The concept of a data lake is hardly new, and it was covered nicely in the following article by Martin Fowler. Since we have a multitude of batch jobs analyzing big data sets, when we decided to build an offline data storage, our main concerns were storage costs and read performance. Amazon S3 and Apache Parquet have proven to be a great fit for both of these requirements. S3 storage is highly cost-effective and offers great read throughput; partitioned Hive tables on top of Parquet files allow us to query data efficiently.

Here is how our data lake infrastructure roughly looks:

  • The activity ingestion pipeline processes data and stores the data in online storage. Simultaneously, the pipeline streams data to a Kinesis stream.
  • The raw data is stored on S3 in a JSON format.
  • Using Spark for data extraction, transformation, and loading (ETL), we perform cleanup and transformations on the data and store the output as parquet files on S3, while also making the data accessible for querying through Hive Metastore. The native AWS solution for this would be AWS Glue, but we use Databricks extensively, so we’ve settled on using Databricks’s internal Hive Metastore.
  • The structured data from a relational DBs is dumped directly to S3 as Parquet files and indexed in the Hive Metastore.
  • The analytics jobs query the Hive tables, then populate online stores such as Elasticsearch, or produce results to SQS, S3 or any other data sink.

The approach above allowed us to decouple online and offline data storage systems as well as enable easier access to the data for teams inside the company.

Dealing with Mutable Data

Dealing with mutable data on S3 is a complicated undertaking. Overwriting anything is not a good option because object updates on S3 are not atomic, and you have to read and then rewrite the whole file. You should treat S3 as an append-only storage system.

So, how do you handle updates and deletions? The answer is, you need to write your data as an event log.

Any update or deletion is written to S3 as a new entity. The event log is then reduced to the latest state of entities using a view or a materialized view.

However, storing data as an event log forever isn’t practical; It’s costly to store and query. Periodically, we compact the real-time event log and store it as a snapshot. The final view of the all-time data is then the union of the event log and the historical snapshot.

Conclusion

Using S3 and Hive, we’ve seen a 5–6x improvement in the performance of our heavier batch jobs that previously loaded data from key-value databases. Offloading batch processing jobs from the online databases to the data lake has removed CPU spikes and allowed us to downscale our most expensive database cluster. Additionally, offloading batch processing jobs from the online databases has enabled us to get rid of some of the database replicas we were using almost exclusively to power batch jobs. Furthermore, the offloading enabled different teams in the company to experiment with the data without impacting production systems. In the end, moving to the new architecture allowed us to significantly cut infrastructure costs, improve batch job performance, and enable easier and uniform access to the data throughout the company.

Ready to start building the future? See all open opportunities by visiting: https://people.ai/careers/

People.ai Engineering

Practical Data Science, Engineering, and Product

Kostya Tsykulenko

Written by

People.ai Engineering

Practical Data Science, Engineering, and Product

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade