Building a Data Lake in AWS

Kostya Tsykulenko
Oct 3, 2019 · 4 min read provides revenue intelligence and CRM data input automation by analyzing business activities and matching them with business entities such as leads, contacts, or deals. Our matching technology is enabled by our data pipeline, which ingests and analyzes millions of activities (such as emails, meetings, or calls) on a daily basis.

These activities are then used as a source to build a presentation data layer (powered by Elasticsearch), to export data to our customers, or to perform a variety of analysis and exploration tasks. Until recently, most of our data stores were either key-value or relational databases and served both online and offline data processing tasks, which required expensive replicas or oversized clusters. This approach proved to be extremely expensive and not scalable in the long run. We faced issues with offline batch jobs impacting the performance of the app as well as poor performance and bottlenecks when reading the data.

That’s when we decided to fully separate online and offline storage and use optimal technology for both.

Enter Data Lake

The concept of a data lake is hardly new, and it was covered nicely in the following article by Martin Fowler. Since we have a multitude of batch jobs analyzing big data sets, when we decided to build an offline data storage, our main concerns were storage costs and read performance. Amazon S3 and Apache Parquet have proven to be a great fit for both of these requirements. S3 storage is highly cost-effective and offers great read throughput; partitioned Hive tables on top of Parquet files allow us to query data efficiently.

Here is how our data lake infrastructure roughly looks:

Image for post
  • The activity ingestion pipeline processes data and stores the data in online storage. Simultaneously, the pipeline streams data to a Kinesis stream.
  • The raw data is stored on S3 in a JSON format.
  • Using Spark for data extraction, transformation, and loading (ETL), we perform cleanup and transformations on the data and store the output as parquet files on S3, while also making the data accessible for querying through Hive Metastore. The native AWS solution for this would be AWS Glue, but we use Databricks extensively, so we’ve settled on using Databricks’s internal Hive Metastore.
  • The structured data from a relational DBs is dumped directly to S3 as Parquet files and indexed in the Hive Metastore.
  • The analytics jobs query the Hive tables, then populate online stores such as Elasticsearch, or produce results to SQS, S3 or any other data sink.

The approach above allowed us to decouple online and offline data storage systems as well as enable easier access to the data for teams inside the company.

Dealing with Mutable Data

Dealing with mutable data on S3 is a complicated undertaking. Overwriting anything is not a good option because object updates on S3 are not atomic, and you have to read and then rewrite the whole file. You should treat S3 as an append-only storage system.

So, how do you handle updates and deletions? The answer is, you need to write your data as an event log.

Image for post

Any update or deletion is written to S3 as a new entity. The event log is then reduced to the latest state of entities using a view or a materialized view.

Image for post

However, storing data as an event log forever isn’t practical; It’s costly to store and query. Periodically, we compact the real-time event log and store it as a snapshot. The final view of the all-time data is then the union of the event log and the historical snapshot.


Using S3 and Hive, we’ve seen a 5–6x improvement in the performance of our heavier batch jobs that previously loaded data from key-value databases. Offloading batch processing jobs from the online databases to the data lake has removed CPU spikes and allowed us to downscale our most expensive database cluster. Additionally, offloading batch processing jobs from the online databases has enabled us to get rid of some of the database replicas we were using almost exclusively to power batch jobs. Furthermore, the offloading enabled different teams in the company to experiment with the data without impacting production systems. In the end, moving to the new architecture allowed us to significantly cut infrastructure costs, improve batch job performance, and enable easier and uniform access to the data throughout the company.

Image for post
Ready to start building the future? See all open opportunities by visiting: Engineering

Practical Data Science, Engineering, and Product

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store