Why Build a Data Lake?

Nick Nathan
unified-engineering
6 min readDec 5, 2019
Photo by Linus Nylund on Unsplash

Unified offers customers unique insights into their social media data across multiple publisher platforms to help them execute a holistic social strategy. In order to provide these cross platform insights we must collect, store, process, analyze and display data from a wide variety of sources. As Unified has grown over the years so has the diversity and complexity of our applications. In order to support this diversity of complex applications our data pipeline has needed to evolve as well. A key component of our data pipeline today is our data lake. In this mini series we’ll explore the reasons why Unified chose to build a data lake, the problems it solves and how we went about building it.

The Problem

Our previous data pipeline relied primarily on AWS Redshift for storage and a number of computationally heavy ETL processes. After collecting data from one of various social publishers the JSON responses would be loaded into a Kafka topic and then dumped into an S3 bucket. Next, a series of Spark jobs would parse the raw JSON and load them into a temporary table in Redshift. Finally, another process ran which removed all duplicate records from the temporary table in Redshift before loading the final dataset into a permanent table, also in Redshift. While this process was sufficient in the earlier stages of the company it suffered from several significant limitations.

Inflexibility

The biggest problem with the previous pipeline was that it was highly inflexible. The source of this inflexibility was twofold. First, the data loaded in its raw form was effectively unusable because it could not easily be retrieved and required heavy processing to be accessed. Therefore use of the results of our data collection was limited to tables at the end of our ETL pipeline. This brings us to the second source of inflexibility. Because the only way to access our data was via Redshift tables, we were constrained by the schema of those tables. Ultimately each schema was informed by the consuming application. Unfortunately, as applications changed and new features were introduced so would their data requirements. This meant that we would have to go back to modify the schema of our source tables. Because the table schema was tightly coupled to the ETL pipeline we would also have to go back and modify our Spark jobs to ensure the original data was parsed correctly to support the new schema. In short, smaller application changes could result in changes throughout the entire ETL pipeline. This tight coupling of our ETL pipeline to our applications slowed the pace of changes and meant that the same data could only be utilized by a small number of applications.

Performance

The second problem arose because our data warehouse was utilized by several reporting applications, some of which delivered reports directly to clients. Unfortunately, because our ETL pipeline required that we do a fair amount of deduplication and processing in the database this meant that our ETL jobs were competing for access to the same tables as our reporting applications. If a large ETL job was running it could slow down the reporting and directly impact end users. Similarly, if internal applications or groups wanted to query a dataset for the purposes of analysis, product development or reporting, they could impact client facing applications or just suffer from poor performance. If, for example, our data science team needed to run a wide variety of ad hoc queries to better understand the shape and structure of a dataset for the purpose of feature creation they would have to worry about impacting the whole business.

Cost

Data warehousing is expensive. All our data was stored in Redshift whether or not we were using it or needed it. As a storage type Redshift was fairly costly when compared to cheaper alternatives like S3. Similarly, because our compute engine was directly coupled to our data storage, we would have to expand our Redshift cluster as our data volume increased. Increasing the cluster size meant we were effectively paying for additional storage and compute whether or not we were using it. As the company grew and the amount of data we had under management increased we were starting to push the limits of our Redshift cluster.

What is a Data Lake?

Data lakes are best understood in contrast to data warehouses. Data warehouses are meant to serve as a central repository for all an organization’s data regardless of where that data originated. Data warehouses however suffer from some key limitations that a data lake attempts to address. Data warehouses are generally a reflection of the business entities and reporting requirements defined by an organization. The needs of the data consumer drive the structure of the warehouse and as a result there is a considerable amount of research and engineering time invested in building tables and schemas that provide direct value to those consumers. By design a data warehouses imposes a high degree of structure on an organization’s data.

Whereas a data warehouse requires that data be schematized, data lakes do not enforce structure on data. By contrast, data requires no preprocessing at all to be loaded into a data lake and therefore can serve as a repository for high volumes of raw data. After being loaded into the data lake an organization can make decisions around the best use of that data and start to impose structure in the form of various schematized datasets. Because they do not require structure data lakes are highly flexible and can store data from a wide range of sources. That same flexibility becomes powerful if the reporting needs change or the business entities change. Similarly, data lakes are typically architected in such a way that makes storage cheap so that all historical data can be preserved whether or not it’s being used. Finally, data lakes can be utilized by a wide range of stakeholders within an organization. While most users will only need structured transformed data, there is a subset of users who will need to create new datasets and search for insights within the raw data. The data lake makes this data available to those users as well.

The Unified Data Lake

Given the problems previously outlined a data lake made a lot of sense for Unified. Once completed the data lake helped address the inflexibility problem by giving teams the ability to load in raw data collected from social media publishers without requiring any transformation or schematization. This is critical not only because we work with so many different publishers but because if the structure of those API responses change our data collections process is unaffected. We could therefore build our data collections pipelines without any consideration of the final schema. With the raw data intact, whenever we want to build a new application that depends on a different view of the data, whether this means showing different fields or applying different transformations, it becomes relatively simple to build out a parallel ETL pipeline.

It helped to address the performance issues because now all ETL processing and data transformations could occur outside the data warehouse and thus independently of any reporting or customer facing applications. This became especially important when we started to load in extremely large proprietary datasets from third party vendors to enrich our publisher data. The business could now ingest and ETL terabytes of data completely independently of any unrelated ETL processes.

Finally, because all storage moved to S3, the business effectively only has to pay for compute in the form of EMR. This is more efficient because we can now only spin up EMR clusters when we need to complete a transformation. Compute and storage is thus fully decoupled.

While the team is still finding new ways to utilize our data lake it was an important step in building out a more cost effective, flexible and scalable data platform. If you’re a developer looking to join a team working on interesting problems at the intersection of technology and marketing be sure to check out Unified at https://unified.com/about/careers-and-culture.

--

--

Nick Nathan
unified-engineering

Building apps and technical infrastructure for startups and growing businesses.