Building a Data Streaming Platform — How Zillow Sends Data to its Data Lake

Published in

Zillow Tech Hub

7 min readJan 22, 2020

Zillow produces a lot of data! We are a source of information on 110 million homes in the U.S., as well as multifamily rental properties. Both require a ton of data, in terms of ingestion and storage. Zillow also uses external data sources, including Clickstream data from Google Analytics. In a previous blog post, the Zestimate team described how they use data as event streams to speed up the calculation of Zestimates. This post will detail how we developed pipelines to process clickstream data, overcame scale issues, and built a generic platform that could be used for data collection and processing.

Data from Databases

When Zillow was originally formed, we used a lot of databases to store the data, with a cache in front to enable fast searching and quick lookup. Later, we standardized with Amazon S3 as our data lake provider.

We had to overcome the challenge of how to get the data from databases into the data lake. Initially, we moved to using custom Sqoop jobs to extract data straight out of the tables and put them into S3. While this solved the immediate problem of getting data into S3, it also raised some issues.

First, since the Sqoop jobs were being written by developers in the Data Science/Engineering Org, who had no idea about the semantics of the tables, or how they fit in the product. And they had to constantly keep up with changes in the schema.

Second, the schema of the exported data closely followed the schema of the database (DB), and the DB schema was not necessarily optimized for Data Science/Machine Learning applications.

Since the Sqoop export jobs ran everyday, and sometimes they impacted live site facing databases, DBA had to create special Read-only replicas of those databases. This required more maintenance and overhead. Some of those databases could not be replicated easily, forcing us to read from one-day old database snapshots.

Writing directly to Data Lake

Some product teams wrote code to write data directly to S3. While this empowered product teams to send data directly, it also meant that there was no enforcement of the schema. Some teams wrote Json, while some teams wrote text files or csv files. The structures of the files were defined by the team, and were inconsistent. When schema changed, there were no consistent rules for when backfill was done for the historical data.

This also required teams to create and manage their own AWS resources. If they were writing directly to S3, they needed to create appropriate Roles and Credentials. If they were using firehose to write to S3, they had to create a firehose stream in addition to credentials.

Finally, people were not aware of governance and lifecycle policies for the data. For example, if the data contained PII, it was supposed to be encrypted. Otherwise, raw data and processed data were supposed to have different lifecycle policies. Frequently when teams wrote directly to S3, they were not aware of these policies.

Data Streaming Platform

In order to solve the problem of getting data to the data lake in a consistent form, we developed a streaming platform as a service. Our goal was to build and architect stream processing as a service platform to power real-time analytics and machine learning applications. The key tenets of this architecture are as follows:

Build streaming infrastructure

We standardized on using Streams to send data into the data lake. Teams have no knowledge of the streams, instead, they just call a REST API that uses underlying streams to send data. By abstracting the underlying technologies from the consumers, it allows us to monitor usage and scale out capacity as needed. Teams can send both event and other data sets with no knowledge of underlying streaming technologies. Teams don’t need to worry about the streaming infrastructure and focus on data analysis. We use a persistent routing table to route messages to the correct destination.

Create Separate Streams for each Application

Since users are no longer involved with the underlying streaming technology, they can directly request resources and start using them. This allows them to get as many streaming resources as necessary, and use them at their desired granularity.

Separating Producer and Consumer Streams

A producer stream is only accessible to infrastructure team for data transformations. Since we don’t know the number of clients accessing the streams and their access patterns. It’s safe to create consumer stream separate from the producer stream. This will ensure that consumers can connect without affecting the ability to receive messages. This also enables us to scale consumer and producer streams separately, allowing us to maintain the tough Service Level Agreements (SLA’s) we set for data delivery.

Support Sends Data to Kafka

We are spinning up an in-house Kafka cluster to support even more low-latency and high-throughput use cases. Additionally, we are adding support to send to a Kafka topic as one of the destinations. This integrates tightly with the schema registry to support messages with schema and enforces compatibility constraints.

Supporting Common Processing/Archival Scenarios

Usually, when someone wants to send data to our system, they also want the ability to query it easily. To support this, we implemented a “Stream Processing as a Service” paradigm, whereby data sent to our system is automatically archived into a Hive table, and data is queryable for end users (Analysts and Business Users) with Mode or Tableau.

Data Catalog and Discovery

As our data producers and consumers grow, there is a need for cataloging so we know who is producing the data, in what format and schema. We also need to know who the consumers are for data governance purposes and to be able to alert data consumers about issues or changes with an upstream dataset. In order to support this, we implemented a searchable Data Catalog which stores current metadata about all data entities and relevant context, including data lineage. The Data Catalog is also used to tag datasets which have special characteristics, such as Personally Identifiable (PI) data and lifecycle policies.

Data Quality

All the data flowing into the system needs to be checked for quality. This might include schema checks, data completeness checks, as well as metric value checks. We implemented a data quality service called Luminaire that tracks the quality of datasets using a combination of heuristics and models. It uses a collection of time series models to make sure the data flow matches what we expect, and if not, it alerts the upstream producers.

Current Usage

We currently have the following types of data flowing through this system.

Current Challenges

Static Infrastructure Tools

We use terraform heavily for creating infrastructure at Zillow. This includes setting up AWS resources such as kinesis and firehose streams, spinning up EMR clusters for processing data, etc. Using terraform for spinning up resources on demand has not scaled well for us, especially because the resource requests come from external teams, and that requires some turnaround due to the shared nature of our data lake account.

In order to solve this, we are slowly moving teams to use Kafka topics instead, and we are developing a CICD pipeline that can automatically create topics and register Avro schemas with the schema registry. We will describe this process more in a future blog post.

Kinesis Client Ecosystem

Kinesis streams allow a limited capacity per shard, and are scaled by that adding more shards. In order to utilize shard capacity to the maximum extent possible, it is recommended to use Kinesis Producer Library (KPL). Our initial deployment used KPL to write to the kinesis streams. However, we found out that this does not scale well, because our service was writing to lots of different streams, and the overhead used by KPL to keep track of shards per stream and buffering messages per shard caused our service to consume JVM heap resources and die. Instead of spending time to tune this more, we decided to write our own KPL compatible library that provided some of the capabilities of KPL. In other words, we decided to trade off inefficient utilization of kinesis stream shard capacity in favor of better service stability.

Also, support of KPL for other languages (e.g. Python) is not fully baked, since it requires running a native language daemon on the side. Due to these issues, we think that moving to Kafka will allow us to provide both high utilization of partitions, as well as service stability.

Conclusion

With the creation of the Data Streaming Platform, we have enabled teams to easily send data to the data lake. They are now able to request resources without needing to contact AWS Account administrators. All the data that is being sent is validated for schema. This has increased the velocity of teams sending data to the data lake, from multiple weeks to a couple of days. It has also enabled the data science team to gain fresh insights into our clickstream data, which powers personalization to display “Related Homes” on home details pages.