Leveraging Google Cloud Dataflow for clickstream processing
Clickstream is the largest data set at zulily. As mentioned in Zulily’s big data platform we store all our data in Google cloud storage and use Hadoop cluster in Google Compute Engine to process it. Our data size has been growing at a significant pace, this required us to reevaluate our design. The dataset that was growing fastest was our clickstream dataset.
We had to make a decision to either increase the size of cluster drastically or try something new. We came across Cloud Dataflow and started evaluating it as an alternative.
Why did we move clickstream processing to Cloud Dataflow?
- Instead of increasing the size of the cluster to manage peak loads of clickstream Cloud Dataflow gave us ability to have an on-demand cluster that could dynamically grow or shrink reducing costs
- Clickstream data was growing very fast and the ability to isolate it from other operational data processing would help the whole system in the long run
- Clickstream data processing did not require complex correlations and joins which are a must in some of the other datasets — this made the transition easy.
- Programming model was extremely simple and transition from development perspective was easy
- Most importantly, we have always thought about our data processing clusters as transient entities, which meant dropping another entity in this case Cloud Dataflow which was different from Hadoop was not a big deal. Our data would still reside in GCS and we could still use clickstream data from Hadoop cluster where required.
High Level Flow
One of the core principles of our clickstream system is to be self serviceable. In short, teams can start recording new types of events anytime by just registering a schema to our service. Clickstream events are pushed to our data platform through our real-time system(more later) and through log shipping to GCS.
In the new process using Cloud Dataflow we
- Create pipeline object and read data from GCS into PCollectionTuple object
- Create multiple outputs from data in the logs based on event types and schema using PCollectionTuple
- Apply ParDo transformations on the data to make it optimized for querying by business users (e.g. transform all dates to PST)
- Store the data back into GCS, which is our central data store but also for use from other Hadoop processes and loading to Big query.
- Loading to BigQuery is not done directly from Dataflow but we use bq load command which we found to be much more performant
The next step for us is to identify other scenarios which are a good fit to transition to Cloud Dataflow, which have similar characteristics to clickstream — high volume, unstructured data not requiring heavy correlation with other data sets.
This entry was posted in Big Data and tagged Dataflow, GCP by Sudhir Hasbe. Bookmark the permalink.
About Sudhir Hasbe
Sudhir is an accomplished product management leader with over 16 years of experience building industry-leading products at startup and blue-chip companies. He has a proven record of leading product teams across engineering and marketing to deliver business results through customer insights and innovation.
Originally published at https://zulily-tech.com on August 10, 2015.