We’ve been using Cloud Dataflow with it’s powerful window semantics for a few years now. But till now our windows where fairly small. Tracking a user session will generally not span more then an hour.
With our next generation data architecture we want to go further then ever before and do even more calculations in real time on bigger windows. With a bunch of experiments we plan to prove that Cloud Dataflow and it’s new and improved Apache Beam model is the correct choice. The Update experiment already proved that we don’t loose any data when updating our pipeline. Now what would happen to the memory and disk usage when using very large overlapping windows, spanning several days till over a week?
If you look at the illustration we want to know several characteristics for a campaign, for example the unique number of visitors. The campaign in general runs for a week. Our first thought was creating custom windows but if you zoom out you see that the windows strategy is exactly the same as a session window. But instead of speaking in terms of minutes, we talk of days.
In our experiment we defined our session gap as 10 day, this is longer then our longest campaign. We predicted first a steady rise in memory and disk usage and then a stabilisation after a few days. The experiment was started and we kept an eye on Stackdriver Dataflow monitoring. In the Dataflow monitoring we only saw a linear size, so we where a bit discouraged till diving deeper into the metrics we where looking at. The thing is that the Dataflow metrics that are shown are cumulative and of course they would only rise. Luckily when looking at the instance metric the numbers where much better.
The disk usage showed exactly what we predicted, first the rise in usage then the stabilisation. The memory usage was actually more stable then expected but as it’s the Java implementation the memory usage is hard to predict anyway.
The numbers made me happy and it’s certainly a nice result to go into the weekend with. But it makes me wonder what the default dataflow monitoring characteristics are good for, maybe I should have a look at Next ‘17 Monitoring and improving your big data application session over the weekend.
Thanks to our newest team member Wout Scheepers for bringing his ideas to the table and making this experiment happen.