Scaling Spark Data Processing

Published in

Pygmalios Engineering

3 min readApr 9, 2019

The last year was huge for Pygmalios. We have installed our retail analytics platform into couple more European countries and started collecting data from hundreds more stores. It was not only an installation challenge as we use physical hardware but data processing challenge as well. And I am not talking about migration from AWS to GCP we have experienced during the Summer.

It is still about finding the balance between new product development and paying out the technical debt. We have used one monolithic data processing application in Dataproc after GCP migration which had to run as a never ending Spark job all the time and required a lot of resources. It was connected to Kafka streaming for realtime data processing, run batches during the night and served API for aggregated data requests. And it started to be a challenge for scaling, user experience and its costs.

Splitting-up the application

One goal was to define parts of application which have to run all the time and have allocated resources (Kafka streaming, API) and which can scale down to zero when they are not used (batches). So instead of one application we use three for different use cases:

realtime streaming
API
batches

Cluster autoscaling

As we don’t need cluster in full power all the time we can use minimal setup by default and auto scale workers when more computing power is required. This feature is currently in beta release in Cloud Dataproc however it works fine for us. You can define your worker limits, scale factor, cooldown period, decommission timeout and more. And spend more time tuning them. For more information visit Autoscaling cluster documentation. Here is the example cluster definition:

gcloud beta dataproc clusters create sep-dev \
--image-version=1.3 \
--project develop-sep \
--async \
--tags develop,sep,sep-develop \
--subnet projects/general-214708/regions/europe-west1/subnetworks/develop-sep \
--region europe-west1 \
--zone europe-west1-b \
--scopes cloud-platform \
--master-machine-type n1-standard-4 \
--master-boot-disk-size 100 \
--num-workers 2 \
--worker-machine-type n1-standard-16 \
--worker-boot-disk-size 100 \
--num-worker-local-ssds 1 \
--initialization-actions gs://dataproc-sep-dev/bootstrap.sh \
--properties "\
dataproc:alpha.autoscaling.enabled=true,\
dataproc:alpha.autoscaling.primary.max_workers=2,\
dataproc:alpha.autoscaling.primary.min_workers=2,\
dataproc:alpha.autoscaling.secondary.max_workers=100,\
dataproc:alpha.autoscaling.secondary_worker_fraction=1.0,\
dataproc:alpha.autoscaling.cooldown_period=2m,\
dataproc:alpha.autoscaling.scale_up.factor=1.0,\
dataproc:alpha.autoscaling.graceful_decommission_timeout=6m"

Don’t forget to increase your project CPU limits before you will be debugging why your autoscaling doesn’t work as expected and get stuck :)

And this is the final cluster autoscaling result, you can see the spike in the morning when batches were running:

What’s next?

It is not the end. We still have a long road ahead and spark-on-k8s-operator looks promising. We should definitely give it a try. We can also consider moving batch scheduling outside spark application and run batches as standalone spark jobs.

However this works for now and there is more technical dept we have to pay sooner. And when your increase your data processing power you also bring more load to your databases, so you should scale them as well. This is our next big thing.

Scaling Spark Data Processing

Splitting-up the application

Cluster autoscaling

What’s next?

Written by Jan Antala