Auto Scaling Impact on Data Scientists Day-to-Day

Pedro Fonseca
Feedzai Techblog
Published in
4 min readMar 15, 2022

More big data and analytics application workloads are being moved to different cloud platforms, where data scientists and data engineers do much of their work. Cloud platforms can be public, private, or both (in which case they are called “hybrid clouds”). The data science and data engineering communities should understand cloud platforms at a deeper level so they can more effectively accomplish analytics work.

Data scientists and data engineers frequently run their data processing and analysis work in non-scalable environments. But with the availability of cloud technology, we have new capabilities that enable more efficient and elastic usage of available resources.

Leveraging Auto Scaling Capabilities of the Cloud

Feedzai continuously works to deliver better user experiences and performance capabilities to one of its core users: data scientists. In an effort to improve the challenges that larger data science teams face, we have introduced multiple improvements and features in our service. One of them was seamless integration with auto scaling capabilities.

As a model is built, more questions are asked and extra data is required to provide answers, resulting in multiple iterations of the Data Science Loop. If you are running the model on premise, you will face a limited and fixed amount of resource types and capacity. That’s where cloud readiness really comes in handy.

We bring our knowledge and experience, resulting in much-needed efficiency, performance, and elasticity. We leverage our own experience of managing multiple customers on the cloud. Under the hood, we use Spark and Yarn as a resource manager. Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or on clusters.

Without Auto Scaling

In the picture above you can see:

  • there are multiple jobs (ML model training, feature generation, data exploration, …) competing for the same fixed capacity;
  • jobs waiting to start because there are no infrastructure resources available;
  • a significant amount of unused capacity.

Moreover, with a fixed capacity, multiple jobs running simultaneously compete for the same resources making the tasks a lot slower. Because they must run sequentially, this can cause delays in deliverables.

The multiple inefficiencies outlined here were solved when our Cloud Operations, Data Science, and Product teams worked together with customer Data Scientists to take advantage of the cloud capabilities of Feedzai RiskOps Platform. Together, we were able to achieve impressive results!

With Auto Scaling

Have a look at the picture above. Can you spot the differences? Let me help you. Whereas before there was no way to have multiple jobs running simultaneously without significantly impacting efficiency, we now can do it by provisioning dedicated resources for each job. The jobs finish faster, and once they finish, the cluster scales down to minimal resources ensuring minimal waste in the process. During this review we achieved the following results:

Scalability: We were able to execute jobs 2–5x faster allowing more iterations, essential for data scientists to deliver better business results;

Capacity Increase: We were able to deliver 4x more capacity to the customer at the same cost;

Availability: Increase of 98% in jobs completed (measured before and after Auto Scaling + Yarn queues). Auto Scaling helps ensure that applications always have the right capacity to handle current traffic demand;

Team Efficiency: Increase from ~10 active data scientists to ~25 active data scientists. The cluster can automatically react to the team workload eliminating the need for the team to self organize to use the cluster in idle periods.

Customer Data Science teams can now scale resources and have dedicated teams using those resources. Our customer came to the conclusion that we achieved better performance, faster iterations, and increased capacity at lower costs.

The efficiencies gained mean that getting to the point of building the model will be quicker and the models will undoubtedly be better. This means data scientists will have more time to spend tweaking and improving them.

--

--

Pedro Fonseca
Feedzai Techblog

IT Manager | Corporate Innovation | Enterprise Architecture | Digital Transformation | PMP. Find out more about me: http://lnked.in/fonseca