Adopt these 3 key practices to drive better long-term application performance on Spark

kgshukla
DBS Tech Blog
Published in
8 min readSep 1, 2022

In the second part of a two-part series, we look at long-term solutions that bring performance improvements

By Kapil Shukla

In the first article of my two-part Spark series, I focused on five short-term solutions to optimise Spark job performance that helped our team overcome performance problems. While effective, the biggest drawback is that they are not meant for the long haul. As such, it’s essential to look beyond just code level optimisation to minimise recurring performance problems. This article focuses on the three steps our teams have taken, and the benefits observed after implementation.

1) Balancing Project And Product Mindsets

Team structure is crucial in governing how a product is developed, delivered, and maintained during the application’s lifecycle. Organisations will typically have either a project or product mindset. This mindset will also drive the team structure to build and deliver products.

Organisations with a project mindset focus on getting the job done, and the team will likely be led by a delivery lead or a project manager. Often enough, architecture principles like scalability and observability are afterthoughts, and will only be prioritised when things don’t work well during the production stage.

Organisations with a product mindset focus on the quality of the end product, and the team will likely be led by a development lead or product manager. Adopting architecture and SRE principles along with delivering business value would be taken into account right from the start.

While performance problems are observed in both approaches, problems will usually be much harder to solve if teams adopt the former mindset for a couple of reasons:

  1. Delivery timelines are fixed, and assigning dedicated people to sort out performance problems risks the team failing to meet timelines; and
  2. The culture and mindset within the team surround are formed based on delivering functional/business features. Workarounds will be priority each time the team faces problems in delivering features. These workarounds will add to the tech debt, and the team would find it hard to step back and relook at the design and architecture

At DBS, the finance team is undergoing transformation to balance the needs of the business (features) as well as technology (design, engineering, scale, etc). New capabilities such as scalability, observability, and resilience are needed in several teams to move towards a product mindset. We’ve taken performance problems as an opportunity to build a few capabilities. We set up a centralised team (that is expanding) comprising colleagues skilled in areas such as performance improvement, observability, and automation. These specialists work alongside project teams to resolve performance issues and build new capabilities that minimise future recurrences.

We observed a few benefits to this centralised approach:

  1. An outside-in perspective that allows team members to re-evaluate the architecture, design, or process;
  2. The centralised team is motivated to understand the business and architecture goals through the ‘Whys’, before delving into the ‘Hows’. Once they nail the ‘Whys’, they can better structure how they can achieve the outcome (the ‘Hows’); and
  3. The centralised team drives non-functional requirements (including automation, observability, logging, and operational procedure) from start to finish, providing the existing team with hands-on learning and an increase in exposure and experience

The benefits also led to results which have been rather promising:

  1. Rearchitecting the design and architecture based on business inputs and identification of gaps in the existing processes;
  2. Prevention of poorly written code from reaching production;
  3. Identification of key metrics that drive performance and making them visible in the dashboards; and
  4. Moving away from shell scripts to event-driven architecture

It would have been difficult to roll out observability, regression framework and automation features if the developers from the project team were the only ones involved. The next two sections delve deeper into the prevention of poorly written code from reaching production, and identification of key metrics, both of which also happen to be solutions to optimise Spark job performance.

2) Preventing Poorly Written Code To Reach Production

When anyone examines performance tuning problems, one data point that is often looked at is the job’s historical timing, followed by the “What has changed since the last good timing?” question. The reasons could be many, including developers adding more code or the increase in sheer volume of data being crunched. We investigated the development and release cycle and understood that the team was following code review, unit testing, and functional testing processes before rolling out the code to production, as shown below.

Figure 1: SDLC Process

It was apparent that the team needed to pivot to other practices as the existing ones were not able to catch the problems in testing environments. As developers added more transformation-related logic in the code, it was difficult figuring out whether the transformations had an adverse effect on job performance, even if the size of the dataset remained same.

We adopted ‘shift-left’ approach and brought a change in our development and release cycle. Our goal was to catch performance problems during the development phase, not at UAT, and certainly not during production. Hence, we wrote a small yet powerful Performance Test Regression Framework using the following principles:

  1. Every developer runs his code on selected jobs that crunch large productions like datasets
  2. The run time must be within a 10% range of the observed run time for the same job in production
  3. The developer must attach the regression report in his pull request to commit his code
Figure 2: SDLC Process with Regression Testing

Essentially, we put a gate in place to push developers to be responsible for breaking the performance of key jobs by their code, given that there is no change in the dataset, or resources assigned to run the job. As mentioned earlier, distributed programming is difficult and often, code reviewers and developers may not pay attention to parts of the code that eventually cause performance problems.

High-level steps were used to build the performance test regression framework:

  1. Generating production-like datasets;
  2. Selecting critical jobs in our system and running these jobs against the dataset with known Spark configurations (including number of executors and cores per executor);
  3. Noting the processing time for each Spark job;
  4. Creating data model to store the jobs, timings, resources, dependencies, and other key configurations; and
  5. Automating the process via a simple shell script that takes input as the commit ID in the developer branch. It then compiles the code, runs the jobs sequentially, keeps track of the completion time and finally creates an assessment report

This strategy paid out immediate dividends as the regression framework prevented the core team (that wrote the performance test regression framework) from committing their code while they were optimising it. This new process puts the onus on the developer to own the issue introduced by his or her code.

3) Identification of Key Metrics

The Phoenix project spoke about four kinds of work — business work, internal projects, operational work, and the dreaded ‘unplanned’ work, which takes the person away from completing his main scope of work. Unnecessary calls, meetings, additional work done for others all fall under this category.

In large organisations, unplanned work remains unnoticed to managers, making it more difficult to reason an employee’s productivity. Simply put: What doesn’t get noticed, doesn’t get measured. Systems performance also behave similarly. Systems built without having visible key metrics would make it difficult for others to find the root cause of the performance problem. Moreover, unplanned work done by the folks who are engaged in solving the issue, puts a dent in productivity.

Designing performance-related indicators that are visible in dashboards are worth the effort when it comes to measuring the efficacy of the application. Prior to this, our teams were performing basic monitoring, yet the data reflected didn’t help us gain additional insight. We focused on observability and built two dashboards, Live Status and Trend Status.

The Live Status dashboard shows metrics we want to observe when jobs are running. Metrics such as data processed by jobs, data skewness, runtimes of various jobs, job status, checkpointing for every job, infra-assignment, are now readily available. Importantly, we defined two Service Level Indicators (SLI) — Data Quality, and Performance — for our system.

Figure 3: Data Quality SLI

Data Quality SLI: We conduct data quality checks at the end of every phase. We defined the indicator as the percentage of the total jobs producing the right data as shown in Figure 3. This indicator is necessary because if our Spark jobs don’t produce the right data, then manual intervention is needed, which prolongs the overall completion time. We wanted to minimise manual interventions and make metrics visible, regardless of how good or bad our jobs were doing.

Performance SLI: We defined this indicator as the percentage of jobs completing within the 10% range over the average of the past five runs. If this percentage falls below a threshold of 15%, then we’ll know that we won’t be able to complete our jobs within the specified timeframe.

The Trend Status dashboard gives an overview view of trends in the last few periods based on the various metrics captured in the Live Status dashboard. Whenever jobs are not completed within the desired timeframe, we want to view what (including metrics) has changed in the last few months, so that we may determine the cause(s) for delays.

Building these dashboards yielded the following benefits to our team:

  • We figured out the underlying metrics responsible for the performance of our system;
  • Moving to SLI, and then to Service Level Objectives (SLO) brought visibility and flourished conversations with business stakeholders as no system is guaranteed to work 24/7;
  • Moving from reactive to proactive problem solving; and
  • Receiving immediate insights on various metrics, toil hours saved on our trends and live status dashboards

Conclusion

Adopting the above three techniques have been challenging and time consuming, yet we can experience a higher level of efficiency on our existing practices and engineering. Building the ecosystem around the application requires additional time, people, and effort, which are all unfortunately squared off when things go wrong. This same ecosystem built around the application accelerates problem solving.

Kapil is a technologist with over a decade experience working as a software developer, product manager and solution architect building large scale enterprise products. In his current role as Head of Engineering and Architecture, Kapil is responsible for defining the technology strategy and adoption, hiring and coaching talents to build superior products for the finance platform.

--

--

kgshukla
DBS Tech Blog

Kapil is a technologist with focus on building data intensive applications and data analytics.