Performance improvement efforts — the dreaded side of software development (part 2)

Published in

Ministry of Programming — Technology

11 min readNov 7, 2019

In the first part we went over the performance improvement effort 101, the factors to consider when planning such effort, as well as a devised checklist on what to consider when jumping in at the deep end of performance improvement waters. This part will focus on the real world example and how flexible the mentioned checklist might get.

The example is from a recent performance improvement effort that I have been involved in. I was tasked to improve performance for a functionality in a trading application that has been driving users crazy for some time.

In this part, we will cover:

the problem and how the performance degradation occurred
the planning factors that were considered for this example
the checklist for phase 1 and show results
the checklist for phase 2 and show results

So without further ado, let’s dive in.

The problem

Uhmm, it’s a natural phenomenon, I’m sure it will pass…

Sometimes it feels like we don’t know how a performance degradation occurred in the first place and how we got to that point in our apps life-cycle. However, that’s not really true. We have already talked about one of the reasons for performance degradation in the first part. Reasons might include the lack of development skills and experience, budget and time to produce quality solutions. Hacking up a solution to get to the market as quickly as possible, but later failing to take time to improve and scale properly. The list goes on and on, but for this example it was about the unfamiliarity with inner behaviour of the 3rd party component.

In this article, we will observe a charting functionality of currency pairs. Since we are building a trading app, one of the main features is to enable users to track how currency pairs (for example, EURUSD) have behaved over time. The charting library on the frontend side is able to show how currency pairs have behaved over observable periods of time, having an option to choose between several observable periods (periods of 1,5,15 minutes, etc.). These periods in our chart functionality are called ticks and they have their open, close, high and low-price values. The library is polling the data from our server every 0.5 second and you are able to go back in time as lazy loading is enabled as well.

So, how was it built in the first place?

We built a proxy server, that used a 3rd party library for communication with the trading data server (that was not controlled by us). So, when users polled for the charting data, we have just relayed those calls to the trading data server and served the response. This would have worked out well if the trading data server hadn’t been used for other things as well. The main reason we had performance degradations wasn’t even one of the reasons mentioned previously. We were unfamiliar with how trading data server worked behind the scene. It was built in a way that it would, while under a high load, decrease a priority for chart data requests. If you take this fact into consideration, factor in the frontend charting library’s polling mechanism and multiply it with tens of thousands of users using this functionality when the trading market is hot, you get a disastrous effect of polling calls lasting for 5-10, sometimes even 15 seconds each. To put it mildly, it was a slow, sluggish and a very frustrating experience for users.

The plan and planning factors

…something corny like “If you fail to plan, you plan to fail!” :)

Let’s go over the planning factors first. By the time I got tasked with this endeavor, users have gotten used to charts behaving poorly and didn’t complain as much as before. Also, since I was the only one involved in this effort, I could assume this was not a red hot “Earth will be scorched unless you fix it” priority. That was not to say I had all the time in the world. I figured that I maybe had two months to get this done and show some results before I got hard-pressed about it. I didn’t have a tight budget, but I could not get flashy with it as well. Let’s just say that I had the option to choose the technologies and other resources as long as it looked fairly reasonable.

With this in mind, the initial plan was:

Still use the proxy server, but as a secondary source of data. Save data in my own data stores optimized for reading (read stores). With just my services using the proxy server, it should not have the same performance issues, as the number of instances that are using the proxy server would be limited.
Choose Go for building services for serving and processing the charting data, Cassandra to build read models and store the charting data in them, Redis for caching.
Utilize all other services and components that already exist in the system.
Roll out as fast as I can with steady performance results, do further improvements in later iterations afterwards.

Phase 1 — go over check list and results

To give you a taste of the journey this endeavor has taken me on, we will go over the checklist.

Time to plan

About one fourth of the entire length of this performance improvement effort went into the planning and solution design phases. Almost half of that was spent on planning. That included the current solution investigations, research for potential technology usage and evaluation against the current skill set. The last one took the most time out of researching because it was important that, if I was about to use technology, framework and/or a library I previously had zero experience with, I would assess the learning curve to prevent the potential misuse of the same. I have already mentioned that I have considered Go and Cassandra. Although I’m confident that I can build pretty much anything with Go, that wasn’t the case for Cassandra. By evaluating the existing learning resources, I determined that I would be able to quickly learn the needed skills to design and build a proper solution.

Time to design a proper solution

So, I chose to go with Go and Cassandra. And I took my time designing the architecture. I spent most of the time designing the read models. I didn’t have a problem with writing to the read stores. As for reading from the read stores, I wanted to do it as quickly as possible, minimising the number of trips to the read stores. Also, I wanted to use the fact that time to retrieve an entire partition in Cassandra is almost constant because of the consistent hashing of partition keys. With this in mind, and with a proper management of the read stores, I would be able to have a solid base for any future improvements.

Take measurable and comparable data

Unfortunately, I didn’t do any profiling at the beginning of this improvement effort. All I had was the reports from users. Comparable data would have certainly come in handy later, both for comparisons and for reports. Comparable metrics were taken at the end of phase 1. To tell you the truth, I have just deemed this checklist item as very important during this performance improvement effort and during this phase.

Build a solution that is able to scale easily

This is where the almost constant time for read store partition reads comes into play. The solution was to separate each tick into a separate table, with partitions configured to suit that specific tick. Data retrieval was done by fetching data from one partition, two in the worst-case scenario (which happened maybe 20% of the time). This would set a solid foundation for any future improvements.

Break down delivery to smaller deliverable chunks

I have to say, when designing a solution like the one described above, I could have put a lot more effort into the delivery planning. I didn’t. Designing the read models aside, most of the effort was spent on writing the code for the services. The solution couldn’t really work without all of the services working properly, so I have easily dismissed delivering smaller chunks and went with one big push. Looking back now, I could have spent some more time breaking up the delivery of the solution. I would have been able to go with several deliveries, with 1–3 days between deliveries. It doesn’t look as much, but I would have started delivering and deploying to production at least a week before the entire functionality was pushed.

Communicate, communicate, communicate

Without the comparable data measured prior to starting this improvements effort, and without smaller deliverable chunks planned, there was really not that much information to put in the reports. All I was left with was to write the boring kind of reports, like “building services is at 30%” or “building read stores is at 45%”. Within the reports, I have also communicated future plans, ETA changes and roadblocks. Without having the measurable data and smaller deliverable chunks to make these reports more interesting, this kind of reports tend to be on the boring side. You can guess how excited the top brass is to read your long boring reports.

To sum up the checklist above, the solution was delivered in two months, with immediate results. Users were happy. Or at least, they weren’t complaining. With metrics below, of course.

Phase 2 — trust the process and repeat the steps

The success of phase 1 was more than I planned for. The top brass were happy and the users were not complaining. It was reported that load time “felt” better during the busy hours. I hate the word “felt” in this context, but it was all I’ve had without any measurable data taken at the beginning of phase 1.

During the supporting stage of the phase 1, I have noticed some potential improvements, so I was keen to tackle them as soon as I had some time to spare. So, without further ado, let’s go over the checklist for phase 2.

Time to plan

Learning from the mistakes I made in the last phase, the main focus this time was breaking the work down into smaller deliverable chunks. I was doing these improvement efforts between other assignments, I did not have more than a couple of days to spare on this. So, the plan was to break the work down into chunks that could be worked on and tested in a day or two.

Time to design a proper solution

At the end of phase 1, I have looked into how the client-side charting library works in more detail. For every tick, when charts are loaded, there are two stages. The first one was loading a bigger chunk of data, for going back and forth through the timeline. The second stage is polling for a set of the current values, a much smaller chunk of data (last ten tick bars). This was hardly optimal, as we are retrieving the entire partition for those ten ticks and the current timestamp. As you can see from the table histogram above (taken at the end of phase 1), that’s about 24–35 KB of data pulled from the database just to take 10 tick bars. The new solution included having two tables for each time-frame; one for retrieving bulk data which would be used for the initial load and another one for polling that would have a much smaller partition.

Take measurable and comparable data

I wouldn’t make the same mistake twice. Before deploying the new improvements, I took the comparable data metrics of the current solution to compare it to the new one. The table histogram above is the result of phase 1. Below is the result of phase 2 improvements.

table histogram for 1-minute table used for polling

Build a solution that is able to scale easily

As you can see from the table histogram above Read latency and Partition size numbers for the polling functionality were drastically reduced. The solution from phase 1 cleared the way for easy additions and minor tweaks to improve performance even further.

Break down delivery to smaller deliverable chunks

As the main focus of the planning effort for this phase, new solution was delivered in smaller chunks. Each of the deliverables contained changes for only one tick. The effort to code and deploy each deliverable was about a day or two.

Communicate, communicate, communicate

Since deliverables and effort to deliver them were smaller altogether, the need for constant updates was redundant. The reports simplified as well as it took just two days to go from starting the development work to production deployment.

Conclusion

Performance improvement efforts vary from five-minute fixes to refactoring efforts spanning several months. They could mean just plugging a leak somewhere in your code or they could mean having the entire architecture evaluated and refactored. Maybe you, or someone from your team noticed it, or your users were complaining. The point is, it varies from one case to another and you cannot have a bulletproof checklist to handle every case the same. That doesn’t mean that there should not be some general guidelines to follow. The list I used is something I’ve compiled over the years and, as you could clearly see, sometimes even I have omitted a guideline or two.

Mirror, mirror, on the wall, which guidelines should I use, if any at all??

Basically, it boils down to this:

You need to give yourself enough time to investigate the issue and create a general plan around that investigation. Implementing fixes based on hunches and guesses can sometimes work or buy you some time, but they can also lead you down the “nothing to show for” rabbit hole.
After the investigation, the general plan should be developed into a detailed one, with considered priorities and implications. Sometimes it’s more important to show fast progress than to design a better solution. Sometimes it’s not.
Everyone needs information about progress during all phases of the improvement efforts, so communication is key. Damage control is as important as fixing the issue.

I would like to hear what you think on this subject. Do you have your own guidelines or do you just wing it? Also, I would like to hear your stories about your performance improvement endeavors, especially the horror stories. Maybe someday there will be a TV show about the gnarliest performance improvement efforts, with Liam Neeson, Morgan Freeman or James Earl Jones as narrators ;).