Optimising Flask Applications with Elastic APM

Ch’ng Yaohong
StashAway Engineering
4 min readAug 31, 2019

New Feature

We recently launched a revamped portfolio analytics service to display more information to our customers, as well as overhaul the original service done in our Scala backend. We wanted to decouple our existing services as much as possible and increase the throughput.

Fig 1: New Returns
Fig 1: New Returns

Originally, net asset value (NAV) and performance of portfolios over time was calculated once a day and stored in a table in a Cassandra database. We enhanced retrieval speed by caching the queries. However, as we improved our logic of NAV calculation to include intraday NAV (due to deposits or withdrawals), our existing processes had to change.

Initially, we calculated returns based on the Modified Dietz Method. After customer feedback, we decided to calculate returns using time-weighted and money-weighted to better reflect a portfolio’s return. (Read more about how we calculate our returns here.)

So we embarked on a small microservice created in Flask to calculate these returns on the fly. We wanted to leverage on the Pandas library as it is optimised for speed. Python is widely used within our existing stack and we could quickly build something around it without incurring additional infrastructure load.

When the main logic was ready to be launched, we needed to benchmark it against existing calculations to ensure that we were not scaring our customers unnecessarily if the returns were significantly different. So we created a script to output results from both old and new services into a CSV file. However, we noticed that the speed taken by the new service was unacceptable! Somehow our new API drove like an old beat-up car and we were left scratching our heads thinking about all the possible reasons for the slowdown. Fingers were pointed at the Cassandra read speed, slowness of Flask, needless loops in the calculations. Honestly, we had no real clue at all.

Cue developer’s despair.

Intro to Elastic APM

New Toys!

At the same time, we were also experimenting with Elastic APM to monitor the performance of one of our NodeJS microservice. Our talented DevOps engineer set it all up in our preproduction environment and encouraged the team to make use of the service. Elastic APM is, as its name suggests, an application performance monitoring system built on top of the Elastic stack. We already leveraged the ELK stack to monitor our logs, so it was natural to have an open source alternative to application monitoring.

Elastic APM allowed us to monitor our microservices in real time, collect detailed information on how our responses performed and drill deeper into what made up the response time like database queries, application logic, external calls etc. This would theoretically allow us to gain more insights into our API performance. More details of Elastic APM can be found in the official documentation.

From the distribution below, we could see that each call to the API takes on average 1 second to complete. This was unacceptable from a customer’s perspective — especially when loading something as critical as a portfolio’s returns. Bottlenecks could exist on multiple layers — network, database and code. Previously, it would have taken quite a bit of work to set up profiling and measure the results. However, with Elastic APM, we could quickly see where slowdowns were happening. From the screenshot below, we could see that this particular function `calculating-returns` was taking a much longer time, essentially 80% of the total time for a response.

Fig 2: Transactions duration distribution
Fig 2: Transactions duration distribution
Fig 3: API Call Breakdown
Fig 3: API Call Breakdown

We could see that the function we were calling to derive returns was extremely slow. So we further broke down the function by adding more span captures in the code and found that a recursive function we wrote to calculate net present value was taking too much time. In the screen shot below, you can see that `analytics.calc.xnpv` was called at least 6 times and each call was averaging 100 ms.

Fig 4: Further Drill Down
Fig 4: Further Drill Down

Now that we have uncovered the culprit, we looked into optimising our function. After some tweaking, we were able to reduce the function’s time to 3 ms. The recursive optimisation algorithm was taking too long to compute results. Running a load test again revealed that we managed to reduce the majority of the calls to 130 ms. This amounted to more than 80% reduction in total time taken.

Fig 5: Improved API Responses
Fig 5: Improved API Responses
Fig 6: Improved Drill Down
Fig 6: Improved Drill Down

Conclusion

Given how useful APM was in improving visibility into our code’s performance, we decided to quickly incorporate it into several other applications. Without APM, running load tests and setting up collection metrics would have taken us additional time. We are glad that its introduction had an immediate effect on our production systems relatively quickly.

We are constantly on the lookout for great tech talent to join our team — visit our website to learn more and feel free to reach out to us!

--

--

Ch’ng Yaohong
StashAway Engineering

Head of Data and Engineering @ January Capital | Former image maker turned mercenary