Optimizing Our Legacy Calendar Service

PushPress Engineering
6 min readJun 29, 2022

--

Jeremy Farbota / June 29, 2022

Introduction

For our gym owner clients, it is imperative that the PushPress calendar service remains fast, functional, and reliable. Recently, our team ran into the issue of our Legacy Calendar Service reaching a limitation regarding its scalability. One endpoint was overloading the entire system; the endpoint was making too many database requests, and we were unable to scale it. In order to resolve this, Engineering split out the API requests by using a load balancer proxy.

In our journey to optimize and improve the calendar service, we landed on a few key improvements that reached our goals, and made the largest impact to this project. First, we created a Legacy Read API to isolate heavy read requests into their own cluster, utilizing read replication database nodes. Additionally, we rewrote the API data request from several select loops into one large query with outer joins, in order to obtain all the information in one request as opposed to classes.

After implementing these changes, our main API scaled down by 80%, and our P90 latency dropped by 90% on these endpoints. This resulted in a much more scalable system, paving the way for a far better user experience.

Background

The Legacy Calendar Service supports all of PushPress’s class scheduling and management, both for the gym owners as well as athletes. The system was originally designed to reliably list classes, in addition to including all of the pertinent details such as description, participants, and waitlist information. As our customer base continues to grow, this system becomes exceedingly taxed with increased usage. In recent weeks, the calendar service grew exceptionally slow for customers attempting to check-in and manage their classes, particularly during peak hours. On a recent Monday, this issue culminated with a severe spike in traffic, drastically impacting the overall interface services for a short window of time.

The calendar system was running in our Legacy PHP codebase, which contains a monolithic API reading from a large, shared MySQL datastore. The legacy codebase presents many issues, from inconsistent mechanisms for reads and mutations to inefficient code architecture. In the face of usability concerns, we aimed to optimize the legacy system to speed up processes and provide more scalability, so that we would have more time to replace functionality in the new system. Additionally, we wanted to ensure that the impact of any slowness would be isolated. In order to do this, we needed to guarantee that there were not any significant architectural changes to the PHP.

Strategic Approach and Challenges

The first step was to isolate and migrate this endpoint’s Legacy API requests to the Read Database. There were several considerable challenges with this endeavor; to start, we could not safely migrate the endpoints to their own API without significant work. The endpoint’s business logic was also too complex to move quickly. Furthermore, the libraries on the Legacy API were very old, requiring us to utilize a backdoor approach to migrate the API to a read version.

To combat the isolation problem, we were able to use the same API image and direct specific requests to the new read server using load balancer configurations. This load balancer can map requests with specific headers to a different service without disrupting any other calls.

Focusing on the Read Database issue, we needed to change the configuration on the database connection for the Legacy API without impacting the main API, as we needed to still use the same image. In order to achieve parity between the services with the database configuration being the only change, we used environmental variables in the secrets configuration to enable a specific setting in the task definition for the service. This way, the image could remain identical between the two services, while still allowing the task definition in the read service to point to the read cluster, as opposed to the main writer.

Results

After making this initial change, we immediately noted a major improvement regarding the main API. By migrating the overloaded endpoint to a new cluster, our main API load dropped by 80%. Our main write database node was suddenly far healthier with regards to its load; the read API responses for the overloaded endpoint were faster than before, as they were now isolated by using the read database cluster. Due to this we were able to scale up our database by adding read nodes, further decreasing the latency.

Latency — Average Response Time
Latency
Requests
Service CPU

After isolating the requests to their own dedicated services, the next step was to optimize how the API requests data from the database and maps it to the necessary outputs. By using joins on the SQL request instead of pulling in several objects, we reduced the number of SQL requests from eight to one. This development posed the challenge of being able to ensure that the response remained the same, and that all of the items from the Legacy system were correctly mapped from the datastore or API model. There were over 80 attributes in this large, nested object.

Latency — Average Response Time
5xx (fewer load balancer timeouts)

Additionally, this challenge had a short deadline of two-days’ time. By focusing solely on this specific improvement and building an effective validation mechanism, the Engineers implemented a Pull Request within the time frame, completely matching responses to the Legacy response. Not only did this reduce the number of requests and speed up the entire process, but it also reduced the memory and CPU tax on the API. The API no longer needed to manage and internally join all of the data, thus, the overall API memory and CPU needs decreased significantly. This change to use a single SQL query reduced the overall latency on this request by a significant 80%.

Lessons and Impact

This calendar endpoint is widely used across our systems; therefore, the impact of this fix was very noticeable to our customers. By reducing the latency on this endpoint, we were able to minimize the potential for other timeout scenarios, and ensure that every request received a response. As this endpoint is heavily used in our Member App for participants to reserve and check-in for classes, Engineering worked with PushPress’s mobile team to update the user experience in the mobile app, reducing errors while leveraging the faster data responses for a better overall experience. While the interface was extremely slow during peak hours prior to this improvement, following the fix all of our applications were reliably fast, operating without any noticeable slowness.

One of the PushPress core values is “Execute Incremental Perfection.” With this improvement, Engineering increased our capacity to build an even more efficient calendar service. On our team, we are constantly forced to make decisions between fixing Legacy or rebuilding it entirely. With this development we were able to implement a hybrid solution; we built a new supplement without needing to rebuild the entire service. This process allowed us to create a new pattern for isolating read requests, as well as to improve endpoints by utilizing SQL optimization. If one endpoint is optimized, then we are able to improve not only that product directly, but also other experiences where said endpoint is no longer taxing the entire Legacy system.

Above all we learned that the PHP systems are a major risk to the site’s flow, and as Engineers we need to continue to prioritize stabilizing and migrating those systems. Moving ahead, we intend to develop a new API to manage calendar requests and responses, so that we can discontinue the overloaded request entirely. We look forward to documenting that process in another blog post!

Team Shout-Outs

The calendar optimization included the work of many different collaborators, and it would not have been possible without the contributions from Gustavo Emmel, Fernando Zeferino, Maximiliano Goffman, Brian Aung, and our amazing Customer Experience and Quality Assurance Teams!

--

--

PushPress Engineering

Telling the story of what we at PushPress Engineering are doing behind the scenes to improve and maintain our systems to better serve our customers.