Debugging Issues At Their Root

Shannon Tully
Coffee Meets Bagel Engineering

--

The heart of Coffee Meets Bagel is the data we collect and analyze. We use the data you provide in your profile and preferences to decide who we should introduce you to and we analyze all your likes and passes so we can calculate what matches you see. At our scale, at any moment, we have megabytes of data coursing through our production systems waiting to be processed.

We have a system that we call Eventshuttle that allows us to take important data from several locations and send it to an AWS Redshift data warehouse which is used for analytics. We have another system that captures events triggered by pre-defined interactions and writes it to AWS RDS for our Customer Service team to help our users debug issues they encounter. We decided to extend Eventshuttle to include sending data to RDS so we could have a more unified system.

Before diving into the details of the challenge we faced, let’s start with a quick overview of the Eventshuttle system for context.

  • The system is comprised of a producer (written in Python) with threading logic and a consumer (written in Go).
  • The producer serializes and sends data to a RabbitMQ exchange and the consumer pulls records out of the RabbitMQ exchange and propagates it to different datastores.

Work started to extend the Eventshuttle system and the code was released late Oct 3rd to allow the event consumer to write to RDS. The next day we saw that the consumer had been stopped all night. We spun up some new consumers manually and everything cleared. We then started seeing elevated BrokerUnavailableExceptions (BUEs) coming from all of our event producers which prior to release we would occasionally see it at a very low volume. The investigation started on new errors.

We found a warning that looked like it could be the cause of the BUEs. “Giving up on initializing configuration for an exchange due to timeout acquiring initializer lock”. This looked to correlate but everyone agreed that it was more likely to be churning of the servers causing this issue than the threading logic being flawed.

Our first thought was that there was a period during server start-up when the producer would just return errors. A few days after the initial release, we removed a max requests limit that was causing our servers to churn. Later that night, we saw that the 500s had increased significantly. Now every box was returning errors every time an event was sent. After bringing back the cap and restarting all of the boxes, the 500s returned to under the threshold for an alert. However, errors had increased enough at this point to cause average latency across the whole app to more than double.

Still thinking the issue was related to server start-up, we attempted to “pre-warm” the producer meaning we would wait for confirmation from RabbitMQ before we sent any events. After deploying this, there was no change in the error rate. Attacking this idea from the other end, we also tried changing where the SIGTERM command we were receiving from gunicorn would be handled, moving it from the producer to the parent process to make sure that tasks were shutting down in the correct order. This didn’t help either.

We then tried to upgrade a library we were using, “pika”, from version 0.13.0 to 1.1.0 because there were some major changes that we thought could help our situation, but they did not.

Up to this point, we had been working under the assumption that there were no messages being dropped but about a week after the first changes we found out that the “…timeout acquiring initializer lock” warning we were receiving was resulting in a 7% data loss. The reason this data loss was happening was that the only time that warning was supposed to trigger was when there were two events attempting to initialize the same configuration for an exchange at the same time, so one event would be enqueued to be retried. This logic was flawed because occasionally we were unable to acquire a lock on the thread at all. When acquiring the lock failed we would put the events into a pending events list which we thought would be consumed the next time we successfully acquired the lock. The problem was that the list would only be consumed when a new queue was created and because all of the queues we needed had already been created, the list would never be consumed. Once the server was churned, those events were gone forever.

Next, we tried fixing the flawed threading logic. We modified our code to use Python’s threading.Event using the .set() and .wait() methods in order to allow the events to wait for each other before continuing. But the timeout issue persisted with this new logic as well.

Eventually, we decided that the logic of having every event check its queue configuration was flawed at its base. We completely removed all of the logic for initializing the configurations at all and instead documented how to manually create the configuration you need on RabbitMQ when creating a new event. When we deployed this fix a few weeks after the initial changes, everything worked.

No more errors.

No more warnings.

No more data loss.

In conclusion, there was a lot of very odd behavior in this whole process of debugging but I also learned some very important lessons that I will take with me. I believe that we should have taken more time at the beginning of the debugging to really explore why the system was built the way that it was. Doing so would have allowed us to better understand exactly what effect our changes would have. It also would have allowed us to think beyond problems with the recent changes that we made to issues with the architecture of the system as a whole.

Eventshuttle was designed to be scalable when it was created, but it was only ever tested with one thread and a minor load. We could have prevented any of these problems from arising if we had done more testing around threading when the system was originally built. In the future, we will need to spend more time testing so we can catch these problems before they make it to production.

One great outcome of this whole episode was that our team developed expertise in managing this important system. We now have a number of developers who can monitor and maintain Eventshuttle for the future. We have a number of projects planned to expand and improve the way we handle our events and our entire team is very excited to work on them.

--

--