Increasing application performance and deep-diving into open-source code

Published in

Life at Chime

11 min readOct 27, 2022

By Daniel Vu Dao, a member of Chime’s Engineering Team

Every day millions of Chime members use our app to achieve financial peace of mind. Whether it be checking their balance, sending a Boost, or receiving a referral bonus — our members usually start in the same place: the Home Screen.

On average, our homepage sees tens of thousands of requests per minute.

Earlier this year we noticed an opportunity to improve performance within one of our services which would speed up our homepage load time.

An investigation ensued and as a result we made our homepage load 50 ms faster (original latency was 60ms, therefore shaving 80%) and also patched a major open source tool.

In this post, we’ll go over how we discovered an opportunity for improvement, the exploration we undertook, and how we tackled the issue. What started out as a small issue turned out to be a solution for our members and the wider Ruby open source community 💎.

*If you’re eager to see the open source contribution, here’s a link to the PR ahead of time.

Setting the Stage

At Chime, we have many micro-services that serve different use-cases for our members. On our mobile client we use GraphQL to make calls to our API gateway service. The API gateway fans out to the underlying micro-services to complete the request.

A high-level overview of Chime’s architecture, on average, receiving tens of thousands of requests per minute.

Consequently, if an underlying service is taking too long, it becomes a bottleneck for our API gateway and mobile app.

We began to notice that one of our services which calls DynamoDB, was taking a long time to respond, fluctuating in latency and degrading our home screen.

You’re probably looking at this graph and thinking…

“But 60 ms? That’s nothing!” 🤔

For folks who don’t know, DynamoDB is a key-value store service provided by AWS. Key-value stores are generally faster than a traditional RDBMS when accessing records, as they don’t require expensive JOIN clauses. Depending on how one sets up their DynamoDB table (sort key, partition key, clustering, etc.), querying records is expected to be fast. For reference, other queries within our service were taking ~10ms and a long DynamoDB query like this was becoming a bottleneck to our application.

Once we identified where the latency was occurring, it was time to start the investigation of why.

It starts with an educated guess 💡

Looking at our graph over time, we noticed that our deployments somewhat (somewhat of an educated guess?) correlated with latency spikes.

Top chart is a request to the service colored by deployment and the bottom chart is latency

Since these spikes started to flatten over time, our first guess was DNS.

We knew that when starting an external API request:

A client needs to do a DNS lookup.
DNS lookups are expensive as they require a DNS resolution to resolve the namespace.
Unfortunately, these lookups are necessary at least once.

As a result, our first experiment was to install a pure Ruby gem (resolv-replace) as opposed to a C extension. This would allow other threads to acquire the GIL and do work while one is blocked on I/O.

Hold on! What do some of those words mean? 😰

First off, Chime uses Ruby on Rails to power many of its backend services.

In Ruby, there is the GIL (Global Interpreter Lock). This means Ruby has concurrency but not true parallelism (this is dependent on your Ruby VM — JRuby can run parallel while MRI does not). The GIL is there to protect developers from unintentionally executing code that isn’t thread-safe. Ruby threads acquire the GIL when executing code, unless they’re in I/O — in which case the GIL frees up for other threads, allowing concurrency.

From this article, DNS lookups are C extensions:

MRI is a little less forgiving when it comes to C extensions. Any time that a library uses the C extension API, the GIL blocks any other code from execution. There is no exception for blocking I/O here, if a C extension is blocking on IO then all other threads will be blocked.

In Chime’s case, our micro-services are hosted on Puma, a web server which serves requests concurrently using threads. By installing resolv-replace we were hoping that the thread doing DNS resolution wasn’t acquiring the GIL and blocking other threads from serving web requests.

And then it becomes a not-so-wild guess 😁

As guesses go, not all work (some do!). The gem installation likely resulted in an improvement, but it wasn’t dropping latency as anticipated. Stepping back, we had a few bits of knowledge:

We were able to see overall DynamoDB latency per query in the DataDog graph.
We had flame graphs that showed us a span of a request.
We also had access to the aws-sdk-core code, as it’s open-source.

By looking at our average request times, we sorted our DataDog traces by duration and found the most expensive ones.

APM traces sorted in DataDog by duration

We noticed that on longer traces, there was an extra POST being made to sts.us-east-1.amazonaws.com.

An extra STS call takes ~20ms in some cases 🤔

Finally! We had a better lead for what might be going on.

The Hunt Begins! 🚗

Knowing that some requests were making an unexpected call to AWS sts.us-east-1.amazonaws.com, we had a start. Let’s recap what we now know along with our tools:

We knew the longest traces had an extra POST being made to STS .
By 👆, we also knew what code was being executed.
We were able to manually instrument DataDog traces in our Rails console.

Given all of this, we were able to reproduce what was going on. Reproducibility is important in any investigation, as it allows one to narrow down their problem space.

When you work in a programming language every day, you become familiar with some of its features. Ruby optimizes for developer experience, and there are many features developers can take advantage of.

If your gem is installed via bundler , you can open a gem such as aws-sdk-core with a simple command like bundle open aws-sdk-core . This doesn’t just let you look at the source code of a gem, but it lets you edit it as well (also known as monkey-patching).

Note: When you’re monkey-patching your gem, it’s recommended to run bundle pristine afterwards. This resets the state of the gem after editing them.

Another feature engineers use heavily are breakpoints. Ruby has a famous gem known as pry-byebug which allows developers to not only place a breakpoint using binding.pry , but also to edit code in a live debugger session with the edit command (ex. edit ClassName#method_name, more features here).

Given our knowledge of Ruby’s breakpoint and monkey-patch tooling, our goal was to track down where this STS call was happening within the aws-sdk-core gem.

To kickstart our investigation, we instrumented the expensive code through monkey-patching with pry and edit Aws::DynamoDB::Client#initialize (edit is a handy method to directly open the class in your favorite editor during a REPL session).

Datadog.tracer.trace("dynamodb", resource: "AwsInvestigation") do
 # We're calling #initialize - but this is doing an HTTP call, odd!
 client = Aws::DynamoDB::Client.new
 ...
end

Looking at this DataDog trace, we were able to see the STS call again after making that call, but only the first time in the Rails console.

Doing bundle open aws-sdk-core , we were able to place that same code within multiple places of the gem and narrow down our traces. We started off in AWS’s initialize method by using edit.

...def initialize(plugins, options)
    Datadog.tracer.trace("dynamodb", resource: "AwsInvestigation") do 
      @config = build_config(plugins, options)
      @handlers = build_handler_list(plugins)      
      after_initialize(plugins)
    end
  end
....

We placed different breakpoints along the way, using binding.pry ‘s continue , next , and edit to understand the direction of the code.

Coupled with DataDog instrumentation, we landed in AWS’s provider chain resolve here. If we look at the source code, this method checks which credential provider is configured and tries to initialize and refresh the token. We instrumented the code again to make sure this was where the bottleneck was.

def resolve
  Datadog.tracer.trace("dynamodb", resource: "AwsInvestigation") do      
    providers.each do |method_name, options|
      provider = send(method_name, options.merge(config: @config))                                                                                                                       return provider if provider && provider.set?
    end
    nil
  end
end

After adding more DataDog traces in, we managed to find the credential provider that was executing.

def initialize
  ....
  @client = client_opts[:client] || STS::Client.new(client_opts.merge(credentials: false))
  ...
endprivatedef refresh
  ...
  @client.assume_role_with_web_identity(...)
  ...end

After inspecting deeper, it turns out calling #assume_role_with_web_identity (here) was making a request to the sts-us1-east-amazonaws.com endpoint! 😱

Taking a Step Back 😅

Let’s take a step back with what we know now. Using a combination of instrumentation and breakpoints, we narrowed down our problem space and found out what was making the extra POST STS call. The next question we had was understanding why this POST happened.

After reading documentation and looking at our system, we learned this credential provider is used when we don’t set AWS environment variables like AWS_ACCESS_KEY_ID and AWS_SECRET_KEY. Rather than managing credentials ourselves and setting a static set of provider credentials, we refresh our access tokens periodically. At Chime, we regularly rotate our AWS EKS pods, and because of that it would be difficult and risky to store a static set of credentials.

For example, you can securely distribute your AWS credentials to the instances, enabling the applications on those instances to use your credentials to sign requests, while protecting your credentials from other users. However, it’s challenging to securely distribute credentials to each instance, especially those that AWS creates on your behalf, such as Spot Instances or instances in Auto Scaling groups. You must also be able to update the credentials on each instance when you rotate your AWS credentials.
— From AWS Docs as to why this is necessary.

Putting it all together 👏

Now we knew the following:

When initializing a new DynamoDB client, we were making a STS call initially to obtain credentials.
STS calls are required, as it’s difficult to securely distribute credentials. We definitely don’t want to compromise our member’s data.

So… why was it still taking so long?

It turned out that we had introduced a bug into our code, initializing one Aws::DynamoDB::Client per class. While DynamoDB manages its own connection pool under the hood, it doesn’t provide a singleton access pattern. This meant that upon every client initialization, we were making an STS call without realizing it.

The DynamoDB client should be accessed as a singleton as we don’t need to initialize a new instance to make a request. Because we have so many instances of our services running at Chime, we were making a new DynamoDB object per class, per process, per thread. 😱

The Initial Fix 🔧 : Singleton + Rails Initializer

The initial fix was simple. It was to use a Singleton and initialize it on application boot. Initializing on application boot would allow us to initialize and make the STS request before the Rails app starts accepting web requests. After doing this, we were able to see a large performance improvement.

Rather than doing something like this inline:

# app/services/example.rbclass Example
  def call
    client = Aws::DynamoDB::Client.new
    client.do_stuff
  end
end

We swapped to:

# /app/config/initializers/dynamodb.rbDYNAMODB_CLIENT = Aws::DynamoDB::Client.new

And then used it as:

# app/services/example.rbclass Example
  def call
    DYNAMODB_CLIENT.do_stuff
  end
end

This allowed us to reference the same DynamoDB object throughout our code!

Performance improved when using our DynamoDB client 💪

The Second Fix 🔧🔧 : Fix tail-end latency upstream

While the singleton fixed initialization latency, we recognized it didn’t fix tail-end latency. Remember we mentioned that a client will refresh its token automatically? Refreshing is a locking I/O process. If you look in the aws-sdk-core code, it looks like:

def refresh_token
  @mutex.synchronize do
    ....
    refresh
  end
end

What’s going on?

The main thread is waiting to acquire a mutex, acquires it, and then refreshes the STS token by making an external request.
The issue is that acquiring a mutex is expensive, as it blocks the current thread.
If we have one thread executing in the mutex block / critical section, then other threads are blocked waiting to acquire the mutex.
This contributes to periodic spikes and peaks in latency, as clients will refresh and acquire their token after the mutex waiting period.

Instead of monkey-patching a fix for the tail-end latency, we contributed with a PR to the AWS gem! We introduced a PR to auto-refresh the token using threads. This addressed the locking issue, as other threads wouldn’t need to refresh the token and wait for the mutex.

This is a simplification of the code merged:

def refresh_token
  if near_refresh?
    Thread.new do@mutex.synchronize do
        ....
        refresh
      end
    end
  end
end

There are more details and discussion in the PR, so hop on over there if you want to read more about it. Threads are incredibly difficult to manage, so we guarantee you that it’s a great discussion that digs into Ruby semantics as well: https://github.com/aws/aws-sdk-ruby/pull/2642.

Finally, after releasing the gem we looked at our max latency graph and noticed a drop in the peaks and occurrences of our spikes.

Max latency graph for an endpoint using the DynamoDB client, the blue line shows the new max latency after release.

Conclusion 💸

By the end of this process, we were able to reduce latency for one of our core services, bringing down average service calls from 50 ms per request to 10 ms per request. In doing so we didn’t just bring down latency, but we also helped our core member experience of opening the app.

And finally, we were able to give back to the Ruby open source community, helping everyone with periodic spikes of AWS token refreshes rather than introducing an internal fix.

Shoutouts 📣

Jared Smith for one of the longest pair programming sessions we’ve had to date.
Akhil Naini for the excellent technical guidance and resources. Additionally with the nudge to open-source and contribute our idea upstream.
My amazing coworkers (Rahul Gupta, Jessica Lo, and Cole Bennett) who helped me edit this post.