Casting Light On The Covid Vaccine Supply

One shot or two?

Gavin Thompson
apree health (Castlight) Engineering
8 min readMay 24, 2021

--

In early 2020, this would’ve been the type of question you’d expect a barista to pose to a customer who would, momentarily, be sitting down to enjoy their latte, unmasked, with a friend. They may chat, share food, and embrace as a farewell when the time comes.

Fast forward to early 2021 and the world is a very different place. As the vaccine rollout gathers pace, the same question is now something you hear uttered by those in line to get a COVID-19 vaccine, who are curious whether they or their contemporaries, will be receiving the J&J single-dose vaccine or the two-dose versions offered by Pfizer and Moderna.

The COVID-19 pandemic has changed the world forever. Regardless of whether you’re vaccine-hesitant or not, countries all over the world are relying heavily on vaccines to play a key part in reopening society, so it’s nigh on impossible to ignore the efforts of the US in its rollout.

The Two Problems

1. The CDC needs to track inventory levels of all COVID vaccines on a daily basis

2. End users want to find nearby locations that have vaccine supply

When the CDC and Castlight discussed partnering to solve this problem in October 2020, I don’t think we entirely knew what we were getting ourselves into. Part two had been partially solved by Boston Children’s Hospital (BCH) and Google with vaccinefinder.org (currently redirecting to vaccines.gov), which had been borne out of the 2009 H1N1 pandemic. VaccineFinder allowed users to search for seasonal vaccines but had a back end that was not deemed scalable enough for the high traffic that was expected for COVID vaccine searches.

Part one sounds like something that would already be available to the CDC. A system was already in place for ordering supplies but not for tracking inventory. Simple — a couple of minor modifications to that system and we should be done, right? Well, “that system” turned out to be dozens of systems, many of which hadn’t been enhanced for years, so this was a non-starter.

Problem One (Inventory)

For part one, the CDC would give us a list of providers. These could be pharmacies, hospitals, medical centers, or “mom & pop shops” — anyone who had been approved by the CDC to order COVID vaccine supply. Our job was to invite them to a portal, enable them to enter in-stock dose counts per vaccine, and then extract this data into a daily export file. The CDC could feed this data into their dashboard to make decisions around inventory distribution. Large pharmacy chains already provided similar data for seasonal vaccinations in existing SFTP file feeds and they would need to extend these to provide COVID vaccine data. We needed to marry this with portal-entered data and upload it to the CDC every 24 hours.

Solution One

Time was of the essence, so most technical decisions had to be driven by the speed of delivery. The project team was formed in late October and the expectation was for select providers to be in a position to log in and record inventory at the end of November. Trial runs to test the logistics of shipping the vaccines began in the first week of December, and providers were required to log “doses’’ for the empty vials they had unpacked from dry ice, to prove that the correct data flowed back to the CDC.

We’d already learned that a custom solution was required, so the next step was to identify components that could be outsourced to reduce implementation times. The authentication, invitation, and identity management components fit that bill, so we were able to leverage Okta for this. We leveraged the sign-in widget and some admin dashboards to invite users over email and provide out-of-the-box tooling for our Support team.

Okta also provides a signup email for which a template can be defined. This provided us a real shot in the arm with regard to delivering the onboarding piece.

To ingest inventory files from retail pharmacy chains and pre-enrollment files from the CDC, we wrote a series of Python data loaders to wrap the SQL to load the data to staging tables. This paved the way for optimisations in the extraction process in terms of only loading what was needed to the application schema and an efficient extract process for the CDC export file.

Problem Two (Search)

With an early February deadline for the search functionality, we had a little more breathing room in terms of making sound and measured engineering decisions. It was contractually predetermined that BCH would own the front end for vaccinefinder.org (now vaccines.gov) and Castlight’s role would be to provide the back end endpoints for COVID searches only. The existing front end would be retired in favor of a COVID-only experience until the regular flu season was back upon us.

One difficult aspect of this problem was the level of uncertainty for the user traffic estimates. One estimate came from BCH which had solid domain knowledge from their work with vaccinefinder.org. The U.S. Digital Service (USDS), which oversees all major technology launches for the US government, and Castlight also provided independent estimates. We intended to prepare for a worst-case scenario but a lot of the “back of the envelope” calculations resulted in numbers with an order of magnitude difference between the three parties.

Based on our experience building and supporting a very popular COVID Test Site Finder, we knew we could leverage Akamai to offload some of the requests. However, it was unclear how much offload we could anticipate, given that the requirement for address-based searching would limit cacheability due to the inclusion of distance to the search origin as part of the response.

Solution Two

We arrived at a solution of periodically re-indexing the data in Elasticsearch and building our search API endpoints on top of the index. We were receiving updates from the pharmacy chains once per day, and individual providers typically recorded their inventory towards the end of the day, so we did not need to index more than every few hours, which helped increase the percentage of requests offloaded to Akamai.

What else was considered?

Since we were already storing the inventory data in MySQL and MySQL supports geospatial indexes, we started with a proof-of-concept using location latitudes and longitudes in an indexed geometry column. For the most part, search performance was adequate but with concurrent load, some of the larger radius searches (more than 100 miles) generated a very large CPU load. Since scaling MySQL horizontally isn’t as straightforward as the other options and we didn’t yet have a clear picture of the Akamai offload percentage, we steered away from this option.

Redis also supports geospatial searching and is on the technology menu at Castlight. For geospatial searches only, Redis performed as well if not better than Elasticsearch. However, adding additional attributes to the search, for example, searching for providers near me with vaccine X in stock rather than just providers near me, resulted in degraded performance numbers with Redis as we needed custom filtering on top of the result set that Redis returned.

The preparation

Load testing dominated our launch preparation time. Given the traffic estimation challenges mentioned previously and the distributed nature of the Akamai CDN cache, it was more difficult than usual to performance test in a realistic way. Even in talking to Akamai staff directly, there was variation in the numbers provided about approximately how many edge servers they had (I believe this in the 1000s) so estimating cache miss percentages was tricky.

Much of the performance testing relied upon extrapolation. For example, testing directly against our origin using either JMeter or Gatling gave us reasonable numbers around latency and throughput for Akamai cache miss requests. With 64 pods of the back end service, eight Castlight edge servers, and a nine-node Elasticsearch cluster we handled 4500 req/s with a sub-second p90 latency.

Pre-launch, we had hoped to be able to measure the expected Akamai offload since many of the questions we were fielding centered on the maximum req/s we could handle. By switching to BlazeMeter, we were able to run the JMeter test from six different regions (e.g., AWS US West, GCP East, etc.) and view aggregated and per location results. We knew this test would provide better performance numbers than the much more finely geographically distributed load from real-life clients. The results were encouraging nonetheless that we were ready ahead of schedule to open the gates to the unvaccinated masses.

The lessons learned

Flexibility was key

The only constant was change in this project. The high-level goals didn’t change but the nitty-gritty in terms of the core business logic changed frequently. On occasion, requirements were written by our product teams, signed off externally, and reversed on the same day due to new or emerging information. This occurred more frequently than many other projects given the four sets of stakeholders involved. The lesson here was for a project of this nature to usually make flexibility the overriding factor when making trade-off decisions. For example, if there were occasions for which items could be inexpensively moved from code to service configuration, we utilized this approach. We also had to be lean with the automation suite and work to minimize duplication in tests written by different teams.

Akamai Peculiarities

For certain endpoints, we observed better latency numbers when it was a cache miss from the Akamai side — this is something that I still can’t explain :)

Press the zoom button

As an engineer, one can often get deep into the technical detail of a problem and forget to zoom out and focus on the original user problem. In the first weeks, post-launch users could find locations with inventory but not secure an appointment.

In March and April when traffic was peaking, we only included the in stock / out of stock indicator mentioned above as the first half of the problem. Some early feedback was that the majority of users used vaccinefinder.org to target sites with inventory and then a second site to find out when appointments opened up. Fortunately, individuals and small groups created a number of sites, such as Vaccine Spotter, to solve the latter requirement. Looking at the problem in a more holistic way may have encouraged us to add appointment availability earlier.

The result

The Numbers

4m searches per day against vaccine finder*

500k users per day according to Google Analytics

20k providers onboarded to log inventory

10m+ doses of inventory logged every day

Vaccine Finder and later vaccines.gov were both launched on schedule and thus far, BCH, the CDC, and the USDS have provided plenty of positive feedback about the stability, response time, and uptime of our API. The CDC has engaged Castlight for a phase 2 body of work above and beyond the original agreement, which one would hope means that Castlight is now a trusted partner for them.

Praise and gratitude have been shared online and on social media towards vaccines.gov and vaccinefinder.org regarding how these sites have helped people get vaccinated who were otherwise struggling. The entire team, across four organizations, should be extremely proud of their efforts towards such an important initiative.

  • Proving that this is all legitimate traffic is challenging. For example, some of this traffic will have been from scrapers.

Collaborators

https://medium.com/@hgorur

--

--