How We Tackled Improving API Reliability & Performance

Published in

Samsara R&D

7 min readAug 6, 2020

Samsara’s customers build their businesses around ours, and being able to integrate with our platform is a significant component of that partnership. I work on the Developer Ecosystem team, and my team’s charter is to deliver a best-in-class developer experience for our customers and partners. The elevator pitch for my team is, “we make sure our APIs are well-designed and performant.”

The reality

If I have a favorite meme right now, it’s probably Oh No. A year and some change ago, “oh no” was pretty much our reaction to the evaluation of our APIs:

APIs were inconsistently defined — a vehicle returned by one endpoint might not match a vehicle returned by another endpoint. Sometimes you needed to pass in a query param as groupId, other times as groupID. Sometimes we required the POST HTTP method for what should have been GET endpoints. 🤦‍♀
Too many endpoints were unbounded — that is, they were not paginated, or they were allowed to query massive amounts of data in one request. This can strain our backend services and affect both reliability and performance of some endpoints. As more customers and larger fleets integrated with our APIs, we were starting to feel the effects.
If an internal team wanted to create new endpoints, there were no clear patterns or guidance on how to design a good endpoint, and no oversight to ensure that the new endpoint was consistent with the overall API surface.

The tricky thing about APIs is that once you release them, you have to support them. Forever. Or at least as long as your customers have built their businesses around them.

So how do we give our current and future customers a better API experience?

In-place fixes

We knew that we would need to support our current API surface for as long as our customers are using it, so a large part of our effort was improving the performance and reliability of our existing endpoints.

Our first priority was to build tooling to measure the current state; having good metrics about our system would show us where we could improve. We use BigQuery to log detailed metrics about API requests, and we used these to develop a list of endpoints we wanted to improve. To prioritize our list, we looked at API reliability from our customers’ perspective — what endpoints does a given customer hit and how reliable are they? The key point is that we look at per-org, per-endpoint performance & reliability. This view of our metrics ensures that an org experiencing very negative reliability on an endpoint is seen, even if most orgs have good reliability on that endpoint.

Sometimes adding better monitoring was simply adding tags to our existing metrics, but the big win for us was being able to look at good traces across API and backend services. This allowed us to identify bottlenecks in certain services that we could target for investigation. Generally, our team was able to use the existing monitoring infrastructure that we use across most Samsara teams — aggregated metrics in Datadog for answering questions like how many requests did we see for a given endpoint this week, and traces in Lightstep for looking at how a single API request flows through our backend services.

Armed with our system metrics, we targeted the endpoints that would deliver the highest impact in performance and reliability. Some improvements were quick wins, like when we noticed that the request to get a single driver was actually looking up all the drivers and then filtering down to just the requested one. 🤦‍♀

The average P99 request duration for v1/fleet/drivers/{driver_id} for orgs with many drivers.

Others were accumulations of ongoing incremental improvements, like the work done on v1/fleet/locations. This endpoint makes a lot of requests — a LOT of requests — to several downstream services. This is great in the sense that it returns a variety of useful data, but also not great because now there are many points of failure for a popular endpoint. 😐 To understand why this matters, it helps to take a high-level look at what happens when we receive a request.

Our API server — apiserver — is essentially a thin adapter that translates incoming API requests into GraphQL requests and mutations. The GraphQL handlers talk to various data stores and call out to different gRPC services that talk to their respective data stores. We shard apiserver to provide some protection from bad actors or disruptions in a single shard, but with a request like v1/fleet/locations, we have so many dependencies that an issue with, or high load on, a single service will fail the entire request.

Having good traces across the different services apiserver calls out to let us identify that some calls could be spectacularly non-performant. Traces often revealed a pattern where we made serialized requests to a database that could instead be batched (e.g. a single request to fetch a batch of 100 drivers instead of a request to fetch a single driver 100 times). In many places we found this sort of batching to be the most effective — converting single SQL queries into batched SQL queries, and where possible, executing some number of these batched queries in parallel.

Although the v1/fleet/locations endpoint is receiving more traffic, it has become more performant and reliable!

Looking forward

Of course, improving performance and reliability of existing APIs isn’t the only answer to the question, “How do we give our current and future customers a better API experience?”

We decided that the best way to give future customers a better API experience was by actually giving them a better API. 🙃 Redesigning the surface of our APIs was not easy! We went through a few iterations and many design discussions. I’ll spare you the entire manifesto, but our core philosophy revolves around a few points:

APIs should be designed to solve the user’s problems, to answer the user’s questions. Our designs should always keep the customer use cases top of mind.
APIs should be intuitive, consistent, and resource-oriented, and should follow industry best practices.
Endpoints do not live in isolation; they should make sense within the context of our entire API surface.
APIs should be performant! Endpoints must be bounded either with pagination or by limiting the scope of the request. It should not be possible to issue unbounded queries that bring our services to their knees. For example, our apiserver should not be issuing a request for all drivers for all time to any of the services it calls out to.

After determining what our APIs should be like, we set about designing our shiny new API endpoints with the internal teams that owned those features. We wanted to make it easier for internal teams to develop endpoints consistent with our philosophy. And let’s be honest, we can’t rely on every team having the time to read a beautiful what-an-API-should-really-be manifesto. We knew that, for our redesign to be successful, we’d need to try to embed as much of what makes a good endpoint into the process of authoring it. We didn’t want to find ourselves in the same position a year from now!

Samsara already has a strong culture around writing design docs for large new changes, so we created an API-specific design template for new endpoints that captures as much of our new design ethos as possible. We built a glossary of common query parameters and added tests that perform linting for some of the rules. We worked closely with teams as they designed and implemented new endpoints, and we try to get regular feedback on what could improve their experience.

Even with these new changes in place, we’re not done yet! We have dreams of releasing more redesigned endpoints, but we’re already seeing increased adoption of the new endpoints.

Some things to think about when designing an endpoint.

What’s next?

So what’s on the horizon for my team? We’re keeping an eye on our API reliability and performance, with the new monitors and a per-customer view to identify when individual customers are experiencing pain.

Now that we’ve had the first generation of endpoints designed with our new framework, we’re using that feedback to further improve the process of designing APIs at Samsara. And we’re expanding the surface of our redesigned APIs to solve more customer use cases.

Finally, we really want to encourage customers to adopt our more performant endpoints, so we’re working on putting out SDKs and getting feedback from our Technical Support Engineers. Take a peek at our new docs!

And here’s a link to the Oh No comics I love.

Learn more about us and stay on top of Samsara Engineering happenings by following us on Facebook! We’re always looking for great people to join us as we learn and grow together, and if you love learning and building things in a highly collaborative environment, we’d love to hear from you! 👋

How We Tackled Improving API Reliability & Performance

Written by Sarah Ward