Performance: Killing an API and Rebuilding the Service Yourself

How do you determine when you need to replace a third-party tool with a custom solution built by your own engineering team?

This is a question that software engineers have undoubtedly asked themselves several times throughout their careers. When building software and iterating quickly and efficiently, it’s necessary to use third-party services, libraries, and API’s to solve specific tasks — otherwise you’d have to constantly reinvent the wheel and you’d never get any work done. The trade-off is that you don’t have full control over the service, but your team then gets to focus on larger, more important software and product issues.

Experienced engineers know that there is a point at which a third-party service won’t cut it anymore, and instead a custom solution — specifically designed to cater to their requirements — needs to be built internally. But how do engineers make the decision as to when a third-party service needs to be rebuilt?

It reduces to a question of performance. At some point the service being used isn’t capable of doing what your software needs it to do, so the only sustainable, long-term option is to rebuild the service with the necessary optimizations internally.

Observing Inefficiencies

Because software can get so complex and the dependencies so intertwined, it’s important to build fault-tolerant, resilient systems that can handle the occasional, but inevitable bugs, errors, outages, network delays, etc.

At a certain point it becomes difficult to improve, or even maintain, the performance of your software if a service or library that it relies on is being pushed to its boundaries. It will reach a point where it can’t handle the scale or volume of your software anymore, or it’ll become difficult to build new features because the service will act as a blocker in some way.

Oftentimes, it’s beneficial to use a third-party service to solve a certain task and after using it in production and observing the performance of it, then to build it internally (or build on top of the service) if needed with specific optimizations and features. A few examples of companies building tools internally:

  • Amazon building a key-value store because of past performance issues.
  • Uber building a mapping service because of issues with Google Maps.
  • Netflix building a caching service as an improvement over memcached.
  • Netflix building on top of React by creating React-Gibbon.
  • Facebook creating Yarn with improvements over npm.
  • LinkedIn building Kafka because ActiveMQ was difficult to scale.
  • Facebook improving parts of the PHP language by creating Hack.
  • Google building TensorFlow as an improvement over DistBelief.
  • Airbnb building their own payments platform after experimenting with other services.

By first using a popular third-party service as a component in your own software, you observe how it functions and learn what its positive and negative features are. In doing so, you’re able to identify exactly how the service should be optimized to perform better within your own software.

Rebuilding a Video Streaming Server

Haggai Weiser, VP of Engineering at Alpha, encountered exactly this type of issue when his engineering team was building a video streaming application that had to record a user’s screen and the audio from their microphone while they were taking a test online. Once the user was finished, the software would process the two media streams and create a final media file.

To get something up and running, the engineering team decided to use a leading third-party service that focused on WebRTC and video streaming services. The process of incorporating the service into their software was fairly straightforward and only took a few days. A few weeks after pushing the entire tool to production, it was processing several hundred videos per week. But soon after the team ran into an issue that rendered almost half of the videos unusable.

We found that over 40% of our videos were out of sync by several seconds. This rendered a lot of videos unusable because you would clearly see a delay when the user was interacting with the prototype and talking.

The server that was processing the media streams wasn’t syncing the video and audio files together correctly, so the first thing the engineering team did was analyze the network statistics that were recorded on the third-party’s servers.

It was clear exactly when the video and audio streams got out of sync — it was right around the point where latency and packet loss spiked in the graphs. When the engineering team emailed someone from the company about this issue, they were told the following:

A possible solution to mitigate the issue a bit would be to do some post-processing on the recorded files by rescaling the timestamps of the streams by using the durations indicated in the JSON manifest file.

They didn’t really explain why this network issue was happening so the Alpha engineering team wasn’t sure whether it was a bug in their system somewhere or whether they were simply pushing their server to a limit. After a few days of trying to circumvent the network issues or patch them up Haggai decided it would just be easier moving forward to write their own server to process the media streams — especially because the tool would be processing over a thousand videos per week soon.

The engineering team at Alpha ended up writing their own server in Node.js that focused specifically on processing and then merging incoming media streams that were recorded in the user’s browser. It took a few days to do build it to their specifications, but once they did it was worth it.

After replacing the third-party service with our own server, less than 3% of videos failed which is a major improvement from the 40% failure rate per week we kept hitting. We built the server and optimized it specifically to cater to our needs.

Understanding the Trade-offs

An experienced engineer is able to weigh the different options and determine how each option will benefit the company and product. If full control over the service is required, then it will have to be built in-house by the engineering team at the cost of time and resources. But, if a simple solution is desired in order to iterate on the product quickly then choosing a third-party service may be the better choice.

The benefit in using a third-party service at first though is that the engineering team can focus on building more important features for the product while also keeping an eye on the third-party service and observing how it functions. Then it can always be rebuilt in the future optimized specifically for the product.

If you want practice solving algorithm challenges check out Coderbyte.