3rd Party Integrations - Lessons Learned

Published in

Wix Engineering

7 min readJan 21, 2017

Over the past 3 years, I have been developing a web based billing system with a massively growing scale. The system uses over a dozen 3rd party services, and the list is growing.

A big part of current day software engineering is integrating 3rd party service providers into your own service, especially in a SaaS world.

There is always the dilemma: should I solve a specific problem by using a 3rd party library /service or should I code my own solution? There are pros and cons to each possibility. Sometimes business requirements push you towards using a 3rd party.

Anyway, here are some lessons I’ve learned about 3rd party integrations.

Resilience

From my experience as part of a development team, every time we introduce more code to our system we increase its complexity. The effect of this grows once we introduce a 3rd party provider into our code, whether it’s a library or a service. Every 3rd party provider is a black box, so in fact every integration increases the complexity of our system.

Their bugs are your bugs. Own that shit.

In the web ecosystem there is no greater failure than downtime. That’s why we’re tasked with building resilient systems with the ability to bounce back from failures, even failures of 3rd party services. How do we do that? That’s a topic for a follow up post, but here’s an overview.

Defensive Programming

Our service is only as resilient as the weakest dependency in it’s chain. we should keep this in mind every time we integrate a 3rd party service provider.

At first, we are trying to navigate through the mist of uncertainty and grow my confidence in the service. During this process, I learned not to make any assumptions. Every API endpoint is a potential point of failure until our confidence grows.
There is that famous quote about insanity:

“Insanity: Doing the same thing over and over again and expecting different results.”

When it comes to 3rd party providers, it is completely possible to get different outputs for a given input.

Don’t get me wrong, my confidence should also be backed up by a proper alerting mechanism that monitors the 3rd party provider.

For example, I integrated a payment provider that uses a string status value in the DTO (DataTransferObject). According to the API, possible values are “success/failure/pending”.

At first, I raised an alert for any status value that differs from these 3 options. After opening the integration to users, this defensive precaution helped me catch a rare scenario. The payment status was listed as “success-updated”, which represents payments that were marked as failures at first but were automatically changed to “success-updated”. However, this status wasn’t listed as a possible value in the API, so the alert helped me prevent data discrepancies.

MVP - ASAP!

Deliver usable products to allow learning to take place. Illustration by Henrik Kniberg

I ship my MVP code as soon as possible to start exposing it to users. This enables me to start growing my confidence and test whether my assumptions hold water.

The key to continuously improving integration with a 3rd party service provider is feedback. We should aspire to make the feedback loop as quick and efficient as possible. I have learned that building Minimal Viable Products plays an integral role in reducing the feedback loop.

For example, I integrated a specific offline payment provider which is based on payment slips. These slips are payed at kiosks, which is very common in Latin America. We shipped the most basic functionality to our users and started getting feedback. We realized the feature wasn’t as successful as expected, and we canceled the next phase of development. Deploying an MVP helped us test our assumptions and prevent wasting dev time.

Composition Over Inheritance

There are a few reasons why designing your system by composing functionality is preferred over inheriting it. The main reason is gaining flexibility.

I learned this lesson the hard way. I once tried to refactor a base class and it triggered multiple changes down the inheritance tree. This created a huge mess, wasted precious time and increased the risk of the change.

From my experience, software that uses a 3rd party service provider tends to be highly mutative. Unforeseen edge cases, at least in the beginning of the integration, will undoubtedly make me change my code over and over again. That’s why I prefer composition, to make it as easy as possible to introduce fixes to my design.

Another lesson I learned is to write code which is loosely coupled to the 3rd party service. This enables the removal of the integration when it’s no longer required or needs to be replaced.

TDD — Red Green Refactor

The Cycles of TDD. Illustration by Robert C. Martin (Uncle Bob)

Coding your system, by TDD methodology, will drive you to achieve a loosely coupled design where you will feel safe to introduce changes quickly with low risk. This is especially relevant when integrating 3rd party services.

Here are some reasons benefits of using TDD when it come to 3rd party integrations:

Using TDD will drive your learning of the 3rd party API.
Using TDD will make it easy to maintain and refactor your integration code.
Using TDD will reduce risk of upgrading versions of 3rd party providers since you have good test coverage.

The Wrong Abstraction — Rule of Three

Don’t be quick to create abstraction layers, as explained by Sandi Metz’s post, The wrong abstraction.

“Duplication is far cheaper than the wrong abstraction”.

As a general rule of thumb, I try to wait for the 3rd code duplication of a service integration, before introducing an abstraction. This reduces the chances of creating the wrong abstraction. Bad abstractions are hard to change without affecting all service integrations.

Self-Healing and Reconciliation

*Image: Terminator T1000 opening head to avoid bullet (TriStar Pictures)*

I try to write self-healing integrations to boost the resilience of my software and reduce manual work required for disaster recovery.

You always need be prepared for failure.

As in all http based integrations, some requests fail, leaving us in an inconsistent state. For example, when a payment request gets a ReadTimeoutException, the payment might be in a failing state or in a successful state. At first, I handled such timeouts by processing reports which were checked manually. Later on, I added a self-healing mechanism that automatically reconciles payments against our records. This enables us to recover from network failures automatically.

Circuit Breaker and Cascading Failures

I use a circuit breaker as a simple state machine to monitor errors of a given protected code segment. Once the number of failures reaches a pre-determined threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error. This prevents the protected call from being made. When the predefined “back off” time interval concludes, the circuit breaker resets and the protected code segment is made available for execution again.

To prevent the errors of 3rd party dependencies from affecting the health of my entire system, I learned that it’s a good idea to wrap the dependencies with circuit breakers. This way, failures will not cascade to other segments of my system, and resilience will increase.

For example, the billing system I’m developing uses a 3rd party VAT validation service and I noticed that sometimes the VAT validation requests get timed out due to very long response times. I added a circuit breaker in front the VAT validation service and configured a maximum number of failures so the validation won’t execute if the service isn’t healthy. After the back-off period, the circuit breaker resets and re-evaluates the health of the VAT validation service. This way my system will fail fast, prevent cascading failures and automatically recover.

Feature Toggles

A feature toggle is a powerful software pattern that allows teams to modify system behaviors without changing code.

Similar to circuit breakers, a feature toggle can give you more active control over the service providers you integrate. It enables you to switch on/off the usage of a specific service, thus gaining a few benefits:

Advanced preparation for known downtimes, like the maintenance time of a specific provider.
Flexibility to adapt to new business requirements without deploying new code.
Improved design, because the mere existence of the feature toggle drives your code to prepare for the option of an unhealthy service.

Summary

Web services still haven’t reached the same level of quality or resilience as other engineering fields, and we need to be committed to constantly improving.

Here are the major points we should keep in mind:

Resilience — build your systems to bounce back from your failures, or the failures of your 3rd party service providers.
Visibility — construct tools within your system to increase realtime visibility of the system’s health and it’s dependencies.
Control — design your architecture so you’ll have control over each component, making it possible to tweak its behavior while the system is running.
Long term maintenance — keep refactoring your code. Let go of fear and fine tune the behavior of every component in your system.

Hi I’m Stas Wishnevetsky.
I’ve spent the last 6 years writing code. I currently work as a backend developer at Wix.com.
You’re welcome to follow me on twitter.

Photo credits: