How not to lose sleep when integrating with a 3rd party REST API — Part 2

Dealing with network outages, data inconsistencies and random authentication errors

Published in

Wix Engineering

7 min readApr 8, 2019

In Part 1 of How not to lose sleep when integrating with a 3rd party REST API, I talked about how we designed an integration with ReliableVendor.com through their RV.com REST API, dealing elegantly with failure, and enabling us to fully test all the business logic.

In this post, I will go into details how we implemented the class that calls the REST API and how we dealt with the type of errors that can occur.

I’d like to get to know you a little better

So, first things first. Since we are using RV.com to manage transactions on behalf of our users, the first thing we needed to do was to create an account in RV.com through which we’ll manage these transactions. Before coding anything we wanted to test out this REST call against RV.com.

Firstly, we did not want to be creating too many test accounts in RV.com production environment, so we were given access to their parallel testing environment where we could freely fire requests at will.

Even better, RV.com API documentation very conveniently had a link to a downloadable Postman collection that greatly eased the crafting of HTTP requests for calling their API in the test environment.

Through playing with the API in Postman, we acquired a much better understanding of the input to the RV.com API, what is returned, and the types of errors.

Using Postman was substantially more convenient than coding testing logic and going through our whole deployment lifecycle to test some behavior.

For more advanced research scenarios, for example when we needed to simulate account creation for all supported countries, scripting with Python proved more convenient. These scripts proved valuable not just as way to do repetitive testing, but also as a great way to collaborate on technical issues when we encountered unexpected behavior in the REST API for more complex scenarios. All we had to do was to share our Python scripts. Since Python is a language that all tech companies are familiar with, we simply shared our python scripts with RV.com tech support which they could run locally straight away.

Component diagram for building RV.com REST integration

Test test test…..

Once we had finished our initial research, it was time to start implementing the RV.com Transaction Service. Arguably, a crucial piece of this was structuring our system such that we could really test every line of code that we were writing. To do this, we made sure that our test code would call our production logic. Clearly, we would also need to have complete control over the responses coming out of the REST API Server — whether it’s a valid OK response, validation error, 400 error, or chaotic unexpected ‘object not found’ errors. Since we couldn’t control such responses on the real RV.com server, we had to have our own RV.com REST server which we can fake whatever behavior we desired.

Once we had our fake server, we could then write tests that simulate server behaviour and then validates that the Wix Accounting Service has the correct behavior. Since the Wix.com Accounting Service contains a reference to an instance of RV.com Transaction Service, all we had to do was point the RV.com Transaction Service to our Fake server, and for each test inject the required behavior of the HTTP requests.

Building a mock server

We chose to use an HTTP Test Kit that was part of the Wix infrastructure. Essentially it brought up an embedded in-memory HTTP server that was initiated with a list of mappings between pre-canned requests and responses.

Initial version of IT tests using Mock HTTP Server

Even though this gave us all the capabilities that we needed at this stage, there was still quite a lot of boilerplate each time we had to create fake requests and responses for each test.

The first improvement was how to craft errors. We noticed that RV.com errors had a fixed structure — the response was always a JSON structure with 4 fields that were relevant — code, number, severity and fault. We captured these in a case class, RVErrorResponse, and changed our API to allow creating errors responses in a simpler manner.

Secondly, we created factory methods for creating requests for each time of REST call. Whether the API call was a GET or a POST, and whatever the entity was it was being passed, we created a simple domain model to help us build requests.

Now, it was a whole lot simpler to create new tests for a bunch of scenarios.

This capability was extended to enable faking the request/response for other REST APIs in RV.com so that we could fully test the multi-stage account creation as mentioned in previous post.

Summing up so far

We were now at the stage where we could go to production — we’ve researched the API’s so that we understood them enough to code, we have implemented the service that calls the API is such as way that we can test it under different scenarios.

As we developed the solution we saw that we had a nice little lifecycle going:

Mission Accomplished? Unfortunately not

After releasing to a small sub population of our users, we started seeing some really strange behaviour. Just a refresher — we were creating account in 3 stages — first we create the account, then we edit the newly created account and then we activate it.

So, back to our terrifying bug! We were seeing that the second call, in which we call editAccount was failing with AccountNotFound error code. This was not happening consistently enough to identify the cause easily.

On calls with RV.com support, we were met with blank faces. No external integrator had used this API yet. It was used solely in the RV.com admin portal and they had no idea what was wrong.

After this call, we realized we had to allow RV.com tech team to see the problem themselves. We quickly threw together a Python script that called the create account REST API followed by edit account REST API, and repeating this call a lot of times, we managed to reproduce the error so that we could share this test script with RV.com so that they could understand their bug and work with us to find a workaround. In the case that edit account failed, we retried to see if it failed forever, or if it succeeded after a while. The results from the script showed that after about 5–6 retries, the edit account request usually succeeded.

We then sent this script to RV.com and they had a big Aha moment when they realized what was the cause of the problem. The suggestion was that the RV.com server cluster was not reflecting a consistent view of created accounts.

To help understand this, let’s assume that RV.com runs on 3 servers — Server A, Server B & Server C. Each call to the REST API is routed to only one of these servers. Let’s say our createAccount call hits Server 1. That means that our account is created in Server 1 and that information is stored in Server 1. When we call editAccount, we might hit Server 1, and that will be fine, since the account that we are accessing exists there. However, we might hit Server 2 or 3, in which the account is not yet present, subsequently resulting in an error — AccountNotFound. The account creation was eventually consistent — i.e after a short time ( a number of minutes in this case), all servers in RV.com reflect the newly created account.

As RV.com started working to fix this bug, we started looking into how we can plug this issue as fast as possible. First things first, we wanted to be able to reproduce this bug. Essentially we wanted to be inject behavior such that editAccount would sometimes fail, and sometimes succeed. Once we could reproduce the bug in our test environment we could confidently start to fix it.

Testing chaotic behavior

We decided to add the following behaviour to our mock server. We created a subclass of HttpResponse, called HttpReponseList that, instead of always returning the same response, it instead returned a list of HttpResponse objects. Each time the mock server received an HTTP Request that matched this handler, it selects the next HTTP Response in the list.

In this manner, we could write tests that reproduced this chaotic behavior with great ease.

Since we got the hard part out of the way, we could now easily implement a fix. Essentially, each call on accounts (edit account and activate account) were wrapped in a retry mechanism, that, in the case of AccountNotFound error, are retried. This retry mechanism was implemented internally within the RV.com Transaction Service, and was deployed rapidly to production.

So where do we stand?

After thrashing out the challenges of being able to test real behavior of RV.com API, and tackle real bugs from the integration, we have got to the stage where we have a pretty robust solution, having been able to research our API, and being able to deal with somewhat chaotic bugs. In addition, we’ve got the infrastructure to deal with new unexpected bugs, and new requirements that could come our way.

So what’s left? In order to boost our confidence in the integration, we really needed to be able to monitor activity and get alerts if things get broken.

In the next and final post, I’ll share how we added monitoring and alerts to this integration, and strategies for dealing with complete outage of RV.com API.