Navigating the API Error landmines

Published in

The Startup

10 min readApr 17, 2020

All APIs are truly successful in an enterprise only when you design them to handle errors reliably and gracefully. Most APIs are built only for the success scenarios or handle the error scenarios so haphazardly that they give an impression of landmines which can go off at any time and the team would have to scramble to support when it all goes wrong. Why is it so hard to implement a successful and reliable error strategy in an organization? Is there a simple way to get started?

API Error handling fundamentals
Anatomy of an API Error Handling Strategy
API Error Response Structure
Support Error Structure
Enterprise Event Management Tool
Document for the API consumer
Document for the support team
Final Thoughts

API Error handling fundamentals

Before we get started with enterprise error handling process, let’s get some fundamentals out of the way — so that we are on the same page.

API vs Enterprise API

In any integration scenario, interaction between an API consumer and API provider happens one on one.

But, in an enterprise an API does not stand alone. There will be more than one API in an enterprise—and they come in different sizes, functionality and trust levels. So, while designing and implementing APIs, we should always think Enterprise API instead of a single API — so that all APIs in the organization will look and behave in the same reliable fashion.

Enterprise API is the progenitor of all APIs within an enterprise. It stands to reason that API error handling also should stem from the Enterprise API!

In this article, wherever API is mentioned, it indicates Enterprise API unless otherwise specifically mentioned

API error handling considerations

Whenever we talk about error handling in general, we immediately get the answer of two types of errors:

Technical errors — like Network being down, database not available, etc
Business errors — like Validation error from the business applications, authorization issues, etc

But when we look at API error handling, we should also have the following actors as additional considerations:

API Consumer — Many of the consumers will assume that APIs abstract the implementation logic, so most errors need to be abstracted from the consumer (unless there is a specific need to not abstract….). For example if the error is due to system XYZ connectivity issue, you will probably not want to tell the consumer that XYZ failed — rather only say that there has been a connectivity issue.
Support Team — Although the support team is not the primary caller of the API, they need to provide run time support to each of the APIs when things go wrong. For example, if an API fails due to connectivity issue, support team needs to understand that the issue is with system XYZ and not ABC. Even though error response didn’t contain these messages, they would need access to further information to help sort out the issue.

Last but not least, we need to understand the type of API we are building — especially from the point of view of who is responsible for errors in the API:

API Consumer responsible API— APIs in which API consumers are expected to handle most of the errors (by retry or handle issues). For example, you are using a customer creation or query API and it fails, the responsibility is on the consumer to try again and handle the errors (invalid post codes, wrong data types, network errors, etc). The API owner can take a look and fix the issue, but the responsibility to handle most of the errors falls under the scope of the API consumer
API Publisher responsible API — APIs in which API publisher takes ownership of the errors. For example, payment APIs which should take the ownership of all errors once triggered and API consumer is not expected to handle the errors (for example — no retry)

Now that we know the considerations we should have in mind to design an Enterprise API Error strategy, how do we go about designing and implementing such a system properly?

Anatomy of an API Error Handling Strategy

Enterprises should have a central error handling strategy in place — which lends itself to help developers and consumers tackle errors in a uniform way.

Let’s see a couple of examples:

If you look at the way Oracle database error handling is defined (for example see here), Oracle always communicates errors to the user in the form of ORA-XXXXXX — which the user can use to lookup the error code and find what the issue is with the system. Error code and description are part of the documentation of the database which helps users understand and deal with the errors in a proper way
Microsoft error codes (for example see here) define system error codes which are communicated to the user in case of issues. User would be able to search based on error codes to understand what he can do rectify the issue
Facebook marketing API (for example see here) defines the error codes and their description similarly — which help API caller clearly identify and fix issues in the API call in a uniform manner

Common points from looking at all these (and other similar) examples provide us the building blocks of an exception handling system:

Building blocks of Enterprise Exception Handling Strategy

Common structure for error handling — both API responses used by API consumer and Support error response used by the API support team. Note that both of these are generated by the API provider itself — but with different intent
Enterprise Event Management Tool — Single source of truth for the organization to handle all sorts of events and exceptions
Run Book/Operations Book for Support Team — Support team would require more information than what is sent between the API provider and consumer (whether in the form of log messages, system under scrutiny or details of the exception) to help trace the errors
API Error Documentation — API Consumer will have the error codes documented in the API documentation to clearly provide guidance of what types of errors to expect and what to do when it happens

Let’s take a look at each of these components and come up with a very simple implementation strategy.

API Response Error Structure

API consumers only deal with messages being sent to an API and the response they receive. Whenever an error occurs, if there is a common way these are handled in an organization, it reduces the burden on consumers in the way they will design and handle the errors. Error structure in the organization could be done in an evolutionary manner — starting with a simple structure and strategy to a more complex one.

For APIs, there are two major ways in which errors can be communicated:

HTTP error codes — these represent the standard error types, typically used to denote the standard (and technical) errors happening with the API call — for example Authentication failure (401), Resource note found (404), Internal Server Error (500) etc. These again would need to be defined at enterprise level and strictly adhered to — we won’t be getting into header error codes in this post.
HTTP Body — these could represent both technical and business errors — providing more detailed errors with respect to an API call.

Let’s look at a very basic design for HTTP Body structure to start building the enterprise error handling strategy.

{
  "code": "ERROR_CODE",
  "type": "ERROR_TYPE", 
  "message": "ERROR_MESSAGE",
  "details": []
}

Essential sections for the HTTP Body design:

Code — represents the error code which could be the HTTP error code repeated (in case of technical errors) or better yet follow an ORG error code which is clearly documented somewhere
Type — useful in communicating the type of error to the caller (Connectivity, Validation, etc)
Message — useful in communicating a summary message in human understandable format — generally for display in a screen or debug scenario
Details — Providing additional details of the error in either machine readable or human readable format (and if required to send array of errors/additional details back to the caller)

There are additional sections that could be added as required (for example a link to the error documentation)— based on enterprise requirements. But the above structure — although simple — can handle most of enterprise requirements quite well.

For example, consider a customer to be created in an organization using a customer API and there are multiple fields which fail the validation and you wanted to throw all the errors back to the caller in one shot. You could throw all field level errors as an array like below — which in turn can replicate the generic structure:

{
  "code": "SUMMARY_ERROR_CODE",
  "type": "SUMMARY_ERROR_TYPE", 
  "message": "SUMMARY_ERROR_MESSAGE",
  "details": [{
    "code": "SUB_ERROR_CODE1",
    "type": "SUB_ERROR_TYPE", 
    "message": "SUB_ERROR_MESSAGE",
    "details":[]
  },{
    "code": "SUB_ERROR_CODE2",
    "type": "SUB_ERROR_TYPE", 
    "message": "SUB_ERROR_MESSAGE",
    "details":[]
  }]
}

In such a structure, you can not only send field level errors in the SUB_ERROR_CODE types, you can even assign a summary error for the entire API call with SUMMARY_ERROR_CODE.

Trace / Correlation ID
One of the other common error field in most implementations is a trace Id or a message Id. Two things to keep in mind regarding these trace Identifiers:
1. For a proper trace/correlation to work across API calls — the caller should have sent the trace Id to the API as input. If they did, there is really no reason to include the trace Id back in the response as well! (unless you are sending across an asynchronous response — in which case look at point 2 below!!)
2. Trace Id/Correlation Id are better sent as HTTP headers rather than in a body. Use X_<ORG>_TRACE_ID or similar to send across the Id as a custom HTTP header

Support Error Structure

This is the often overlooked part of the error handling strategy— since this is not documented in the API (and requirements!!) and should be part of enterprise API support process. But this is probably more important than the API response error structure (especially for API publisher responsible APIs) to help aid the support team in solving the issues smoothly.

Based on the error handling strategy of the organization, you might want to log support message as the debug/trace message (in case you have a log aggregation tool like ELK/Splunk which aggregates the logs) or post the error to an external system (like SIEM or custom event management application). In either case, you might want to publish additional information than the error message structure above — to be helpful to the support team like below:

{
  "message_id": "MESSAGE_ID",
  "correlation_id": "CORRELATION_ID",
  "timestamp": "SYSTEM_TIME_STAMP",
  "http_status_code": "HTTP_STATUS_CODE",
  "url_path": "URL_PATH",
  "remote_ip": "REMOTE_IP_OR_HOSTNAME",
  "code": "ERROR_CODE",
  "type": "ERROR_TYPE",
  "system": "SYSTEM",  
  "message": "ERROR_MESSAGE",
  "details": []
}

Some of these fields might not be needed if they are not already used in the organization. Additional fields like the input payload could be logged as required — although we need to consider the fact that PII data shouldn’t be logged and filter/obfuscate the content before logging it.

Error Code Strategy
If the organization wishes to have a single response structure for all it’s APIs, it is advisable to have a uniform error code naming strategy in place. This could be implemented either via:
Centralized team maintaining the list of error codes to be used by all the API developed in an organization. For example, central team will gather inputs from the API teams and either map an existing error code or create a new one based on need. This works great in cases where you want all the organization APIs to behave similarly and you want to maintain strict control over the API error codes centrally.
Alternatively, assigning a range of codes for each API (for example: API A will have error codes allocated from ORG-00100 to ORG-00199, API B will have error codes from ORG-00200 to ORG-00299, etc) — which provides for both centralized control (assigning ranges) and per API/project control (ability for projects to define codes from 200, 201, etc)

Enterprise Event Management Tool

Enterprise should have a strategy for how they would handle different events occurring within their application landscape. If there is an enterprise approved tool, we should plug the events and exceptions from the API into the same system. In the absence of such a system, you could have a log aggregation tool (like ELK or Splunk) or build a custom solution.

Following would be the key requirements for such a system:

Store events with timestamps — for detailed time based analysis of the events
Ability to query and gather inputs from within the event message (search based on type, system or ID)
Correlate events across different applications/transactions based on a correlation ID
Support of some sort of dynamic query mechanism to enable self service discovery
Contain network and connectivity events from the application landscape (since majority of technical errors result due to this issue)

Documentation for the API consumer

Error code for API consumer could be brief and provide a high level information and troubleshooting information — provided either an API level or from a centralized documentation page.

The level of documentation should provide details such as:

Error code and Error type details
What the error means (brief summary or details of what generally causes the error)
Who is expected to take action (for eg: If it’s a wrong post code, then the consumer might have to resend the valid input for the API to work. In the example of a payment API, once submitted, even if there are any errors, response might come back asynchronously from the provider and hence no action needs to be taken from the consumer)
How to reach out to the API provider support team in case further support is required

Documentation for the support team

Documentation for the support team would be in the form of run books for the team which should include:

How to recognize the error has occurred (Alerts, Dashboards, etc)
Details of the error code and type of error
What is the usual cause of the error
How to debug further (which all systems needs to be checked and what logs needs to be looked into)
What corrective actions to be taken — if any
Notifications to be sent — if any

Final Thoughts

Enterprise API error strategy need not be complex or time consuming — it could be implemented with very little effort. Since APIs are like the face of an organization, additional effort to make it real enterprise grade system will be well worth the effort. It is better to start with a simple strategy and gradually enhance to suit the enterprise vision than to have a disjointed and stand alone API level error handling implemented.

Thanks for reading! And I would love to hear your suggestions and comments!