Since the year 2002, when Bezos issued a mandate to all teams at Amazon to expose data and functionality through APIs, service-oriented architecture has come a long way. Most of the companies expose and consume internal and external data and functionality through APIs, because APIs provide loose coupling where individual services can be managed, scaled, and evolved independently. The foundations of SOA also enable teams to move fast in dreaming up and building new functionality, faster testing cycles, and continuous deployment through CI/CD. Over the last decade, REST has become the default modus operandi for implementing software services.
The reason, this article is called the API Manifesto is because, as APIs have become extremely important for organizations, getting them right is supercritical. A manifesto is a public document proclaiming the aim of an organization or a team or an individual. It not only declares the goal but also the means to achieve it. In this case, the manifesto is about achieving the best API outcome by employing the right design principles, implementing the right semantics, managing expectations of API consumers, and using the right tools to monitor, control, and debug.
This page tries to cover the entire area of API Management. API Management includes API Design, implementation, authentication and authorization, rate limiting, audit logging, metrics, monetization, documentation, health checks, reporting, etc. In other words, it is a discussion on the best practices on the design and implementation of REST APIs which are followed across the industry.
Best Practices in API Design
REST is an architectural style for modeling distributed systems as a set of resources. Resources can be data, objects, or services that can be accessed by clients. Every resource is represented by a URI which uniquely identifies the resource. Also, REST APIs (based on HTTP) are built around the HTTP actions such as GET, POST, PUT, DELETE, and PATCH.
A well-designed API will have the following characteristics:
- Easy to use: It will follow a standards-based approach and developers will be able to easily understand and work with the API. The better the ease of use, the less they need for support and extensive documentation.
- Hard to misuse: Good API design will make it hard to commit mistakes and misuse the API for unintended purposes by the consumer. This is very important for the security and stability of the API.
- Complete and Concise: The API should provide enough functionality to build meaningful applications around it. The APIs should not be very verbose and the context should be limited.
Organizing the APIs around resources
- It is important to avoid mirroring the underlying data items as Resources. They have to represent business objects or domain objects. For example, an Order or Invoice entity may be represented as a single resource to the client but may be stored in multiple tables under the hood. The client should be abstracted from the internal implementation.
- Entities are often grouped into collections (for example, users, items, orders, etc). A collection should be exposed as a separate resource and should have its own URI. In other words, use plural nouns to represent collections.
- Sending an HTTP GET request to a collection resource (example: /orders) returns a list of items in the collection. Whereas the URL “/orders/12345” represents a path to order with id 12345.
- It is important to provide navigability in the API responses. For example, let's say /orders/12345 returns a response with order details with a collection of line items, it is important to provide a navigable link to access the line items.
- Sometimes it may be necessary to expose some functions as a resource. For example, a URI such as “/customer/121/tax?year=2019” may be exposed to calculate tax for a given customer for a given financial year. However, such API definitions must be used sparingly.
Define operations in terms of HTTP methods
Most of the REST APIs use HTTP semantics to implement APIs. Common HTTP methods used are:
- GET: Used to retrieve a resource. Returns the representation of a resource. The header contains the response code and the body contains the representation of the resource.
- POST: Creates a new resource at the specified URI. The body of the request message contains the details required to create the resource. The response usually contains the representation of the resource as in GET. This method can also be used as a trigger to invoke operations. For example: starting data processing.
- PUT: USed to either create or replace (modify) a resource. As in POST, the body of the request contains the details of the resource to be created/modified.
- DELETE: Removes the data as given URI
- PATCH: Used to partially update a resource. Not used commonly.
The following table enumerates common implementation conventions for these HTTP methods with an example:
Design the APIs around HTTP semantics
All the guidelines in this section MUST be adhered to.
- Always include “Content-Type” header in the response. For example: Content-Type: application/json
- Read “Content-Type” header from the request and reject if the server cannot support it by returning HTTP error code 415 (unsupported media type)
- Read “Accept” header from the client request, which specifies the media type accepted by the client. If the server doesn’t support that media type, return HTTP error code 406 (Not acceptable)
- A GET method, when successful returns 200. If the specified resource cannot be found, it should return HTTP error code 404 (Nt Found).
- If successful, the response will contain the object in the message body.
- A GET request on an empty collection should return an empty array in the response body along with response code 200.
- Post method should create a new resource, return the response with HTTP status code 201 (created), location header containing URI of the resource and body containing a representation of the resource.
- If the method doesn’t create a new resource but does some processing, it should return HTTP status code 200 with the result of the operation in the response body.
- If there is nothing to return, the HTTP status code passed on to the client should be 204 (no content)
- If the client has sent invalid data in the request, the response should have an HTTP code of 400 (bad request), with additional information about the error in the body.
- If the PUT method creates a resource, it should return the response with HTTP status code 201 (created), location header containing URI of the resource and body containing a representation of the resource.
- If it successfully updates an existing resource, it should return a HTTP response code of 200, and body containing the representation of the resource.
- If it successfully updates the resource, but there is no content to return, it can return the HTTP response code of 204 (no content).
- If the API cannot update the resource for any reason (for example, the resource is locked), it should return 409 (conflict)
- Bulk updates should be implemented as PUT operations, the request should specify URI of the collection and the body should contain the details of the resources that need to be updated. In case of success, the API returns an HTTP status code of 200. In case of partial success (some entities successfully updated and some not), it should still return a status code of 200 with response payload listing successes and failures. In case the bulk update is designed to run as a long-running process, the API should return 202 (accepted). Handling long-running requests are discussed in the later sections.
- If the Delete method is successful, it must return HTTP response code of 204 (no content). Response body need not contain any information.
- If the resource doesn’t exist, it must return HTTP response code of 404 (not found)
- DELETE should not be implemented on a collection. If executed, it must return a response code 405 (not allowed).
- This method is used to request headers from the server for a resource.
- It should be treated as identical to GET request, but it must not return the message body. It must only return meta-information such as content length (especially for large objects), content type etc.
- The request can contain URL parameters.
- If the resource exists, response code of 200 should be returned, if it doesn’t, then the API must return 404 response code.
The PATCH request is used by clients to send updates to an existing resource in the form of a patch document. A patch document need not contain all the fields of the resource. In other words the patch document doesnt describe the whole resource but only the changes to be applied.
There are mainly two JSON based document formats for patching. For the sake of further discussions, consider a resource with the following representation
In this method, the document contains a list of data items which contain the specific directives such as “add”, “replace”, “remove”, “move” and “copy” operations.
JSON Merge Patch
This is a simpler format, where the patch document has a similar format to that of resource creation but will include just the subset of fields that should be changed or added. In addition to this, a field can be deleted by specifying “null” in the patch document as shown below.
The media type for Merge Patch JSON payload must be “application/merge-patch+json”.
In both cases, 200 (OK) must be returned as a success response. If the document format is not supported, the server should return 415 (unsupported media type), if the document is invalid, 400 (bad request) must be returned and if the document is valid but the changes cannot be applied to the object, 409 (conflict) must be returned as the response code.
Patterns for API Implementation
Guidelines in this section are based on various features of APIs such as asynchronous operations and filtering. When implemented, the APIs SHOULD follow the below-listed common patterns to bring consistency across all the APIs.
Handling Asynchronous Operations
It is possible that some of the update operations (POST, PUT, DELETE) might take a while for the processing to complete, and if the API waits for completion of processing before sending a response, then it would cause unacceptable latency on the client’s end. In this case, it is better to implement it as an asynchronous operation. An asynchronous operation returns the HTTP status code 202 (accepted) along with the URI to “status” endpoint in the location header. For example:
Status endpoint must be implemented to get the update on the request. If the client sends a GET request to the endpoint specified in the location, it must return the current status of the asynchronous request with a 200 (OK) status code. For example:
While the asynchronous request is in progress, if a DELETE is sent to the status API endpoint, the processing should be canceled (if it is possible).
Data filtering, sorting, and pagination
When we expose a collection of resources (example: orders) there is a possibility that a large amount of data might be fetched when only a subset of the information should suffice. Let’s say that we just provided a plain vanilla REST API to access orders for a given customer. If the client wants to extract only those orders which exceeded a specific amount, the client has to get all orders and apply the filter and then extract the information needed. This is inefficient since clearly there is a wastage of processing power and bandwidth on both the client and server. An optimal way to accomplish this would be for the client to pass a set of filters to the API and for the API to apply those filters while reading data from the data source.
Any API potentially returning a large number of items should implement filtering and pagination. This also limits the possibility of DoS (denial of service) attacks on the application layer.
An example of filtering is shown below:
The above example is a GET request, which contains filters as URL parameters. Since GET requests don’t support Body, APIs supporting complex filters may need to be implemented as a POST request, even though it is semantically incorrect. This example also shows the sorting directive passed using “sort” and “order”. The sort parameter contains “field name” (by which the records need to be sorted) as value and “order” contains either “ascending” or “descending” directive.
An example of pagination is shown below:
The above example depicts pagination through limit and offset. The first page will start from offset = 0, and the limit represents the page size that the client expects. As the client moves to the next page, usually the limit remains the same, but offset keeps increasing. The API should implement default page size and maximum page size (to avoid DoS attacks).
Response with pagination is generally structured as shown below.
It should contain prev and next links along with the limit and offset of the current page. This allows the client program to easily navigate the pages. For the first-page prev link need not be provided and for the last next link need not be provided. That way the client knows when it has reached the end of data.
All APIs will evolve over time. As business requirements change, new resources may be added, old resources might be amended and relationships between different resources might change. However, the clients of the API might not have the bandwidth to consume the changes immediately. Hence, while continuing to innovate, improve, and evolve the APIs, it is imperative to help the existing client applications to continue to work without breaking their functionality. Versioning is an approach that enables us to achieve the aim of isolating existing clients from breaking when new changes are released.
Versioning through URI
In this method, every time an API signature (data contract or behavior or response) changes a new version number is added to the URI of the resource. For example, https://api.xx.com/v1/orders, here v1/v2/v3 in the path indicates the version numbers. However, the existing versions should continue to operate as before, returning resource representation conforming to the original schema.
Even though this versioning mechanism is simple, it depends on the server being able to route requests to appropriate end-point depending on the version path parameter. It becomes unwieldy as more and more versions are released. Navigability (including paths of objects) within REST results becomes more complicated as the paths need to include versions as well. Most of the API gateways support this type of versioning based routing of requests.
Versioning through Query String
Rather than using URIs to determine the version, the query string based versioning works by using a query parameter to specify the version of API being invoked, For example:
In this case, we need to implement a default version that will be returned when no version variable is specified. Also, in this case, the versioning needs to be handled within the code, which needs to parse the query string and construct the object conforming to that version. In other words, the routing to the right API endpoint cannot be handled by an API gateway or a load balancer.
Versioning through Header
This approach works through custom header which indicated the version of API being invoked. The client is expected to add the custom header indicating the version. For example:
GET https://api.xyz.com/orders/12345 HTTP/1.1
Even in this case, the routing cannot be done by API gateway or load balancer.
Theoretically, idempotency means the same operation repeated multiple times results in the same value. That is F(x) = F(F(x)). Read more on the patterns of idempotency here.
Implement GET, PUT, and DELETE operations to be idempotent. In other words, the same request repeated over the same resource should result in the same state for the resource and the same response to the client without causing any side-effects. In case of hard delete, it is possible that the first time client gets a response of 204 (no content), but subsequent requests get 404 (resource not found) because it has been hard deleted. Otherwise, the API should ensure that it returns the same response (unless there is a server error). POST operations which do not create a resource, but perform processing on an object to move it from state A to state B are also ideal candidates for idempotency.
In a loosely connected world of distributed systems, where there are many points of failure (servers, routers, switches etc), Idempotency acts to reduce friction. Let’s say the client initiated a transaction which timed out on the client’s side. The client at this point doesn’t know whether his request succeeded or failed. If it succeeded. If the client retries the same transaction, the server can either respond with an error code (returned by a state machine or the database), or it can return the same response it would return on success. The second option requires more work, but it reduces the ambiguity to the client.
To avoid chattiness, it is recommended to support POST and PUT over the entire collections (for example: /orders). A POST request should be able to accept the resource array in the payload and create them in bulk and a PUT request should be able to replace multiple resources in a collection.
It is very important to pass the correct error codes and error descriptions to the clients. Any internal errors need to be caught and appropriate error responses returned to the clients. The framework/platform implementing the APIs should make sure that uncaught errors are not propagated to the clients. Try to avoid sending 500 status codes to clients because they are unactionable. For example, if a client is trying to delete an order when one of the lines is in the shipped state, return 409 (conflict) instead of returning 500 (system error). If due to any condition (or rules) the request is unachievable, return 400 (bad request).
On many web servers / API gateways, you can specify Authentication providers. This routine is executed even before the request reaches the API endpoint. If an authentication error occurs on the webserver/API gateway, they return 401 (unauthorized). Once the client is authenticated, it is the responsibility of the API to authorize the client. In other words, to check the client’s privileges to execute the current API. If the authorization fails, the API should return 403 (forbidden).
The list of HTTP response codes is enumerated in this document hosted by W3C. Visit this page to know more about all the standard HTTP response codes and what they mean.
Enabling Client-side Caching
In distributed systems, network latency is something that cannot be wished away. The client will experience this every time it makes a request and receives the response. Wherever clients are frequently sending requests and receiving responses, we should aim to reduce the amount of network traffic flowing through the network.
HTTP protocol supports caching by clients and intermediate proxy servers through which the request is routed by using cache-control headers. When the server sends a response for a client request, it needs to include Cache-Control headers in the response, which indicate whether data in the body can be safely cached and for how long. The example of such a response is shown below:
In the above example, the Cache-Control header specifies that the content can be cached for 600 seconds (5 minutes) and only by a private client (such as a browser). In other words, the above response will not be cached in shared caches (such as a proxy). Specifying “public” in the Cache-Control header will enable caching on shared caches, whereas specifying “no-store” in the Cache-Control header will disable caching by the clients.
A word of caution: Enabling client-side caching can make objects go stale in the cache. Based on this information, the client may try to update the object, which will cause data consistency issues. To avoid this, ETAGs need to be used. You can read more about it here.
Using API Gateway
In an API centric world, we will have to expose our APIs to clients, which are dependent on our APIs to get things done. However, it is not secure to expose the API endpoints directly as if the endpoints are exposed, they can be hacked or attacked. API gateways serve the same purpose as proxies server for web applications. They provide a layer of managed indirection, hiding the real endpoint from the consumer while monitoring and protecting the endpoint. The most important function of an API gateway is rate-limiting. Rate limiting will reduce the occurrence of DoS (denial of service) attacks on the API. API gateways can also be used for offloading common functionality like SSL termination, authentication/authorization, metrics collection, audit logging, transformations, and so on. Some of the commercial and open-source APIs gateways available in the market are NGINX, MuleSoft, Kong, ZUUL etc.
- SSL Termination: It is recommended that all enterprise APIs be exposed over HTTPS. Some API gateways allow SSL certificates to be installed on them, enabling SSL termination at the API gateway instead of the service. This is very useful when multiple microservices expose their API endpoints through the same API gateway.
- Authentication/Authorization: Common functions such as authentication (and even authorization) can be done at the gateway relieving the burden on the services. API gateways provide the capability to write plugins to do the same. Each API must be authenticated (whether internal or external) through many available standard mechanisms (JWT or API key). In other words, some form of API authentication is a MUST in both internal and external facing APIs.
- Metrics and audit logging: API gateways are the best place to collect metrics such as API latency, success and error metrics etc. These metrics can be extracted periodically stored in a time-series database. Time-series databases allow for excellent reporting capability. Metrics and logging can be achieved through tools such as Medusa and Splunk.
- Rate limiting: Rate limiting (or throttling) is a feature to guard against the sudden surge of requests. APIs can get overwhelmed by too many requests, which will impact all the consumers of the API by causing latency to skyrocket or forcing the servers to shutdown. DoS (denial of service) attacks also achieve the same result. Rate limiting is, in fact, a defense against DoS attacks. This is also used to make sure one client doesn’t use up the entire capacity.
- Transformations: Some API gateways allow request/response transformations. For example, through the gateway, you can transform payload (for example: from JSON to XML, ver2 to ver1 and so on).
Where API Gateway cannot be deployed, the assumption is that the API itself will implement the required features such as Authentication, Metrics, Audit logging, and Rate limiting.
An API is backward compatible if the client code written for the previous version of API works for the current version. In other words, a client which was written for version 1 can work with version 2. This has various advantages. The clients don’t have to invest in development effort every time an API changes. The release cycle will be faster since no clients will break because of a new release.
Developers SHOULD, whenever possible, maintain the backward compatibility of the resources and objects (input/output). If the new changes make it impossible to maintain backward compatibility, a new resource and resource representation (input/output) will have to be created. The exception to this rule will be made when the current behavior or input/output constitutes a security threat or when the API has been incorrectly implemented, affecting a large number of customers. In this case, the API will be changed even if it breaks backward compatibility. However, when the API change is not backward compatible, customers need to be notified and educated.
Rules of Backward Compatibility
Stable URIs: The resource that existed at a given URI for the previous version, should continue to exist at the same URI without a change in meaning. HTTP response codes should not change between versions. However, the resource may support new query parameters in new versions, but they SHOULD be optional. Not providing them should not break the functionality. The new version of the resource can return a redirection response (301/302), which needs to be handled by the client. In this case, a location header MUST be sent to the client.
Stable Representations (input/output objects): If a resource accepts a representation (input object), via POST or PUT, it MUST continue to accept the same representation in future versions. Additional properties are allowed but will NOT be mandatory. The default value that is substituted for the absent property must carry the same meaning as the previous version.
Default values and limits: Default values with respect to page size, object (input/output) size limits, and rate limits (throttling) can change between versions.
Robustness: To make it easier for our clients to use our APIs, we should build them robustly. API should be resilient to failures. This means it should be tolerable to variations in input data, query parameters, headers, etc. from the clients. API should decide how to handle the request only based on what it recognizes. For example: if a request query parameter from the client request is unrecognized, the server should ignore the query parameter. If any of the fields in the JSON payload are unrecognized, those fields should be ignored. If the header contains an unrecognized attribute, it must be ignored.
Monitoring API Health
API monitoring is a serious business as many clients would depend on a business-critical API and an outage will cause dependent services and clients to fail. Monitoring APIs for an outage is not enough. We need to monitor APIs for failure to meet the SLA (see next section). The following attributes of an API call are recommended to be recorded, preferably at the API gateway level. Recording these metrics will help us in strong analytics and alerting capability.
- Client ID or Source IP or API key (to identify the client)
- Request length (as a measure of payload)
- Request timestamp
- Response timestamp
- Latency (ms)
- Response code (2xx, 3xx, 4xx, 5xx)
- Endpoint (path)
Recording these either in a time series database (such as Prometheus/Clickhouse/Druid) or analytics database as idempotent entries will be very useful in a deep analysis of the API performance and uptime.
External health check
Checking the health of APIs can be accomplished by periodically calling an API with demo/dummy data from outside the corporate network. When a consumer calls an external API, the call goes through a network of routers, switches, firewalls, load-balancers, and gateways. Even if one link is broken along the chain, the consumer experiences an outage. Internal monitoring will not provide a real world view of an API’s uptime. Hence it is absolutely necessary to have an external monitoring setup. Many companies like Postman provide such service. An API health check can be performed from a set of geographically distinct clients (Ex: US East, South Asia, etc) to measure uptime, latency, and outage from a customer perspective.
For API monitoring, it is not enough to measure the CPU and memory of the system or outage. It is equally important to make sure the API is complying with a given SLO (see below). To achieve this, alerts are very useful. It is recommended that alerting needs to be set up not only for outages but also for breaches of SLO. In other words, SLO breach (availability and performance) MUST be treated as an outage for operational purposes.
SLAs, SLOs, and SLIs
SLA (Service level agreement): The agreement a company makes with the consumer for a given API. The SLAs are generally drawn up by business/legal teams in terms of responsiveness, uptime, and responsibilities (customer vs provider).
SLO (Service level objective): The objectives that the team must meet to satisfy the SLA. In other words, SLO is a line-item within an SLA which refers to a specific metric such as response time or uptime. SLOs are the individual promises that hold the engineering and DevOps teams accountable for meeting them. SLOs can also be defined for internal systems as well, for example, a CRM system or an IAM system.
SLI (Service level indicator): Real metrics (numbers) gathered on performance and availability of a service. In other words, SLI measures the compliance of a measurement with respect to a given SLO. For example, let's say SLO for an API is 99.5% uptime. To meet the SLA, the SLI has to meet or exceed a given SLO. With respect to our example API, the SLI needs to meet or exceed 99.5% uptime to satisfy the SLA.
It is obvious that before SLA or SLO can be provided, an API has to undergo performance and scalability testing.
Providing a Monthly Uptime SLA
Actual Monthly Uptime Percentage = (A-B+C)/A , where:
A = Total Monthly Time (in seconds/minutes);
B = Unavailable Monthly Time (duration for which the service was not available to consumers inside and outside the network); and
C = Excluded Monthly Times (should include maintenance window and outages caused by factors outside the company’s control)
Providing Response Time SLA
Response time SLA for a given (set of) APIs can be provided in terms of either 90th, or 95th or 99th percentile response times either in milliseconds or seconds over a fixed period of time. For example,
95th percentile response time of 1 second calculated on a daily basis.
An important thing to note is that this metric needs to be monitored and published from the API gateways. This metric can show a lot of variances if measured from the last mile (API consumer’s end), since network delays will add up.
API documentation can be defined as a set of instructions on how to effectively use an API, specifically written for developers. It can be thought of as a reference manual containing all the information to work with the API, such as authentication/authorization, input/output payloads, headers and parameters. There are several API description formats available such as Swagger Open API specification and RAML. I have found Stripe API specification to be one of the best examples of API documentation.
Unless we provide client libraries or SDKs to our clients for integration with our APIs, there is no need to provide code samples. However, we need to provide comprehensive examples of calling our APIs using curl commands, which include authentication, query params, headers and payload. We MUST also provide complete JSON object examples for input/output. Here too, it is worth emulating Stripe.
Sandbox environment helps consumers of the APIs to test their integrations and validate their application integration and use case flows, before deploying it on their production environment.
An API can be in one of the following states.
We should strive to provide our customers with stable APIs. If we need to discontinue or remove features from an API, we should provide a notice of at least 60 days to the API consumers providing the following information
- API and version being deprecated
- Whether a new API is replacing the old one, and its description
- Reason for not maintaining the backward compatibility of the API
- Responsibility of the client: such as moving to the new API within the deprecation period
- Indemnity which states the company is not responsible for security risks and other problems arising out of using deprecated APIs
- Detectability: the deprecated API should send a deprecated flag in the response header as an indicator to the API consumer
Documentation of each API should carry the current status of the API.
We need to build a database of frequently asked questions on the APIs. This needs to be made available to the API consumers. FAQs will reduce the number of customer calls to support and improve the customer experience.
Thank you for reading a really long article. Please leave your thoughts in the comments section.