ING Contact Center 2.0 — Creating Resilient APIs

Published in

ING Blog

10 min readSep 15, 2022

Contact Center 2.0 is ING’s global, omni-channel platform for customer communication. The platform is at the core of all communications between ING employees and its customers. As with any other mission critical system , CC 2.0 is designed and developed to be highly available and resilient. In this blog post we shall look at some patterns that engineers in the CC 2.0 family apply to achieve high availability and resiliency.

Before we go into details, let’s understand the CC 2.0 platform’s business goals and look at a high level design.

Business goals for CC 2.0 Platform -

“ Offer one shared platform for all ING countries that facilitate communication between ING customers and ING employees via any channel.”

CC 2.0 and Twilio

CC 2.0 uses Twilio to achieve the one shared platform goal. The reasons for choosing Twilio are -

ING pays Twilio based on usage; there are no fixed charges for licences or maintenance.
ING can develop quickly with Twilio. With its APIs and SDKs, Twilio is a perfect fit in the ING architecture.
Practically every customer journey can be realised with Twilio because of its flexible setup with building blocks and easy integration.

CC 2.0 High level design

The following diagram shows an overview of the functional capabilities of the Contact Center 2.0 platform. The capabilities are divided into 4 layers: touchpoints, communication channels, orchestration & routing and utilities.

All communication channels and routing functionalities are backed by one (or more) APIs that combine together with other ING APIs and Twilio to deliver a complete customer journey. CC2.0 backend APIs are critical to the working of this digital contact center. All APIs are designed and built with the principles of “security, availability, resilience, and observability.”

CC 2.0 in Numbers

The contact center 2.0 platform is currently used in more than 10 different countries for Chat, Voice, Email and Video based solutions. Here are some interesting statistics in terms of the platform usage and the load it handles.

1. Number of ING customers routed to ING agents via CC2.0 platform in chat channel in 2022. More than 100,000 customers were supported in June alone.

2. Number of Voice calls handle by CC2.0 platform in 3 major countries in 2022. Close to 1,500,000 calls are handled every month in just these 3 countries combined.

3. Incoming calls from ING customers as seen for last 60 mins on a given day. The peak load is high as 150 incoming calls per minute.

Here is how these numbers translate to our load metrics :

On average we handle a peak load of 180 TPS in a single channel like voice or chat or email.
The average throughput for the most critical API’s in our platform is 8000 RPM.
The average response times for these API’s must be between 0.1 and 1 seconds for us to handle the load.

Resiliency — CC2.0 Platform

Two key goals in design of the platform are “availability” and “resiliency”. Resiliency in a distributed system is easier said than done. The very nature of distributed system increases the number of failure points. With a distributed system in micro-services architecture, faults can cause failures unless the system is resilient. Confused :), let’s understand the statement.

Faults, Failures and Resilience -

Fault is the presence of an invalid or false state in our internal system.

“A fault is anything that behaves unexpectedly and could be any hardware, software, network, or operational aspect of the system that does something we didn’t plan for!”

Many reasons exist that can cause a fault to happen. Some common examples of a software system faults include:

Memory leaks in application
Threading issues
Dependency issues
Corrupt data

Failure is an inability of the system to perform what it is meant to do. Failure is when our customers are not able to reach us or when we cannot reach them. Failure is when our API’s are not reachable. Failure means loss of up-time and availability on systems.

So why do failures happen? Failures happen when Faults in one part of system are unchecked and they get propagated. Faults can lead to failures if not contained from propagating further into the system.

So, Resilience is all about preventing faults turning into failures. We cannot guarantee fault free systems. However we can ensure high availability and resiliency by constraining faults and ensuring that they remain localised and do not turn into failures.

Common resiliency patterns used in CC 2.0 Backend API’s

TIMEOUT

“At some point, you have to give up.”

Let’s consider the following scenario.

We have an API A depending on API B for serving its requests. API B is for some reason not in a healthy state and is slow.

Implementing a timeout in the call from A to B will ensure fast failure. Slowness of B will be reported soon and A can either propagate the error further or use a fallback mechanism to still serve its request.

Timeout at Controller Layer

When a server receives an HTTP request, it’s good practice to specify the time limits within which this HTTP request must be completed.

For example in Jersey we can use Jersey async to specify the timeout is at the controller level.

@Path("/tasks")
public class TaskResource {
  @GET
  public void createTask(@Suspended final AsyncResponse ar) {
      ExecutorService es = Executors.newSingleThreadExecutor();      ar.setTimeoutHandler(new TimeoutHandler() {
          @Override
          public void handleTimeout(AsyncResponse asyncResponse) {
              asyncResponse.resume("Processing timeout.");
              es.shutdown();
          }
      });      ar.setTimeout(2, TimeUnit.SECONDS);
      es.submit(() -> {
          ..
      });
  }
}

Timeout for Twilio calls

We use an apache http client to configure the timeout policies for ING-Twilio API calls.

def buildClient: CloseableHttpClient = {
  val clientBuilder = HttpClients.custom  val requestBuilder = RequestConfig.custom
    .setConnectTimeout(connectTimeout)
    .setSocketTimeout(socketTimeout)
    .setConnectionRequestTimeout(connectionRequestTimeout)  clientBuilder.setDefaultRequestConfig(requestBuilder.build).build
}

The apache client is created inside a INGHttpClient which is an extension of Twilio’s HTTPClient.

class INGHttpClient extends HttpClient {  val client = buildClient  override def makeRequest(request: Request): Response = {
    client.execute(request)
  }

The INGHttpClient being an extension of Twilio HTTPClient can be given to a TwilioRestClient. This is the class that allows access to Twilio api’s.

def buildTwilioClient(httpClient: HttpClient): TwilioRestClient = {
  new TwilioRestClient.Builder(sid, token)
    .httpClient(httpClient)
    .build
}

And finally this TwilioRestClient can be used to access Twilio resources

Task task = Task.creator("WSXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX")
            .setAttributes("{\"type\":\"support\"}")
            .setWorkflowSid("WWXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX")
            .create(twilioRestClient);

Timeout for API-API call

We use Finagle for API-API requests within ING. Finagle contains two modules that are responsible for handling timeouts: Session Timeout module and Request Timeout module.

Below snippet shows how to configure session timeouts by defining a maximum lifetime and idle time for the client. Configuring these parameters allows you to set a maximum lifetime for the session and a maximum idle time where no requests are being sent from that client.

client
  .withSession.maxLifeTime(20.seconds)
  .withSession.maxIdleTime(10.seconds)
  .withSession.acquisitionTimeout(10.seconds)

The most basic configuration to define a request timeout on the client side is to use the “withRequestTimeout”method on the client API. The code snippet below, for example, defines a timeout for a request.

client.withRequestTimeout(200.milliseconds)

RETRIES

“If at first you don’t succeed, try, try again”

Retries can help reduce time spent recovering from failures. “Retrying” is especially useful when dealing with intermittent failures. Retries are used in conjugation with timeouts. We give a specific timeout and then we retry after the timeout.

Retries for Twilio calls

For retrying Twilio calls made via Twilio SDK, the INGHTTPClient is useful again. The client maintains a retry count per http call and will retry failed calls that match certain criteria.

We use resilience4j and other such libraries to configure retry logic. A naive representation of that logic is represented here

def execute(request: HttpUriRequest): Response = execute(request, 0)

def execute(request: HttpUriRequest, retryCount: Int): Response = {
  executor.execute(request).recoverWith {
    case exception: Throwable =>
      if (canRetry(exception, retryCount)) {
        execute(request, retryCount + 1)
      }
      else throw exception
  }
}

Retries for API-API calls

If we take the example of Finagle again — By default, a “RequeueFilter” is configured in each client. This filter does retries but does them without any delays, which is not recommended. Backoff policies should be used to configure the behaviour of the filter. Finagle also provides several built-in policies for that purpose. In addition to backoff policies, retry budgets (number of retries) should also be configured when using a “RequeueFilter.”

This is how you can configure a “RequeueFilter” with a backoff policy and a “RetryBudget.”

client    
  .withRetryBackoff(Backoff.decorrelatedJittered(2.seconds, 32.seconds))    
  .withRetryBudget(RetryBudget(ttl = 5.seconds, minRetriesPerSec = 5, percentCanRetry = 0.1))

FALLBACKS

“Graceful degradation”

Like all other APIs built in this world, even our APIs will not work as expected sometimes. When there are faults inside our APIs or in those we depend upon, we select alternative mechanisms to respond with a degraded response instead of failing completely.

The selected alternative mechanism depends on the specific customer or employee journey. Let’s look at how CC 2.0 uses Twilio Functions for voice fallbacks.

Fallback to Twilio Functions

Callflow API is our voice channel orchestration API. A happy scenario for the API looks as follows

Incoming phone calls from customers to ING are sent from the PSTN network to Twilio, where the calls are handled by Twilio Programmable Voice.
Twilio Programmable Voice sends a request to the CC 2.0 Callflow API at ING to start a callflow for the incoming phone call.
The callflow has an automated voice dialog with the customer to collect information about the customer and the intent of the phone call.
The Callflow/Voice API creates a task in Twilio TaskRouter to route the customer to an ING employee, based on the information that was collected in the callflow.

What happens when there are issues with the CallFlow API? The fallback scenario mitigates the situation where starting calls at ING fails. When this occurs, Twilio Functions is used as alternative to route the phone call to an ING employee.

When Twilio Programmable Voice receives an error or a timeout on starting a callflow, it will execute the Twilio Function that is configured as fallback. This is also recommended by Twilio.
The Twilio Function has a limited automated voice dialog with the customer to collect basic information for coarse-grained routing.
The Twilio Function creates a task in Twilio TaskRouter to route the customer to an ING employee, based on the information that was collected in the function.
CIRCUIT BREAKERS

“Stop the line”

We use circuit breakers in our homes to prevent a sudden surge in current. This prevents a potential fire occurring in the house. A circuit breaker trips the circuit and stops the flow of current.

This same concept is applied to our distributed systems where we stop making calls to downstream services when we know the system is unhealthy and failing and needs time to recover.

Failure Accrual is one of the three types of circuit breakers provided by Finagle, along with Fail Fast and Threshold Failure Detection.

Failure Accrual

Similar to Retry Filters, Failure Accrual is configured using a policy that essentially can be success-biased or failure-biased. The Failure Accrual module is enabled by default for all clients with a policy based on five consecutive failures and an equal jittered backoff policy. These examples show both failure-biased and success-biased configurations respectively.

client
  .configured(FailureAccrualFactory.Param(() =>
  FailureAccrualPolicy.consecutiveFailures(
    numFailures = 10,
    markDeadFor = Backoff.decorrelatedJittered(5.seconds, 20.seconds))))client
  .configured(FailureAccrualFactory.Param(() =>
  FailureAccrualPolicy.successRate(
    requiredSuccessRate = 0.99,
    markDeadFor = Backoff.decorrelatedJittered(5.seconds, 20.seconds))))

BULK HEADING

The name `Bulkhead` comes from the sectioned partitions in a ship. If a partition is damaged and/or compromised, only the damaged region fills with water and prevents the whole ship from sinking.

Similarly, you can prevent failure in one part of your distributed system from affecting and bringing down other parts. The Bulkhead pattern can be applied in multiple ways within a system.

We use “Categorized Resource Allocation” — splitting resources of a system into various buckets, instead of a common pool as a bulkhead mechanism. One resource we apply this pattern on is the “Threadpools” that are supplied to an API or to Twilio SDK.

object ExecutionContextConfig {
  val unboundedExecutor : ExecutorService = Executors.newFixed..  val listeningExecutorService: ListeningExecutorService = MoreExecutors.listeningDecorator(unboundedExecutor)  /**
   * Set the thread pool that Twilio will use for all the `async` operations
   */
  Twilio setExecutorService listeningExecutorService  /**
   * Execution context backed by unbounded threads. To be used for
   * blocking , I/O operations. 
   */
  val ioExecutionContext: ExecutionContext = ExecutionContext.fromExecutor(unboundedExecutor)  /**
   * Execution context backed by limited threads. To be used for
   * non blocking, CPU bound operations
   */
  val boundedExecutor: ExecutionContextExecutor = concurrent.ExecutionContext.global
}

MONITORING AND ALERTING

The key objective of monitoring is to make sure that the APIs individually and the distributed system as a whole is “observable”. We should be able to look at the system from the outside and identify things that are going wrong . Upon identifying such faults, we should be able to trigger alerts. The following techniques help us achieve that:

Logs
Metrics
Distributed Tracing

Each CC 2.0 API produces the aforementioned data sets with contextual data. The end goal is to simplify finding symptoms and causes of faults in CC 2.0 platform.

Why do we care more about “Resilience”?

Resilience of a system is the measure of its availability. Higher resilience can generate higher availability. Failing to be resilient can affect us in many ways. For CC 2.0 platform, not being resilient means:

Customers cannot use features like ChatBot or the self service IVR.
Customers cannot reach out to ING.

The consequences of all these are unacceptable to us. As a system that sits between our Customers and Agents we must always be ready. Ready to route phone calls to the most suitable agent, ready to reply to chat messages , ready to serve customers on mobile apps, ready to answer your queries via ChatBot and ready for more..