Deploying WSO2 API Manager on multiple data centers — caveats, tips, and test conditions

There has been an increase in number of multi data center deployments in past several months. Enterprises looking to deploy products on multiple data centers as well as refactor existing applications to make them “multi data center aware”.

Why deploy on multiple data centers?

  1. Data center maintenance — When doing data center maintenance, all software systems inside the data center will be unavailable. Having another data center will make data center maintenance simple. No service disruption to users/customers. (one cloud also argue here that data center maintenance should be done without making the entire data center unavailable)
  2. Faster response times — Better response times for users in the same or nearest geographic location
  3. Data locality policies or compliance — e.g. EU data protection regulations

Although points discussed below apply to API Manager, some can be generic requirements that apply to any software component deployed across multiple data centers.

Routing traffic

When there are data centers across different geographic regions (US-East and EU for example), it’s not ideal to route traffic in round-robin manner from a global load balancer. This can yield to varied latency for users. Routing based on geolocation is a much better strategy. An active-active round-robin load balancing may be acceptable if there are multiple data centers within a state or couple of adjacent states/towns/locations.

Database setup

When multiple data centers are actively serving traffic, whatever the database used for API Manager should be replicated across all data centers. Each node (or instance/API Manager component) that’s deployed in a data center should resolve to a locally accessible database instance.

Instance/node/machine/VM configuration

All instances in data centers should have a common timezone (e.g. UTC). If local timezone is used, log information will be written using local time. If there’s no central log processing system that normalize timestamps and process log entries, it’ll be difficult to co-related logs across data centers (highly unlikely but will be difficult if you have to do it).

When timezones are not in sync a token generated from one data center might fail if subsequent requests are routed to a different data center. It might work if token lifetime is greater than the time difference between 2 data centers. For consistency each node should have the same timezone. Ideally synced with a time server (ntpd in Linux).

Syncing artifacts (APIs) can be done in 3 ways. Raj explains these strategies in detail in below post.

Throttling

For all API Manager versions before 2.0.0, when you setup a multi data center deployment, throttling policy will be double if you route requests between data centers in round-robin fashion. When 2 data centers are in different cluster domains, throttling policies are tied to that cluster. When you apply a throttling policy to an API saying x calls per second, a user will be able to get 2x between data centers. Care should be taken to define throttling policies if traffic is served simultaneously from 2 data centers for the same user.

Following article provide background info on setting up API Manager in different clusters.

Test scenarios

It’s always recommended to test a deployment thoroughly in a staging environment prior moving to production. Following provide some edge cases that should be tested

  • Primary key collisions—Generate 2 OAuth keys from 2 data centers at the same time. Can use a script to generate tokens continuously after reducing token lifetime during testing
  • Split brain scenarios — Stop DB replication and do a long running test that involve token generation/revocations (this might not be a valid scenario in some cases). If requests are coming from trusted clients, a longer token expiration time may be an option. Need a way to resolve conflicts if there are any
  • Replication delay — Test replication delay between data centers. If a user generate a token from one data center and use it immediately from other data center, if this time is shorter than replication delay then this call will fail. Need to assess and decide whether this is a valid scenario for the deployment/business use case. If it is, and there’s a larger replication delay, there can be a few token invalid errors. Another point to keep in mind is that sometimes connection between data centers can become unreliable. Additional latency will contribute to invalid token errors
  • Artifact synchronization delay — Based on the strategy chosen to sync artifacts across data centers, when an API is published from one data center, there will be a delay when that API is available from the other data center. This will not be an issue for end user scenarios. However, if there are integration/automation tests those might fail
  • Other system/application testing — In a data center, along with API Manager there are various other software/services that implement different business capabilities. When introducing another data center, all those applications have to be multi data center aware. There has to be a larger testing (and possibly dev) investment to modify all other components deployed on a data center to make them work across multiple data centers