Architecting: The Operational

Be sympathetic to infrastructure:

  • Virtualised? I/O and network worries increase, noisy neighbours need accounting for, not magic and can introduce hidden performance bottlenecks.
  • Cloud? Assume volatility/instability in I/O and network. Account for random spread of services across infrastructure, closeness/low latency can’t be counted upon — caching etc takes on greater importance. Elasticity can be a double-edged sword, you can grow fast automatically but incur additional costs that might break the budget.
  • Understand the underpinnings Developers have knowledge of their code and shouldn’t be leaving tuning of GC to Operational teams. Core counts should be accounted for when sizing thread pools.
  • Where are the costs and what do they look like as you scale? Was always relevant but increasingly so in cloud where the costs are more visible and receive more scrutiny

Monitoring and measurement need to be in place to track growth and potential impact on architectural lifetime constraints (when do you need to start thinking about a change to cope with growth?) More fundamental than that, if you don’t know when everything is healthy you can’t know when it’s safe to deploy, or when to stop a deploy or make the high-cost call to wake someone up.

Configuration can be a significant source of coupling, house it with the entity that requires it wherever possible. Provide appropriate methods for discovering, describing and manipulating configuration. Do not confuse central point of access (e.g. a console) with central point of storage or implementation.

Make sure to isolate/abstract key points of scale such as databases. Vertical scaling is acceptable early but can be painful to transition from in absence of decent preparation. Cloud changes the balance often making horizontal scale the better architectural approach from day 1.

It’s essential to establish SLAs and/or baselines to allow for routine testing and analysis of performance improvement or degradation.

Testing environments beyond unit test and system test with stubs becomes increasingly less viable as the number of components grows:

1. Providing an approximation of production hardware is costly as kit is typically under-utilised.

2. Importing database snapshots from production is high cost and requires downtime.

3. Maintenance and version control create friction which is tolerated in production where predictability and traceability are considered important. In other environments, the stance is more relaxed making stability a problem which limits effectiveness.

It is often better to invest early in dark-testing (in-production testing) with appropriate infrastructure to:

1. Record, duplicate and re-route traffic.
2. Rollout changes incrementally to sets of boxes and/or clusters.

The benefit of such an approach is felt not only in testing but also in dealing with the inevitable failures that creep through regardless of QA efforts. It also closes the difficult to plug holes in regard to validating non-functionals.

Like what you read? Give Dan Creswell a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.