Rollback as an Agile Strategy

Hamed Ghasemzadeh
7 min readAug 8, 2020

--

Rollback is downgrading the current version of a software component to the previous one. This usually happens due to failure which are that much important that the current version can not keep performing. For example when a back-end service doesn’t start or when a web application doesn’t show up.

Traditionally rollback was part of operation and support upon deployment of new versions. A developer releases a tested software and later that component gets deployed by Ops people which is an instruction based process. The ops person reads the release instructions which is normally populated by the developers who knows more about the recent change e.g. instructions for database, there might be new settings should be applied in the router. Rollback happens In case any of these steps fails.

Cleaning

As rollback is the way to come back to normal situation, there should enough assurance that first the rollback is possible to run the previous version and clean the remaining of the failed service. cleaning is mostly related to the state including the database and settings files and some time is related to the cache. In all cases there are some data which is updated and these updates should be rolled back. Rollback mainly is deploying the older working version and making sure data is in the latest state which before deployment or at least compatible with the older state (backward compatible upgrades).

Someone might argue that there might be no need to roll the data back as apart of service rollback as the old service version might be compatible with the modified data. This might be true in some cases, but generally the new data is updated as part of the new version and that is not needed for the old version. Removing the data that is not needed could be seen as a maintenance strategy which makes us always clean the data as part of the rollback instead of arguing about exceptions.

Dependencies

Deployment consists of an artifact and a set of dependencies. The relation between a service and data could also be seen as a dependency, especially in case there is a need to do an update on the data upon deployment. In that sense a deployment consists of an artifact and a set of dependencies which included data, other services. There might be some dependents as well which are using services coming out of the component at hand.

An upgrade for a component gives a broken dependent

There might be some updates on other services or dependencies which is needed to have a new version function. In that sense a rollback on any of the dependencies potentially triggers a rollback on the dependent service as well. So a rollback could be upgrading a failed component or data cascaded to all other dependents, in another word.

The dependency rollback is cascaded to component rollback

There are cases which a dependent doesn’t need to rollback in case a dependency fails. For example a dependency might have two dependents which one is working with the latest and the other one could work with the latest and the oldest component. the first dependent is subject of rollback in case of failure in the dependency but the second dependent might keep working as is due to higher compatibility. A dependent works with a range of versions given for a dependency which comes into account in case of rollback. This could be called compatibility versions. Compatibility graph for each release should be given by the developers, So the Ops team can consider per release finding out where a rollback should be cascaded.

Agility

There are advantages to avoid unnecessary rollback. First of all a new version has some new features or bug fixes might be important for stakeholders. Avoiding rollback could also mean more availability and less interruption for the service and third minimizes cascaded rollbacks. In the other hand relying on safe rollbacks as an option has impact on the agility of developer.

A developer which knows that there is a predicable rollback as an option might need less checks and tests before deployment which contributes to productivity as the cost of failure is fairly less in case of predictable and quick rollback. So lack of rollback increases failures and development agility. investing on quick rollbacks could be since part of agility of the company. Ops team demands developers to share information about dependencies per deployment to handle cascading which makes it more predictable in case rollback happens.

More common components, more chance to get a rollback cascaded witch makes it expensive. As if the common component fails, all other dependents should rollback. Bigger the component, more chance of failure and later more chance of cascaded rollbacks. So having a big component as a dependency, expensive rollbacks dooms its effects. In that sense, no one can assume rollback which pushes developers toward imposing more assurance before release that adds to the development cost.

Cost

There are factors determining the cost of rollback,

  1. How big is the rollback which means how far a rollback has been cascaded
  2. How often a service goes into rollback
  3. How easy is the rollback

The third approach is a matter of tooling, e.g. Kubernetes has made rollback for stateless services easy. Database migrator has made rollback for database schema predictable. Containers has made the rollback easier for Ops has they don’t need to know a lot about the the internals. Containers has made the rollback less cascading as there is less dependencies being shared between services.

Quality of the service determines how often it should go into rollback (number 2). it is also related to the dependency quality in case of cascaded rollbacks.

A service goes into rollback in case of a failure or having a dependent which doesn’t work. in case a dependency doesn’t exist or a dependent doesn’t work with specific dependency, the newer version in all cases has to rollback that would cascaded to eliminate incompatibilities. This is why big deployment units with many dependents is the biggest risk.

In case smaller deployments depend to big component, the smaller components has to always be forward compatible. Which means both the old and new version of the smaller deployment component should work with the new version of the big deployment unit. To achieve this, the monolith should always accumulate features which both the old and new version work. For example removal of an endpoint is not easy. Changes to the big monolith should be backward compatible even when there is no dependent in the next release after the upgrade. Sometime it leads to have multiple versions of the same component per release.

Platform Services

Platform Services normally have more dependents as they are intended to be shared among other services. For example a notification service which is used by many app developers could be typically considered as a platform service.

Backward compatibility is not for free which practically comes into a trade off between cost of cascading rollback. Rollback cost is inherently proportional to the dependency structure, high coupled micro service potentially has same issues with the big monolith when it comes to rollback cascading. Having common services with backward compatibility reduces cascading risk in case a dependent rollbacks.

Platform level dependencies with more dependents may practically think to be backward compatibility as rollback could be expensive. Backward compatibility potentially helps when there is no rollback, but assuming that for all services with dependents makes many platform level services forward only that would set rollback aside at least for those services. This can probably make development of platform services hard.

The other strategy is to not use a service until enough time passes and the risk of rollback is less. This strategy also reduces productivity of platform teams as adds to the time a feature is asked and the time it is ready for use. Such a strategy makes development of platform services easier as there is always a group of users which depend to the stable version which are protected against rollbacks and the other group practically play the role of the user for the platform service allowing to develop the service by constantly finding defects.

Another strategy is to allow all services to rollback and remove a feature if that is not needed as soon as possible. Assuming so puts cross dependencies between services an important factor determining the cost of rollbacks. As more dependencies more chance to have cascading especially for platform level services which have more dependents compared to edge services. Keeping the cross service dependencies as low as possible can keep rollback cost low which enables developers assuming there is predicable and quick rollback.

Lack of rollbacks means a service is forward only. There are cases which updates to the data is not easy to rollback for example when there a column should be deleted. In that case if the migration removes the column the data is not backward compatible so the dependent service can not roll back. So destructive updates are always forward only and limits rollback. Having forward only update in dependency graphs diminishes the rollback effect to the extend limiting the rollback option. Having forward only components also means there is a need to have backward compatibility which comes with cost.

Conclusion

Backward compatibility is not for free and comes into a trade off with cost of cascaded rollbacks in practice. Platform level services inherently impose higher cost of development as there is risk of cascading rollbacks. Making platform service backward compatible is an strategy along side separating a stable and a latest version allowing the development to keep the agility and avoid imposing expensive incompatibility, failures and cascaded rollbacks.

Rollback has impact on the agility and quality of the product by giving developers the option to allow the service to be tested in practice with controlled risk. This could also be true for platform service in case ops have access to clear dependency structure allowing predictable rollbacks. Controlling the unnecessary coupling between services also helps reducing cascading risk along side giving more agility to developers by keeping the rollback as a practical option.

--

--