Next chapter on our journey to achieve and maintain operational excellence of ft.com
A few years ago, the ft.com group worked on a project that involved a complete rebuild of ft.com called NEXT.
ft.com is now:
- Built on modern technologies
- Using the fundamentals of distributed systems ie microservices architecture
- Shipping hundreds of times in a week
- Focused on measurements and a healthy appetite for A/B testing.
Since the rebuild, the strategy for ft.com is No next NEXT. We want to make sure that the product and the underlying technology do not drift so far off course that we have to rebuild ft.com again in a few years, spending entirely avoidable time and money. This vision means achieving and maintaining operational excellence of ft.com is paramount to keep the website sustainable. In this blog post, I talk about how we plan to do this.
Operational Excellence is about sustainable improvement within an organisation
Operational Excellence is a mindset that embraces certain principles and tools to create sustainable improvement within an organisation. The objectives for achieving operational excellence are as follows:
- Keeping toil low while maintaining live products and services
- Ensuring minimum waste during the entire lifecycle of products and services ie from inception of an idea to retiring and decommissioning under utilised services
- Having smooth processes for any incoming issues with a clear path for resolving them
- Having high level of visibility of the group level operational concerns and a clear prioritisation mechanism for fixing large operational concerns
- Enabling stream-aligned product teams to work with minimum disruptions
We used to have the Ops Cop team dealing with incidents and bugs reported on ft.com
In 2020, the ft.com and apps group moved from initiative based teams to stream-aligned durable teams. As part of this change, we decided to have a small dedicated team that dealt with all incidents and incoming requests from stakeholders for ft.com called the Opscop team. This team had two permanent members — a tech lead and a delivery manager — and each week was assisted by two engineers rotating from the durable teams. This team would triage all incoming issues and delegate the non urgent issues to durable teams that owned the service. This setup had some benefits:
- Having permanent members meant there were two people who had the context for all incoming incidents
- Durable teams owning the service had visibility of incoming issues and could prioritise those issues alongside product initiatives
- Having a delivery manager meant that the engineers could focus on technical work and avoid interruptions from stakeholders.
However, the Opscop team was set up with a very narrow focus of triaging, fixing and delegating issues on ft.com with lots of dependencies on the two permanent members in the team and this had some drawbacks.
No metrics and Service Level Agreements for incoming issues
The team didn’t work towards any operational metrics such as mean time to fix incoming issues or the cost of incidents. This meant
- The process to fix incoming bugs or minor customer issues was not well defined
- The stream aligned durable teams had no way to prioritise these issues in a consistent way
- The opscop team had no way to find out areas with recurring issues that could do with wider improvements
Some large operational issues without a clear owner to drive strategic changes
In an ecosystem of mature distributed systems, we had no one looking at ft.com as a whole or the larger operational concerns on ft.com. This meant
- Performance metrics on ft.com were only looked at per microservice rather than the whole website
- We were struggling with ownership of wider operational concerns on the website like driving initiatives around accessibility, SEO and other improvements that span across ft.com.
No systematic process for creating a shared sense of ownership across ft.com
This team had a permanent engineer (the tech lead) and two engineers rotating from our stream-aligned durable teams. We set it up this way so that engineers in a durable team could get a flavour of working on operational aspects of ft.com and learn about microservices on ft.com. However in reality this meant the engineer’s learning experience was totally dependent on the incidents during the week. Some engineers picked up interesting issues, learnt loads and took back useful operational ideas that their team should think about while others saw this as a mere disruption from their product work. We also found it hard to work on larger fixes for operational concerns on ft.com as the cost of handovers and knowledge transfers each week was very high.
We are reshaping the opscop team, and calling it the ft.com team
Our plan now is to reshape the opscop team. We’ll do this by ending rotations, all team members will be permanent so they can take on longer term initiatives and proactively find ways to enable operational excellence on ft.com. They will still be the single point of call for live issues and bugs on the website, which they will then triage and delegate to the team that owns the service. Our stream-aligned product teams are accountable for the operational concerns of their products, and the ft.com team will help them achieve operational excellence. This will enable us to keep operational excellence at the forefront of everything we do. We believe that this would help reduce the reactive operational work that teams have to do giving us more time as a group to focus on the proactive, preventive and predictive maintenance of ft.com systems and services. Our vision is to get to a place where the ft.com team in partnership with the stream-aligned product teams and the platform team spend 90% of their time improving the overall reliability aspects of ft.com and only spend 10% of their time on triaging issues that the group has never seen before.
The ft.com team would also be responsible for creating metrics that help our stream-aligned product teams prioritise operational work alongside new feature development. This team would also help the platform team look for reusable patterns that can help reduce duplicate implementations and hence reduce waste. Most importantly this team would advocate and educate both internal and external stakeholders of ft.com about operational excellence.
Because this team only works on ft.com initiatives and concerns, it’s going to be called the ft.com team. Our stream-aligned product teams work on specific business value products that live on both ft.com and the apps.
In order for this team to be a high performing team, it needs permanent engineers with specialisation in site reliability and a dedicated product owner with a strong technical background. This will be the first time we’ve hired specialists in site reliability, so there is lots of interesting work to do.
Rebuilding ft.com took close to two years and cost in the order of magnitude of £10m. It also meant we had limited capacity to change the old site for users on it. We don’t want to do that again. So the plan is to work on our ‘No Next NEXT’ strategy by achieving and maintaining operational excellence on ft.com.
We are not quite where we want to be with this because of the wrong team structure and a limited ability within this team to create awareness and educate our teams about operational excellence. Hence the plan is to create a team called ft.com who can help us with the next steps on our journey to achieve and maintain operational excellence. We will capture operational metrics and stakeholder happiness metrics to measure the success of this team and continue to listen to feedback from the metrics to make sure we are successful.
This is a very exciting time for the ft.com group and we are hiring for this team. If this interests you please get in touch with our talent team at firstname.lastname@example.org