SmartShop — Growing from 2 stores to the whole estate
When developers write blog posts to do with scaling, they more often than not talk about a particular part of their tech stack which was limiting their ability to scale, or even their architecture as a whole. I’ve found that people seldom talk about tackling the problem of scale of a whole product on a macro-level — how do you grow a digital product from a small trial to a business-as-usual service used by millions of customers?
It’s certainly not an easy one to answer, but having been on the SmartShop team since that first part of our journey, I thought I’d try and give that one a shot from an engineering perspective, so here it goes…
(Note: there is a TL;DR section at the end if you want to know the secret sauce to our success)
Chapter 1: The pilot
Self-scanning shopping wasn’t revolutionary when I joined Sainsbury’s 4 years ago. Sainsbury’s already had a 3rd-party solution called FastTrack working in around 35 stores, but for many good reasons, Sainsbury’s decided to scrap our vendor solution before it was too late and develop their own product in-house. By late 2015, SmartShop (as it became known) was in “pilot” phase — live to customers in only 2 stores.
I joined Sainsbury’s as part of the digital services team — a team tasked with building unified (where possible) backend services to be consumed by other teams in our new Digital & Technology division. But this was an in-house tech division in its infancy, growing hugely, and even with our team’s good intentions, the microservices we built were not used by many (if any) other teams than SmartShop. This created some legacy — I’m looking at you lists service and SmartShop lists decorator service — but it did also help us focus on true microservice architecture.
In the pilot phase, the SmartShop backend consisted of a mixture of PHP, Python & Go services — taking the word of Service-Orientated Architecture to the letter. In fact, we were the first team in Sainsbury’s to adopt Go (a journey which I’ve talked about previously at London Gophers). We had the mentality that we would write each service in the language that was best for it — an honourable sentiment, but on reflection each decision was probably related to the team’s comfort level of coding in certain languages. This legacy of our mixed backend (for better or worse) would live on ’til present day, although we have tackled it over time — we’re almost completely rid of our Python service and we’ve had a strategy for the past 2 years of building all new microservices services in Go (where appropriate still, of course).
The pilot grew into a fully-fledged trial (who knew they were different!), and within 12 months SmartShop was operational in 10 stores.
Chapter 2: Growing pains
SmartShop was originally made up of two distinct development teams — apps and services. Not uncommon in other companies of course — and probably equally as common was the lack of working closely together. It soon became apparent we (the services team) weren’t aware of what the apps team were working on or indeed much of what was on their roadmap. We built several features which never saw the light of day, often leaving a long code smell as these parts were seldom touched after initial development. Ultimately, both teams also grew to disengage with each other a little with no real communication about the reasons why. And as we started to scale to more stores, it became obvious our two technical dependencies — the in-store Wi-Fi network and the checkout system — weren’t always keeping pace.
The network we were on was never meant for us to use so greedily, and some of the problems with that showed after a while. It was traffic-shaped to benefit SmartShop, but still these were the early days where we had to send a ping every 30 seconds from each in-store handset to indicate any potential network issues (a great idea that also rather counter-intuitively put a bit more strain on the network).
We were also tightly coupled to the checkout system from the beginning. Although there is some good-reasoned logic to this (pricing, promotions and product range are all identical to in-store checkouts), we were using a 3rd-party system that was a proof of concept rather than something meant to see us beyond a product pilot.
Despite all of this, we wanted to keep growing the proposition to more stores to benefit more customers. So we chugged along with the rollout, addressing performance issues where we could and protecting the customer experience whilst developing and growing the product further.
Chapter 3: Organisation change
To be honest, this chapter started when I joined and hasn’t finished yet. We have worked through a few organisational changes, but one in particular affected us directly — the services and the apps teams for SmartShop combined into two feature teams of cross-functional engineers.
Woah. Lots of change in that last sentence, so let’s take on each bit separately.
“The services and the apps teams combined”: Big news, but sensible and predictable. We were too apart for too long — sometimes blaming each other for mistakes, and at times lobbing the occasional hot potato over the wall. But still, we had different ways of working, different expectations and different legacy.
“Two feature teams”: Bigger news — we were now going to split in a different way. There were too many of us for one team (not even the world’s largest pizza would have fed us), with 2 product owners and 2 backlogs of work for the same one product.
“Cross-functional engineers”: The biggest news. In essence, this meant not just working on your particular expertise but also learning, over-time and from others, to work on different parts of the stack. It’s importantly different from a 100% full-stack team, but a distinction we took time to learn properly and address (and something this blog post covers pretty well).
These changes went down enthusiastically for some, less so for others. But everyone gave it a real go and importantly it’s lasted the test of time.
There was now a real emphasis on learning in the team, and one of the immediate benefits of that was to share best-practice engineering between ourselves and tackle some of the things that had been dragging us down. Test-Driven Development was adopted across the team, and heavy refactoring of some of the apps taken on in order to increase test coverage dramatically. We actively moved away from manual QA testing, which was previously slow to do and sometimes inconsistent, and introduced automation everywhere across the stack. Pair-programming started to be used by default — not only to enforce the cross-functional learning mentality, but also to provide a constant code review that ultimately spelt the end of most Pull Requests (and waiting for reviews) across the project.
Collaboration thrived and the whole project benefited, even though productivity and (visible) deliverables probably dropped in the short-term, and it took a bit of time to get exactly right.
Chapter 4: The race to mobile payments
No sooner had we finally started building some real worthy cadence from our team re-jiggle were we tasked with the next big challenge. The race was on to be the first UK retailer to offer a checkout-less experience to customers — where they can scan and pay in-app and walk out of the store.
The funny thing was, from SmartShop’s inception we had all envisaged our app doing this. In fact, we had a sister team spun up to investigate and prototype this very thing with customers from a good 6 months before work started — a team that had the capacity to do this work separately but nonetheless a team we stayed close to.
We folded this sister team (which at this point consisted of one developer) into our team, which brought immediate knowledge-sharing and forced collaboration 😁 not so counter-intuitively, it also gave us the opportunity to rebuild SmartShop into the prototype apps — as the prototypes were essentially lightweight versions of SmartShop that had been built from the ground up in Swift & Kotlin with high test coverage. Refactoring or rebuilding our own apps was already on the agenda, so this gave us another option.
Amazon Go only launched its first US customer store in January 2018. We weren’t far behind them — launching the UK’s first till-free shopping experience that same Summer (not to be confused with when we launched a till-free store and made it SmartShop only in April 2019).
For the initial launch, we concentrated heavily on the iOS & Apple Pay implementation first before focusing on Android & Google Pay. By getting to the UK-first accolade, we did make some conscious (and less-conscious) decisions around building in technical debt to mop up later. A lot of code was clearly just copied from the prototype, which gave way to inconsistencies around design patterns and unclear context (built in to code that was now owned by a new team). And by the time we got on to the Android implementation, Android Pay was now Google Pay, and their APIs had changed significantly. So context that was buried in code was now even harder to unpick given we had to rework it all.
But hey — been there, done that, got the newspaper headlines (and t-shirts). And over time a lot of the tech debt has been mopped up.
Chapter 5: Christmas goes off without a hitch
Our previous Christmases had been more than challenging. 2017 was the year of the “not releasing handsets” problem. We only found out that we had a handset releasing problem when all the handsets needed to be used (and some didn’t release). The fault was larger and more complicated than just one single issue and involved cross-team collaboration from networks, to engineering and 3rd party hardware suppliers. And 2018 was the year our new production-grade checkout-in-the-cloud dependency ran into scaling problems when we started hitting Christmas peak, even though it had been load and capacity tested ahead of time.
But then Christmas 2019 came along. With a fully upgraded store WAN and Wi-Fi network across the estate, and several learnings and fixes applied to the checkout-in-the-cloud system throughout the year, we just had to rely on our own apps and services — that had been through much larger capacity testing — to stand firm and deliver, and they did! We sat in awe during the peak hours watching very large usage visualisations on Kibana and Grafana in a spookily quiet office… bliss.
Chapter 6: The whole estate and beyond
At the time of this publication, SmartShop is in over 600 stores nationwide. That figure includes all but a handful of our supermarket estate, and has been growing exponentially — we were only in 72 stores 12 months ago. We’re in the land where we (as an engineering team) don’t hear about each new store launch or have to prepare (thanks in part to the 4th iteration of our web-based administration panel — one for another blog post I’m sure). This is a very nice place to be — a sentiment I’m sure fellow engineers amongst you will understand very well.
We’re now using data to inform decisions to a greater extent than ever— from building metrics into everything we do, to sitting with our in-team user researchers and designers to set up, plan and experiment with A/B tests across mobile and in-store devices. We proactively monitor and alert on issues rather than waiting for stores to tell us there’s a problem (although that still sometimes happens). And we have a support rota for engineers where we hardly ever get called — and people actually want to volunteer to be a part of it!
We’ve brought the technology to a handful of convenience stores and, with nearly 800 of them, if we continue to rollout we could potentially more than double the number of stores we’re in at the moment. We’re also looking closely at our mobile payments proposition (currently in 10 stores) and how we can give more of our customers the ability to checkout in this super-fast way. It’s no longer as much of a technical challenge but more of a logistical one — and one as a large cross-functional team we’re helping to solve as engineers too. So I’m certain there’s lots to come in the near future — watch this space!
TL;DR — What we learnt:
- There is no secret sauce but a load of important lessons learnt (sorry/not sorry to those who took the shortcut here)
- We overestimated the capability of our dependencies, and underestimated our reliance on them, but ultimately they got us to where we are today
- We learnt how to support our product as we went along
- There are always gaps in testing — even if you test-all-the-things — and you only realise that when something goes wrong in Production
- It’s not all about the red tape, and we can’t solve everything* — especially when we rely on other teams that have different practices, problems & legacy
- The whole team should be involved in decision-making wherever possible (most of the time), but also given time to understand decisions when that’s not possible
- When you gain respect as a team, you are able to push back on things much more effectively when they don’t make sense
- A product like ours will always be made up of several teams across multiple functions — and that is a great thing as long as you all collaborate, sit with each other and understand how each other work
- It can be a long slog, so make sure you have a great team around you!
- When you get there, you forget what you went through until you write a blog post on it
* with limited money, team capacity and other priorities