At a time where A.I. is ever present in our lives — from personalised recommendations to voice assistants, from image recognition to automated chat-bots — it’s becoming apparent that the way companies leverage these technologies will condition their future successes against their competitors.
At ASOS, we’ve quickly identified this as a key factor for our future growth — by investing in our talented mix of engineers and scientists, by experimenting with the latest tools and frameworks, by building core capabilities like our bespoke data lake. After all, our vision is to be able to integrate A.I. systems easily and safely within every aspect of our business.
But these systems, besides their inherent complexity, are slightly different in nature than traditional software: we need to understand what it takes to deliver them in production. In other words, we want to answer the following questions:
How can we accelerate the delivery of A.I. systems? How do we ensure these systems deliver value to the business? How do we propose to scale?
Framing the problem
Beyond the (cool) alliterations and the (less cool) click-bait appeal of the title, it was always going to be a little problematic using loaded terms like ‘agile’ or ‘A.I.’. So, before we dive in, it’s worthwhile clarifying what these two terms mean for us in ASOS:
- ‘Agile’ is seen here as more than a set of practices that help us deliver software. It’s a mindset that focuses on trusting and empowering the skilled people who do the work, engaging actively with the customer every step of the way, adapting swiftly to changing conditions and emphasising on transparency.
- ‘A.I.’ — for Artificial Intelligence — has been the buzz word du jour for quite some time now and refers to the ability for computer systems to solve problems in an intelligent way. More pragmatically, the recent boom of A.I. corresponds to the boom of one of its subsets: machine learning i.e. the ability for computer systems to learn to solve problems with data. What’s covered in this post concerns exclusively the latter.
At this stage, it’s also important to note that machine learning is by no means a substitute for traditional software development; both are needed to solve different sets of problems, both are compatible with each other. Over time, we see them becoming increasingly intertwined.
But they’re inherently different by nature; whereas in traditional programming we build software to solve a problem, in machine learning, we build software which learns to solve a problem.
The difference can seem tenuous at first (simply adding two words!), but as you can see in the table above, the implications of taking a different approach has numerous ramifications, notably in the way we want to deliver these systems to production.
While adhering to the highest engineering standards set by the organisation (reliability, availability, logging, security, monitoring, alerting, CI/CD pipelines, etc…), we need to imagine new processes, develop new tools, evolve our existing architectures and invent new practices for accelerating the way we deliver these systems. Only once we have discovered them can we start to envisage scaling A.I. capabilities across our tech estate.
Demonstrating the art of the possible
When embarking on that journey, it seemed sensible to start small instead of designing and implementing grand organisational changes tied to ambitious delivery plans. The idea was to get key understandings about technology and ways of working while minimising the risks to the business.
We identified a non-critical use case (personalised brand recommendations), we built a small cross-functional team around motivated individuals (engineers and scientists), and we selected a piece of technology which would help us make our vision come to life (Azure Machine Learning).
The first objective was to prove that the technology was mature enough (can we have a single end to end pipeline which handles feature engineering, model training and servicing our inferencing API?). The second was to understand how we could be working together (How can we optimise the way we work to deliver machine learning models faster and more often?).
Several months later, this team delivered our Brand Recommendations API to the business, which was the first end to end machine learning application ever delivered in ASOS. Interestingly, we did run an MVT against a static list of brands and it turns out that customers favoured the static list over the personalised list! Of course, the building blocks were now in place and we could iterate extremely fast on this model to make small improvements and eventually beat the static list.
What was important for us was elsewhere; we successfully proved the technology and we understood how we could work together. We could now see a clear path, through a set of guiding principles, as to how we could scale our approach across the organisation.
Our guiding principles
1. Cross-functional collaboration
We recognise we need a mix of skills and perspectives to deliver machine learning models efficiently. We decided to pivot from component teams (e.g. Big Data team / Data Science team / API team) to fully cross-functional teams aligned to clear business outcomes. These teams can be composed of software engineers, machine learning scientists, data scientists, operational staff, big data engineers, quality assurance, product owners, subject matter experts etc… and who are responsible for deciding how to implement the solutions in line with the company objectives.
This approach has naturally opened up conversations about the opportunity to unify our tech stack. Although C# is the dominant programming language across our technology department, our platform has centred around Python for developing our feature generation and training pipelines, our machine learning models, our inferencing APIs. Having one single programming language across different activities favours cross skilling and shared ownership across our teams.
2. Set up for iterating and experimenting
Machine learning models are rarely optimum in their first iterations — we often need to refine the features used for training e.g. trying different algorithms and then tuning their parameters. Also, models are not absolute by nature and we can always do more work to increase their accuracy and/or adjust what they should be optimised for.
This has two major implications if we are to deliver model improvements fast and often:
- The ability to iterate quickly
- The ability to test and measure quickly
We have invested early in foundational capabilities like automatic infrastructure provisioning, centralised experimentation tracking tools, near real time MVT dashboards, data lake, CI/CD infrastructure, etc… When embarking on a new project, the emphasis is to create the underlying infrastructure which will enable the quick iterations, even if the model is far from perfect at first.
These capabilities are always built for a specific product development in mind, never with the intention to fulfil a utopian list of desired enablers for the platform. If we are to deliver another project, we look out for the opportunities to use what has been built before, maybe iterate on it or extract a reusable framework for future work.
This approach only functions if communication and awareness is high across teams. On top of traditional sprint reviews where everyone is invited to discuss progress, we encourage discussion with focus groups around specific topics (Databricks, Azure Kubernetes Service (AKS), paper peer review, security, etc…) where engineers and scientists can collaborate, learn or teach outside of their team.
3. Different initiatives with different timelines
Research is an essential part of building machine learning models — it gives us space to experiment with new technology and new algorithmic approaches, it encourages machine learning scientists to be involved in the community by publishing papers and collaborating with universities and it helps us attract new talent looking for innovative companies.
But research can be time consuming, uncertain and might not yield any visible return on investment immediately. How we prioritise and plan our research against our deliveries becomes critical when we engage with the business.
We aim to have an 18 months view where we balance our short-term deliveries with our mid to long-term initiatives, to continuously deliver value to our customers while leaving some space for research and innovation. We also understand that these projects while different in nature, do not operate in isolation. In fact, we constantly look out for opportunities to combine them when revising our plans.
4. A.I. is not an island
Machine learning models are generally useless on their own — a recommendation engine cannot fully function without a compelling user experience; a customer lifetime value model needs to fit within a marketing strategy.
Moreover, their impact can sometimes be misunderstood or hard to grasp. As mentioned previously, it’s more than a simple plug and play capability: it’s a novel way of solving problems, fuelled by the increasing amount of data at our disposal.
We have to be comfortable with the lack of explicit requirements and the need for hypothesising. With predictive models, it’s sometimes as if we need to find the problems we could solve with the solutions we have uncovered, instead of the other way round.
In each of our cross-functional teams we have translators. Their job is to relentlessly bridge the gap between our team of engineers and scientists on one hand and our stakeholders on the other hand. Their official job title might differ (product manager, data analyst, subject matter expert, business analyst), but their function is essentially the same: build strong relationships with the business, drive transparency of our teams and steer product development to achieve our goals.
Rewiring the culture
These are exciting times for A.I. at ASOS. We can clearly see how the systems we build impact our business and revolutionise our customer experience; we can iterate and experiment faster thanks to our technology investments and thanks to our common understanding of how we want to work together; we are entrusted by the business to unlock the potential of our data.
Once our vision was clarified, we chose a phased and empirical approach, starting small and then scaling up little by little. We centred first on our people and our teams. We combined the adoption of new technology with cultural change as we see them deeply intertwined.
We are now entering the next phase of our transformation where we will start to devolve A.I. capabilities to other tech platforms. A new set of questions will emerge (How do we balance autonomy and alignment? How do we manage research and hiring at scale? How do we envisage cross-platform delivery while staying close to the business needs?). We are confident we will overcome them, step by step and collectively, by staying true to our principles.
Reda Kechouri is a Delivery Manager at ASOS.com. When he’s not helping machine learning teams, he collects records and occasionally DJs in East London.