Where Engineering meets Data Science — An Architecture Overview
By João Faria, Data Scientist.
This post was originally published on our F-Tech Blog. Come check it out here :-)
Data is the fuel for success for any size of an organisation across all industries. Yet, it is still hard today for many organisations to turn project ideas with data science components into reality. The integration of the data science discipline has proven to be a challenging task. Many have tried and some have actually failed to benefit from such an investment (see, for instance, the article by Nick Elprin of Domino Data Lab, which conducted a survey about this topic here).
Data scientists and engineers typically have very different academic backgrounds, which may lead to some difficulties regarding their integration towards a common goal. In these multidisciplinary conditions, it is desirable to design a project architecture that allows both disciplines to collaboratively focus on their strongest points whilst ensuring some common ground and knowledge sharing.
This blog post will share an overview of our teams’ architectural solutions.
Firstly, let us guide you through the team routines and organisation. Farfetch’s Engineering department is organised in clusters, each responsible for one area of the business. The Search cluster consists of two teams: Rank and Search. Both teams are multidisciplinary, comprised of people with software engineering, computer science, statistics, pure mathematics, machine learning and even biomedical instrumentation engineering backgrounds. Each of these teams is composed of software developers; data scientists; quality assurance; and product owners. This way, we foster synergies between all disciplines, regardless of their academic background.
But how exactly do we combine science with engineering? First of all, the entire team participates in all agile practices, like the daily stand-ups, plannings and retrospectives. Also, considering that most of the team members share the same physical space, collaboration happens naturally, as does pair-programming and ad-hoc brainstorming sessions. Since both disciplines are focused on the same goal and working towards the same objective, we eliminate the need for alignment meetings and the risk of breaking the ecosystem.
Taking this cooperation into account, and keeping our differences in mind, it was necessary to adapt the architecture of our solutions to better accommodate our dynamics.
Historically, projects at Farfetch have been developed as services. There was an Application Programming Interface (API) layer, written in .NET, which accepted external requests. Much of the complexity was contained inside the Python components, monolithic Extraction, Transformation, and Loading (ETL) processes that periodically consumed external data sources, performed all the necessary computations, and stored the results in the service’s local database. That database structure was the only interaction point between the Python components and the API layer, which accessed it every time there were incoming requests to the platform.
In this case, the Python components were used to perform significantly heavy jobs, making all the processed information ready for the API layer to use. Among other tasks, they were responsible for grabbing data from a myriad of external sources, applying all the pre-processing stages, training the Artificial Intelligence (AI) models, producing the predictions, applying business rules and sending the results to the local database with minimal impact on external reading accesses. Most services following this architecture needed to be run against several different contexts while collecting and producing large amounts of data. As you can imagine, it got difficult to scale up this way.
Furthermore, within Farfetch, Python had been exclusive to data scientists. This required them to claim ownership of all the developments concerning the whole ETL process and to have to deal with engineering requirements (e.g. service performance, service scalability). Data scientists were thus forced to split their focus between a large set of tasks instead of favouring their domain, hindering not only the development of machine learning components but also of the whole project.
In order to tackle the aforementioned problems, the Search cluster devised a new modularised architecture. In this case, the database look-up from the API layer is done in the same old fashion: only results already computed by the entire flow and stored in the database can be immediately returned. For the data updating process, however, a new set of modular components have been introduced:
- Feeder (responsible for collecting all the necessary information for the service task);
- A pool of AI modules (which compute the results given the input data);
- And, an Updater (responsible for populating the internal database with the new results entries).
These three components communicate through Kafka messaging. For every request to the API, this will trigger under certain project conditions (e.g no pre-computed results found in the database) a message to the Feeder component. This will gather all the necessary information from other services and build a Kafka message with the collected data. The AI modules will consume this message and compute the expected predictions that, once again, will be sent as a Kafka message. Finally, the Updater component will consume this message and upload the predictions into the database that the API component consumes.
This is a generic view of our architecture. This architecture is flexible and depending on the project, certain modifications may be set in place.
The AI modules are developed in Python since it is the language which provides data scientists with the best toolset available and internationally established as a de facto programming language in this area. All the other components are usually language agnostic. For cultural reasons within the company and due to the natural engineering ownership of such components, there is a predominance of the .NET Core platform.
The reformulated architecture brings several advantages to the table. Due to its modular nature, besides the traditional gains from code decoupling, it is now much easier for engineering and data science to share ownership of different components.
Moreover, all the components communicate through messaging, which highlights our modularity gains:
- Language agnostic communication layer;
- Even in case of failing components, messages keep flowing for later processing;
- Scalability, from duplicating processing components as needed, since we now have queues of messages ready to be processed at each stage.
This architecture is not suitable for every kind of solution. For a near real-time scenario (for example, when results from AI modules must be sent within a short time span for each request), there is a significant latency associated with all the layers involved in the process which, in practice, would need to be removed. However, having real-time predictions is not currently a requirement, making this solution a success in the search cluster scope.
A Brief Overview of our Services
Currently, there are three projects working under our proposed architecture.
SmartRank is the project responsible for ordering the products in each listing page of the Farfetch platform. SmartRank not only helps our fashion experts to manually curate listing pages but also to produce AI-powered ranks using Learning to Rankbased approaches. With these two options, we aim to provide the best possible experience for our customers by combining AI with the creativity and innovation of our fashion connoisseurs. We will deep dive into the intricate workings of SmartRank in a later blog post.
Another project, Semantic Search, is a query understanding platform that aims to extract the semantic meaning of a user’s query, therefore returning more refined search results. From an engineering point of view, this is a challenging task, given the number of requests to which the API needs to respond. The machine learning task is also quite demanding: a chain of deep learning models deal with Part-of-Speech Tagging, Dependency Parsing and Named Entity Recognition tasks (with a scientific publication on KDD Workshop on AI for Fashion — click here to read it) to identify entities present in a user’s query. Later, we will publish a blog post describing this project in more detail.
Finally, there is also a project under the Computer Vision field, named VIPER, which stands for Visual Information for Product Exploration and Retrieval. The core idea is to fully describe a product’s properties, namely its category hierarchy, colours and attributes, given only its image. A novel deep learning approach has been developed to tackle this problem. VIPER’s development led to a scientific publication on the KDD Workshop on AI for Fashion, entitled A Unified Model with Structured Output for Fashion Images Classification (click here to read it) and the foundations have also been presented in this blog post.
Our current service architecture has enabled us to take the best from both the engineering and data science disciplines, in order to deliver scalable, robust and high-performing projects embedded with machine learning components. We have found a way to make data scientists and engineers work together towards a common goal while using a diverse technological stack.
Splitting the AI monolithic component into several smaller ones was the main key to success. From that moment on, data scientists were able to focus on good quality development/coding of the traditional pre-processing, train and predict pipelines of a project using the more adequate technologies at their disposal. Likewise, the engineering discipline could now better contribute to the system development using the tools they consider best for the task.
Working as one team has also been a pillar of our accomplishments. By sharing knowledge, each discipline learns from the other and every project decision becomes a joint effort — this has certainly been our secret weapon behind all these projects.
However, not everything is perfect, and we are still finding our way. For instance, on projects that require the AI models to produce real-time predictions, this project architecture will not suffice. It will have to be adapted and improved towards that requirement. Nevertheless, we believe we have found a good solution to join together two distinct, yet complementary, disciplines to deliver amazing products.
Special thanks to António Garcez, Daniela Ferreira, Edgar Coelho, Hugo Galvão, Isabel Chaves, João Faria, João Pires, João Santos, João Teixeira, Jorge Marques, Nikola Misic, José Marcelino, Luís Baía, Mário Barbosa, Miguel Monteiro, Pedro Nogueira, Pedro Vale, Peter Knox, Ricardo Gamelas Sousa, Rui Silva, Vitor Saraiva and Vitor Teixeira.