Transforming Data Science Workflows with GCP’s Vertex AI

Francesca Iannuzzi
Maisons du Monde
Published in
9 min readJul 5, 2023

Co-authored with Nicolas Gorrity, ML Engineer, and Melissa Cardinale Cortes, Data Scientist

A Data team has existed at Maisons du Monde since 2017. As discussed in other stories from this publication, our technical stack is built around GCP, Airflow, GitLab, and QlikSense. Over the years, the team has developed a solid foundation of Extract-Load-Transform workflows that feed into a comprehensive data warehouse featuring multiple levels of data enrichment. This robust backend has been complemented by extensive frontend BI development, enabling a range of business units and stakeholders to access and utilize data through dashboards and data marts. This, coupled with a steady investment in Data Governance, has revolutionized the way the company accesses and uses data.

This wealth of information opened the door to both descriptive analytics and early predictive use-cases. The latter set the stage for what later became a more deliberate strategy.

Indeed, as the team grew in maturity and foundational needs were met, it was time for a dedicated data science team to step in and address a broader array of use cases as part of a continuous effort.

The Catalyst: Sales Forecasting POC

The initial demands faced by the Data Science Team at Maisons du Monde were primarily around sales forecasting. In typical fashion, the need spurred from a context of urgency that translated into stringent time constraints. Indeed, towards the end of 2021, the volatility of post-Covid sales behaviors required close monitoring on the business side. We were asked for frequent, up-to-date forecasts to support decision-making.

We tackled this task as a proof of concept, aiming to provide immediate business support while simultaneously demonstrating the potential of a data-science driven approach to the problem. (Up until this point, sales forecasting was handled primarily on the business side via a blend of in-house resources and external proprietary solutions.)

This phase was characterized by an intensive use of notebooks, laptops, and GCP-based virtual machines for more time-consuming processes. We used Prophet as our primary algorithm as it is fast to set up, can factor in external variables, and provides interpretable results. (We later added other methods through the Darts library, but their discussion goes beyond the purpose of this article.) We retrained models every week to incorporate the latest available ground-truth data before making predictions. Feature engineering was continuously updated and, as a consequence, hyperparameters were frequently tuned.

By the end of the POC, we had developed dozens of models and generated nearly a hundred forecasts. This would not have been possible without concurrently developing a semi-automated workflow alongside our primary modeling objectives.

To this end, we set up a Python application that would handle this repetitive pipeline of tasks:

  • Downloading the most recent data from BigQuery;
  • Formatting the data as a valid time series;
  • Fetching the hyperparameters used for the previous model (or new ones that we would have selected with hyperparameter optimization processes);
  • Training a new model;
  • Generating predictions; and
  • Uploading predictions into BigQuery.

This Python app was dockerized and scheduled using Airflow, the orchestrator that was already in place for data engineering tasks. Every Monday, the necessary data would be automatically collected, processed, and modeled. The resulting forecasts became accessible via a QlikSense Dashboard.

Fig. 1 — A simplified view of our first DAG, developed during the POC phase of the Sales Forecasting project. Data collection and preprocessing were triggered automatically every Monday. Following this, we would inspect the data, review our modeling approach if necessary, and then manually trigger the training and prediction stages.

From POC To Project: New Challenges

The successful results from the proof of concept led to sponsorship for a more ambitious project focused on sales forecasting. This required us to ensure the rapid deployment of a larger number of models, along with the establishment of strong technical foundations to support scalable growth. (In retail, forecasting can involve multiple target variables and potentially thousands of relevant time series at the most granular level.)

Our existing workflow served us well during the early stages of the project, supporting the first deployments. However, we understood that it needed to evolve to meet the increasing complexity of our tasks.

Our first concern was scalability. Using the standard infrastructure, originally designed for data engineering workloads, limited our flexibility in choosing the hardware configuration for the virtual machine that ran the retraining jobs. Retraining models may require to scale out (running multiple workloads running in parallel) or scale up (allocating more RAM, CPU or GPUs for particularly heavy models).

Another significant aspect was code maintainability and the ease of onboarding new team members on the project. Although our program was structured to minimize the amount of coding required by a data scientist (typically only a new forecasting model or a new data transformation function), as the complexity of our forecasting tasks increased, the deployment of new models became increasingly time-consuming. Under the existing setup, a team member needed to understand almost the entirety of the app to write configuration files correctly.

Lastly, we had to contemplate our long-term strategy, especially in relation to the production of future data science products, their costs, and speed of implementation. As the team expanded and started to tackle demands beyond sales forecasting, each new deployment phase felt like starting from scratch. Creating a new automated workflow for each project was time consuming and required DataOps skills beyond what could reasonably be expected from even our most versatile full-stack data scientists.

Beyond deployment, it was clear we had to revamp our development environment. It was time to transition away from local Python installations and the constant shuttling of information to and from virtual machines.

Elevating The Stack With Vertex AI

Over a year ago, we made the choice of developing our data science workflows by leveraging the features of GCP’s Vertex AI.

This was a somewhat natural choice, considering the rest of our stack heavily relied on GCP already and given the pronounced computer-science / engineering skills in our data science team. Far from abstracting away the necessity of coding the details of the treatments, this platform enabled us to streamline the overall development, testing, and deployment processes.

In what follows, we are focussing on two main Vertex AI features: Pipelines and Workbench.

Efficient Deployments with Pipelines

Similar to the SQL-based treatments already in use by the data engineering team, we migrated to a tech stack where Vertex pipelines could be triggered by Airflow tasks.

With Vertex AI Pipelines, our data scientists design their workflows in terms of distinct steps, or components. Each component is a functional building block of the workflow — for example, it could be transforming a dataframe, training a model, or uploading data into BigQuery. These components ingest and output data objects known as artifacts. Interdependencies between components are formulated into a pipeline using a very Pythonic syntax.

In addition to migrating our deployments to Vertex AI Pipelines, we implemented some technical layers to further facilitate the process for data scientists.

  • Once the Vertex pipeline is created, a CI/CD pipeline is automatically triggered to run the unit tests for the code, compile the pipeline into a template file using the Python SDK, and upload the compiled file to Google Cloud Storage.
  • We also developed a reusable Airflow operator that runs a dockerized app designed to trigger the execution of a Vertex pipeline simply given its name and desired parameters.

As a result, data scientists are left to focus solely on coding the specifics of their use cases — the components and pipelines. They do not need to worry about Airflow operators, service accounts, or CI/CD pipelines. Transitioning to Vertex AI not only addressed the technical challenges we faced, but it also significantly improved the efficiency of our data scientists’ deployments and fostered an Agile approach in our workflow.

The creation, maintenance, and continuous evolution of this infrastructure is now the responsibility of the ML Engineer — a role that we grew internally in parallel to this process.

Upon triggering a pipeline, the Vertex AI Pipelines UI presents a visual representation of its execution, with components organized as a DAG and data artifacts coming in and out of components. An intuitive color code marks running components in blue, successful ones in green, and those that failed are marked in red.

Fig. 2 — An excerpt from one of our Vertex Pipelines. Note how components (the rectangular boxes) are interconnected and artifacts are created and exchanged along the way. Also note the indication of execution status (the component at the top has failed).

The interface becomes a powerful tool for debugging, as we simply have to click on a component to visualize its logging messages. They are automatically captured by GCP from the standard output of the component execution and then uploaded into Google Cloud Logging. Identifying the cause of a failing pipeline only takes two clicks, and diagnosing the root issue is simplified as the workflow is divided into smaller, functional units: the components. This has been a huge time saver for us in performing model maintenance.

Finally, this is how the Airflow DAG has evolved from the one shown in Fig.1

Fig. 3 — A simplified view of our latest DAG. Note how each of the green Airflow tasks corresponds to an entire Vertex Pipeline, like the one shown in Fig 2.

Using Workbench for Development

As we improved our deployment process, it was equally crucial to upgrade our development environment. On this front, our team has had an insightful experience working with Workbench, a Jupyter notebook-based development environment part of the Vertex AI suite.

This platform allowed us to enhance our data science workflows and introduced several useful features.

A significant advantage is that each Workbench instance arrives pre-installed with numerous kernels containing common data science libraries. This eliminates the need for managing local Python installations and multiple virtual environments. Furthermore, users can create custom Python environments to install project-specific packages.

We primarily use Vertex AI’s Workbench for exploring and analyzing large datasets, as well as for tasks like creating models, conducting training and optimizing hyperparameters.

Indeed, this tool is particularly useful when running long jobs such as hyperparameter tuning, as it allows us to modify hardware on the fly. We can start with a smaller, less costly machine for preliminary testing and seamlessly switch to a more powerful one (with one or more GPUs) for resource-intensive tasks, boosting performance.

Despite all these advantages, our journey with Workbench was not without its challenges. We occasionally encountered operational issues such as prolonged startup times and unexpected machine shutdowns, which disrupted our workflows. Furthermore, the pre-installed Python version (3.7) didn’t always match our project requirements, something which we overcame by using Conda.

Overall, our experience with Vertex AI’s Workbench has been positive despite a few hiccups along the way. We will continue our journey, further exploring the features offered by the platform. Transitioning from entirely Google-managed to user-managed notebooks will be one of our next steps, allowing us to have more control over our environment.

6 months on

Today, our sales forecasting project features automated modeling of over 40 hierarchically arranged time series. Some of these series are tackled by multiple algorithms, leading to over 60 distinct batch predictions with each weekly run.

Should we decide to delve deeper into the hierarchy, generating forecasts for hundreds or even thousands of series, the primary considerations would be adopting the appropriate modeling approach and accepting the increase in computing costs. Deployment, in and of itself, is no longer an obstacle.

In parallel with this main project, we have migrated a product-assortment algorithm from custom code to a pipeline-based process on Vertex AI. Beyond the benefits already mentioned, the integration with other GCP components like Pub/Sub, and Cloud Run has allowed us to package this algorithm into a custom and interactive user interface, thereby opening up new possibilities for data products beyond our QlikSense dashboards.

Onboarding new team members has become more straightforward, thanks to a simplified code base and the development environment provided by Vertex AI’s Workbench. While we continue to work on initial proof-of-concept projects that can be a bit messy, operating within an integrated GCP environment overall provides a smoother experience.

As we continued to explore the potential of Vertex AI, we experimented with AutoML — state-of-the-art machine learning models that are easy to deploy and to customize with minimal effort. AutoML also offers valuable tools such as automatic feature engineering, feature importance analysis, and hierarchical forecasting that further simplify our work.

We also delved into Google’s latest generative AI model, PaLM2. We explored its functionalities through the PaLM API, specifically tuned for language tasks such as classification and summarization.

By integrating Vertex AI into our tech stack, we’ve built a foundation that accelerates our ability to offer complex treatments addressing diverse operational challenges. These treatments may leverage custom machine-learning models or glossy off-the-shelf AI functionalities, and the importance of a streamlined deployment avenue for such high-profile applications cannot be overstated. However, the advantages of using Vertex extend to the development and deployment of even more traditional algorithms. Indeed, in a company like Maisons du Monde, there are still simple data-powered treatments holding considerable potential to transform and optimize a variety of retail processes.

--

--