Cdiscount: one single platform for universal data governance

Fabien Jaunas
Peaksys Engineering
5 min readAug 30, 2022

Technological strata based on changing needs

We have had various different data analysis platforms since Cdiscount was launched more than 20 years ago.

The first SQL Server platform appeared in the 2000s, with the launch of an enterprise data warehouse, and was used to address traditional BI needs.

In 2013, 2 years after the marketplace was launched, we were faced with a significant increase in the volumes of data produced through website activity. In order to tackle this new challenge, we decided to set up a Hadoop cluster in response to increasing data science needs within the company. More than 50 data scientists and as many algorithms “crunched” our data platform until its demise in 2022.

Finally, a third data platform was launched to cater to specific requirements regarding near real-time processing and sharing of KPIs for measuring the commercial performance of the website cdiscount.com, a service not provided by traditional data marts.

What this meant was that, as of late 2019, we had three platforms for handling all of our data analysis needs, each dealing with a specific use case (traditional BI, data science, and real-time performance tracking).

Time to switch to one single platform

Running three platforms wasn’t easy, and had a range of consequences:

  • Our positioning wasn’t always clear to users => who does what? where is the data?
  • Data duplication
  • Higher project costs, in that it was often necessary to scale up all three systems at the same time
  • Difficult to use => data from three different sources had to be combined for analysis purposes
  • Inconsistency in terms of data between platforms.
  • Higher maintenance costs => running three platforms at the same time
  • Horizontal/vertical scalability rendered extremely complicated or even impossible in certain cases

In response to this, we opted to explore the possibility of merging these platforms together, factoring in new business needs.

Given the size of the company, the number of users of our platforms and the high (and growing) volume of data (several hundred terabytes), this change in data architecture was a real challenge, from both a technical and an organizational perspective.

Snowflake chosen to cut down on maintenance and accelerate value generation

We opted for a SAAS solution with the capacity to address all of the problems cited above. This single solution had to be capable of handling all of the use cases previously discussed, while meeting the main criteria below:

  • Availability: Coping with the seasonality of the business, with a particular focus on major commercial events (e.g. sales or Black Friday) and significant peaks in demand, with instant horizontal and vertical scalability (Scale OUT & Scale UP).
  • Performance: Dedicating virtual warehouse to usage enables us to guarantee high performance levels by allocating resources based on use cases (data ingestion, reporting, ad hoc analysis, data science, etc.).
  • The Great Wall of China: We employ a very simple system for compartmentalising data across our different subsidiaries, ensuring full security thanks to different holders.
  • Data sharing: When it comes to our partners and B2B clients, we realised we needed a simple and secure way of sharing data. The new platform enables us to deliver this service at high-speed.

New platform, new organization

Following these technological changes and the various different problems outlined above, we revised the scope for each team in order to get as close as possible to the needs of our users.

  • Centre of Excellence and data architecture: Responsible for overseeing and upscaling the platform, in addition to all architectural choices for the data stack, and communicating best practices to all our users (developers, data analysts, data scientists) through training and community networks.
  • Run Data: Oversees platforms and ensures they are available. Handles and resolves any issues with the platform, requests for deliveries of technical assets specific to data platforms, and the continuous improvement of flows.
  • Data Management: Responsible for data governance. Classifies, documents, validates and evaluates all data integrated into the platform. Draws on a broad community of data owners, each member responsible for a clearly-defined business silo. Uses stock market prices to evaluate each piece of information stored.
  • Project Hub: Comprising versatile teams tasked with executing projects for specific lines of business. Responsible for overseeing the successful completion of our plan for switching from our old data platforms to Snowflake, in addition to executing data projects for business product teams.

A simple, high-performance data stack

Drawing on our experience when it comes to designing industrialised data platform architecture for different Cdiscount needs, we have expanded and renovated our data stack in order to develop a platform that delivers the following main services:

Ingestion:

  • Batch: With our long-standing duo of Talend/DollarUniverse for everything relating to sources such as relational databases, files and APIs.
  • Streaming: A Kafka-Connect packaged Docker image deployed on Kubernetes is used to ingest all relevant Kafka topics directly on Snowflake (retrospectively reprocessed via Snowflake Streams/Tasks)

Transformation, synchronisation and aggregation

  • The Talend/DollarUniverse duo is used to trigger, sequence and send queries to Snowflake for the purposes of post-ingestion processing.

Managing Snowflake roles, access and rights

  • Implemented at a schema level.
  • Executed via specific roles.
  • Automated and coordinated via Terraform.

Managing changes to object structures (e.g. tables, views, tasks, streams)

  • Snowflake object structures are created and developed in Liquibase and deployed on Azure Devops

Machine learning

  • ML processing is packaged via Docker Python images on Kubernetes, exploiting Snowflake’s power (if needed) and data storage. Data scientists also have the option of using Jupyter notebooks on-demand via Kubernetes.

Documentation

  • All users will soon have access to a data catalogue service.
  • Freely accessible reports document all of the tables and columns available in Snowflake, how they are supplied and how fresh they are.

Ad-hoc analysis and reporting

  • Delivered by Power BI and open to all Cdiscount employees (subject to conditions for accessing underlying data).

Data export and output (reverse ETL)

  • Executed via Talend or Kafka depending on the target output (database, API, indexing search engines, NoSQL, etc.)

Keeping users happy while saving money

Combining our different data platforms into one single platform has enabled us:

  • To offer new services to users
  • To deliver much higher levels of availability compared to previous platforms
  • To drastically improve performance in terms of user queries and processing
  • To make accessing data easy and secure
  • To significantly reduce project duration (TTM)

This also presented us with an opportunity to reorganise the data hub and to review the roles of each team. There are still one or two technical building blocks to be developed. We are constantly thinking about future changes to the data platform as we seek to make life easier for all our users and to deliver more features.

We will soon be publishing an article outlining how data is governed on the platform, chiefly through systematic classification and anonymisation by design of personal data.

--

--