Analyze clients & orders with a unified data platform

Adrien Auclair
Serenytics
5 min readNov 10, 2017

--

Creating daily updated KPIs and dashboards from large clients & orders data sets (or any star schema dataset) is not an easy project. It usually requires an ETL (like Talend), a data-storage (like AWS Redshift) and a dataviz tool (like Tableau). Having to set up and maintain these three modules in production requires technical skills and create opportunities for errors and delays.

In this article, I’ll show you how this kind of data project can be highly simplified using our unified data platform: Serenytics.

Note that for a one-shot analysis, a data analyst/scientist could also use low-level tools (i.e. tools like Python, Pandas and Jupyter notebooks). But it requires programming skills, is usually time consuming (mostly reading Pandas documentation and StackOverflow) and is limited to small datasets. And most importantly, it is not adapted to KPIs/dashboards that should be updated daily and shared.

The input data

For the purpose of this article, we’ll use an ORDERS-CLIENTS model:

  • The ORDERS dataset contains one row by order (from an e-commerce website or from a physical store) with columns such as date, price, client-id.
  • The CLIENTS dataset contains one row by client with columns such as client-id, inscription-date, birthdate, country.

And the final goal is to compute KPIs like:

  • Revenue by country, by age, by month.
  • Monthly growth of number of clients by age (and age is not a column of the dataset, let’s say that only birthdate is available)
  • Number of customers who are new churners this month
  • Number of customers who are re-activated this month (i.e. did not buy for the last 12 months, but bought in the previous month)
  • Client segmentation per number of orders (1 order , 2–5 orders, 6–10 orders…)

We could have chosen to work with any CLIENTS & INTERACTIONS data (e.g. opens & clicks data from marketing campaign tools like Adobe Campaign, Selligent…) or any data with a star schema. The following steps would be very similar.

Loading the data

Let’s say we have one CSV file for the CLIENTS and another one for the ORDERS.

Within Serenytics, we just need to create one BigCSV data source per file and load the associate CSV files. As a result, the data is now stored within our internal data-warehouse (i.e. in AWS Redshift). And this also works for large files (we’ve done projects with 200M orders).

Joining ORDERS and CLIENTS in a single data source

To create a table that contains columns from both ORDERS and CLIENTS, we have to create a join between the two data sources we just created.

To achieve this, we create a new data source of type Join, selecting the two initial data sources and the keys:

  • id column for the CLIENTS dataset
  • client-id for the ORDERS dataset (using an inner-join)

As a result, we obtain a data source we can already use to create a dashboard with basic KPIs (revenue by country, revenue by month, number of clients by month…). But that’s not enough for more advanced KPIs.

Note that in the next paragraphs, we’ll refer to this join table as ORDERS_CLIENTS_JOIN.

Enriching your data

From the ORDERS_CLIENTS_JOIN data source, using the Formulas tab, we’ll add new columns required in our analysis, without writing SQL.

For this example, we’ll add the following columns:

  • age defined as a simple formula:
    date_diff([client.birth_date], now(), "year")
  • age-bin defined as a conditional formula:
    IF [age] <=18 THEN "0-18"
    ELSE IF [age] <= 30 THEN "19-30"
    ELSE IF [age] <= 40 THEN "31-40"
    ELSE IF [age] <= 50 THEN "41-50"
    ELSE "51-100"
  • is_order_in_last_12_months_before_previous_month defined as a conditional formula:
    IF [order.date] >= month(-13) and [order.date] < month(-1) THEN 1
    ELSE 0
  • is_order_in_previous_month defined as a conditional formula:
    IF [order.date] in month(-1) THEN 1
    ELSE 0
  • growth_clients_creation_YtoY defined as a value formula: (count_if([client.id], [client.date-creation] in year(0)) - count_if([client.id], [client.date-creation] in year(-1))) / count_if([client.id], [client.date-creation] in year(-1))

Note that for this last formula would be more readable by using two intermediate formulas: nb_clients_created_last_year and nb_clients_created_this_year.

Computing data per client

From the above steps, we obtained a data source with one row per order. But let’s say we want to create a bar-chart with the number of re-activated customers (i.e. who did not buy for the last 12 months, but bought in the previous month) by age-bin. We need a table with one row per client and with a column containing the information is_reactivated_or_not.

To create this new table in the data-warehouse, in the Automation menu, we create an ETL step with these settings:

  • Input data source: we select the ORDERS_CLIENTS_JOIN source created previously
  • Output source: let’s name it COMPUTED_INFO_BY_CLIENT
  • Group-by dimensions: we group by client.id to have one row by client

In the Data section, we select all the metrics we want to aggregate by client:

  • sum(is_order_in_last_12_months_before_previous_month)
    We rename this aggregate to nb_orders_in_last_12_months_before_previous_month
  • sum(is_order_in_previous_month)
    We rename this aggregate to nb_orders_in_previous_month

To actually create the new table, in the Execution tab, we click on Run Now.

In the created COMPUTED_INFO_BY_CLIENT data source, we can now add a conditional formula is_reactivated_in_previous_month:

IF [nb_orders_in_previous_month]>0 and [nb_orders_in_last_12_months_before_previous_month]==0 THEN 1
ELSE 0

In some situations, we also want to pass raw client information from the input join to the output table (e.g. the client country). To achieve this, the best option is to add this column in the group-by dimensions selected in the ETL step. Another identical solution is to add it as a metric and use the min aggregation.

Creating the dashboard

We now have 4 data sources to use in our project:

  • ORDERS: one row by order
  • CLIENTS: one row by client
  • orders_clients_join: one row by order, with columns about orders and clients
  • computed_info_by_client: one row by client with computed columns

Within the dashboard editor, it’s now very easy to compute many KPIs from these 4 data sources.

Automating the process

The data loading explained in this article is manual. In a production environment, we would schedule it (e.g. loading the CSV files from a SFTP server each night). This can be done with a few clicks in Serenytics.

We would also:

  • Add a data cleaning step right after the loading.
  • Automatically reload the dashboard cache each night so that each morning, the dashboard is immediate to load.
  • These Automation steps (loading + cleaning + computing intermediate tables + cache reloading) would be launched within a single Flow task, sending an email in case something went wrong.
  • Each Monday morning, the dashboard would be sent by email as a pdf file.

Conclusion

In this article, we’ve seen all the steps to achieve a clients-orders analysis with Serenytics. Everything is done within a single Cloud application, without any tool to setup and without any programming/SQL skills.

Of course, one can achieve the same result with a multi-tools stack (e.g. Talend/AWS Redshift/Tableau) and would benefit from the advanced features of each module. But that comes with a lot more complexity and only people skilled in the full stack can achieve the project (or a team of experts with at least one data analyst and one devops). And for most cases, the total cost of the project would be much lower with a unified data platform such as Serenytics.

--

--

Adrien Auclair
Serenytics

Serenytics Founder - Planorama Founder- PhD in Computer Vision - Entrepreneur & coder