Shopify’s Data Science & Engineering Foundations

Image for post
Image for post

We currently face a global pandemic. People are in pain around the world. Daily life has been disrupted. Many face monumental financial damage or unemployment. Small businesses are struggling to find creative solutions to adapt and stay afloat — and they need to do so quickly.

At Shopify, we have a mission to make commerce better for everyone, but we did not expect 2020 to start out with a global pandemic. Shopify is a mini-economy — with merchants, partners, buyers, carriers, payment providers all interacting — and planning helps us build products that positively impact the entire system. When COVID-19 happened, those plans went out the window. …

How am I using basic data work to ensure I am getting a good price on my trip.

Image for post
Image for post

It has been 4 years since my wife and I took some vacation in a sunny place. Last time, for our honeymoon, we spent some quality time in Mexico. We enjoyed 10 days in a very nice all-inclusive resort in Riviera Maya. Since then, a house, two kids, a new job and many other things. After some reflexion we decided that it was time to go back on the beach. So, next December (2019) we (my wife, our 3 years old, our 4 months old and I) will be heading to Riviera Maya once again.

Don’t worry, I am not turning my Data blog into a travel and lifestyle blog. I want to share with you how I am making sure I am getting the right price for the trip. …

Image for post
Image for post
Picture from Spark and AI Summit 2019

My review of the latest Spark and AI Summit hosted in San Francisco on April 24th and 25th 2019.

Last week was hosted the latest edition of the Spark Conference. It was the first time for me attending the conference. Here is a breakdown of the different aspect of the conference.

The big news

Databricks, organizer of the conference and the main contributor of Spark announced couple of items:


They announced a new project called Koalas, a native “Pandas” interpreter for Spark. You can now automagically port your Pandas code to the distributed world of Spark. This will be a fantastic bridge from people used to the Pandas environment. Many online classes/Universities teach Data Science using Pandas. …

Image for post
Image for post

A more technical post about how I end up efficiently JOINING 2 datasets with REGEX using a custom UDF in SPARK


For the past couple of months I have been struggling with this small problem. I have a list of REGEX patterns and I want to know which WIKIPEDIA article contains them.

What I wanted to end with was a table with the following columns:

  • Wikipedia Article ID
  • Wikipedia Article Text
  • Matching Pattern (or null if no pattern got triggered)

Image for post
Image for post

Let’s not start with data science this time. Let’s start with psychology. I am far from having any competence in this domain, but I remember in high school being presented the Maslow’s hierarchy of needs. The best I can describe it is the different stage humans must go through to find happiness. To get better understanding of it, you can look here.

Here is the famous pyramid.

Image for post
Image for post

Data Science is getting very popular and many people are trying to jump into the bandwagon, and this is GREAT. But many assume that data science, machine learning, plug any other buzzword here, is to plug data to some Sckit-Learn libraries. Here is what the actual job is.

To bring you into context, the following is happening after the data was collected. Don’t get me wrong, I don’t think it should be considered a simple step, but I would like to focus on data pre-processing and normalization.

The Problem

If you followed my blogs, you probably realized that I work a lot in the Machine 2 Machine field. Recently at work I was trying to cluster machine together based on their behaviour, aka their data consumption. …

Image for post
Image for post

A more efficient loss function for Siamese NN

At work, we are working with Siamese Neural Net (NN) for one shot training on telecom data. Our goal is to create a NN that can easily detect failure in Telecom Operators networks. To do so, we are building this N dimension encoding to describe the actual status of the network. With this encoding we can then evaluate what is the status of network and detect faults. This encoding as the same goal as something like word encoding (Word2Vec or others). To train this encoding we use a Siamese Network [Koch et al.] to create a one shot encoding so it would work on any network. A simplified description of Siamese network is available here. …

Image for post
Image for post

A needed tool in your data science toolbox

Lately, at work, we had to do a lot of unsupervised classification. We basically had to distinguish N classes from a sample population. We had a rough idea of how many classes were present but nothing was sure, we discovered the Kolmogorov–Smirnov test a very efficient way to determine if two samples are significantly different from each other.

I will give you a bit of context on the Kolmogorov–Smirnov test and walk you though one problem we solved with it.

Original post on


Rejecting the null hypothesis. That sounds like a painful memory from university statistic class, but it’s actually exactly what we want to do here. We want to reject the possibility that the two samples are coming from the exact same distribution. Let’s look at a very high level, non-mathematical, overview of some tests available. If you want to get a good understanding of the mathematics behind all these tests, use the Wikipedia link provided in all the sections. …


Marc-Olivier Arsenault

Data Science Lead at Shopify— Personal blog available at

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store