What “big data” should retailers collect?

The biggest mistake retailers make when pursuing big data is to collect every data point they can find and store it in their data warehouse.

They could end up investing many years, create millions of terabytes of data, but very little business impact.

It’s not the breadth of your data. It’s the actionability.

First things first

The best way to understand what data to collect is to understand the principles of maximizing profit in a brick-and-mortar and online context.

In a brick-and-mortar store, you offer the same shopping layout and experience to all customers, so your goal is to implement the layout that maximizes the chance of a sale for the average customer that walks into that store. You can do a lot of this by just knowing a breakdown of sales for each store, you do not need to tie each sale to a customer. However, there are also ways to personalize the experience for customers in a brick-and-mortar context. For instance if you knew their identity the moment they walked into a store.

In the online context, you can personalize the experience for each customer as you know his identity once he logs in, so your goal is to understand the customer as much as possible to show him the right message, at the right time, at the right place.

Customer data

From a marketing perspective, there is a hierarchy of importance of data when it comes to understanding customers that is ranked based on how correlated an action is to a purchase intent:

  • Purchases (what did they buy?)
  • Add to cart (what did they add to their cart?)
  • Product click (what products did they click on?)
  • Product view (what products did they view?)

The most important data to collect is purchase information (and of course the means to contact the customer via email, SMS etc.). In brick and mortar stores, this is done through loyalty programs, because otherwise you do not know who bought what. In online stores, this is usually collected, except when you sell through a third-party like Amazon.

The next level of data — add-to-carts, clicks and views — can be tracked online using clickstream tools like Google BigQuery and Adobe Marketing Cloud. In brick and mortar stores, the use of beacons and RFID tags are starting to gain popularity in tracking things like number of times a product is lifted from the rack (i.e. roughly equivalent to a product click), amount of traffic in a certain area in the store.

Data related to browser type, device, country and IP are give you much needed information about unregistered online users. For instance, you could personalize content based on where the users’ IP is from. You could figure out if a Mac or PC user has higher value to you. As an aside, you could also identify common fingerprints for fraud by combining this data with other payment related data.

Competitor data

There is another type of data used to determine pricing, sizing and product design strategy. This is information about you and your competitors products, things like ratings, reviews, colors, pricing etc.

In the brick-and-mortar context, brands rely on research companies to give them such intelligence.

In the online context, this is used extensively when selling on marketplaces like Amazon. Selling on Amazon is somewhat like thinking about product positioning on a physical store. When you search for a term in Amazon, you get the “shelf” of products displayed in a grid. Analyzing the products that show up beside yours — descriptions, reviews, colors — can help you figure out how you should design new products or tweak existing ones.

Other data (inventory, logistics, etc.)

A lot of the other data is not exactly “big data”, but regular data that gets stored in structured databases like inventory, logistics and advertising spend. Your ERP, WMS and other systems will already capture them and there’s no need to move those to a separate data warehouse.

These data can be made available to your teams by using tools such as RJMetrics, Tableau or PowerBI. If you have a more technical oriented team who are comfortable running data pipelines and SQL queries, Amazon Redshift is also a very flexible option.


In a nutshell, you do not need that much “big data” to get started with predictive algorithms. What is more important is the principle of shipping early and often. Define a business outcome you’re looking for, collect the data and ship a deployable app that you can use to achieve that business outcome.

If you can do that, you are already outperforming most of your competition.