Getting Started with Data Science — Instacart Dataset & Google BigQuery
In 2017 Instacart open-sourced 3 million grocery orders. To this date, it is still the largest real grocery sales dataset. We will use this very interesting dataset to hone our skills in Exploratory Data Analysis and try Google Cloud BigQuery Data Studio. It will be a fun and informative beginner friendly tutorial on data analysis. There are limitations to this data:
- The users are anonymized. There’s no demographics data — no gender, age. Instacart explained on its blog post that it’s too hard to protect privacy of users if such data is included. In real life, Instacart also does not collect such data, but does use code scripts to analyze and infer gender from usernames. We heard of a devtool, from a different developer in a different company, that figures out gender called genderize.io.
- There is no brand specific data. All of almond milk is classified as almond milk regardless of the brand.
- There’s is no explicit time or date data, but there is interval data. Some data that infers time and also indication of the sequence of purchasing by each user.
Despite its limitations, it is a huge dataset of granular purchasing data. We love it and we are grateful to Instacart for opening it to data scientists and developers.
Citation: “The Instacart Online Grocery Shopping Dataset 2017”, Accessed from https://www.instacart.com/datasets/grocery-shopping-2017 on April 2019. Instacart Note: the dataset…