Post Exploratory Statistics

Emily Jaekle
Superior Sommelier
Published in
2 min readNov 16, 2017

The database we created from the cellar tracker data has the following statistics:

Notably, price is missing from this dataset. We plan to include this in the overall database by matching wines on other datasets, however, we will likely not have as much data on price and will need to be careful when collecting this data to control for time (when the price was recorded) and location (where the price was recorded) to avoid inconsistencies in the data.

We looked at the data in tabular form and while that was somewhat informative we also decided to create some visualizations to help get a better idea of the scope of the data for our purposes.

One of the most important considerations is the number of reviews associated with each wine. If we have too few reviews per wine there would be a limited number of conclusions we could draw.

(for the top 100,000 most reviewed wines)

With that in mind, we’re in pretty good shape as you can see in the figure on the left. For the first 100,000 most reviewed wines our mean is ~14 reviews/wine with a lot of outliers on the high end.

(for top 100,000 wines with the first 1,000 highest priced wines removed)

Using the given wine names, we searched the wine app, Vivino, and were able to obtain their prices. We have a good distribution of wine prices for the first 100,000 most reviewed wines as the figure on the left shows. The average being quite high isn’t a surprise given the data source.

Additionally, we have a rich range of wine varieties to work with as the figure below shows:

(for the top 100,000 most reviewed wines)

--

--