ETL (extract, transform, load) + Database

Emily Jaekle
Superior Sommelier
Published in
3 min readNov 16, 2017

Initially we looked at using Wine.com’s API, but unfortunately it is not longer accessible as of September of this year. Then we looked into scraping data from a few different websites, unfortunately that did not seem possible either; for many of these websites, data is their product, so they all make some attempt to guard it. Methods are in place on many websites to challenge bots and prevent simple scraping methods. In some cases these protections are tough to defeat (and, present some ethical considerations for circumventing their protections). Finally we found data that someone else compiled into two csv files.

https://data.world/petergensler/cellartracker-reviews

http://snap.stanford.edu/data/web-CellarTracker.html

We also found a data set from Kaggle with similar information that also includes wine price. Although it only had 150,000 wine reviews.

https://www.kaggle.com/zynicide/wine-reviews

Since the kaggle dataset only had 150,000 reviews versus over 2 million from Peter Gensler’s dataset, we will proceed (at least initially) with Peter’s data.

We created a python script that reads in the .csv files, and loads the data into sqlite database. The script runs through each line in the raw data and loads them individually into the database. At this time, each row in our database represents an individual’s review of a wine with all available data fields:

  • Review_points
  • Review_text
  • Review_time
  • Review_userId
  • Review_userName
  • Wine_name
  • Wine_variant
  • Wine_wineId
  • Wine_year

Since we currently have over 1GB of data, we are setting up a directory in the CS GPU cluster where the database and scripts can be stored.

In order to get pricing we scraped from Vivino’s website https://www.vivino.com / to get the average price per bottle. We were able to get the pricing for 100,000 wines.

Our updated database has the following structure (consolidated to most important tables):

The full database structure is as follows:

Once we had our database set up we had to clean our review text to effectively use it. We used three different methods to do this: tokenizing, lemmatizing/stemming, and stopwords.

Tokenizing

The first thing we did was tokenize our reviews. The tokenizer took in a review string and separated it into individual words, which we could then stem/lemmatize.

Lemmatizing vs. Stemming

In order to account for variations in words (peaches, peachy, peach for example) we decided to lemmatize/stem our words. Originally we used stemming and the stemmer from the NLTK (Natural Language Toolkit). Unfortunately the stemmer did not work the way we wanted it to. An example of this was peach, peaches, peachy text. While it stemmed peaches to peach, peachy went to peachi (which was still another word and not the adjective we wanted). Then we decided to explore lemmatizing. The lemmatizer from NLTK requires a type of speech input (nouns, verbs, adjectives, etc). We decided to lemmatize our nouns and adjectives. The adjectives for the most part did not lemmatize to nouns (peachy stayed as peach), but for our word clouds at least they remained recognizable words. We decided not to lemmatize verbs because we did not want verbs to have high weights (based on frequency) in our model since they were not representative of tasting notes.

Stopwords

In addition to the above steps, we also needed to create a list of stopwords to remove from our data. Many of these stop words were common to every review. Some examples of these are: bottle, wine, good, drink, etc. In addition to words common to every review, we also removed varieties of wine from our data.

--

--