Stories by Ricky Kim on Medium

Building Spotify Discover Weekly Email Alert with Luigi

Ricky Kim — Mon, 03 Jun 2019 14:22:43 GMT

A simple Luigi data pipeline

First of all, I would like you to understand that this is a record of my personal learning. You might find a better way to implement things than what I have tried and shown here. If you have any suggestions to improve the code, I’d be very happy to hear advice, comments.

If you are a Spotify user, you may have heard of their feature called “Discover Weekly”. Discover Weekly is a playlist of 30 songs, that Spotify recommends based on your listening history. I absolutely love Discover Weekly, and sometimes even get scared a little bit of how Spotify knows me too well. And it is not just me, it appears.

dave horwitz on Twitter

It's scary how well @Spotify Discover Weekly playlists know me. Like former-lover-who-lived-through-a-near-death experience-with-me well.

The only problem I have with Discover Weekly is that I can’t access my historical Discover Weekly since it automatically refreshes every Monday. When I forgot to save the songs I like to my libraries or playlists, next week the list is completely gone, and I have no way to figure out what was that song I absolutely loved but forgot to save.

The small private project I am sharing here started from the above problem. The solution I came up with is to extract the list of songs from Discover Weekly every Monday and send the list as an email to myself. It was a perfect opportunity to try a simple data pipeline with Luigi.

Requires:

Spotify account
Gmail account
AWS account

(BTW I am using a Macbook for this project, some of the steps might be slightly different if you are on Windows.)

Getting Spotify API Access

In order to be able to access Spotify programmatically, you need client ID and client secret from Spotify. You can get them easily by going to Spotify for Developers page. Once you are in the dashboard page, you can click on the green “CREATE A CLIENT ID” button, then you will be asked questions like app name, description, etc.

Next, you will be asked if the app is for commercial use. In my case, I am just building this for myself, so I clicked “No”.

Finally, tick the checkboxes, then submit.

Then you will be given a client ID and a client secret that you can use to access Spotify API.

Take a memo or copy and paste it somewhere because we will need these later. Click “EDIT SETTINGS”, and add “https://localhost:8080” to Redirect URIs. In a proper app, this will redirect the user to the app after confirming the API access, but in this case, this will only be used as a part of authentication params.

One last thing you need to do is to follow Discover Weekly on your Spotify. This makes it possible to retrieve Discover Weekly from our Python program.

Launching EC2 Instance on AWS

Sign in to your AWS Management Console, and click into EC2. I am writing this assuming that you already have an AWS account. Click “Launch Instance”. For this project, I chose Amazon Linux AMI 2018.03.0.

Make sure the instance type chosen is t2.micro which is eligible for free tier. One important step we should do is to open the port for Luigi so that we can access Luigi’s central scheduler.

Keep the default settings for the rest except for “6. Configure Security Group”. Once you get here, click “Add Rules” and choose Custom TCP from the “Type” dropdown, and type in “8082” in “Port Ranges”. Luigi’s central scheduler GUI uses port 8082 as a default, so this step enables us to access Luigi GUI on a web browser. As an additional step, you can add your own IP address in the “Sources” section, so that you only allow inbound traffic from a certain IP address. If you want to explicitly your own IP address only type in “your-IP-address/32” in the Sources section. Now click “Review and Launch”.

Next, you will be prompted to either choose an existing key pair or create a new one. The key will be used when communicating with your instance from your local machine. Let’s create a new key pair for the project.

First, download key pair, then finally launch instance. Go back to EC2 section of AWS console, and you might see the instance is still not in “running” state. Give it a few moments, and when it turned to “running”, take a good note of its Public DNS (IPv4) and IPv4 Public IP.

Additional AWS Preparation

Either from terminal or on AWS web console, create an S3 bucket named “luigi-spotify”. This will be later used to store the list of songs extracted from Spotify as TSV.

Connecting to EC2 Instance

I hope there was nothing too complicated so far. Now since we launched the instance, we can ssh into it. Before we do that, we need to change the file permission of the key pair we downloaded above, because EC2 instance will not accept key file which is publicly viewable. Open your terminal and run below command after replacing “directory…” part to your own directory

chmod 400 directory-where-you-downloaded-the-key-file/luigi_spotify.pem

There are 3 permissions (Read, Write, Execute) for 3 types of users (User, Group, Others). What the above line of code does is changing the file permission so that the key file has only one permission (Read) allowed to one type of user (User). Now we are ready to ssh into our instance. Again please replace the part “directory…” and “your-instance…” with your own directory and public DNS.

ssh -i directory-where-you-downloaded-the-key-file/luigi_spotify.pem ec2-user@your-instance's-public-DNS

Preparing EC2 Instance for Luigi Tasks

Once in your EC2 instance, let’s first install Git so that we can clone the repository I prepared for this project.

sudo yum -y install git

Now clone the repository using git clone command.

git clone https://github.com/tthustla/luigi_spotify.git

Go to the cloned directory, and let’s first take a look at files.

cd ~/luigi_spotify
ls

ec2-prep.sh will be used to install required packages. luigi.cfg is a configuration file where you will put all the API keys and credentials. luigi_cron.sh is a bash script that runs Luigi pipeline defined in run-luigi.py.

Make both of the bash scripts executable by running below command.

chmod 777 *.sh

Now let’s first run the ec2-prep.sh

./ec2-prep.sh

Luigi

Before we actually run the pipeline, it’d be good to have an understanding of what the pipeline does and how it does it. Below is the code for run_luigi.py.

https://medium.com/media/e2123c34bb7081fddabb8934f969808a/href

On a high level, it performs two tasks. First, get the list of songs from the Discover Weekly playlist, and store them in S3 as TSV. Once it’s finished storing TSV, then with the TSV, it creates an email message that shows [Song Title] — [Artist] as Spotify links, then send the message to yourself. The first task is defined in GetWeeklyTracks class, and the second is defined in SendWeeklyEmail class. In order for these tasks to be able to run, it needs credential info, and these are retrieved from luigi.cfg file using luigi.Config class. Getting Spotify API token, and establish the connection to Spotify is being done outside of Luigi tasks.

Running Luigi on Local Scheduler

Next thing we need to do is filling in the information in luigi.cfg file. First open the file with Nano.

nano luigi.cfg

Fill in each value with your own credentials without quotes or double quotes. Now we are finally ready to do a local test run of the pipeline within our EC2.

python3 run_luigi.py SendWeeklyEmail --local-scheduler

Due to Spotipy’s (a Python library for Spotify API) authentication flow, you will see instructions like below.

If tested on your local machine, this would have opened the web browser, but it doesn’t open on EC2 since there is no web browser installed. Copy the URL (the blue highlighted part in the above screenshot), and paste this to your local machine web browser and open.

If you see a screen like above, click “AGREE”, then it will show error page like below. You don’t have to worry about this error page. This happens because the redirect URI we provided is just a localhost port without anything running on it. Copy the URL address of the error page, there is a code embedded in the URL that will be used by Spotipy’s autehntication flow.

Now back to your EC2 terminal, paste the URL to the console, where it shows “Enter the URL you were redirected to:”. Now chances are high that this won’t succeed at first try because Google blocking this Gmail login from an unknown IP address at first. If this happens, log in to your Gmail using your local machine web browser, then try run the command again. If everything goes well you will have received the email sent from the trial run.

Running Luigi on Central Scheduler

We are almost there. The train run succeeded, now we have a few more steps to go. Now let’s do a proper run of the pipeline with Luigi’s Central Scheduler, so that we can access Lugi GUI. First, create a directory for log files.

mkdir ~/luigi_spotify/log

When we do a background run console output will be stored in log files in the above directory we just created. Let’s launch Luigi Central Scheduler.

luigid --background --logdir ~/luigi_spotify/log

Since we have opened the port 8082 to access GUI, we can now open GUI on a web browser. Open the page with the IPv4 Public IP and “:8082” attached as below.

your-EC2-pulic-IP:8082

We haven’t run any tasks yet, so you won’t see any tasks now. Now let’s run the pipeline without “ — local-scheduler” param at the end. You might want to delete the folder created in the S3 bucket during the trial run to see the whole pipeline running from scratch. Otherwise, Luigi will see the folder and files in the S3 bucket, then just check the output files are there and mark the task as success without running any of the tasks. Now you will see two tasks run successfully.

Hooray!!

Image courtesy of vernonv on Redbubble

Setting Up a Cron Job

The very last part is to set up a Cron job so that we can decide when and how frequent these tasks run. The one thing you have to consider here is that your EC2 instance’s Linux time might be different from your local time. Run below command to set the time zone in your EC2 instance.

cd /usr/share/zoneinfo
tzselect

After you select the right timezone for you. Copy the part that’s highlighted in blue from your own terminal and run. Since I will finish setting up my EC2 without restarting, I just directly run the code on the terminal without appending it to ‘.profile’.

TZ='Europe/London'; export TZ

We will be setting up a Cron job with luigi_cron.sh that will run run_luigi.py. As you will see from the Cron command below, I am specifying LC_CTYPE with the correct value of the EC2 instance. This small part took me a while to figure out. The same file, the same tasks were throwing encoding error when run as a Cron job, while it works perfectly fine without Cron. After a long googling I finally found the way that works. You can find the EC2 instance’s LC_CTYPE value by run “locale”.

locale

Once you have that LC-CTYPE info, open a Crontab with below code.

crontab -e

You won’t find anything there yet. Press “i” and go into “insert” mode, then paste below code, and press esc then type “:wq” to write the changes and exit.

0 8 * * MON LC_CTYPE="en_GB.UTF-8" ~/luigi_spotify/luigi_cron.sh

Above Crontab expression will schedule the bash script to run 08:00AM every Monday, but you can change it to your own preference. If you need help with Crontab expression, you can try your own expression at https://crontab.guru/.

If you want to check if the Cron works, you can first set the Crontab value as below (It will run the task every minute), then check if it works, and change it back to the weekly Crontab value you want to set. Again if you want to do this check, please don’t forget to delete the folder from your S3 bucket.

*/1 * * * * LC_CTYPE="en_GB.UTF-8" ~/luigi_spotify/luigi_cron.sh

That is it! Now the Luigi pipeline will run every Monday to fetch songs from my Discover Weekly and will send me an email!

I know this is not a very complicated task. But it was such a wonderful learning experience for me. Of course, there are spaces for improvements in my code implementation, but I am one happy man today who solved one of my daily problems using data and Luigi!

Thank you for reading. You can find Git repository from the below link.

https://github.com/tthustla/luigi_spotify

Building Spotify Discover Weekly Email Alert with Luigi was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Building K-pop Idol Identifier with Amazon Rekognition

Ricky Kim — Sat, 25 May 2019 07:18:17 GMT

A gentle guide to face recognition

Photo by Bach Tran on Unsplash

Building a data science model from scratch is quite a big job. There are many elements that make up a single model, many steps involved, and many iterations needed to create a decent model. Even though going through these steps will definitely help you to have a deeper understanding of the algorithm being used in the model, sometimes you just don’t have enough time to go through all the trials and errors especially when you have a tight deadline to meet.

Image recognition is a field in machine learning that has been intensively explored by many tech giants such as Google, Amazon, Microsoft. Among all the features of image processing, probably what’s most being discussed is facial recognition. There are lots of debates on ethical aspects of the technology, but that is beyond the scope of this post. I will simply share what I have tried with Amazon Rekognition, and hope you can get something out of this post.

The urge to write this post started when I played around with Amazon Rekognition demo on their web interface. It provides many useful services like “object and scene detection”, “facial recognition”, “facial analysis”, and “celebrity recognition”. I tried with a few pictures, and everything ran smoothly until I got to “celebrity recognition”. Celebrity recognition first seemed to work fine until I tried with K-pop celebrities’ pictures. The performance of the recognition significantly dropped with K-pop celebrities. Sometimes it gives me the right answer, sometimes it cannot recognise, sometimes it gives me the wrong name.

By the way, the above picture is Tzuyu from a group called Twice, which is my favourite K-pop girl group, and I cannot accept that Amazon recognises this picture as Seolhyun (who’s a member of another group called AOA).

So I decided to write a simple Python script using Amazon Rekognition which can accurately detect the members of Twice.

In addition to short code blocks you can find in the post, I will attach the link for the whole Jupyter Notebook at the end of this post.
This post is based on the tutorial “Build Your Own Face Recognition Service Using Amazon Rekognition”, but modified from the original code to fit the specific purpose of the project.

Face Detection with Amazon Rekognition

There are a few prerequisites in order for you to run below steps in your Jupyter Notebook.

Amazon AWS account
AWS credentials configured with AWS CLI
The latest version of Boto3

Let’s first start by importing some packages that will be directly used for the next step.

import boto3
from PIL import Image

%matplotlib inline

Now we need to have an image that we want to process. I chose the same image that I tried with the above web interface demo, and we will send this image to Rekognition API to get the result of its image recognition. (The image can also be found in the Github link that I will share at the end of this post.) Let’s take a quick look at the image.

display(Image.open('Tzuyu.jpeg'))

The most basic task we can ask Rekognition is facial recognition with the given image, and this can be done with just a few lines of codes.

import io

rekognition = boto3.client('rekognition')

image = Image.open("Tzuyu.jpeg")
stream = io.BytesIO()
image.save(stream,format="JPEG")
image_binary = stream.getvalue()

rekognition.detect_faces(
Image={'Bytes':image_binary},
    Attributes=['ALL']
)

You can either send the image to Rekogntion as in-memory binary file object directly from your local machine or upload your image to S3 and give your bucket and key details as a parameter when calling rekognition.detect_faces(). In the above example, I am sending the binary object directly from my local machine. The response you will get from the above call will be quite long with all the information you can get from detect_faces function of Rekognition.

{'FaceDetails': [{'AgeRange': {'High': 38, 'Low': 20},
   'Beard': {'Confidence': 99.98848724365234, 'Value': False},
   'BoundingBox': {'Height': 0.1584049016237259,
    'Left': 0.4546355605125427,
    'Top': 0.0878104418516159,
    'Width': 0.09999311715364456},
   'Confidence': 100.0,
   'Emotions': [{'Confidence': 37.66959762573242, 'Type': 'SURPRISED'},
    {'Confidence': 29.646778106689453, 'Type': 'CALM'},
    {'Confidence': 3.8459930419921875, 'Type': 'SAD'},
    {'Confidence': 3.134934186935425, 'Type': 'DISGUSTED'},
    {'Confidence': 2.061260938644409, 'Type': 'HAPPY'},
    {'Confidence': 18.516468048095703, 'Type': 'CONFUSED'},
    {'Confidence': 5.1249613761901855, 'Type': 'ANGRY'}],
   'Eyeglasses': {'Confidence': 99.98339080810547, 'Value': False},
   'EyesOpen': {'Confidence': 99.9864730834961, 'Value': True},
   'Gender': {'Confidence': 99.84709167480469, 'Value': 'Female'},
   'Landmarks': [{'Type': 'eyeLeft',
     'X': 0.47338899970054626,
     'Y': 0.15436244010925293},
    {'Type': 'eyeRight', 'X': 0.5152773261070251, 'Y': 0.1474122554063797},
    {'Type': 'mouthLeft', 'X': 0.48312342166900635, 'Y': 0.211111381649971},
    {'Type': 'mouthRight', 'X': 0.5174261927604675, 'Y': 0.20560002326965332},
    {'Type': 'nose', 'X': 0.4872787892818451, 'Y': 0.1808750480413437},
    {'Type': 'leftEyeBrowLeft',
     'X': 0.45876359939575195,
     'Y': 0.14424000680446625},
    {'Type': 'leftEyeBrowRight',
     'X': 0.4760720133781433,
     'Y': 0.13612663745880127},
    {'Type': 'leftEyeBrowUp',
     'X': 0.4654795229434967,
     'Y': 0.13559915125370026},
    {'Type': 'rightEyeBrowLeft',
     'X': 0.5008187890052795,
     'Y': 0.1317606270313263},
    {'Type': 'rightEyeBrowRight',
     'X': 0.5342025756835938,
     'Y': 0.1317359358072281},
    {'Type': 'rightEyeBrowUp',
     'X': 0.5151524543762207,
     'Y': 0.12679456174373627},
    {'Type': 'leftEyeLeft', 'X': 0.4674917757511139, 'Y': 0.15510375797748566},
    {'Type': 'leftEyeRight',
     'X': 0.4817998707294464,
     'Y': 0.15343616902828217},
    {'Type': 'leftEyeUp', 'X': 0.47253310680389404, 'Y': 0.1514900177717209},
    {'Type': 'leftEyeDown',
     'X': 0.47370508313179016,
     'Y': 0.15651680529117584},
    {'Type': 'rightEyeLeft',
     'X': 0.5069678425788879,
     'Y': 0.14930757880210876},
    {'Type': 'rightEyeRight',
     'X': 0.5239912867546082,
     'Y': 0.1460886150598526},
    {'Type': 'rightEyeUp', 'X': 0.5144344568252563, 'Y': 0.1447771191596985},
    {'Type': 'rightEyeDown',
     'X': 0.5150220394134521,
     'Y': 0.14997448027133942},
    {'Type': 'noseLeft', 'X': 0.4858757555484772, 'Y': 0.18927086889743805},
    {'Type': 'noseRight', 'X': 0.5023624897003174, 'Y': 0.1855706423521042},
    {'Type': 'mouthUp', 'X': 0.4945952594280243, 'Y': 0.2002507448196411},
    {'Type': 'mouthDown', 'X': 0.4980264902114868, 'Y': 0.21687346696853638},
    {'Type': 'leftPupil', 'X': 0.47338899970054626, 'Y': 0.15436244010925293},
    {'Type': 'rightPupil', 'X': 0.5152773261070251, 'Y': 0.1474122554063797},
    {'Type': 'upperJawlineLeft',
     'X': 0.46607205271720886,
     'Y': 0.15965013206005096},
    {'Type': 'midJawlineLeft',
     'X': 0.47901660203933716,
     'Y': 0.21797965466976166},
    {'Type': 'chinBottom', 'X': 0.5062429904937744, 'Y': 0.24532964825630188},
    {'Type': 'midJawlineRight',
     'X': 0.5554487109184265,
     'Y': 0.20579127967357635},
    {'Type': 'upperJawlineRight',
     'X': 0.561174750328064,
     'Y': 0.14439250528812408}],
   'MouthOpen': {'Confidence': 99.0997543334961, 'Value': True},
   'Mustache': {'Confidence': 99.99714660644531, 'Value': False},
   'Pose': {'Pitch': 1.8594770431518555,
    'Roll': -11.335309982299805,
    'Yaw': -33.68760681152344},
   'Quality': {'Brightness': 89.57070922851562,
    'Sharpness': 86.86019134521484},
   'Smile': {'Confidence': 99.23001861572266, 'Value': False},
   'Sunglasses': {'Confidence': 99.99723815917969, 'Value': False}}],
 'ResponseMetadata': {'HTTPHeaders': {'connection': 'keep-alive',
   'content-length': '3297',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Sun, 19 May 2019 08:45:56 GMT',
   'x-amzn-requestid': '824f5dc3-7a12-11e9-a384-dfb84e388b7e'},
  'HTTPStatusCode': 200,
  'RequestId': '824f5dc3-7a12-11e9-a384-dfb84e388b7e',
  'RetryAttempts': 0}}

As you can see from the above example response of the detect_faces call, it has not only bounding box information of the location of the face in the picture but also more advanced features such as emotions, gender, age range, etc.

Comparing Faces

With Amazon Rekognition, you can compare faces in two pictures. For example, if I set a picture of Tzuyu as my source picture, then send a group photo of Twice as my target picture, Rekognition will find the face in the target picture which is the most similar to the source picture. The group photo of Twice I’ll be using is below.

It might be difficult even for humans, especially if you’re not Asian (or not a Twice fan). You can take your guess who is Tzuyu in the picture. As a Korean, and at the same time a Twice fan, I know the answer, but let’s see how well Rekognition can find Tzuyu from this picture.

sourceFile='Tzuyu.jpeg'
targetFile='twice_group.jpg'
   
imageSource=open(sourceFile,'rb')
imageTarget=open(targetFile,'rb')

response = rekognition.compare_faces(SimilarityThreshold=80,
                              SourceImage={'Bytes': imageSource.read()},
                              TargetImage={'Bytes': imageTarget.read()})
response['FaceMatches']

The response of the above compare_faces will also output information of all the unmatched faces in the group picture, and this can get quite long, so I’m just outputting the match that Rekognition found by specifying response[‘FaceMatches’]. It seems like a matching face has been found from the group photo with the similarity of around 97%. With the bounding box information, let’s check which face that Rekognition is referring to as Tzuyu’s face.

By the way, the values in the BoundingBox section are ratios of the overall image size. So, in order to draw box with the values in BoundingBox, you need to calculate the location of the box’s each point by multiplying ratios to the actual image height or width. You can find how it can be done in the below code snippet.

from PIL import ImageDraw

image = Image.open("twice_group.jpg")
imgWidth,imgHeight  = image.size  
draw = ImageDraw.Draw(image)
box = response['FaceMatches'][0]['Face']['BoundingBox']
left = imgWidth * box['Left']
top = imgHeight * box['Top']
width = imgWidth * box['Width']
height = imgHeight * box['Height']
points = (
            (left,top),
            (left + width, top),
            (left + width, top + height),
            (left , top + height),
            (left, top)

)
draw.line(points, fill='#00d400', width=2)

display(image)

Yes! Well done, Rekognition! That is Tzuyu indeed!

Creating Collection

Now we can detect face from a picture, and find the most similar face to the source picture from the target picture. But, these are all one-off call, and we need something more to store the information of each member’s face and their name, so that when we send a new picture of Twice, it can retrieve data and detect each member’s face and display their names. In order to implement this, we need to use what Amazon calls “Storage-Based API Operations”. There are two Amazon-specific terms for this type of operations. The “collection” is a virtual space where Rekognition stores information about detected faces. With a collection, we can “index” faces, which means to detect faces in an image, then store the information in the specified collection. What’s important is that the information Rekognition stores in the collection is not actual images, but feature vectors extracted by Rekognition’s algorithm. Let’s see how we can create a collection and add indexes.

collectionId='test-collection'
rekognition.create_collection(CollectionId=collectionId)

Yes. It is as simple as that. Since this is a new collection we just created, we don’t have any information stored in the collection. But, let’s double check.

rekognition.describe_collection(CollectionId=collectionId)

In the above response, you can see ‘FaceCount’ is 0. This will change if we index any face and store that information in the collection.

Indexing Faces

Indexing faces is again as simple as one line of code with Rekognition.

sourceFile='Tzuyu.jpeg'   
imageSource=open(sourceFile,'rb')

rekognition.index_faces(Image={'Bytes':imageSource.read()},ExternalImageId='Tzuyu',CollectionId=collectionId)

From the above code, you can see that I am passing ExternalImageId parameter and give it the value of string “Tzuyu”. Later when we try to recognise Tzuyu from a new picture, Rekognition will search for faces that are matching any of the indexed faces. As you will see later, when indexing a face, Rekognition will give it a unique face ID. But I want to display the name “Tzuyu” when a matching face is found from a new picture. For this purpose, I am using ExternalImageId. Now we if we check our collection, we can see 1 face has been added to the collection.

rekognition.describe_collection(CollectionId=collectionId)

Search Faces by Image

Now with Tzuyu’s face indexed in our collection, we can send a new unseen picture to Rekognition and find the matching face. But a problem with search_faces_by_image function is that it can only detect one face (the largest in the image). So if we want to send a group picture of Twice and find Tzuyu from there, we will need to do an additional step. Below we will first detect all the faces in the picture by using detect_faces, then with the bounding box information of each face, we will call search_faces_by_image one by one. First let’s detect each face.

imageSource=open('twice_group.jpg','rb')
resp = rekognition.detect_faces(Image={'Bytes':imageSource.read()})
all_faces = resp['FaceDetails']
len(all_faces)

Rekognition detected 9 faces from the group picture. Good. Now let’s crop each face and call serach_faces_by_image one by one.

image = Image.open("twice_group.jpg")
image_width,image_height  = image.size

for face in all_faces:
    box=face['BoundingBox']
    x1 = box['Left'] * image_width
    y1 = box['Top'] * image_height
    x2 = x1 + box['Width'] * image_width
    y2 = y1 + box['Height']  * image_height
    image_crop = image.crop((x1,y1,x2,y2))
    
    stream = io.BytesIO()
    image_crop.save(stream,format="JPEG")
    image_crop_binary = stream.getvalue()

response = rekognition.search_faces_by_image(
            CollectionId=collectionId,
            Image={'Bytes':image_crop_binary}                                       
            )
    print(response)
    print('-'*100)

Among the 9 search_faces_by_image calls we have made, Rekognition has found one face that matches the indexed face in our collection. We only indexed one face of Tzuyu, so what it has found is Tzuyu’s face from the group picture. Let’s display this on the image with the bounding box and the name. For the name part, we will use the ExternalImageId we set when we indexed the face. By the way, from the search_faces_by_image response, ‘FaceMatches’ part is an array, and if there are more than one matches found from the collection, then it will show all the matches. According to Amazon this array is ordered by similarity score with the highest similarity first. We will get the match with the highest score by specifying the first item of the array.

from PIL import ImageFont
import io

image = Image.open("twice_group.jpg")
image_width,image_height  = image.size 
   
for face in all_faces:
    box=face['BoundingBox']
    x1 = box['Left'] * image_width
    y1 = box['Top'] * image_height
    x2 = x1 + box['Width'] * image_width
    y2 = y1 + box['Height']  * image_height
    image_crop = image.crop((x1,y1,x2,y2))
    
    stream = io.BytesIO()
    image_crop.save(stream,format="JPEG")
    image_crop_binary = stream.getvalue()

response = rekognition.search_faces_by_image(
            CollectionId=collectionId,
            Image={'Bytes':image_crop_binary}                                       
            )
    
    if len(response['FaceMatches']) > 0:
        draw = ImageDraw.Draw(image)
        points = (
                    (x1,y1),
                    (x2, y1),
                    (x2, y2),
                    (x1 , y2),
                    (x1, y1)

)
        draw.line(points, fill='#00d400', width=2)
        fnt = ImageFont.truetype('/Library/Fonts/Arial.ttf', 15)
        draw.text((x1,y2),response['FaceMatches'][0]['Face']['ExternalImageId'], font=fnt, fill=(255, 255, 0))
        display(image)

Hooray! Again the correct answer!

Identifying All Group Members of Twice

Now let’s expand the project to identify all members from the group picture. In order to do that, we first need to index faces of all members (there are 9 members). I have prepared 4 pictures of each member. I have added multiple pictures of the same person following the logic of Amazon tutorial written by Christian Petters. According to Petters, “adding multiple reference images per person greatly enhances the potential match rate for a person”, which makes intuitive sense. From the Github link I’ll share at the end, you will find all the pictures that are used in this project.

collectionId='twice'
rekognition.create_collection(CollectionId=collectionId)

import os

path = 'Twice'

for r, d, f in os.walk(path):
    for file in f:
        if file != '.DS_Store':
            sourceFile = os.path.join(r,file)
            imageSource=open(sourceFile,'rb')
            rekognition.index_faces(Image={'Bytes':imageSource.read()},ExternalImageId=file.split('_')[0],CollectionId=collectionId)
rekognition.describe_collection(CollectionId=collectionId)

OK. It seems like all 36 pictures are indexed in our “twice” collection. Now it’s time to check the final result. Can Rekognition be enhanced to identify each member of Twice?

from PIL import ImageFont

image = Image.open("twice_group.jpg")
image_width,image_height  = image.size 
   
for face in all_faces:
    box=face['BoundingBox']
    x1 = box['Left'] * image_width
    y1 = box['Top'] * image_height
    x2 = x1 + box['Width'] * image_width
    y2 = y1 + box['Height']  * image_height
    image_crop = image.crop((x1,y1,x2,y2))
    
    stream = io.BytesIO()
    image_crop.save(stream,format="JPEG")
    image_crop_binary = stream.getvalue()

response = rekognition.search_faces_by_image(
            CollectionId=collectionId,
            Image={'Bytes':image_crop_binary}                                       
            )
    
    if len(response['FaceMatches']) > 0:
        draw = ImageDraw.Draw(image)
        points = (
                    (x1,y1),
                    (x2, y1),
                    (x2, y2),
                    (x1 , y2),
                    (x1, y1)

)
        draw.line(points, fill='#00d400', width=2)
        fnt = ImageFont.truetype('/Library/Fonts/Arial.ttf', 15)
        draw.text((x1,y2),response['FaceMatches'][0]['Face']['ExternalImageId'], font=fnt, fill=(255, 255, 0))

display(image)

YES! It can! It identified all the members correctly!

Thank you for reading. You can find the Jupyter Notebook and the pictures used for the project from the below link.

https://github.com/tthustla/twice_recognition

Building K-pop Idol Identifier with Amazon Rekognition was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Deploying PySpark ML Model on Google Compute Engine as a REST API

Ricky Kim — Mon, 31 Dec 2018 07:07:51 GMT

Step-by-step tutorial

Photo by Sigmund on Unsplash

This post is loosely connected to my previous post.

Step-by-Step Tutorial: PySpark Sentiment Analysis on Google Dataproc

In my previous post, I trained a PySpark sentiment analysis model on Google Dataproc, and saved the model to Google Cloud Storage. In this post, I will show you how you can deploy a PySpark model on Google Compute Engine as a REST API. I will use the model I trained in my previous post, but I’m sure you can make some minor changes to the codes I will share and use with your own PySpark ML model.

I have the whole pipeline saved as a pipelineModel, and now I want to use the model for a REST API so that it can serve real-time predictions through simple REST API calls.

I first looked at Google Cloud ML Engine to see if this is a valid option for this specific use case. By reading through the introduction, you can not only train a machine learning model but also serve your model for predictions. But sadly it seems like Cloud ML Engine does not support Spark ML model. The next Google service I found was Google App Engine. This service made it really easy for anyone to deploy a web app. But after a couple of attempts, I realised it is not easy to set up Java Development Kit (which is needed to run PySpark) on the VM instances created through Google App Engine. It might be possible, but at least it was not straightforward enough for me to choose that route.

After some consideration and trials, below is the way that I found working. I will first go through the process step-by-step, then I will also tell you about the downside using Spark ML model for online real-time predictions.

In this tutorial, I will not go through the basic set up process again such as setting up a free account, enabling APIs for the services you want to use, installing Google Cloud SDK, but if this is your first time to try Google Cloud Platform I recommend you to check my previous post, and go through the set up steps (Creating a Free Trial Account on GCP, Enabling APIs, Installing Google Cloud SDK). Before you can proceed with the below steps, you will have to be ready by enabling Google Compute Engine API.

Cloning Git Repository

Now clone the git repository for this project by running below command in terminal.

git clone https://github.com/tthustla/flask_sparkml

Once you clone the repository, it will create a folder named flask_sparkml. Go into the folder and check what files are there.

cd flask_sparkml/
ls

You will see three files and one subfolder.

install.sh (startup script when creating Compute Engine instance)
main.py (Flask web app to serve predictions as REST API)
model (the folder where trained PySpark pipeline model is stored)
response_time.py (simple Python script to measure API response time)

Creating Google Compute Engine VM Instance

You can either create an instance through the web console or from your terminal using Google Cloud SDK. Before we create an instance, let’s take a quick look at the startup script install.sh to see what it does.

https://medium.com/media/9fa18fefe070191eedce575fb8d580b6/href

The above code will be a startup script for the VM.

Web Console

Go to your console by visiting https://console.cloud.google.com/. Once you are in the console, click “Compute Engine” and “VM instances” from the left side menu. Click “Create”.

Type in the name for your VM instance, and choose the region and zone where you want your VM to be created. For this tutorial, we won’t use any other GCP services, so you can choose any region/zone you like, but it’s better to consider where the majority of the traffic will be coming from once the API is online.

Scroll down and choose “Allow full access to all Cloud APIs” in Identity and API access section, and tick “Allow HTTP traffic” in Firewall so that it can be reached from outside your VM. Finally, let’s add the startup script so that the VM will install required packages when starting up.

From the above screen just below Firewall section, you will see blue text “Management, security, disks, networking, sole tenancy”. Click on it to expand, and find the text box under “Startup script”. Copy and paste the whole install.sh code.

Click “Create” at the bottom.

Google Cloud SDK

In order to be able to interact with GCP from your terminal, you should be logged in to your account, and also it should be set to the project you intend to work on (I’m assuming you already have Google Cloud SDK installed on your terminal, if that’s not the case, please follow instructions on https://cloud.google.com/sdk/)

In order to check whether you’re signed in, copy and paste below command in your terminal, and it will show you which account is active.

gcloud auth list

If you want to check whether Google Cloud SDK is currently set to the project you want to work on, you can use below command from the terminal.

gcloud config list

If everything looks fine, run below command from your terminal in the flask_sparkml folder that you cloned from Git.

gcloud compute instances create flask-sparkml \
--zone=europe-west1-b \
--scopes=https://www.googleapis.com/auth/cloud-platform \
--metadata-from-file startup-script=install.sh --tags http-server

This creates a VM instance with

instance name set to flask-sparkml
zone set to europe-west1-b
allow full access to all Cloud APIs
using local file install.sh as a startup script for the VM
allow HTTP traffic

Creating Firewall Rule

In order to allow access to the application on port 8080 from outside, we need to create an inbound firewall rule which opens port 8080. Again you can choose to do this on web console or from the terminal.

Web Console

From the left side menu on your console, find “VPC network” under “NETWORKING” section. Click into “Firewall rules” and click “CREATE FIREWALL RULE” button on the top. This will bring you to a screen looks like below.

First, give it a descriptive name. I named it “default-allow-http-8080”. Scroll down, and you will see the default setting of “Direction of traffic” is already set to “Ingress”, and also “Action on match” is set to “Allow”. For some reasons, if they are not, please make sure they are set properly.

Scroll down more to “Target tags” and give it a tag “http-server”. The tags are how GCP apply network rules to VM instances or instance template. When we created VM in the console, we have checked the tick box for “Allow HTTP traffic”. By doing this we are attaching any firewall rule with the tag “http-server” to the VM instance. Thus by creating a new firewall rule with the same tag will be automatically applied to your VM. Next, type in “0.0.0.0/0” for “Source IP ranges”. This will allow access to our API from any IP address. If you want to specify some limited ranges of IP, then you can do so here. Finally, under “Protocols and ports”, type in “8080” next to “tcp” and click “Create” button at the bottom.

Google Cloud SDK

If you want to create the firewall rule from your local terminal, you can run below command from your terminal.

gcloud compute firewall-rules create default-allow-http-8080 \
  --allow tcp:8080 \
  --source-ranges 0.0.0.0/0 \
  --target-tags http-server

You can also check whether this firewall rule created properly.

gcloud compute firewall-rules list

Connecting to Created VM Instance

Now a Compute Engine VM is created either through web console or from the terminal. To check whether the startup script is properly triggered, let’s connect to the VM. Again you can do this either through web console or from the terminal.

Web Console

Go to “Compute Engine” and “VM instances” from the left side menu. You will see an instance running from the list. Click “SSH” under “Connect”.

Google Cloud SDK

If your default zone is set to the zone where your VM is created, and the project is set to the current project you are working on (properties you can see by running “gcloud config list” from the terminal), then you can simply type below command to SSH into your VM.

gcloud compute ssh [your_Google_user_name]@flask-sparkml

Checking Installed Packages on VM

Now since we are on our VM, let’s inspect a few things. The startup script we provided earlier to VM includes Java JDK8 install and a few Python packages including PySpark. We can check whether they’re installed properly. (Please note that you need to give it a minute or two so that the VM will have enough time to finish the installations in the background.)

java -version

Java seems to be installed. What about the Python packages?

pip list

We can see four Python packages we specified in the startup script are installed on VM. Don’t close the VM terminal yet, since we will need to come back to this later.

Copying Local Files to VM

The next step is to upload the main.py and model from our local machine to VM. Go to your local terminal and CD into flaks_sparkml (the folder cloned from Git) folder. Run below command to securely copy the files to VM.

gcloud compute scp --recurse . [your_Google_user_name]@flask-sparkml:~/flask_sparkml

Go back to VM terminal and check if the files are uploaded.

cd /home/[your_Google_user_name]/flask_sparkml
ls -l

Finally, we are ready to run the actual Flask app.

Running Flask REST API

Before we run the actual code, let’s take a quick look at the code to see what it does. The original code for this is from a Medium post Deploying a Machine Learning Model as a REST API by Nguyen Ngo, and I made some small changes to fit my specific use case. Thank you for the great tutorial Nguyen Ngo! I have already added comments inside the code, so I will not go through it line by line.

https://medium.com/media/e8a3e2a31112e8b830c85ca5aaa9c2f2/href

It’s time to run the application and make REST API call to the app to get a sentiment prediction from your own text! On your VM terminal (that you left open from the above “Connecting to Created VM Instance” step), run below command (using “nohup” command to prevent VM to go down after exiting VM terminal) from /home/[your_Google_user_name]/flask_sparkml directory.

nohup python main.py

You won’t be able to see the log output of the file main.py, since it writes logs to a file ‘nohup.out’. Let’s close the VM terminal to double check if ‘nohup’ is properly working, and get real-time sentiment prediction through an API call. In order to do that, we first need to know the external IP address that our app is deployed at. You can either check on your web console’s “Compute Engine” -> “VM Instances”

or check from your local terminal by running below command.

gcloud compute instances list

Making API Call to the Deployed App

Either start Python from your local terminal or open a Jupyter Notebook and copy and paste below code block and run to see the result.

https://medium.com/media/6a23a2f437055ea0d8d5cd9105d941a2/href

I expect to see the model to predict the above text as negative.

“Listen Morty, I hate to break it to you, but what people call ‘love’ is just a chemical reaction that compels animals to breed.”

Hooray! The prediction from our model says it’s negative with quite high confidence.

Measuring API Response Time

You might have noticed that there is one more file in the folder that I haven’t mentioned yet. While going through this project, there has been one big question mark in my head. I know Spark can handle big data, and this might be beneficial in the model training stage. But how is the performance of Spark ML, when it is deployed to perform real-time prediction on single entries? When I made the API call in the above, it sure didn’t feel very fast. So I added a final experiment to the finished API. It is a simple Python program, designed to make 100 queries to the API and record API response times. At the end, it shows mean, median, min, max of this recorded response times.

https://medium.com/media/b52ded69987077056575b5c46228de63/href

Go to your local terminal, in the flask_sparkml folder, run the program and check the output.

python response_time.py http://[external_IP_address_of_your_app]:8080

Is this fast or slow? We have no way of knowing without a benchmark. So I have googled hard to find some other’s machine learning API response time. And I finally found this Medium post Falcon vs. Flask-Which one to pick to create a scalable deep learning REST API by Dat Tran. Thank you for the informative post, Dat Tran!

From Dat Tran’s post, the average response time for Keras CNN model on MNIST dataset is 60ms. We can see that the average response time of Spark ML is 0.99 seconds, which is 990ms. That doesn’t sound good at all.

While I was looking for an answer for this, I came across a presentation slide Productionizing Spark ML Pipelines with the Portable Format for Analytics by Nick Pentreath. Thank you for the great slide, Nick Pentreath! PFA (Portable Format for Analytics) is a JSON representation of Spark ML model that can be transferred across different languages, platforms. The reason why I’m mentioning this even when I didn’t export my model as PFA, is that he goes through some limitations of Spark ML model in production.

Image courtesy of Nick Pentreath at slideshare.net

According to the slide (page 9: Challenges specific to Spark), scoring models in Spark is slow due to the overheads of Spark Dataframes and task scheduling, and not the best framework to be used for real-time scoring due to its latency.

Further Consideration

Even though we have a working REST API deployed, this seems to be not the ideal solution to serve machine learning prediction real-time. I can explore further into exporting the model as PFA and compare the performance. Or I can also try to build a new model from scratch using Keras or Tensorflow, and deploy it to compare the performance with Spark ML. At the moment, I’m more tempted to try the latter, since it will give me chances to explore other GCP services such as Google Cloud ML Engine. Anyway, I will try to share the next part of my journey here.

Cleaning

To avoid incurring unwanted charges on your GCP account it’s important to clean up. You can simply delete the VM instance you created, but in case you want to know how to kill the nohup process we started with the file main.py, I will go through simple steps to kill processes running in the background. Go to your VM terminal either from web console or from your local terminal using “glcoud compute ssh” command. Once you are in the VM terminal, run below command to check the process ID of main.py running in the background.

ps aux | grep python | grep main.py

Once you have IDs of the currently running processes kill them using below command replacing [] parts with the process IDs of yours.

sudo kill [PID1] [PID2]

The actual deleting instance part can be done either on web console or from the terminal.

Web Console

Google Cloud SDK

gcloud compute instances delete flask-sparkml

Thank you for reading. You can find the Git Repository of the scripts from the below link.

https://github.com/tthustla/flask_sparkml

Deploying PySpark ML Model on Google Compute Engine as a REST API was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Step-by-Step Tutorial: PySpark Sentiment Analysis on Google Dataproc

Ricky Kim — Mon, 24 Dec 2018 09:22:50 GMT

PySpark Sentiment Analysis on Google Dataproc

A Step-by-Step Tutorial

Photo by Joshua Sortino on Unsplash

I recently had a chance to play around with Google Cloud Platform through a specialization course on Coursera; Data Engineering on Google Cloud Platform Specialization. Overall I learned a lot through the courses, and it was such a good opportunity to try various services of Google Cloud Platform(GCP) for free while going through the assignments. Even though I’m not using any of GCP’s services at work at the moment, if I have a chance I’d be happy to migrate some parts of my data works to GCP.

However, one thing that the course lacks is room for your own creativity. The assignments of the course were more like tutorials than assignments. You basically follow along already written codes. Of course, you can still learn a lot by trying to read every single line of codes and understand what each line does in detail. Still, without applying what you have learned in your own problem-solving, it is difficult to make this knowledge completely yours. That’s also what the instructor Lak Lakshmanan advised at the end of the course. (Shout out to Lak Lakshmanan, thank you for the great courses!)

*In addition to short code blocks I will attach, you can find the link for the whole Git Repository at the end of this post.

Requirements

Homebrew (https://brew.sh/)
Git (https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)

Creating a Free Trial Account on GCP

So I have decided to do some personal mini projects making use of various GCP services. Luckily, if you haven’t tried GCP yet, Google generously offers a free trial which gives you $300 credit you can use over 12 months.

You can easily start your free trial by visiting https://cloud.google.com/gcp/

The first project I tried is Spark sentiment analysis model training on Google Dataproc. There are a couple of reasons why I chose it as my first project on GCP. I already wrote about PySpark sentiment analysis in one of my previous posts, which means I can use it as a starting point and easily make this a standalone Python program. The other reason is I just wanted to try Google Dataproc! I was fascinated by how easy and fast it is to spin up a cluster on GCP and couldn’t help myself from trying it outside the Coursera course.

If you have clicked “TRY GCP FREE”, and fill in information such as your billing account (Even though you set up a billing account, you won’t be charged unless you upgrade to a paid account), you will be directed to a page looks like below.

Home screen of GCP web console

On the top menu bar, you can see “My First Project” next to Google Cloud Platform. In GCP, “project” is the base-level entity to use GCP services, enable billing, etc. On the first login, you can see that Google automatically created a “project” called “My First Project” for you. Click on it to see ID of the current project, copy it or write it down, this will be used later. By clicking into “Billing” on the left-side menu from the web console home screen, “My First Project” is automatically linked to the free credit you received.

Enabling APIs

In GCP, there are many different services; Compute Engine, Cloud Storage, BigQuery, Cloud SQL, Cloud Dataproc to name a few. In order to use any of these services in your project, you first have to enable them.

Put your mouse over “APIs & Services” on the left-side menu, then click into “Library”. For this project, we will enable three APIs: Cloud Dataproc, Compute Engine, and Cloud Storage.

In the API Library page, search the above mentioned three APIs one by one by typing the name in the search box. Clicking into the search result, and enable the API by clicking “ENABLE” button on the screen.

When I tried it myself, I only had to enable Cloud Dataproc API, since the other two (Compute Engine, Cloud Storage) were already enabled when I clicked into them. But if that’s not the case for you, please enable Compute Engine API, Cloud Storage API.

Installing Google Cloud SDK

If this is your very first time to try GCP, you first might want to install the Google Cloud SDK so that you can interact with many services of GCP from the command-line. You can find more information on how to install from here.

Install Google Cloud SDK by following instructions on https://cloud.google.com/sdk/

By following instructions from the link, you will be prompted to log in (use the Google account you used to start the free trial), then to select a project and compute zone (project: choose the project you enable the APIs from the above steps if there are more than one, compute zone: To decrease network latency, you might want to choose a zone that is close to you. You can check the physical locations of each zone from here.).

Creating Bucket

Since you have installed Google Cloud SDK, you can either create a bucket from the command-line or from the web console.

Web Console

Click into “Storage” from left-side menu, then you’ll see a page like the above. Click “Create bucket”

For convenience, enter project ID you checked at the end of “Creating a Free Trial Account on GCP” stage. You can just click “create” without changing any other details, or choose the same location as your project.

Google Cloud SDK

Replace your_project_id with the project ID that you copied and run the below line on your terminal to set BUCKET_NAME variable to your project ID and make it available to sub-processes. (A Bash script you need to run later will make use of this)

export PROJECT_ID='your_project_id'

Then create a bucket by running gsutil mb command as below.

gsutil mb gs://${PROJECT_ID}

The above command will create a bucket with the default settings. If you want to create a bucket in a specific region or multi-region, you can give it -l option to specify the region. You can see available bucket locations from here.

#ex1) multi-region europe
gsutil mb -l eu gs://${PROJECT_ID}

#ex)2 region europe-west1
gsutil mb -l europe-west1 gs://${PROJECT_ID}

Cloning Git Repository

Now clone the git repository I uploaded by running below command in terminal.

git clone https://github.com/tthustla/pyspark_sa_gcp.git

Preparing Data

Once you clone the repository, it will create a folder named pyspark_sa_gcp. Go into the folder and check what files are there.

cd pyspark_sa_gcp/
ls

You will see three files in the directory: data_prep.sh, pyspark_sa.py, train_test_split.py. In order to download the training data and prepare for training let’s run the Bash script data_prep.sh. Below is the content of the script and I have added comments to explain what each line does.

The original dataset for training is “Sentiment140”, which originated from Stanford University. The Dataset has 1.6million labelled tweets.
50% of the data is with negative labels and the other 50% with positive labels. More info on the dataset can be found from the link. http://help.sentiment140.com/for-students/

https://medium.com/media/0a8d0f8c255dc9072a4a7e9b226b9437/href

In the above Bash script, you can see it’s calling a Python script train_test_split.py. Let’s also take a look at what it does.

https://medium.com/media/4940526455d60642883327fff808559c/href

Now we can run the Bash script to prepare the data. Once it’s finished, it will have uploaded prepared data to the cloud storage bucket you created earlier. It will take 5~6 mins to upload the data.

./data_prep.sh

Checking the Uploaded Data

Web Console

Go to the Storage from the left side menu and click into your bucket -> pyspark_nlp -> data. You will see two files are uploaded.

Google Cloud SDK

Or you can also check the content of your bucket from your terminal by running below command.

gsutil ls -r gs://${PROJECT_ID}/**

Creating Google Dataproc Cluster

Cloud Dataproc is a Google cloud service for running Apache Spark and Apache Hadoop clusters. I have to say it is ridiculously simple and easy-to-use and it only takes a couple of minutes to spin up a cluster with Google Dataproc. Also, Google Dataproc offers autoscaling if you need, and you can adjust the cluster at any time, even when jobs are running on the cluster.

Web Console

Go to Dataproc from the left side menu (you have to scroll down a bit. It’s under Big Data section) and click on “Clusters”. Click “Create clusters”, then you’ll see a page like below.

Give it a name (for convenience, I gave the project ID as its name), choose Region and Zone. To decrease the latency, it is a good idea to set the region to be the same as your bucket region. Here you need to change the default settings for worker nodes a little, as the free trial only gives you permission to run up to 8 cores. The default setting for a cluster is one master and two workers all with 4 CPUs each, which will exceed the 8 cores quota. So change the setting for your worker nodes to 2 CPUs, then click create at the bottom. After a couple of minutes of provisioning, you will see the cluster created with one master node (4 CPUs, 15GB memory, 500GB standard persistent disk) and two worker nodes (2 CPUs, 15GB memory, 500GB standard persistent disk each).

Google Cloud SDK

Since we need to change the default setting a little bit, we need to add one more argument to the command, but it’s simple enough. Let’s create a cluster and give it the same name as the project ID, and set worker nodes to have 2 CPUs each.

gcloud dataproc clusters create ${PROJECT_ID} \

--project=${PROJECT_ID} \

--worker-machine-type='n1-standard-2' \

--zone='europe-west1-b'

You can change the zone to be close to your bucket region.

Submitting Spark Job

Finally, we are ready to run the training on Google Dataproc. The Python script (pyspark_sa.py) for the training is included in the Git repository you cloned earlier. Since I commented on the script to explain what each line does, I will not go through the code. The code is a slightly refactored version of what I have done in Jupyter Notebook for my previous post. Below are a few of my previous posts, in case you want to know more in detail about PySpark or NLP feature extraction.

And let’s take a look at what the Python script looks like.

https://medium.com/media/b1d254fc7af1cad76c7080c07d182561/href

Since I commented inside the script to explain what each line does, I will not go through the code extensively. But in a nutshell, the above script will take three command line arguments: Cloud Storage location where the training and test data are stored, a Cloud storage directory to store prediction result of the test data, and finally a Cloud storage directory to store the trained model. When called, it will first do the preprocessing of the training data -> build a pipeline -> fit the pipeline -> and make predictions on the test data -> print the accuracy of the predictions -> save prediction result as CSV -> save fitted pipeline model -> load the saved model -> print the accuracy again on the test data (to see if the model is properly saved).

Web Console

In order to run this job through the web console, we need to first upload the Python script to our cloud storage so that we can point the job to read the script. Let’s upload the script by running below command. (I’m assuming that you are still on pyspark_sa_gcp directory on your terminal)

gsutil cp pyspark_sa.py gs://${PROJECT_ID}/pyspark_nlp/

Now click into Dataproc on the web console, and click “Jobs” then click “SUBMIT JOB”.

From the above screenshot replace the blurred parts of the texts to your project ID, then click “submit” at the bottom. You can inspect the output of the machine by clicking into the job.

The job is finished after 15 minutes, and by looking at the output, it seems like the cluster struggled a bit, but nonetheless, the prediction looks fine and the model seems to be saved properly.

Google Cloud SDK

If you submit a job from the command-line, you don’t even need to upload your script to Cloud Storage. It will be able to grab a local file and move to the Dataproc cluster to execute. (Again I’m assuming that you are still on pyspark_sa_gcp directory on your terminal)

gcloud dataproc jobs submit pyspark pyspark_sa.py \

--cluster=${PROJECT_ID} \

-- gs://${PROJECT_ID}/pyspark_nlp/data/ gs://${PROJECT_ID}/pyspark_nlp/result gs://${PROJECT_ID}/pyspark_nlp/model

Again the cluster seemed to struggle a bit, but still got the result and model saved properly. (I have tried to submit the same job on my paid account with 4 CPUs worker nodes, then it didn’t throw any warnings)

Checking the Results

Go to your bucket, then go into pyspark_nlp folder. You will see that the results of the above Spark job have been saved into “result” directory (for the prediction data frame), and “model” directory (fitted pipeline model).

Finally, don’t forget to delete the Dataproc cluster you have created to ensure it will not use up any more of your credit.

Through this post, I went through how to train Spark ML model on Google Dataproc and save the trained model for later use. What I showed here is only a small part of what GCP is capable of and I encourage you to explore other services on GCP and play around with it.

Thank you for reading. You can find the Git Repository of the scripts from the below link.

https://github.com/tthustla/pyspark_sa_gcp

Step-by-Step Tutorial: PySpark Sentiment Analysis on Google Dataproc was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Urban Sound Classification — Part 2: sample rate conversion, Librosa

Ricky Kim — Sat, 04 Aug 2018 09:32:24 GMT

Photo by Thomas Ribaud on Unsplash

This is the 2nd part of my ongoing Urban Sound Classification project. You can find the previous posts from the below links.

Part 1: sound wave, digital audio signal

At the end of the previous post, I just realised that the sound files in data might have different sampling rates, bit depths, number of channels from file to file. First, let me extract this information from each sound file and see how the distribution looks like.

*In addition to short code blocks I will attach the link for the whole Jupyter Notebook at the end of this post.

Sound Data Format Investigation

I haven’t properly explained this in my previous post, but in a WAVE file, there’s not only sampled sound data but also all the information about the data format. Below picture shows what kind of information is available in a WAVE file.

Image courtesy of http://soundfile.sapp.org/doc/WaveFormat/

In addition to path_class (to build a full path name from each file name) function I defined in the previous post, I will add one more function wav_fmt_parser just to extract three types of information I need from each WAVE file, number of channels, sampling rate, bit_depth.

https://medium.com/media/f2791570ad3fe8601a0281b9aed72b97/href

Now we can use this wav_fmt_parser function to add additional information to our data frame to see what kind of different data types exist in our dataset, and how they are distributed.

wav_fmt_data = [wav_fmt_parser(i) for i in data.slice_file_name]
data[['n_channels','sampling_rate','bit_depth']] = pd.DataFrame(wav_fmt_data)
data.head()

data.sampling_rate.value_counts()

data.n_channels.value_counts()

data.bit_depth.value_counts()

The most worrying from the above is the sampling rate. The number of channels can be prepared by either extracting data from only one channel, or even averaging the values of two channels. The bit depth is a problem of a range of values that each sample can take, in intuition it feels like it can be prepared by normalising the values with consideration of maximum and minimum values it can take within the bit depth. (Warning: this is just my intuition, and I might have missed some pitfalls. More and more I read and learn about DSP, I learn that things are not as simple as I thought)

Sample Rate Conversion

The reason why I’m doing sample rate conversion is to transform data so that they all have the same shape and easy to be processed with machine learning models. But in real life, there are many more use cases of sample rate conversion. For example, typical studio recording audio has 192khz and to make this recording as a CD, it should be resampled to CD sampling rate of 44.1khz, and different mediums might have different sample rate requirements (DVD has 48khz sample rate).

We saw that majority of our data has 44.1khz sample rate. Let’s say we want to resample our data to 22.05khz. Why? 44.1khz has better quality audio, but for sound classification purpose, 22.05khz is good enough to catch the difference of sounds. And our model will be faster if the size of each data becomes half of its original size. There are other sample rates present in the dataset, but first let’s think about the simplest case, resampling audio from 44.1khz to 22.05khz.

At first glance, this doesn’t seem so complicated. 22.05 is exactly the half of 44.1, and since the sampling rate means how frequent the samples are taken from original sound, it feels like we can just skip every other sample to get half of its original sample rate. No! I will show you what happens when you do that.

As an example, let’s create a simple sine sweep. Sine sweep is a signal in which the frequency changes over time from its starting frequency to finishing frequency. If we create a sine sweep with the starting frequency of 20hz to finishing frequency of 22.05khz for 10 seconds, we can hear a sound with increasing pitch over time.

from scipy.signal import chirp
import soundfile as sf

fs = 44100
T = 10
t = np.linspace(0, T, T*fs, endpoint=False)

w = chirp(t, f0=20, f1=22050, t1=T, method='linear')
sf.write('sine_sweep_44100.wav', w, fs)

plt.figure(figsize=(15,3))
plt.specgram(w, Fs=44100)
plt.colorbar()
_=plt.axis((0,10,0,22050))
ipd.Audio('sine_sweep_44100.wav')

What you see above is a spectrogram of the sine sweep with time on the x-axis, frequency on the y-axis. By playing the actual audio file, you can easily guess what the spectrogram shows.
It shows the spectrum of frequency over time. Since our sine sweep started from 20hz, you can see at the start the red line start low, then its pitch going up until it reaches 22.05khz.

What if we simply skip every other samples to reduce the sampling rate to half?

down_sampled = w[::2]
sf.write('sine_sweep_downsampled.wav', down_sampled, 22050)

plt.figure(figsize=(15,3))
plt.specgram(down_sampled, Fs=22050)
plt.colorbar()
_=plt.axis((0,10,0,22050))
ipd.Audio('sine_sweep_downsampled.wav')

What just happened?? The pitch is not constantly increasing anymore, it increases then decreases. Only thing I did was skipping every other sample, but I didn’t change anything else.

In order to explain what has happened, we need to understand the Nyquist Sampling Theorem. According to the Nyquist Theorem, half the sampling rate, the Nyquist limit, is the highest frequency component that can be accurately represented. So, in the signal with 22.05khz sampling rate, the highest frequency it can represent is 11.025khz, but our original sine sweep, frequency (pitch) increases up to 22.05khz, thus all those extra energies which cannot be represented with 22.05khz sampling rate (spectrum ranging from 11.025khz to 22.05khz) has gone into negative frequency and created decreasing sine sweep.

In this case, the signal first needs to be transformed by a low-pass filter, then every other sample should be selected. And depending on the ratio of two sample rates, it will complicate things even more. Furthermore, in the case of upsampling, we might need to interpolate to get the samples which were not included in the original sampled signal.

The reason I went through the above example was to show you some caveats when you are dealing with digital signal data. But the purpose of this post is not to create a new package for sample rate conversion. Then what do we do? Luckily we already have a wonderful library called Librosa which does all the conversion for us, hooray! (The actual sample rate conversion part in Librosa is done by either Resampy by default or Scipy’s resample)

Librosa

Now let’s pick one file from our dataset, and load the same file both with Librosa and Scipy’s Wave module and see how it differs.

data[data.slice_file_name == '100652-3-0-1.wav']

By default, Librosa’s load function will convert the sampling rate to 22.05khz, as well as reducing the number of channels to 1(mono), and normalise the data so that the values will range from -1 to 1.

import librosa
fullpath,_ = path_class('100652-3-0-1.wav')
librosa_load, librosa_sampling_rate = librosa.load(fullpath)
scipy_sampling_rate, scipy_load = wav.read(fullpath)
print('original sample rate:',scipy_sampling_rate)
print('converted sample rate:',librosa_sampling_rate)
print('\n')
print('original wav file min~max range:',np.min(scipy_load),'~',np.max(scipy_load))
print('converted wav file min~max range:',np.min(librosa_load),'~',np.max(librosa_load))

plt.figure(figsize=(12, 4))
plt.plot(scipy_load)

plt.figure(figsize=(12, 4))
plt.plot(librosa_load)

By plotting two data loaded from the same source sound file, we can see that Librosa loaded data has been reduced to mono (only one colour line on the graph), while the original has two channels (green for one channel, and blue for the other).

And guess what. All the worries I had about each data having all different sampling rates, the number of channels, different value ranges just been beautifully solved by loading sound data with Librosa!

I could have just used Librosa from the beginning, and not even worry about any of these different types of wave file format, but by investigating what Librosa is actually doing when loading a sound file, now I have a better understanding of different sound data types, what these mean.

I think I now have quite a good understanding of the data that I am dealing with and next is the fun part, machine learning. In the next post, I will finally start the first phase of model building: feature extraction.

Thank you for reading. You can find the Jupyter Notebook of the code from the below link.

https://github.com/tthustla/urban_sound_classification/blob/master/urban_sound_classification_part2.ipynb

Urban Sound Classification — Part 2: sample rate conversion, Librosa was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Urban Sound Classification — Part 1: sound wave, digital audio signal

Ricky Kim — Sun, 24 Jun 2018 21:57:51 GMT

Photo by Sora Sagano on Unsplash

It’s been a while since my last post. So far I’ve been mainly writing about NLP, but I should admit that my focus was quite narrow. Out of 19 posts I have written so far, 13 of them were about NLP, and more specifically sentiment analysis. Even though it encompasses a lot of side tasks to get to the goal of sentiment classifier, but still the end goal is sentiment analysis. Of course, I’m still nowhere near an expert, and still have a lot more to learn than I already learned, but to give my data science learning a bit of diversity, I turned to another type of data.

*In addition to short code blocks I will attach, you can find the link for the whole Jupyter Notebook at the end of this post.

Audio data

Before I try anything, the first thing I wanted to do was get to know this new friend better. What is sound? To have a basic understanding of the data that I’ll be dealing with, I’ll have to go to a fundamental level.

The sound is compressions and rarefactions in the air that an ear will pick up. The sound is a movement of air. Often expressed in a waveform, which shows what’s happening to air particles, moving back and forth over time. The vertical axis shows how the air is moving either backwards or forwards with respect to a zero position. The horizontal axis shows time.

The above explanation is taken from a Youtube video. He did a really good job explaining sound, sampling, Nyquist Theorem. But wait, the waveform of a sound looks like it’s moving up and down, but he’s saying it’s moving back and forth. What am I missing? To be honest, I spent quite some time here to get my head around. It might be elementary Physics, but I didn’t pay much attention to Physics classes during my school years. I never imagined this will come back to haunt me.

From here, it will be a bit of review of introductory Physics, so if you are already familiar with it, you might find the content of this post a bit too basic.

We often hear a term “sound wave”, yes the sound is a wave. Then what is “wave”? Waves are vibrations that transfer energy from place to place without the matter being transferred. Depending on different characteristics of waves, there exist different categories of waves. Here I will talk about two of them: transverse waves and longitudinal waves.

Image courtesy of www.difference.wiki

A transverse wave is a wave in which the particles of the medium are displaced in a direction perpendicular to the direction of energy transport. You can think of a stretched rope sending waves by moving up and down on the one end. A longitudinal wave is a wave in which the particles of the medium are displaced in a direction parallel to the direction of energy transport. As you can imagine from the above picture, one example is slinky. With stretched out slinky (one end fixed), if you hold the other end of the slinky and move slinky back and forth, this will transfer a wave looks similar to the above picture on the left.

An example of a transverse wave is water, and an example of a longitudinal wave is sound. Then I was confused. OK, I now understand sound is the longitudinal wave, but why am I seeing all the sound plots looking like transverse waves?

You might have seen plots looking like above. In the above plot, A, C is pitch, and 440HZ, 535.25HZ are their frequencies. After a lot of Googling, I finally got it. By the way, sorry if this is too elementary for you. But for me, it was one of those Eureka moments.

“An incorrect understanding of this graph would be to picture air molecules going up and down as they travel across space from the place in which the sound originates to the place in which it is heard. This would be as if a particular molecule starts out where the sound originates and ends up in the listener’s ear. This is not what is being pictured in a graph of a sound wave. It is the energy, not the air molecules themselves, that is being transmitted from the source of a sound to the listener’s ear.”

When we hear a sound, we don’t get air molecules hitting our ears like a wind. We just hear a sound, that is because air is medium that transports the energy (sound), and the medium itself is not being transferred. So, the waves we see from sound wave plot is not plotting up and down movement of air, but it’s plotting compression and rarefaction of air particles. What looks like a crest in sound wave plot is actually a compression where air molecules are close together, and what looks like a trough is actually a rarefaction, where air molecules are more spread out. And following the same logic, what looks like equilibrium (0 on the y-axis) is the ambient pressure present before sound was produced.

Dataset: UrbanSound8K

With the basic understanding of sound and sound wave plot, we can take a peek at our dataset. The dataset is called Urbansound8K.

https://serv.cusp.nyu.edu/projects/urbansounddataset/urbansound8k.html

You can find more information about how the classes are drawn and data is collected, but to give you a short overview of data, “this dataset contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music”

According to the original paper, sound excerpts are taken from www.freesound.org and are already pre-sorted into ten folds for cross-validation.

Let’s first take a look at the meta-data.

import pandas as pd
import numpy as np

data = pd.read_csv("UrbanSound8K/metadata/UrbanSound8K.csv")
data.shape

data.head()

The meta-data contains 8 columns.

slice_file_name: name of the audio file
fsID: FreesoundID of the recording where the excerpt is taken from
start: start time of the slice
end: end time of the slice
salience: salience rating of the sound. 1 = foreground, 2 = background
fold: The fold number (1–10) to which this file has been allocated
classID:
0 = air_conditioner
1 = car_horn
2 = children_playing
3 = dog_bark
4 = drilling
5 = engine_idling
6 = gun_shot
7 = jackhammer
8 = siren
9 = street_music
class: class name

The audio data has been already sliced and excerpted and even allocated to 10 different folds. Some of the excerpts are from the same original file but different slice. If one slice from a certain recording was in training data, and a different slice from the same recording was in test data, this might increase the accuracy of a final model falsely. Thanks to the original research, this has also been taken care of by allocating slices into folds such that all slices originating from the same Freesound recording go into the same fold.

Now let’s take a look at the class distribution of each fold to see how balanced the dataset is.

appended = []
for i in range(1,11):
    appended.append(data[data.fold == i]['class'].value_counts())
    
class_distribution = pd.DataFrame(appended)
class_distribution = class_distribution.reset_index()
class_distribution['index'] = ["fold"+str(x) for x in range(1,11)]
class_distribution

OK, it looks like the dataset is not perfectly balanced. Let’s take a look at the total balance.

data['class'].value_counts(normalize=True)

There are two classes (car_horn, gun_shot) which have a bit less than half amount of entries compared to other 8 classes. It doesn’t look like it is severely unbalanced, so for the moment, I have decided not to consider any data augmentation for these two minority classes.

Below I defined two functions to first get the full path name of a WAV file and its label, and then plot the waveform with additional information and also with the audio player that you can play the sound file. Before I go into any detail, let’s first plot one sound file, and see what it shows.

https://medium.com/media/f4347d70f16f809793d2bcd789dd5b8a/href

fullpath, label = path_class('100263-2-0-117.wav')
wav_plotter(fullpath,label)

I have briefly mentioned characteristics of the sound wave, but I still haven’t looked at the concept of a digital audio signal. Without the understanding of digital signal, it is hard to understand what all these information means (sampling rate, bit depth, etc). So please allow me to take a detour to touch on basics of digital audio. Again if this is too elementary for you, feel free to skip.

Sampling Rate, Bit Depth

An audio signal is a continuous analogue signal, and it is impossible for computers to process this type of continuous analogue data. It first needs to be transformed into the series of discrete values, and “sampling” is doing just that. “sampling rate” and “bit depth” is two of the most important elements when discretizing audio signal. In the below picture, you can see how they are related to analogue to digital conversion. In the graph, the x-axis is time, the y-axis is amplitude. “sampling rate” decides how frequent it will take samples, and “bit depth” decides how detailed it will take samples.

Let’s take CD as an example. Normally, CD has 44.1khz sampling rate with 16-bit depth. First, the sampling rate of 44.1khz tells us that samples are taken 44,100 times per second. 16 bit tells us that any sample can take a value from range 65,536 values corresponding to its amplitude. Compared to 8 bit, samples taken with 16 bit will be 256 times more detailed than that of 8 bit. By the way, if you are wondering why CD has a 44.1khz sampling rate, I recommend the same Youtube video I mentioned above. It will give you an intuitive understanding of sampling, aliasing, and Nyquist Theorem.

https://medium.com/media/4df523a36f7a12885ff46a21dc2f56f6/href

The number of channels tells us how many channels are there. We call it stereo when 2 channels are used, and call it mono when only one channel is used. Of course, mono sound can be played with more than one speaker, but it is still the exact same copy of the signal, which is played through the speakers. On the other hand, stereo is recorded with two different input channels of the same audio signal. What we normally see is stereo sound with left and right, this gives us a sense of directionality, perspective, space.

Now we have some of the important pieces of the puzzle to understand what the metadata of the WAVav file is telling us.

The sampling rate is same as CD quality, 44.1khz, bit depth is as again CD quality. It is stereo sound, and we can also see that from the plot. The green colour is plotting one channel while blue is plotting the other. It is a 4-second clip. Since the sampling rate is 44.1khz and the duration is 4 seconds, we can easily calculate the number of samples in the data by multiplying 44100 by 4, which is 176,400.

But here’s the bad news. By taking another look at the information on Urbansound8K, there’s a note saying “8732 audio files of urban sounds (see description above) in WAV format. The sampling rate, bit depth, and number of channels are the same as those of the original file uploaded to Freesound (and hence may vary from file to file).”

Uh oh. It means there might be many different sample rates in the data, which means even with the same duration, the number of samples will be different. That doesn’t sound good to build a model with. Moreover, different bit depth means, they can take different rage of values. Some of them might be stereo, while others are mono. That also doesn’t sound good.

I have spent most of the space going over basic concepts of sound signal. In the next post, I will continue with my journey on sound data preparation, and hopefully will involve more coding.

Thank you for reading. You can find the Jupyter Notebook of the code from the below link.

https://github.com/tthustla/urban_sound_classification/blob/master/urban_sound_classification_part1.ipynb

Urban Sound Classification — Part 1: sound wave, digital audio signal was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Yet Another Twitter Sentiment Analysis Part 1 — tackling class imbalance

Ricky Kim — Fri, 20 Apr 2018 16:58:34 GMT

Photo by Piret Ilver on Unsplash

I finished an 11-part series blog posts on Twitter sentiment analysis not long ago. Why do I want to do the sentiment analysis again? I wanted to extend further and run sentiment analysis on real retrieved tweets. And there are other limits to my previous sentiment analysis project.

The project stopped at the final trained model and lacks application of the model to retrieved tweets
The model was trained on only positive and negative class, so it lacks the ability to predict a neutral class

Regarding neutral class, it might be possible to set a threshold value for negative, neutral, positive class, and map the final output probability value to one of three classes, but I wanted to train a model with training data, which has three sentiment classes: negative, neutral, positive.

Since I already wrote quite a lengthy series on NLP, sentiment analysis, if a concept was already covered in my previous posts, I won’t go into the detailed explanation. And also the main data visualisation will be with retrieved tweets, and I won’t go through extensive data visualisation with the data I use for training and testing a model.

*In addition to short code blocks I will attach, you can find the link for the whole Jupyter Notebook at the end of this post.

Data

In order to train my sentiment classifier, I need a dataset which meets conditions below.

Preferably tweets text data with annotated sentiment label
with 3 sentiment classes: negative, neutral, positive
big enough to train a model

While googling to find a good data source, I learned about renowned NLP competition called SemEval. “SemEval (Semantic Evaluation) is an ongoing series of evaluations of computational semantic analysis systems, organized under the umbrella of SIGLEX, the Special Interest Group on the Lexicon of the Association for Computational Linguistics.”

You might have already heard of this if you’re interested in NLP. Highly-skilled teams from all around the world compete on a couple of tasks such as “semantic textual similarity”, “multilingual semantic word similarity”, etc. One of the competition tasks is the Twitter sentiment analysis. It also has a couple of subtasks, but what I would want to focus on is “Subtask A. : Message Polarity Classification: Given a message, classify whether the message is of positive, negative, or neutral sentiment”.

Luckily the dataset they provide for the competition is available to download. The training data consists of SemEval’s previous training and test data. What’s even better is they provide test data, and all the teams who participated in the competition are scored with the same test data. This means I can compare my model performance with 2017 participants in SemEval.

I first downloaded full training data for SemEval 2017 Task 4.

There are 11 txt files in total, spanning from SemEval 2013 to SemEval 2016. While trying to read the files into a Pandas dataframe, I found two files cannot be properly loaded as tsv file. It seems like there are some entries not properly tab-separated, so end up as a chunk of 10 or more tweets stuck together. I could have tried retrieving them with tweet ID provided, but I decided to first ignore these two files, and make up a training set with only 9 txt files.

import pandas as pd  
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

Once I import basic dependencies, I’ll read the data to a Pandas dataframe.

import glob
path ='Subtask_A/'
all_files = glob.glob(path + "/twitter*.txt")
frame = pd.DataFrame()
list_ = []
for file_ in all_files:
    df = pd.read_csv(file_,index_col=None, sep='\t', header=None, names=['id','sentiment','text','to_delete'])
    list_.append(df.iloc[:,:-1])
df = pd.concat(list_)

df = df.drop_duplicates()
df = df.reset_index(drop=True)
df.tail()

The dataset looks fairly simple with individual tweet ID, sentiment label, and tweet text.

df.info()

There are total 41,705 tweets. As another sanity check, let’s take a look at how many words are there in each tweet.

df['token_length'] = [len(x.split(" ")) for x in df.text]
max(df.token_length)

df.loc[df.token_length.idxmax(),'text']

OK, the token length looks fine, and the tweet for maximum token length seems like a properly parsed tweet. Let’s take a look at the class distribution of the data.

df.sentiment.value_counts()

The data is not well balanced, and negative class has the least number of data entries with 6,485, and the neutral class has the most data with 19,466 entries. I want to rebalance the data so that I will have a balanced dataset at least for training. I will deal with this after I define the cleaning function.

Data Cleaning

Data cleaning process is similar to my previous project, but this time I added a long list of contraction to expand most of the contracted form to its original form such as “don’t” to “do not”. And this time, instead of Regex, I used Spacy to parse the documents, and filtered numbers, URL, punctuation, etc. Below are the steps I took to clean the tweets.

Decoding: unicode_escape for extra “\” before unicode character, then unidecode
Apostrophe handled: there are two characters people use for contraction. “’”(apostrophe) and “‘“(single quote). If these two symbols are both used for contraction, it will be difficult to detect and properly map the right expanded form. So any “’”(apostrophe) is changed to “‘“(single quote)
Contraction check: check if there’s any contracted form, and replace it with its original form
Parsing: done with Spacy
Filtering punctuation, white space, numbers, URL using Spacy methods while keeping the text content of hashtag intact
Removed @mention
Lemmatize: lemmatized each token using Spacy method ‘.lemma_’. Pronouns are kept as they are since Spacy lemmatizer transforms every pronoun to “-PRON-”
Special character removal
Single syllable token removal
Spell correction: it is a simple spell correction dealing with repeated characters such as “sooooo goooood”. If the same character is repeated more than two times, it shortens the repetition to two. For example “sooooo goooood” will be transformed as “soo good”. This is not a perfect solution since even after correction, in case of “soo”, it is not a correct spelling. But at least it will help to reduce feature space by making “sooo”, “soooo”, “sooooo” to the same word “soo”

https://medium.com/media/5e4834ad57ecf0dfe4b73e4cf76c82cb/href https://medium.com/media/0834b4fd6fff4c0c9d321f01c6402498/href

OK now let’s see how this custom cleaner works with tweets.

pd.set_option('display.max_colwidth', -1)
df.text[:10]

[spacy_cleaner(t) for t in df.text[:10]]

It looks like it’s doing what I intended it to do. I’ll clean the “text” column and create a new column called “clean_text”.

df['clean_text'] = [spacy_cleaner(t) for t in df.text]

By running the cleaning function I can see it encountered some “invalid escape sequence”. Let’s see what these are.

for i,t in enumerate(df.text):
    if '\m' in t:
        print(i,t)

The tweets that contain ‘\m’ were actually containing an emoticon ‘\m/’ I didn’t know about this until I googled it. Apparently ‘\m/’ stands for the horn sign you make with your hand. This hand sign is popular in metal music. Anyway, this is just a warning and it is not an error. Let’s see how the cleaner deals with this.

df.text[2064]

spacy_cleaner(df.text[2064])

Again it seems like to be doing what I intended it to do. So far so good.

Imbalanced Learning

“The class imbalance problem typically occurs when, in a classification problem, there are many more instances of some classes than others. In such cases, standard classifiers tend to be overwhelmed by the large classes and ignore the small ones.”

As I have already realised, the training data is not perfectly balanced, ‘neutral’ class has 3 times more data than ‘negative’ class, and ‘positive’ class has around 2.4 times more data than ‘negative’ class. I will try fitting a model with three different data; oversampled, downsampled, original, to see how different sampling techniques affect the learning of a classifier.

The simple default classifier I’ll use to compare performances of different datasets will be the logistic regression. From my previous sentiment analysis project, I learned that Tf-Idf with Logistic Regression is a pretty powerful combination. Before I apply any other more complex models such as ANN, CNN, RNN etc, the performances with logistic regression will hopefully give me a good idea of which data sampling methods I should choose. If you want to know more about Tf-Idf, and how it extracts features from text, you can check my old post, “Another Twitter Sentiment Analysis with Python-Part5”.

In terms of validation, I will use K-Fold Cross Validation. In my previous project, I split the data into three; training, validation, test, and all the parameter tuning was done with reserved validation set and finally applied the model to the test set. Considering that I had more than 1 million data for training, this kind of validation set approach was acceptable. But this time, the data I have is much smaller (around 40,000 tweets), and by leaving out validation set from the data we might leave out interesting information about data.

Original Imbalanced Data

https://medium.com/media/4fe9f2a4d7e5aa7c60ca31eb2c7afe47/href

from sklearn.pipeline import Pipeline

original_pipeline = Pipeline([
    ('vectorizer', tvec),
    ('classifier', lr)
])

lr_cv(5, df.clean_text, df.sentiment, original_pipeline, 'macro')

With data as it is without any resampling, we can see that the precision is higher than the recall. If you want to know more about precision and recall, you can check my old post, “Another Twitter sentiment analysis with Python — Part4”.

If we take a closer look at the result from each fold, we can also see that the recall for the negative class is quite low around 28~30%, while the precisions for the negative class are high as 61~65%. This means the classifier is very picky and does not think many things are negative. All the text it classifies as negative is 61~65% of the time really negative. However, it also misses a lot of actual negative class, because it is so very picky. We have a low recall, but a very high precision. The intuition behind this precision and recall has been taken from a Medium blog post by Andreas Klintberg.

Oversampling

There is a very useful Python package called “imbalanced-learn”, which helps you deal with class imbalance issues, it is compatible with Scikit Learn, and easy to implement.

Within imbalanced-learn, there are different techniques you can use for oversampling. I will use below two.

RandomOverSampler
SMOTE (Synthetic Minority Over-Sampling Technique)

There is one more point to consider if you are cross-validating with oversampled data. Oversampling the minority class can result in overfitting problems if we oversample before cross-validating. Why is that so? Because by oversampling before cross validation split, you are leaking the information of validation data already to your training set. As they say “What has been seen, cannot be unseen.”

If you want more detailed explanation, I recommend this Youtube video “Machine Learning — Over-& Undersampling — Python/ Scikit/ Scikit-Imblearn”

Luckily cross-validation function I defined above as “lr_cv()” will fit the pipeline only with the training set split after cross-validation split, thus it is not leaking any information of validation set to the model.

RandomOverSampler

Random over-sampling is simply a process of repeating some samples of the minority class and balance the number of samples between classes in the dataset.

from imblearn.pipeline import make_pipeline
from imblearn.over_sampling import ADASYN, SMOTE, RandomOverSampler

ROS_pipeline = make_pipeline(tvec, RandomOverSampler(random_state=777),lr)
SMOTE_pipeline = make_pipeline(tvec, SMOTE(random_state=777),lr)

Before we fit each pipeline, let’s see what the RadomOverSampler does. In order to make it easier to see I defined some toy text data below, and the target sentiment value for each text.

sent1 = "I love dogs"
sent2 = "I don't like dogs"
sent3 = "I adore cats"
sent4 = "I hate spiders"
sent5 = "I like dogs"
testing_text = pd.Series([sent1, sent2, sent3, sent4, sent5])
testing_target = pd.Series([1,0,1,0,1])

My toy data has 5 entries in total, and the target sentiments are three positives and two negatives. In order to be balanced, this toy data needs one more entry of negative class.

One thing is over sampler won’t be able to handle raw text data. It has to be transformed into a feature space for over sampler to work. I’ll first fit TfidfVectorizer, and oversample using Tf-Idf representation of texts.

tv = TfidfVectorizer(stop_words=None, max_features=100000)
testing_tfidf = tv.fit_transform(testing_text)
ros = RandomOverSampler(random_state=777)
X_ROS, y_ROS = ros.fit_sample(testing_tfidf, testing_target)
pd.DataFrame(testing_tfidf.todense(), columns=tv.get_feature_names())

pd.DataFrame(X_ROS.todense(), columns=tv.get_feature_names())

By running RandomOverSampler, now we have one more entry at the end. The last entry added by RandomOverSampler is exactly same as the fourth one (index number 3) from the top. RandomOverSampler simply repeats some entries of the minority class to balance the data. If we look at the target sentiments after RandomOverSampler, we can see that it has now a perfect balance between classes by adding on more entry of negative class.

y_ROS

lr_cv(5, df.clean_text, df.sentiment, ROS_pipeline, 'macro')

Compared to the model built with original imbalanced data, now the model behaves in opposite way. The precisions for the negative class are around 47~49%, but the recalls are way higher at 64~67%. Now we have a situation of high recall, low precision. What this means is that the classifier thinks a lot of things are negative. However, it also thinks a lot of non-negative texts are negative. So from our set of data we got a lot of texts classified as negative, many of them were in the set of actual negative, however, a lot of them were also non-negative.

But without resampling, the recall rate was as low as 28~30% for negative class, the precision rate for the negative class I get from oversampling is more robust at around 47~49%.

Another way to look at it is to look at the f1 score, which is the harmonic average of precision and recall. The original imbalanced data had 66.51% accuracy and 60.01% F1 score. However with oversampling, we get a slightly lower accuracy of 65.95%, but a much higher F1 score of 64.18%

SMOTE (Synthetic Minority Over-Sampling Technique)

SMOTE is an over-sampling approach in which the minority class is over-sampled by creating “synthetic” examples rather than by over-sampling with replacement.

According to the original research paper “SMOTE: Synthetic Minority Over-sampling Technique” (Chawla et al., 2002), “synthetic samples are generated in the following way: Take the difference between the feature vector (sample) under consideration and its nearest neighbour. Multiply this difference by a random number between 0 and 1, and add it to the feature vector under consideration. This causes the selection of a random point along the line segment between two specific features. This approach effectively forces the decision region of the minority class to become more general.” What this means is that when SMOTE creates a new synthetic data, it will choose one data to copy, and look at its k nearest neighbours. Then, on feature space, it will create random values in feature space that is between the original sample and its neighbours.

Once you see the example with the toy data, it will become clearer.

smt = SMOTE(random_state=777, k_neighbors=1)
X_SMOTE, y_SMOTE = smt.fit_sample(testing_tfidf, testing_target)
pd.DataFrame(X_SMOTE.todense(), columns=tv.get_feature_names())

The last entry is the data created by SMOTE. To make it easier to see, let’s see only the negative class.

pd.DataFrame(X_SMOTE.todense()[y_SMOTE == 0], columns=tv.get_feature_names())

The top two entries are original data, and the one on the bottom is synthetic data. You can see it didn’t just repeat original data. Instead, the Tf-Idf values are created by taking random values between the top two original data. As you can see, if the Tf-Idf values for both original data are 0, then synthetic data also has 0 for those features, such as “adore”, “cactus”, “cats”, because if two values are the same there are no random values between them. I specifically defined k_neighbors as 1 for this toy data, since there are only two entries of negative class, if SMOTE chooses one to copy, then only one other negative entry left as a neighbour.

Now let’s fit the SMOTE pipeline to see how it affects performance.

lr_cv(5, df.clean_text, df.sentiment, SMOTE_pipeline, 'macro')

SMOTE sampling seems to have a slightly higher accuracy and F1 score compared to random oversampling. With the results so far, it seems like choosing SMOTE oversampling is preferable over original or random oversampling.

Downsampling

How about downsampling. If we oversample the minority class in the above oversampling, with downsampling, we try to reduce the data of majority class, so that the data classes are balanced.

from imblearn.under_sampling import NearMiss, RandomUnderSampler

RUS_pipeline = make_pipeline(tvec, RandomUnderSampler(random_state=777),lr)
NM1_pipeline = make_pipeline(tvec, NearMiss(ratio='not minority',random_state=777, version = 1),lr)
NM2_pipeline = make_pipeline(tvec, NearMiss(ratio='not minority',random_state=777, version = 2),lr)
NM3_pipeline = make_pipeline(tvec, NearMiss(ratio=nm3_dict,random_state=777, version = 3, n_neighbors_ver3=4),lr)

RandomUnderSampler

Again, before we run the pipeline, let’s apply this to the toy data to see what it does.

rus = RandomUnderSampler(random_state=777)
X_RUS, y_RUS = rus.fit_sample(testing_tfidf, testing_target)
pd.DataFrame(X_RUS.todense(), columns=tv.get_feature_names())

pd.DataFrame(testing_tfidf.todense(), columns=tv.get_feature_names())

Compared with the original imbalanced data, we can see that downsampled data has one less entry, which is the last entry of the original data belonging to the positive class. RandomUnderSampler reduces the majority class by randomly removing data from the majority class.

lr_cv(5, df.clean_text, df.sentiment, RUS_pipeline, 'macro')

Now the accuracy and the F1 score has significantly dropped. But the characteristic of low precision and high recall is as same as oversampled data. Only its overall performance dropped.

NearMiss

According to the documentation of “imbalanced-learn”, “NearMiss adds some heuristic rules to select samples. NearMiss implements 3 different types of heuristic which can be selected with the parameter version. NearMiss heuristic rules are based on nearest neighbors algorithm.”

There is also a good paper on resampling techniques. “Survey of resampling techniques for improving classification performance in unbalanced datasets” (Ajinkya More, 2016)

I borrowed the explanation of three different versions of NearMiss from More’s paper.

NearMiss-1

In NearMiss-1, those points from majority class are retained whose mean distance to the k nearest points in minority class is lowest. Which means it will keep the points of majority class that’s similar to the minority class.

nm = NearMiss(ratio='not minority',random_state=777, version=1, n_neighbors=1)
X_nm, y_nm = nm.fit_sample(testing_tfidf, testing_target)
pd.DataFrame(X_nm.todense(), columns=tv.get_feature_names())

pd.DataFrame(testing_tfidf.todense(), columns=tv.get_feature_names())

We can see that NearMiss-1 has eliminated the entry for the text “I adore cats”, which makes sense because both words “adore” and “cats” are only appeared in this entry, so makes it the most different from minority class in terms of Tf-Idf representation in feature space.

lr_cv(5, df.clean_text, df.sentiment, NM1_pipeline, 'macro')

It seems like both the accuracy and F1 score got worse than random undersampling.

NearMiss-2

In contrast to NearMiss-1, NearMiss-2 keeps those points from the majority class whose mean distance to the k farthest points in minority class is lowest. In other words, it will keep the points of majority class that’s most different to the minority class.

nm = NearMiss(ratio='not minority',random_state=777, version=2, n_neighbors=1)
X_nm, y_nm = nm.fit_sample(testing_tfidf, testing_target)
pd.DataFrame(X_nm.todense(), columns=tv.get_feature_names())

pd.DataFrame(testing_tfidf.todense(), columns=tv.get_feature_names())

Now we can see that NearMiss-2 has eliminated the entry for the text “I like dogs”, which again makes sense because we also have a negative entry “I don’t like dogs”. Two entries are in different classes but they share two same tokens “like” and “dogs”.

lr_cv(5, df.clean_text, df.sentiment, NM2_pipeline, 'macro')

Both accuracy and F1 score got even lower compared to NearMiss-1. And we can also see that all the metrics fluctuate from fold to fold quite a lot.

NearMiss-3

The final NearMiss variant, NearMiss-3 selects k nearest neighbours in majority class for every point in the minority class. In this case, the undersampling ratio is directly controlled by k. For example, if we set k to be 4, then NearMiss-3 will choose 4 nearest neighbours of every minority class entry.

Then we’ll end up with either more or fewer samples of majority class than minority class depending on n neighbours we set. For example, with my dataset, if I run NearMiss-3 with default n_neighbors_ver3 of 3, it will complain and the number of neutral class(which is majority class in my dataset) will be smaller than negative class(which is minority class in my dataset). So I explicitly set n_neighbors_ver3 to be 4, so that I’ll have enough majority class data at least the same number as the minority class.

One thing I’m not completely sure is that what kind of filtering it applies when all the data selected with n_neighbors_ver3 parameter is more than the minority class. As you will see below, after applying NearMiss-3, the dataset is perfectly balanced. However, if the algorithm simply chooses the nearest neighbour according to the n_neighbors_ver3 parameter, I doubt that it will end up with the exact same number of entries for each class.

lr_cv(5, df.clean_text, df.sentiment, NM3_pipeline, 'macro')

NearMiss-3 produced the most robust result within NearMiss family, but slightly lower than RandomUnderSampling.

from collections import Counter

nm3 = NearMiss(ratio='not minority',random_state=777, version=3, n_neighbors_ver3=4)
tvec = TfidfVectorizer(stop_words=None, max_features=100000, ngram_range=(1, 3))
df_tfidf = tvec.fit_transform(df.clean_text)
X_res, y_res = nm3.fit_sample(df_tfidf, df.sentiment)
print('Distribution before NearMiss-3: {}'.format(Counter(df.sentiment)))
print('Distribution after NearMiss-3: {}'.format(Counter(y_res)))

Result

5-fold cross validation result (classifier used for validation: logistic regression with default setting)

Based on the above result, the sampling technique I’ll be using for the next post will be SMOTE. In the next post, I will try different classifiers with SMOTE oversampled data.

Thank you for reading and you can find the Jupyter Notebook from the below link:

tthustla/yet_another_tiwtter_sentiment_analysis_part1

Yet Another Twitter Sentiment Analysis Part 1 — tackling class imbalance was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Sentiment Analysis with PySpark

Ricky Kim — Tue, 13 Mar 2018 16:12:45 GMT

Photo by Chris J. Davis on Unsplash

One of the tools I’m deeply interested but haven’t had many chances to explore is Apache Spark. Most of the time, Pandas and Scikit-Learn is enough to handle the size of data I’m trying to build a model on. But that also means that I haven’t had a chance to deal with petabytes of data yet, and I want to be prepared for the case I’m faced with a real big-data.

I have tried some basic data manipulation with PySpark before, but only to a very basic level. I want to learn more and be more comfortable in using PySpark. This post is my endeavour to have a better understanding of PySpark.

Python is great for data science modelling, thanks to its numerous modules and packages that help achieve data science goals. But what if the data you are dealing with cannot be fit into a single machine? Maybe you can implement careful sampling to do your analysis on a single machine, but with distributed computing framework like PySpark, you can efficiently implement the task for large datasets.

Spark API is available in multiple programming languages (Scala, Java, Python and R). There are debates about how Spark performance varies depending on which language you run it on, but since the main language I have been using is Python, I will focus on PySpark without going into too much detail of what language should I choose for Apache Spark.

Image courtesy of DataFlair

Spark has three different data structures available through its APIs: RDD, Dataframe (this is different from Pandas data frame), Dataset. For this post, I will work with Dataframe, and the corresponding machine learning library SparkML. I first decided on the data structure I would like to use based on the advice from the post in Analytics Vidhya. “Dataframe is much faster than RDD because it has metadata (some information about data) associated with it, which allows Spark to optimize query plan.” You can find a comprehensive introduction from the original post.

And there’s also an informative post on Databricks comparing different data structures of Apache Spark: “A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets”.

Then I figured out that I need to use SparkML instead SparkMLLib if I want to deal with Dataframe. SparkMLLib is used with RDD, while SparkML supports Dataframe.

One more thing to note is that I will work in local mode with my laptop. The local mode is often used for prototyping, development, debugging, and testing. However, as Spark’s local mode is fully compatible with the cluster mode, codes written locally can be run on a cluster with just a few additional steps.

In order to use PySpark in Jupyter Notebook, you should either configure PySpark driver or use a package called Findspark to make a Spark Context available in your Jupyter Notebook. You can easily install Findspark by “pip install findspark” on your command line. Let’s first load some of the basic dependencies we need.

*In addition to short code blocks I will attach, you can find the link for the whole Jupyter Notebook at the end of this post.

import findspark
findspark.init()
import pyspark as ps
import warnings
from pyspark.sql import SQLContext

First step in any Apache programming is to create a SparkContext. SparkContext is needed when we want to execute operations in a cluster. SparkContext tells Spark how and where to access a cluster. It is first step to connect with Apache Cluster.

try:
    # create SparkContext on all CPUs available: in my case I have 4 CPUs on my laptop
    sc = ps.SparkContext('local[4]')
    sqlContext = SQLContext(sc)
    print("Just created a SparkContext")
except ValueError:
    warnings.warn("SparkContext already exists in this scope")

The dataset I’ll use for this post is annotated Tweets from “Sentiment140”. It originated from a Stanford research project, and I used this dataset for my previous series of Twitter sentiment analysis. Since I already cleaned the tweets during the process of my previous project, I will use pre-cleaned tweets. If you want to know more in detail about the cleaning process I took, you can check my previous post: “Another Twitter sentiment analysis with Python-Part 2” .

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('project-capstone/Twitter_sentiment_analysis/clean_tweet.csv')
type(df)

df.show(5)

df = df.dropna()
df.count()

After successfully loading the data as Spark Dataframe, we can take a peek at the data by calling .show(), which is equivalent to Pandas .head(). After dropping NA, we have a bit less than 1.6 million Tweets. I will split this into three parts; training, validation, test. Since I have around 1.6 million entries, 1% each for validation and test set will be enough to test the models.

(train_set, val_set, test_set) = df.randomSplit([0.98, 0.01, 0.01], seed = 2000)

HashingTF + IDF + Logistic Regression

Through my previous attempt at sentiment analysis with Pandas and Scikit-Learn, I learned that TF-IDF with Logistic Regression is quite a strong combination, and showed robust performance, as high as Word2Vec + Convolutional Neural Network model. So in this post, I will try to implement TF-IDF + Logistic Regression model with PySpark.

By the way, if you want to know more in detail about how TF-IDF is calculated, please check my previous post: “Another Twitter sentiment analysis with Python — Part 5 (Tfidf vectorizer, model comparison, lexical approach)”

https://medium.com/media/123c2c0b2e4323b4d1975d376a5a18ff/href

from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(maxIter=100)
lrModel = lr.fit(train_df)
predictions = lrModel.transform(val_df)

from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
evaluator.evaluate(predictions)

0.86! That looks good, maybe too good. Because I already tried the same combination of techniques with the same data in Pandas and SKLearn, I know that the result for unigram TF-IDF with Logistic Regression is around 80% accuracy. There can be some slight difference due to the detailed model parameters, but still, this looks too good.

And by looking at the Spark documentation I realised that what BinaryClassificationEvaluator evaluates is by default areaUnderROC.

And for binary classification, Spark doesn’t support accuracy as a metric. But I can still calculate accuracy by counting the number of predictions matching the label and dividing it by the total entries.

accuracy = predictions.filter(predictions.label == predictions.prediction).count() / float(val_set.count())
accuracy

Now it looks more plausible, actually, the accuracy is slightly lower than what I have seen from SKLearn’s result.

CountVectorizer + IDF + Logistic Regression

There’s another way that you can get term frequency for IDF (Inverse Document Frequency) calculation. It is CountVectorizer in SparkML. Apart from the reversibility of the features (vocabularies), there is an important difference in how each of them filters top features. In case of HashingTF it is dimensionality reduction with possible collisions. CountVectorizer discards infrequent tokens.

Let’s see if performance changes if we use CountVectorizer instead of HashingTF.

https://medium.com/media/8149adbb2979ac51b687d44f1b4ed2d0/href

It looks like using CountVectorizer has improved the performance a little bit.

N-gram Implementation

In Scikit-Learn, n-gram implementation is fairly easy. You can define a range of n-grams when you call TfIdf Vectorizer. But with Spark, it is a bit more complicated. It does not automatically combine features from different n-grams, so I had to use VectorAssembler in the pipeline, to combine the features I get from each n-gram.

I first tried to extract around 16,000 features from unigram, bigram, trigram. This means I will get around 48,000 features in total. Then I implemented Chi-Squared feature selection to reduce the number of features to 16,000 in total.

https://medium.com/media/d9012446ff48afd63c35b2c5bdc7a8fa/href

And now I’m ready to run the function I defined above.

%%time
trigram_pipelineFit = build_trigrams().fit(train_set)
predictions = trigram_pipelineFit.transform(val_set)
accuracy = predictions.filter(predictions.label == predictions.prediction).count() / float(dev_set.count())
roc_auc = evaluator.evaluate(predictions)

# print accuracy, roc_auc
print "Accuracy Score: {0:.4f}".format(accuracy)
print "ROC-AUC: {0:.4f}".format(roc_auc)

Accuracy has improved, but as you might have noticed, fitting the model took 4 hours! And this is mainly because of ChiSqSelector.

What if I extract 5,460 features each from unigram, bigram, trigram in the first place, to have around 16,000 features in total in the end, without Chi Squared feature selection?

https://medium.com/media/71fa6dac2444f204d3367576beff6df1/href

%%time

trigramwocs_pipelineFit = build_ngrams_wocs().fit(train_set)
predictions_wocs = trigramwocs_pipelineFit.transform(val_set)
accuracy_wocs = predictions_wocs.filter(predictions_wocs.label == predictions_wocs.prediction).count() / float(val_set.count())
roc_auc_wocs = evaluator.evaluate(predictions_wocs)

# print accuracy, roc_auc
print "Accuracy Score: {0:.4f}".format(accuracy_wocs)
print "ROC-AUC: {0:.4f}".format(roc_auc_wocs)

This has given me almost same result, marginally lower, but the difference is in the fourth digit. Considering it takes only 6 mins without ChiSqSelector, I definitely choose the model without ChiSqSelector.

And finally, let’s try this model on the final test set.

test_predictions = trigramwocs_pipelineFit.transform(test_set)
test_accuracy = test_predictions.filter(test_predictions.label == test_predictions.prediction).count() / float(test_set.count())
test_roc_auc = evaluator.evaluate(test_predictions)

# print accuracy, roc_auc
print "Accuracy Score: {0:.4f}".format(test_accuracy)
print "ROC-AUC: {0:.4f}".format(test_roc_auc)

Final test set accuracy is 81.22% with ROC-AUC 0.8862.

Through this post, I have implemented a simple sentiment analysis model with PySpark. Even though it might not be an advanced level use of PySpark, but I believe it is important to keep expose myself to new environment and new challenges. Exploring some basic functions of PySpark really sparked (no pun intended) my interest.

I am attending Spark London Meetup tomorrow (13/03/2018) for “Apache Spark: Deep Learning Pipelines, PySpark MLLib and models in Streams”. I can’t wait to explore deeper into PySpark world!!

Thank you for reading and you can find the Jupyter Notebook from the below link:

https://github.com/tthustla/setiment_analysis_pyspark/blob/master/Sentiment%20Analysis%20with%20PySpark.ipynb

Sentiment Analysis with PySpark was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Bayesball: Bayesian analysis of batting average

Ricky Kim — Sun, 04 Mar 2018 19:25:51 GMT

Photo by Joshua Peacock on Unsplash

*In addition to short code blocks I will attach, you can find the link for the whole Jupyter Notebook at the end of this post.

One of the topics in data science or statistics I found interesting, but having difficulty understanding is Bayesian analysis. During the course of my General Assembly’s Data Science Immersive boot camp, I have had a chance to explore Bayesian statistics, but I really think I need some review and reinforcement.

This is my personal endeavour to have a better understanding of Bayesian thinking, and how it can be applied to real-life cases.

For this post, I am mainly inspired by a Youtube series by Rasmus Bååth, “Introduction to Bayesian data analysis”. He is really good at giving you an intuitive understanding of Bayesian analysis, not by bombarding you with all the complicated formulas, but by providing you with a thought-process of Bayesian statistics.

The topic I chose for this post is baseball. To be honest, I’m not a big sports fan. I rarely watch sports. As a Korean, baseball is the most famous sports in Korea, and I believe there are some Korean players in MLB as well. It’s a bit embarrassing to admit, but I have heard of Chan-Ho Park, but that’s about it.

Then why choose baseball?

“I don’t know whether you know it, but baseball’s appeal is decimal points. No other sport relies as totally on continuity, statistics, orderliness of these. Baseball fans pay more attention to numbers than CPAs.” — Sportswriter Jim Murray

They say baseball is probably the world’s best documented sports. The history has cumulated records in the past hundred years of the baseball statistics. However having collected stats alone doesn’t make baseball interesting in terms of statistics. Maybe the more important aspect is the individual nature of the game. For example, during an at-bat, who is playing in the outfield has very little effect on whether or not the batter can hit a home run. In other sports, especially football and basketball, the meaning of individual statistics can be diluted by the importance of what is going on elsewhere on the field or the court. This is what makes baseball stats useful for player comparison.

Baseball stats consist of numerous metrics, some of them straight-forward, some of them quite advanced. The metric I chose to take a look at is batting average(AVG). In baseball, the batting average is defined by the number of hits divided by at bats. It is usually reported to three decimal places.

There can be criticism on batting average, but according to C. Trent Rosecrans, “Still, what batting average does have over all the other statistics is history and context. We all know what a .300 hitter is, we know how bad a .200 hitter is and how great a .400 hitter is.”

It seems like the regular season hasn’t started yet, and will start soon (29th of March). But there’s spring training. In Major League Baseball (MLB), spring training is a series of practices and exhibition games preceding the start of the regular season.

The questions I would try to answer are as follows:

How I should interpret batting average from 2018 spring training
How can I compare two players on their batting average

Before I jump into code, I will briefly touch on what Rasmus Bååth explained in his videos.

We first need three things to implement Bayesian analysis.
1. Data
2. Generative Model
3. Prior

In my case, the data will be the batting average records from 2018 spring training. The data is simply what we observed.

A Generative Model is the model that generates data when given parameters as input. The parameters are values you’ll need to generate a distribution. For example, if you know the mean and the standard deviation, you can easily generate normally distributed data of your chosen size by running below code. We will see other types of distribution later to use in Bayesian analysis.

import matplotlib.pyplot as plt
import numpy as np

mu, sigma = 0, 0.1 # mean and standard deviation
s = np.random.normal(mu, sigma, 1000)
plt.hist(s)

In the case of Bayesian analysis, we inverse the generative model and try to infer parameters with observed data.

Image Courtesy of Rasmus Bååth, “Introduction to Bayesian data analysis part 1”

Finally, Prior is the information that the model has before seeing the data. Is any probability equally likely? Or do we have some prior data that we can utilise? Or is there any educated guess that we can make?

I will first define a function to scrape Fox Sports’ stats page for a player. I defined it as to be able to extract BATTING stats for either spring training or regular season.

import pandas as pd
import seaborn as sns
import requests
from bs4 import BeautifulSoup

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

def batting_stats(url,season):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    table = soup.find_all("table",{"class": "wisbb_standardTable tablesorter"})[0]
    table_head = soup.find_all("thead",{"class": "wisbb_tableHeader"})[0]
    if season == 'spring':
        row_height = len(table.find_all('tr')[:-1])
    else:
        row_height = len(table.find_all('tr')[:-2])
    result_df = pd.DataFrame(columns=[row.text.strip() for row in table_head.find_all('th')], index = range(0,row_height)) 
    
    row_marker = 0
    for row in table.find_all('tr')[:-1]:
        column_marker = 0
        columns = row.find_all('td')
        for column in columns:
            result_df.iat[row_marker,column_marker] = column.text.strip()
            column_marker += 1
        row_marker += 1
    return result_df

Now let’s see who I should choose for analysis.

The above screen is spring training stats page for NY Mets (As I have already admitted, I know little about baseball, and I chose NY Mets because I liked the logo). If you arrange the players by their batting average (AVG), you can see Dominic Smith (DS) as the first, and Gavin Cecchini (GC) as the second. Are they good players? I don’t know. But by looking at the AVG only, DS is the best with 1.000 AVG.

But by some googling, I found out that “In recent years, the league-wide batting average has typically hovered around .260”. If so, then the AVG for DS and GC seems too high. By further looking at At-Bats (AB), Hits (H) of both players, it is clear that DS had only 1 AB and CS had 7. And also by looking further at AB for other players, the highest AB is 13 for 2018, and in 2017 the maximum AB is 60 within NY Mets.

Scenario 1

Let’s assume that I know nothing about their past performance, and the only data I observed is 2018 spring training. And I don’t know what value range I should expect from AVG. Based on this, how should I interpret the stats from 2018 spring training?

Let’s scrape the spring training stats for DS.

ds_url_st = "https://www.foxsports.com/mlb/dominic-smith-player-stats?seasonType=3"
dominic_smith_spring = batting_stats(ds_url_st,'spring')
dominic_smith_spring.iloc[-1]

n_draw = 20000
prior_ni = pd.Series(np.random.uniform(0, 1, size = n_draw)) 
plt.figure(figsize=(8,5))
plt.hist(prior_ni)
plt.title('Uniform distribution(0,1)')
plt.xlabel('Prior on AVG')
plt.ylabel('Frequency')

The prior represents our beliefs before we see the data. In the above distribution, any probability is almost equally likely (There are slight differences due to the random generation). Thus this means I know nothing about the player, and I don’t even have any educated guess to make about AVG. I assume that 0.000 AVG is equally like as 1.000 AVG or any other probability between 0 and 1.

Now the data we observed says there was 1 AB, and 1 H, hence 1.000 AVG. This can be represented by Binomial distribution. A random variable X that has a binomial distribution represents the number of successes in a sequence of n independent yes/no trials, each of which yields success with probability p. In case of AVG, AVG is the probability of success, AB is the number of trials, and H is the number of success.

Keeping these in mind, we can define our inverse generative model.

We will randomly pick one probability value from the uniform distribution we defined, and use this value as a parameter for our generative model. Let’s say the value we randomly picked is 0.230, this means 23% chance of success in Binomial distribution. The number of trials is 1 (DS has 1 AB), and if the result of the generative model matches the result we observed (in this case, DS has 1 H), then we keep the probability value 0.230. If we repeat this generation and filtering, we will finally get a distribution of probability that has generated the same result as we observed.

This becomes our Posterior.

def posterior(n_try, k_success, prior):
    hit = list()
    for p in prior:
        hit.append(np.random.binomial(n_try, p))
    posterior = prior[list(map(lambda x: x == k_success, hit))]
    plt.figure(figsize=(8,5))
    plt.hist(posterior)
    plt.title('Posterior distribution')
    plt.xlabel('Posterior on AVG')
    plt.ylabel('Frequency')
    print('Number of draws left: %d, Posterior mean: %.3f, Posterior median: %.3f, Posterior 95%% quantile interval: %.3f-%.3f' % 
      (len(posterior), posterior.mean(), posterior.median(), posterior.quantile(.025), posterior.quantile(.975)))

ds_n_trials = int(dominic_smith_spring[['AB','H']].iloc[-1][0])
ds_k_success = int(dominic_smith_spring[['AB','H']].iloc[-1][1])
posterior(ds_n_trials, ds_k_success, prior_ni)

95% quantile interval in posterior distribution is called credible interval and should be seen slightly different from confidence interval in Frequentists’ sense. There is another credible interval you can use, and I will get back to this when I mention Pymc3.

One major distinction between Bayesian’s credible interval and Frequentist’s confidence interval is their interpretation. The Bayesian probability reflects a person’s subjective beliefs. Following this approach, we can make the claim that true parameter is inside a credible interval with measurable probability. This property is appealing because it enables you to make a direct probability statement about parameters. Many people find this concept to be a more natural way of understanding a probability interval, which is also easier to explain. A confidence interval, on the other hand, enables you to make a claim that the interval covers the true parameter. If we gather a new sample, and calculate the confidence interval, and repeat this many times, 95% of those intervals we calculated will have true AVG value within the interval.

Credible Interval: “Given our observed data, there is a 95% probability that the true value of AVG falls within the credible interval”

Confidence Interval: “There is a 95% probability that when I compute confidence interval from data of this sort, the true value of AVG will fall within the confidence interval.”

Note the difference: the credible interval is a statement of probability about the parameter value given fixed bounds. The confidence interval is a probability about the bounds given a fixed parameter value.

Often in real-life, what we would like to know is about the true parameters not about the bounds, in that case, the Bayesian credible interval is the right way to go. In this case, we are interested in true AVG of the player.

With above posterior distribution, I am 95% certain that DS true AVG will be somewhere between 0.155 to 0.987. But that is a very broad statement to make. In other words, I am not quite certain about the true AVG of DS, after I observe just one trial with no prior knowledge.

Scenario 2

For the second scenario, let’s assume that we know spring training stats from previous year.

dominic_smith_spring.iloc[-2:]

Now we have 2017 spring training stats, and our prior should reflect this knowledge. This is not a uniform distribution anymore since we know that in 2017 spring training, DS’s AVG was 0.167.

The Beta distribution is a continuous probability distribution having two parameters, alpha and beta. One of its most common uses is to model one’s uncertainty about the probability of success of an experiment. In particular, the conditional distribution of X, conditional on having observed k successes out of n trials, is a Beta distribution with parameters k+1 as alpha and n−k+1 as beta.

n_draw = 20000
prior_trials = int(dominic_smith_spring.iloc[3].AB)
prior_success = int(dominic_smith_spring.iloc[3].H)
prior_i = pd.Series(np.random.beta(prior_success+1, prior_trials-prior_success+1, size = n_draw)) 
plt.figure(figsize=(8,5))
plt.hist(prior_i)
plt.title('Beta distribution(a=%d, b=%d)' % (prior_success+1,prior_trials-prior_success+1))
plt.xlabel('Prior on AVG')
plt.ylabel('Frequency')

posterior(ds_n_trials, ds_k_success, prior_i)

The 95% quantile region has been narrowed compared to the posterior with the uniform prior in Scenario 1. Now I can say that I am 95% certain that the true AVG of DS will lie between 0.095 to 0.340. However, considering that above 0.300 AVG is often called best hitters, the statement means that the player can be either worst hitter or the best hitter. We need more data to narrow our region of credibility.

Scenario 3

For this scenario, let’s assume that I not only have stats from 2017 spring training, but also stats from 2017 regular season. How does this affect my statement after I get the posterior?

ds_url = "https://www.foxsports.com/mlb/dominic-smith-player-stats?seasonType=1"
dominic_smith_reg = batting_stats(ds_url,'regular')
dominic_smith = dominic_smith_reg.append(dominic_smith_spring.iloc[3], ignore_index=True)
dominic_smith

ds_prior_trials = pd.to_numeric(dominic_smith.AB).sum()
ds_prior_success = pd.to_numeric(dominic_smith.H).sum()

n_draw = 20000
prior_i_02 = pd.Series(np.random.beta(ds_prior_success+1, ds_prior_trials-ds_prior_success+1, size = n_draw)) 
plt.figure(figsize=(8,5))
plt.hist(prior_i_02)
plt.title('Beta distribution(a=%d, b=%d)' % (ds_prior_success+1,ds_prior_trials-ds_prior_success+1))
plt.xlabel('Prior on AVG')
plt.ylabel('Frequency')

posterior(ds_n_trials, ds_k_success, prior_i_02)

Now I can say that I am 95% certain that the true AVG of DS will lie between 0.146 to 0.258. It may not be pin-point but compared to Scenario 1 and 2, the credible interval is much narrower now.

Scenario 4

I want to compare two players and see who’s better in terms of AVG. The data I observed is result from 2018 spring training, and the prior knowledge I have is of 2017 spring training and regular season. Now I want to compare DS to GC.

Up until Scenario 3, I simulated the sampling by rejecting all the parameters which yielded the result different from what I observed. But this type of random sample generation and filtering is often computationally expensive, and slow to run. But luckily, there’s a tool that we can use to enable the sampler spends more time in regions of high probability, raising efficiency. Probabilistic programming tools such as Pymc3 can efficiently handle sampling procedure by making use of clever algorithms such as HMC-NUTS.

Let’s first start by scraping stats for Gavin Cecchini from Fox Sports.

gc_url_st = "https://www.foxsports.com/mlb/gavin-cecchini-player-stats?seasonType=3"
gc_url_reg = "https://www.foxsports.com/mlb/gavin-cecchini-player-stats?seasonType=1"
gavin_cecchini_spring = batting_stats(gc_url_st,'spring')
gavin_cecchini_reg = batting_stats(gc_url_reg,'regular')
gc_n_trials = int(gavin_cecchini_spring.iloc[1].AB)
gc_k_success = int(gavin_cecchini_spring.iloc[1].H)
gc_prior = pd.DataFrame(gavin_cecchini_reg.iloc[1]).transpose().append(gavin_cecchini_spring.iloc[0])
gc_prior

gc_prior_trials = pd.to_numeric(gc_prior.AB).sum()
gc_prior_success = pd.to_numeric(gc_prior.H).sum()

def observed_data_generator(n_try,observed_data):
    result = np.ones(observed_data)
    fails = n_try - observed_data
    result = np.append(result, np.zeros(fails))
    return result

ds_observed = observed_data_generator(ds_n_trials,ds_k_success)
gc_observed = observed_data_generator(gc_n_trials,gc_k_success)

Now we are ready to fit a Pymc3 model.

import pymc3 as pm
with pm.Model() as model_a:    
    D_p = pm.Beta('DS_AVG', ds_prior_success+1, ds_prior_trials-ds_prior_success+1)
    G_p = pm.Beta('GC_AVG', gc_prior_success+1, gc_prior_trials-gc_prior_success+1)
    DS = pm.Bernoulli('DS', p=D_p, observed=ds_observed)
    GC = pm.Bernoulli('GC', p=G_p, observed=gc_observed) 
    DvG = pm.Deterministic('DvG', D_p - G_p)
    start = pm.find_MAP()
    trace = pm.sample(10000, start=start)

pm.plot_posterior(trace, varnames=['DS_AVG','GC_AVG','DvG'],ref_val=0)

If we plot the posterior distributions of DS_AVG, GC_AVG, and DvG (DS_AVG — GC_AVG) using plot_posterior function in Pymc3, we see the term HPD instead of quantile. Highest Probability Density (HPD) interval is another type of credible interval we can use with posteriors. HPD interval chooses the narrowest interval, which will involve choosing those values of highest probability density including the mode.

Again I found another post by Rasmus Bååth provides an easy-to-understand visual comparison of quantile interval and highest density interval. Below are the mode and the highest density intervals covering 95% of the probability density for the six different posterior distributions.

Image Courtesy of Rasmus Bååth, “Probable Points and Credible Intervals, Part 1: Graphical Summaries”

The quantile interval includes the median, and having 50% of the probability to its left and 50% to its right and the quantile interval leaving, say, 2.5% probability on either side (in case of 95% credible interval).

Image Courtesy of Rasmus Bååth, “Probable Points and Credible Intervals, Part 1: Graphical Summaries”

In the case of batting average for DS and GC, it looks like the mode and the median is not that different, and if so HPD interval will be similar to quantile interval. Let’s see how they look.

pm.summary(trace)

We can see that both for DS and GC, HPD interval and quantile interval is either exactly the same or slightly different in decimal places.

The question I wanted to answer was who is the better player in terms of AVG, and I should say I can’t be certain. At least, I can’t be 95% certain that these two players are different in terms of AVG. The difference I calculated and plotted shows that the difference of AVG of two players (DS — GC, so if DvG is more positive then it means DS is better, else if DvG is more negative then it means GC is better), can be somewhere between -0.162 to 0.033.

This interval includes 0.000 which represents there is no difference between two players’ AVG. Thus, there is some evidence that GC is better than DS (since the DvG posterior distribution has a larger region in negative area than in the positive area), but I can’t be 95% certain that these two players are different in terms of AVG.

Maybe with more data, I might be able to be certain about their difference. After all, that is the essence of Bayesian thinking. It is not that the truth doesn’t exist, but it is that we can’t know it perfectly, and all we could hope to do is update our understanding as more and more evidence became available.

Thank you for reading, and you can find the whole Jupyter Notebook from the below link.

https://github.com/tthustla/Bayesball/blob/master/Bayesball.ipynb

Bayesball: Bayesian analysis of batting average was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Another Twitter sentiment analysis with Python — Part 11 (CNN + Word2Vec)

Ricky Kim — Fri, 23 Feb 2018 07:33:12 GMT

Photo by Mahdi Bafande on Unsplash

This is the 11th and the last part of my Twitter sentiment analysis project. It has been a long journey, and through many trials and errors along the way, I have learned countless valuable lessons. I haven’t decided on my next project. But I will definitely make time to start a new project. You can find the previous posts from the below links.

*In addition to short code blocks I will attach, you can find the link for the whole Jupyter Notebook at the end of this post.

Preparation for Convolutional Neural Network

In the last post, I have aggregated the word vectors of each word in a tweet, either summation or calculating mean to get one vector representation of each tweet. However, in order to feed to a CNN, we have to not only feed each word vector to the model, but also in a sequence which matches the original tweet.

For example, let’s say we have a sentence as below.

“I love cats”

And let’s assume that we have a 2-dimensional vector representation of each word as follows:

I: [0.3, 0.5] love: [1.2, 0.8] cats: [0.4, 1.3]

With the above sentence, the dimension of the vector we have for the whole sentence is 3 X 2 (3: number of words, 2: number of vector dimension).

But there is one more thing we need to consider. A neural network model will expect all the data to have the same dimension, but in case of different sentences, they will have different lengths. This can be handled with padding.

Let’s say we have our second sentence as below.

“I love dogs too”

with the below vector representation of each word:

I: [0.3, 0.5], love: [1.2, 0.8], dogs: [0.8, 1.2], too: [0.1, 0.1]

The first sentence had 3X2 dimension vectors, but the second sentence has 4X2 dimension vector. Our neural network won’t accept these as inputs. By padding the inputs, we decide the maximum length of words in a sentence, then zero pads the rest, if the input length is shorter than the designated length. In the case where it exceeds the maximum length, then it will also truncate either from the beginning or from the end. For example, let’s say we decide our maximum length to be 5.

Then by padding, the first sentence will have 2 more 2-dimensional vectors of all zeros at the start or the end (you can decide this by passing an argument), and the second sentence will have 1 more 2-dimensional vector of zeros at the beginning or the end. Now we have 2 same dimensional (5X2) vectors for each sentence, and we can finally feed this to a model.

Let’s first load the Word2Vec models to extract word vectors from. I have saved the Word2Vec models I trained in the previous post, and can easily be loaded with “KeyedVectors” function in Gensim. I have two different Word2Vec models, one with CBOW (Continuous Bag Of Words) model, and the other with skip-gram model. I won’t go into detail of how CBOW and skip-gram differs, but you can refer to my previous post if you want to know a bit more in detail.

from gensim.models import KeyedVectors
model_ug_cbow = KeyedVectors.load('w2v_model_ug_cbow.word2vec')
model_ug_sg = KeyedVectors.load('w2v_model_ug_sg.word2vec')

By running below code block, I am constructing a sort of dictionary I can extract the word vectors from. Since I have two different Word2Vec models, below “embedding_index” will have concatenated vectors of the two models. For each model, I have 100 dimension vector representation of the word, and by concatenating, each word will have 200 dimension vector representation.

embeddings_index = {}
for w in model_ug_cbow.wv.vocab.keys():
    embeddings_index[w] = np.append(model_ug_cbow.wv[w],model_ug_sg.wv[w])

Now we have our reference to word vectors ready, but we still haven’t prepared data to be in the format I have explained at the start of the post. Keras’ ‘Tokenizer’ will split each word in a sentence, then we can call ‘texts_to_sequences’ method to get a sequential representation of each sentence. We also need to pass ‘num_words’ which is a number of vocabularies you want to use, and this will be applied when you call ‘texts_to_sequences’ method. This might be a bit counter-intuitive. Because if you check the length of all the word index, it will not be the number of words you defined, but the actual screening process happens when you call ‘texts_to_sequences’ method.

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=100000)
tokenizer.fit_on_texts(x_train)
sequences = tokenizer.texts_to_sequences(x_train)

Below are the first five entries of the original train data.

for x in x_train[:5]:
    print x

And the same data prepared as sequential data is as below.

sequences[:5]

Each word is represented as a number, and we can see that the number of words in each sentence is matching the length of numbers in the “sequences”. We can later make connections of which word each number represents. But we still didn’t pad our data, so each sentence has varying length. Let’s deal with this.

length = []
for x in x_train:
    length.append(len(x.split()))
max(length)

The maximum number of words in a sentence within the training data is 40. Let’s decide the maximum length to be a bit longer than this, let’s say 45.

x_train_seq = pad_sequences(sequences, maxlen=45)
x_train_seq[:5]

As you can see from the padded sequences, all the data now transformed to have the same length of 45, and by default, Keras zero-pads at the beginning, if a sentence length is shorter than the maximum length. If you want to know more in detail, please check the Keras documentation on sequence preprocessing.

sequences_val = tokenizer.texts_to_sequences(x_validation)
x_val_seq = pad_sequences(sequences_val, maxlen=45)

There’s still one more thing left to do before we can feed the sequential text data to a model. When we transformed a sentence into a sequence, each word is represented by an integer number. Actually, these numbers are where each word is stored in the tokenizer’s word index. Keeping this in mind, let’s build a matrix of these word vectors, but this time we will use the word index number so that our model can refer to the corresponding vector when fed with integer sequence.

Below, I am defining the number of words to be 100,000. This means I will only care about 100,000 most frequent words in the training set. If I don’t limit the number of words, the total number of vocabulary will be more than 200,000.

num_words = 100000
embedding_matrix = np.zeros((num_words, 200))
for word, i in tokenizer.word_index.items():
    if i >= num_words:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

As a sanity check, if the embedding matrix has been generated properly. In the above, when I saw the first five entries of the training set, the first entry was “hate you”, and the sequential representation of this was [137, 6]. Let’s see if 6th embedding matrix is as same as vectors for the word ‘you’.

np.array_equal(embedding_matrix[6] ,embeddings_index.get('you'))

Now we are done with the data preparation. Before we jump into CNN, I would like to test one more thing (sorry for the delay). When we feed this sequential vector representation of data, we will use Embedding layer in Keras. With Embedding layer, I can either pass pre-defined embedding, which I prepared as ‘embedding_matrix’ above, or Embedding layer itself can learn word embeddings as the whole model trains. And another possibility is we can still feed the pre-defined embedding but make it trainable so that it will update the values of vectors as the model trains.

In order to check which method performs better, I defined a simple shallow neural network with one hidden layer. For this model structure, I will not try to refine models by tweaking parameters, since the main purpose of this post is to implement CNN.

from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.layers import Flatten
from keras.layers.embeddings import Embedding

https://medium.com/media/797309d313e17301c931fe5dcd2ac28f/href

https://medium.com/media/6da70277dd345021951a0d8448cb4afd/href

https://medium.com/media/0fdf6cdd597a510a93ef2a9ae0c20a63/href

As a result, the best validation accuracy is from the third method (fine-tune pre-trained Word2Vec) at 82.22%. The best training accuracy is the second method (learn word embedding from scratch) at 90.52%. Using pre-trained Word2Vec without updating its vector values showed the lowest accuracy both in training and validation. However, what’s interesting is that in terms of training set accuracy, fine-tuning pre-trained word vectors couldn’t outperform the word embeddings learned from scratch through the embedding layer. Before I tried the above three methods, my first guess was that if I fine-tune the pre-trained word vectors, it would give me the best training accuracy.

Feeding pre-trained word vectors for an embedding layer to update is like providing the first initialisation guideline to the embedding layer so that it can learn more efficiently the task-specific word vectors. But the result is somewhat counterintuitive, and in this case, it turns out that it is better to force the embedding layer to learn from scratch.

But premature generalisation could be dangerous. For this reason, I will compare three methods again in the context of CNN.

Convolutional Neural Network

You might have already seen how Convolutional Neural Network (CNN) works on image data. There are many good sources that you can learn basics of CNN. In my case, the blog post, “A Beginner’s Guide To Understanding Convolutional Neural Networks” by Adit Deshpande really helped me a lot to grasp the concept. If you are not familiar with CNN, I highly recommend his article, so that you will have a firm understanding of CNN.

Now I will assume you have an understanding of CNN in case of image data. How can this be applied to text data then? Let’s say we have a sentence as follows:

“I love cats and dogs”

With word vectors (let’s assume we have 200-dimensional word vectors for each word), the above sentence can be represented in 5X200 matrix, one row for each word. You remember we added zeros to pad a sentence in the above where we prepared the data to feed to an embedding layer? If our decided word length is 45, then the above sentence will have 45X200 matrix, but with all zeros in the first 40 rows. Keeping this in mind, let’s take a look at how CNN works on image data.

Image courtesy of machinelearninguru.com

In the above GIF, we have one filter (kernel matrix) of 3X3 dimension, convolving over the data (image matrix) and calculate the sum of element-wise multiplication result, and record the result on a feature map (output matrix). If we imagine each row of the data is for a word in a sentence, then it would not be learning efficiently since the filter is only looking at a part of a word vector at a time. The above CNN is so-called 2D Convolutional Neural Network since the filter is moving in 2-dimensional space.

What we do with text data represented in word vectors is making use of 1D Convolutional Neural Network. If a filter’s column width is as same as the data column width, then it has no room to stride horizontally, and only stride vertically. For example, if our sentence is represented in 45X200 matrix, then a filter column width will also have 200 columns, and the length of row (height) will be similar to the concept of n-gram. If the filter height is 2, the filter will stride through the document computing the calculation above with all the bigrams, if the filter height is 3, it will go through all the trigrams in the document, and so on.

If a 2X200 filter is applied with stride size of 1 to 45X200 matrix, we will get 44X1 dimensional output. In the case of 1D Convolution, the output width will be just 1 in this case(number of filter=1). The output height can be easily calculated with below formula (assuming that your data is already padded).

where

H: input data height

Fh: filter height

S: stride size

Now let’s try to add more filters to our 1D Convolutional layer. If we apply 100 2X200 filters with stride size of 1 to 45X200 matrix, can you guess the output dimension?

As I have already mentioned in the above, now the output width will reflect the number of filters we apply, so the answer is we will have 44X100 dimension output. You can also check the dimensions of each output layer by looking at the model summary after you define the structure.

from keras.layers import Conv1D, GlobalMaxPooling1D
structure_test = Sequential()
e = Embedding(100000, 200, input_length=45)
structure_test.add(e)
structure_test.add(Conv1D(filters=100, kernel_size=2, padding='valid', activation='relu', strides=1))
structure_test.summary()

Now if we add Global Max Pooling layer, then the pooling layer will extract the maximum value from each filter, and the output dimension will be a just 1-dimensional vector with length as same as the number of filters we applied. This can be directly passed on to a dense layer without flattening.

structure_test = Sequential()
e = Embedding(100000, 200, input_length=45)
structure_test.add(e)
structure_test.add(Conv1D(filters=100, kernel_size=2, padding='valid', activation='relu', strides=1))
structure_test.add(GlobalMaxPooling1D())
structure_test.summary()

Now, let’s define a simple CNN going through bigrams on a tweet. The output from global max pooling layer will be fed to a fully connected layer, then finally the output layer. Again I will try three different inputs, static word vectors extracted from Word2Vec, word embedding being learned from scratch with embedding layer, Word2Vec word vectors being updated through training.

https://medium.com/media/e194bbac4bd8c0005cb78e9cbce8c897/href

https://medium.com/media/95430a05f2f7113413b0ede4faaeed7d/href

https://medium.com/media/f37e0ba247fc4aab6b4aeed8860531da/href

The best validation accuracy is from the word vectors updated through training, at epoch 3 with the validation accuracy of 83.25%. By looking at the training loss and accuracy, it seems that word embedding learned from scratch tends to overfit to the training data, and by feeding pre-trained word vectors as weights initialisation, it somewhat more generalises and ends up having higher validation accuracy.

But finally! I have a better result than Tf-Idf + logistic regression model! I have tried various different methods with Doc2Vec, Word2Vec in the hope of outperforming a simple logistic regression model with Tf-Idf input. You can take a look at the previous post for detail. Tf-Idf + logistic regression model’s validation accuracy was at 82.91%. And now I’m finally beginning to see a possibility of Word2Vec + neural network outperforming this simple model.

Let’s see if we can do better by defining a bit more elaborate model structure. The CNN architecture I will implement below is inspired by Zhang, Y., & Wallace, B. (2015) “A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification”.

Image courtesy of Zhang, Y., & Wallace, B. (2015) “A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification”

Basically, the above structure is implementing what we have done above with bigram filters, but not only to bigrams but also to trigrams and fourgrams. However this is not linearly stacked layers, but parallel layers. And after convolutional layer and max pooling layer, it simply concatenated max pooled result from each of bigram, trigram, and fourgram, then build one output layer on top of them.

The model I defined below is basically as same as the above picture, but the differences are that I added one fully connected hidden layer with dropout just before the output layer, and also my output layer will have just one output node with Sigmoid activation instead of two.

There is also another famous paper by Y. Kim(2014), “Convolutional Neural Networks for Sentence Classification”. https://arxiv.org/pdf/1408.5882.pdf

In this paper, he implemented more sophisticated approach by making use of “channel” concept. Not only the model go through different n-grams, his model has multi-channels (eg. one channel for static input word vectors, another channel for word vectors input but set them to update during training). But in this post, I will not go through multi-channel approach.

So far I have only used Sequential model API of Keras, and this worked fine with all the previous models I defined above since the structures of the models were only linearly stacked. But as you can see from the above picture, the model I am about to define has parallel layers which take the same input but do their own computation, then the results will be merged. In this kind of neural network structure, we can use Keras functional API.

Keras functional API can handle multi-input, multi-output, shared layers, shared input, etc. It is not impossible to define these types of models with Sequential API, but when you want to save the trained model, functional API enables you to simply save the model and load, but with sequential API it is difficult.

https://medium.com/media/9f51d2b7a028a239923b6c9715aad064/href

from keras.callbacks import ModelCheckpoint

filepath="CNN_best_weights.{epoch:02d}-{val_acc:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')

model.fit(x_train_seq, y_train, batch_size=32, epochs=5,
                     validation_data=(x_val_seq, y_validation), callbacks = [checkpoint])

from keras.models import load_model
loaded_CNN_model = load_model('CNN_best_weights.02-0.8333.hdf5')
loaded_CNN_model.evaluate(x=x_val_seq, y=y_validation)

The best validation accuracy is 83.33%, slightly better than the simple CNN model with bigram filters, which yielded 83.25% validation accuracy. I could even define a deeper structure with more hidden layers, or even make use of multi-channel approach that Yoon Kim(2014) has implemented, or try different pool size to see how the performance differs, but I will stop here for now. However if you happen to try more complex CNN structure, and get the result, I would love to hear about it.

Final Model Evaluation with Test Set

So far I have tested the model on the validation set to decide the feature extraction tuning and model comparison. Now I will finally check the final result with the test set. I will compare two different models: 1. Tf-Idf + logistic regression, 2. Word2Vec + CNN. As another measure for comparison, I will also plot ROC curve of both models.

from sklearn.feature_extraction.text import TfidfVectorizer
tvec = TfidfVectorizer(max_features=100000,ngram_range=(1, 3))
tvec.fit(x_train)
x_train_tfidf = tvec.transform(x_train)
x_test_tfidf = tvec.transform(x_test)
lr_with_tfidf = LogisticRegression()
lr_with_tfidf.fit(x_train_tfidf,y_train)
yhat_lr = lr_with_tfidf.predict_proba(x_test_tfidf)
lr_with_tfidf.score(x_test_tfidf,y_test)

sequences_test = tokenizer.texts_to_sequences(x_test)
x_test_seq = pad_sequences(sequences_test, maxlen=45)
yhat_cnn = loaded_CNN_model.predict(x_test_seq)
loaded_CNN_model.evaluate(x=x_test_seq, y=y_test)

https://medium.com/media/0eca8c08ba43a764c814238f4ce5038c/href

And the final result is as below.

Thank you for reading. You can find the Jupyter Notebook from the below link.

https://github.com/tthustla/twitter_sentiment_analysis_part11/blob/master/Capstone_part11.ipynb

Another Twitter sentiment analysis with Python — Part 11 (CNN + Word2Vec) was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.