Identify Phishing & Spam Messages with Watson Natural Language Classifier

Photo credit: Christiaan Colen on Visual hunt / CC BY-SA

Update (5/31): Added a phishing email classifier demo and GitHub Repo

SPAM? Not to be mistaken for the canned meat, these unwanted texts, tweets, and emails that we receive come at a cost. Enterprises around the globe deal with spam messages, the Radicati Research Group stated that spam will cost a business $20.5B, annually, due to decreased productivity. Although, traditional spam classifiers exist, the key differentiation required for enterprise solutions is scalability and the ability to own your own data.

So what can you do?

In this article, we will reference two code patterns, one by Carmine DiMascio and the other by Scott D’Angelo to see how we can classify messages and e-mails, respectively. Using IBM’s Watson Natural Language Classifier, we can create a simple way to classify e-mails and text messages.

Phishing e-mails

Visit the GitHub repo to get an overview of how you can replicate this app. Curious to test the accuracy? Try the demo.

Spam messages

Follow along below or visit the GitHub repo to build your text message classifier.

Try out the live web demo

Note: For any classifier, training data is key. More Data = Higher Accuracy. If you happen to have your own dataset, follow the steps below with your own data.


Make Your Message Classifier

Note: To create an e-mail classifier, you can follow the overall process, but we will be changing the training data used.

To get started:

  • An IBM cloud account: Log in or create a free one
  • A development environment supporting UNIX commands (Linux, MacOS, etc.)
  • A plain text editor
Selecting the Natural Language Classifier Service

Head over to the IBM Cloud Catalog and select the Watson Natural Langauge Classifier.

Creating your own instance of Natural Language Classifier

Lets call our instance “Spam Classifier” and provision it by clicking on the create button in blue.

Once you create your initial “Spam Classifier”, take note of your username and password. It will come in handy when we start training the classifier.

Step 1: Copying your environment

Screenshot of Carmine’s Spam Classifier Demo (Click here and try it out)

Lets copy over Carmine’s WatsonNLCSpam project repository:

For the e-mail classifier:

Review the 4 repository contents:

  • README.md describes the project
  • web contains the source code for the sample web application
  • spam.py will perform a basic accuracy test
  • data contains the training and test data sets, SpamHam-Train.csv and SpamHam-Test.json, respectively.

The training and testing data

When creating a classifier you need to have training and test data. In the GitHub repo: SpamHam-Train.csv will contain 90% of the original data. The remaining 10% will be the test data.

When training Watson Natural Language Classifier, training data must be presented in a .csv file. With up to 20,000 rows of training data, .csv files should contain two or more columns: text,label.

Example: SpamHam-Train.csv:

1. "=Bring home some Wendys =D",ham

2. "100 dating service cal;l 09064012103 box334sk38ch",spam

In this example, we will be training a binary classifier. Each string of text associates to one of two classes: Spam or Ham. Spam will refer to unwanted text where Ham will be the opposite.

Note: Watson Natural Language Classifier does support multi-class classification.

2. Create and train

Now that we have the GitHub repo cloned and our IBM cloud account setup, let train our classifier. Open the terminal window and type out the following curlcommand:

Note: When typing this information into the terminal window it will be easier to copy and paste. Add your username, password, and url to the code, below.

curl -X POST -u username:password -F training_data=@SpamHam-Train.csv -F training_metadata="{\"language\":\"en\",\"name\":\"My Classifier\"}" "https://gateway.watsonplatform.net/natural-language-classifier/api/v1/classifiers"

Pro Tip: Training the classifier can take up to 30 min. Go and grab a snack or see another cool NLC demo application. Before you step away take note of theclassifier_id. We'll use it shortly.

3. Validate Training

Now that you’ve had a snack and watched some Youtube video’s, lets see if our classifier is ready to use. In your terminal window, call the Watson Natural Language Classifier endpoint by using the following GET request*:

curl -u username:password <url>/v1/classifiers/<classifier-id>

*Pro Tip: Remember add your username, password, url, and classifier-id to the code above. Once done, copy and paste to the terminal window.

4. Try out the spam classifier

And we’re (almost) done! Let’s see if everything works. Send the followingPOST request to the /classify endpoint:

curl -X POST -u username:password -H "Content-Type:application/json" -d "{\"text\":\"I love you mom\"}" <uri>/v1/classifiers/<classifier_id>/classify

Pro Tip: If you forgot the classifier ID, you can retrieve it by invoking the following endpoint with this curl command. This endpoint will return a list that contains all your classifiers:

curl -u username:password <uri>/v1/classifiers"

5. Testing accuracy

Remember the sample data we saved in the beginning? Let’s use the provided Python script, spam.py to test our classifier. The script invokes the request described in the previous step and counts the number of classified predictions that match the label. Accuracy is calculated by taking the number of correct predictions and dividing by the total number of test observations.

  • Open spam.py and update the following with your specific information: YOUR_CLASSIFIER_ID,YOUR_CLASSIFIER_USERNAME,and YOUR_CLASSIFIER_PASSWORD
  • In the project directory or terminal window, run the following command:
  • python spam.py

When the script completes, you should see the following output:

accuracy: 0.993079584775

Conclusion

Curious to what else you can do with this application? Check out additional sample use-cases catered to industry verticals.

IBM’s Watson Natural Language Classifier delivers a powerful, scalable, and secure classification solution for users. Now that you’ve had a chance to build, train, and test a classifier use your own data and see what you can do!