Identify Phishing & Spam Messages with Watson Natural Language Classifier
Update (5/31): Added a phishing email classifier demo and GitHub Repo
In this article, we will reference two code patterns, one by Carmine DiMascio and the other by Scott D’Angelo to see how we can classify messages and e-mails, respectively. Using IBM’s Watson Natural Language Classifier, we can create a simple way to classify e-mails and text messages.
SPAM? Not to be mistaken for the canned meat, these unwanted texts, tweets, and emails that we receive come at a cost. Enterprises around the globe deal with spam messages, the Radicati Research Group stated that spam will cost a business $20.5B, annually, due to decreased productivity. Although, traditional spam classifiers exist, the key differentiation required for enterprise solutions is scalability and the ability to own your own data.
Phishing e-mails
Visit the GitHub repo to get an overview of how you can replicate this app. Curious to test the accuracy? Try the demo.
Spam messages
Follow along below or visit the GitHub repo to build your text message classifier.
Note: For any classifier, training data is key. More Data = Higher Accuracy. If you happen to have your own dataset, follow the steps below with your own data.
Make Your Message Classifier
Note: To create an e-mail classifier, you can follow the overall process, but we will be changing the training data used.
To get started:
- An IBM cloud account: Log in or create a free one
- A development environment supporting UNIX commands (Linux, MacOS, etc.)
- A plain text editor
Head over to the IBM Cloud Catalog and select the Watson Natural Langauge Classifier.
Lets call our instance “Spam Classifier” and provision it by clicking on the create button in blue.
Once you create your initial “Spam Classifier”, take note of your username and password. It will come in handy when we start training the classifier.
Step 1: Copying your environment
Lets copy over Carmine’s WatsonNLCSpam
project repository:
- To clone the Git repo: In your terminal, enter the following command:
git clone https://github.com/cdimascio/watson-nlc-spam
For the e-mail classifier:
- To clone the Git repo: In your terminal, enter the following command:
git clone https://github.com/IBM/nlc-email-phishing
Review the 4 repository contents:
README.md
describes the projectweb
contains the source code for the sample web applicationspam.py
will perform a basic accuracy testdata
contains the training and test data sets,SpamHam-Train.csv
andSpamHam-Test.json
, respectively.
The training and testing data
When creating a classifier you need to have training and test data. In the GitHub repo: SpamHam-Train.csv
will contain 90% of the original data. The remaining 10% will be the test data.
When training Watson Natural Language Classifier, training data must be presented in a .csv file. With up to 20,000 rows of training data, .csv files should contain two or more columns: text,label
.
Example: SpamHam-Train.csv
:
1. "=Bring home some Wendys =D",ham
2. "100 dating service cal;l 09064012103 box334sk38ch",spam
In this example, we will be training a binary classifier. Each string of text associates to one of two classes: Spam or Ham. Spam will refer to unwanted text where Ham will be the opposite.
Note: Watson Natural Language Classifier does support multi-class classification.
2. Create and train
Now that we have the GitHub repo cloned and our IBM cloud account setup, let train our classifier. Open the terminal window and type out the following curl
command:
Note: When typing this information into the terminal window it will be easier to copy and paste. Add your username, password, and url to the code, below.
curl -X POST -u username:password -F training_data=@SpamHam-Train.csv -F training_metadata="{\"language\":\"en\",\"name\":\"My Classifier\"}" "https://gateway.watsonplatform.net/natural-language-classifier/api/v1/classifiers"
Pro Tip: Training the classifier can take up to 30 min. Go and grab a snack or see another cool NLC demo application. Before you step away take note of theclassifier_id
. We'll use it shortly.
3. Validate Training
Now that you’ve had a snack and watched some Youtube video’s, lets see if our classifier is ready to use. In your terminal window, call the Watson Natural Language Classifier endpoint by using the following GET
request*:
curl -u username:password <url>/v1/classifiers/<classifier-id>
*Pro Tip: Remember add your username, password, url, and classifier-id to the code above. Once done, copy and paste to the terminal window.
4. Try out the spam classifier
And we’re (almost) done! Let’s see if everything works. Send the followingPOST
request to the /classify
endpoint:
curl -X POST -u username:password -H "Content-Type:application/json" -d "{\"text\":\"I love you mom\"}" <uri>/v1/classifiers/<classifier_id>/classify
Pro Tip: If you forgot the classifier ID, you can retrieve it by invoking the following endpoint with this curl command. This endpoint will return a list that contains all your classifiers:
curl -u username:password <uri>/v1/classifiers"
5. Testing accuracy
Remember the sample data we saved in the beginning? Let’s use the provided Python script, spam.py
to test our classifier. The script invokes the request described in the previous step and counts the number of classified predictions that match the label. Accuracy is calculated by taking the number of correct predictions and dividing by the total number of test observations.
- Open
spam.py
and update the following with your specific information:YOUR_CLASSIFIER_ID
,YOUR_CLASSIFIER_USERNAME
,andYOUR_CLASSIFIER_PASSWORD
- In the project directory or terminal window, run the following command:
python spam.py
When the script completes, you should see the following output:
accuracy: 0.993079584775
Conclusion
Curious to what else you can do with this application? Check out additional sample use-cases catered to industry verticals.
IBM’s Watson Natural Language Classifier delivers a powerful, scalable, and secure classification solution for users. Now that you’ve had a chance to build, train, and test a classifier use your own data and see what you can do!
Useful Links & Blogs
Watson Natural Language Classifier Website | NLC Best Practices Guide | New Product Announcements | Watson YouTube Channel | Slack Community