# Spam detection using neural networks in Python

Neural networks are powerful machine learning algorithms. They can be used to transform the features so as to form fairly complex non linear decision boundaries. They are primarily used for classification problems.

A common classification problem is that of spam identification. In this problem, we are given a bunch of emails (in raw form or in processed form) and we are also given labels of those emails (spam or no spam). Then, we are given a set of new emails (test data) and we have to label each email in the test set as spam or no spam.

You can read about Neural Networks and how they work on Wikipedia

We have created a 3 layer neural network for spam detection:

• The first layer is the input layer which has 57 nodes, 1 node for each feature of the email
• The second layer is the middle layer which has 4 node
• The final layer is the output layer which has just 1 node

The input layer takes in the 57 features of the email as a vector and passes it to the middle layer. Finally, the output layer outputs a real number in the interval (0, 1) which in some sense serves as a probability of the mail being a spam.

Here is a link to pre-processed email dataset (make sure to save it with the name ‘Train.csv’)

To get an idea of what each column in the dataset means, have a look at this link

Here is the Python code for our spam detector:

`from sklearn import preprocessingimport numpy as np`
`def derivative(x):    return x * (1.0 — x)`
`def sigmoid(x):    return 1.0 / (1.0 + np.exp(-x))`
`X = []Y = []`
`# read the training datawith open(‘Train.csv’) as f:    for line in f:        curr = line.split(‘,’)        new_curr = [1]        for item in curr[:len(curr) — 1]:            new_curr.append(float(item))        X.append(new_curr)        Y.append([float(curr[-1])])`
`X = np.array(X)X = preprocessing.scale(X) # feature scalingY = np.array(Y)`
`# the first 2500 out of 3000 emails will serve as training dataX_train = X[0:2500]Y_train = Y[0:2500]`
`# the rest 500 emails will serve as testing dataX_test = X[2500:]y_test = Y[2500:]`
`X = X_trainy = Y_train`
`# we have 3 layers: input layer, hidden layer and output layer# input layer has 57 nodes (1 for each feature)# hidden layer has 4 nodes# output layer has 1 node`
`dim1 = len(X_train[0])dim2 = 4`
`# randomly initialize the weight vectorsnp.random.seed(1)weight0 = 2 * np.random.random((dim1, dim2)) — 1weight1 = 2 * np.random.random((dim2, 1)) — 1`
`# you can change the number of iterationsfor j in xrange(25000):    # first evaluate the output for each training email    layer_0 = X_train    layer_1 = sigmoid(np.dot(layer_0,weight0))    layer_2 = sigmoid(np.dot(layer_1,weight1))`
`    # calculate the error    layer_2_error = Y_train — layer_2`
`    # perform back propagation    layer_2_delta = layer_2_error * derivative(layer_2)    layer_1_error = layer_2_delta.dot(weight1.T)    layer_1_delta = layer_1_error * derivative(layer_1)`
`    # update the weight vectors    weight1 += layer_1.T.dot(layer_2_delta)    weight0 += layer_0.T.dot(layer_1_delta)`
`# evaluation on the testing datalayer_0 = X_testlayer_1 = sigmoid(np.dot(layer_0,weight0))layer_2 = sigmoid(np.dot(layer_1,weight1))`
`correct = 0`
`# if the output is > 0.5, then label as spam else no spamfor i in xrange(len(layer_2)):    if(layer_2[i][0] > 0.5):        layer_2[i][0] = 1    else:        layer_2[i][0] = 0    if(layer_2[i][0] == y_test[i][0]):        correct += 1`
`# printing the outputprint “total = “, len(layer_2)print “correct = “, correctprint “accuracy = “, correct * 100.0 / len(layer_2)`

We have used the standard back propagation algorithm for training the neural network

This simple algorithm achieves an accuracy of 90%, which is a great start!

Happy coding!