python code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
data = pd.read_csv("D:/JJ/4_Email Spam Detection with Machine Learning/archive (3)/spam.csv")
X = data['v2']
y = data['v1']
y = y.map({'ham': 0, 'spam': 1})
X = X.tolist()
count_vectorizer = CountVectorizer()
X_counts = count_vectorizer.fit_transform(X)
tfidf_transformer = TfidfTransformer()
X_tfidf = tfidf_transformer.fit_transform(X_counts)
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
class_report = classification_report(y_test, y_pred)
print("Classification Report:")
print(class_report)
sample_emails = ["Go until jurong point, crazy.. Available only in bugis n great world la e buffet...","Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's","I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today."]
sample_emails_counts = count_vectorizer.transform(sample_emails)
sample_emails_tfidf = tfidf_transformer.transform(sample_emails_counts)
sample_predictions = classifier.predict(sample_emails_tfidf)
sample_predictions = ['spam' if pred == 1 else 'ham' for pred in sample_predictions]
print("Sample Email Predictions:", sample_predictions)
Here’s a detailed explanation of each part of the code:
- Importing Libraries: Importing necessary Python libraries, including
numpy
andpandas
for data manipulation,matplotlib.pyplot
for data visualization,seaborn
for creating visually appealing plots, and modules fromsklearn
for machine learning. - Loading the Dataset: Loading the email spam dataset from a CSV file using
pandas
. The dataset is read into a DataFrame. - Data Preprocessing:
— Extracting the email text data (‘v2’) and labels (‘v1’) from the DataFrame.
— Converting the labels ‘ham’ and ‘spam’ into binary values (0 for ‘ham’ and 1 for ‘spam’).
— Converting the email text data to a list of strings to prepare it for feature extraction.
4. Feature Extraction (CountVectorizer): Using the CountVectorizer
from sklearn
to convert the text data into a matrix of token counts. This process converts the text into numerical features, representing the frequency of each word in the emails.
5. TF-IDF Transformation (TfidfTransformer): Applying the TF-IDF (Term Frequency-Inverse Document Frequency) transformation to the token count matrix. This step assigns weights to words based on their importance in the document and the corpus, making it more suitable for machine learning.
6. Splitting the Data: Splitting the dataset into a training set and a testing set. The training set is used to train the machine learning model, while the testing set is used to evaluate its performance. This helps ensure the model generalizes well to new data.
7. Training the Classifier (Multinomial Naive Bayes): Choosing the Multinomial Naive Bayes classifier, a common choice for text classification tasks. Training the classifier using the training data allows it to learn patterns and relationships within the email text and their corresponding labels.
8. Making Predictions: Using the trained model to make predictions on the testing set. The model assigns labels (0 for ‘ham’ and 1 for ‘spam’) to the emails in the test set.
9. Model Evaluation (Accuracy): Calculating the accuracy of the model by comparing the predicted labels with the actual labels in the testing set. Accuracy measures the percentage of correctly classified emails.
10. Confusion Matrix:
Generating a confusion matrix to understand the model’s performance in terms of true positives, true negatives, false positives, and false negatives. This provides a deeper insight into the model’s performance.
11. Visualization (Seaborn): Visualizing the confusion matrix using the seaborn
library. The heatmap provides a clear representation of the model's performance in distinguishing between 'ham' and 'spam' emails.
12. Classification Report:
Generating a classification report, which provides a detailed summary of various classification metrics such as precision, recall, F1-score, and support for both ‘ham’ and ‘spam’ classes. This report provides a comprehensive view of the model’s performance.
13. Sample Email Predictions: Defining a list of sample emails to test the model. These emails are used to check how well the model predicts the labels for new, unseen data.
14. Predicting Sample Emails:
Preprocessing and predicting the labels for the sample emails using the trained model. The model assigns labels (‘ham’ or ‘spam’) to these sample emails.
15. Displaying Sample Email Predictions: Printing the model’s predictions for the sample emails, showing whether the model classifies each email as ‘ham’ or ‘spam.’
The code follows a structured workflow, from data preprocessing to model evaluation and sample email predictions, to build an email spam detector using machine learning.