How Recent Machine Learning Tools and Open Source Changed the Course of Problem-Solving: A Spam Classification Example

Tolga Akiner
The Startup
Published in
12 min readJul 25, 2020

Disclaimer: My genuine computer scientist friend Tolun Tosun is the co-author of this article.

We are probably on the same page about how computation and coding have changed civilization and shaped the future of humanity. This story had its baby steps around 1900–1600 B.C. with the algebra applications in ancient Babylon, started to crawl by the geometrical advances in Hellenistic Period, was used as ‘algorithm’ by Arabic mathematicians around 800 B.C. and then found some of its first physics applications by Descartes, Leibniz, and Newton after Renaissance. And when people figured out how to use some mechanical (starting in the 19th century) or electronical (in 20th century) systems for those calculations that are impossible to do by human hand, a new era has begun. Since then, we’ve cured some of the diseases and extended human life, discovered the best ways of transportation and globalized the world, went to space and monitored not only the deep space but also the Earth and nearly all the humans, started large scale wars and killed millions of people, built as many as nuclear weapons as can destroy almost the entire civilization, invented the internet and spread the information across the globe, and filled the planet with plastic and carbon gases while executing most of these actions.

Computation has always aimed at finding some answers to satisfy intrigued human curiosity, and the urge of moving forward while revealing the unknown, and now with the availability of coding tools and the power of open-source, we can solve some of the problems in our bedroom which were impossible to solve by the world’s top research institutions a few decades ago, such as the problem we are going to give as an example today: Spam classification.

This high availability of the advanced computational tools profoundly sparked the interest of problem solvers, resulting in an immense demand and curiosity around one of the most valuable professional skills of the 21th century: coding. I recently stumbled upon a very cool company motto that I can relate with this paragraph of this article: “Coding in 2016 is like reading in 1816”. According to Evans Data Corporation, number of software developers in the world is going to be increased by 89% in 2030 with respect to 2018 hitting to 45 millions and not to mention about millions of other jobs involving decent amount of coding skills. The backbone of this chain of interest is the open source phenomena which has enabled this coding community to leverage the one of the greatest features of homo-sapiens: collaboration. Another amazing expression I come across in this blog: “One of the greatest benefits of open source is that it has created a model where smart people who disagree with each other can collaborate with each other. It’s easy to collaborate if we agree, but open source enables collaboration even when people disagree ”. The reason I can write an email spam classifier in my bedroom today is the fact that some coders have been mutually enhancing their projects without having to have agreements and paperwork, and then they sometimes decide to share all their code in a package format so that another practitioner/developer can play with it and maybe offer another perspective. And now I just need to write a piece of code starting with ‘import’ command! This system has been showing a tremendously efficient progress engaging the entire community and playing a crucial role for the new computation paradigm where we prefer letting the algorithms to find their pattern rather than dictating thousands of rules to them through rigorous coding. Yes, some of you know what I’m talking about and this benchmarking of learning-based vs. rule-based and their availability to us are two main ideas in this article.

One of the game-changer methods within the coding concept in the last years is definitely Machine Learning (ML), and this technology has been covering a larger portion of our lives every day. First up, engineering and natural science schools immersed coding lectures in their curriculum and now we have gotten to a point that any STEM student is exposed to some sort of ML algorithm somewhere within their education journey. Today software developers, data scientists, ML engineers, AI researchers (we may need another article to digest these titles nowadays) have been finding, cleaning and processing terabytes of data in each project, building sophisticated neural network architectures, utilizing advanced CPU/GPU/TPU infrastructures, optimizing multi-variable models and visualizing their results with fancy looking figures whereas people were running hundreds of if-else statements in semi-mechanical compilers only one generation before now.

As ML has been changing some of the computational paradigms, one question arising is when to use this method and when not as described by a similar example in this blog. We also asked this question from a different angle that is how would a problem-solver tackle one of the common types of ML problem with more conventional computational tools, let’s say when the ML classifiers were not this popular or available? To articulate this comparison, we took spam classifier problem into account where Naïve Bayes classifier has been shown to outperform rule based RIPPER algorithm in some of the test sets; however, both methods resulted in similar accuracy for some other sample sets in a study of 2002. This problem is forming a feasible foundation to address the question above and to see if we really need ML for this particular case. For the models of interest, Naïve-Bayes classifier has been a nice rebound model for ML practitioners who couldn’t get along with massive transfer learning models for text classification tasks yet, and RIPPER algorithm is a good rule based candidate for our purpose considering its smart rule selection feature.

In this article, we also wanted to give an example of how some impossible problems have become possible by ML approaches and the magic of open-source through one of the popular NLP challenges, that is the spam email classification. Even though your spam folder categorization might be working just fine and you might not need any model like this one at this moment, nor does the spam detector change the world, text classification refers to a much broader need such as sentiment analysis, any enterprise level text labeling automation effort and the controversial hate speech detection problem. Beyond all these popular examples, classifying a piece of text is the fundamental unit for how human mind processes the linguistic data and it is also an inspirational component of growing conversational AI technologies which are carrying a huge potential for the new cutting edge NLP efforts.

A labeled spam email dataset having 3017 emails has been selected as our data and we have developed two models that are a RIPPER classifier with Scala and a Naive-Bayes classifier via Python. Some people may just follow the crowd while choosing their coding language, but there lies a deep reasoning behind these selections. Python has increased its popularity with the emerge in development of data science and it is the most popular programming language in the data science community. This is thanks to its capability and flexibility in numerical computations (as R). It does not require a strong programming background and therefore, scientists from other disciplines are boosting their work easily with the data science tools provided by Python. Scala on the other hand, is not as popular as Python in data science although it is widely used in big data processing with frameworks like Hadoop, Spark. It is a well-known example of Functional Programming. For us, the main advantage of Scala is the capability to use Java open source libraries with a much more developer-friendly syntax. Moreover, it works on Java virtual Machine so that it preserves Java’s performance advantage over Python.

Combined with big data processing tools and Java’s open source capabilities that are being developed over decades, Scala is a good candidate for solving a classification problem. But Python is the favorite of most of the data science community and the main tool for the practitioners. It retains the most powerful state of art solution nowadays with the availability of advanced algorithms such as the huge advantage in deep learning applications that Google’s Tensorflow provides. As a result, we present a recent ML approach by Python while the course of solving a classification problem through Scala/Java is viewed more traditional.

The email data is textually nasty as you may imagine, so we dropped N/A rows, stop-words, URL links and punctuation, and lemmatized the corpus with the help of NLTK, Regex and BeautifulSoup packages. The pre-processed data can be found in our GitHub repository, under Data section.

### Text Cleaning Functions ###
def lemmatization2(text):
stemmer = WordNetLemmatizer()
corpus_lem = []
for i,row in enumerate(text):
document = row.split()
lem_doc=[]
for token,tag in nltk.pos_tag(document):
if tag.startswith(‘J’):
lem_doc.append(stemmer.lemmatize(token,wordnet.ADJ))
elif tag.startswith(‘V’):
lem_doc.append(stemmer.lemmatize(token,wordnet.VERB))
elif tag.startswith(‘R’):
lem_doc.append(stemmer.lemmatize(token,wordnet.ADV))
else:
lem_doc.append(stemmer.lemmatize(token))

corpus_lem.append(‘ ‘.join(lem_doc))

return corpus_lem
def remove_stopwords(text):
nltk.download(‘stopwords’)
stop_words = set(stopwords.words(‘english’))
corpus_sw=[]
for i in range(0,len(text)):
review = [word for word in text[i].lower().split() if not word in stop_words]
review = ‘ ‘.join(review)
corpus_sw.append(review)

return corpus_sw
def clean_text(texts):

clean = []
for text in texts:
# Removing the @
text = re.sub(r”@[A-Za-z0–9]+”, ‘ ‘, text)
# Removing the URL links
text = re.sub(r”https?://[A-Za-z0–9./]+”, ‘ ‘, text)
# Keeping only letters
text = re.sub(r”[^a-zA-Z.!?’]”, ‘ ‘, text)
# Removing additional whitespaces
text = re.sub(r” +”, ‘ ‘, text)
clean.append(text)

return clean

And the data looks like:

# Read the data
raw_data = pd.read_csv(‘./spam_or_not_spam.csv’)
# Drop the nans
raw_data = raw_data.dropna()
# Clean text
cleaned_text = clean_text(raw_data[‘email’].tolist())
# Remove stopwords
cleaned_text = remove_stopwords(cleaned_text)
# Lemmatize
cleaned_text = lemmatization2(cleaned_text)
# Pull the target variable (binary)
target = raw_data[‘label’].tolist()
# Quick glance at the dataset
raw_data.head()

And then the cleaned corpus has been converted into a TFIDF matrix which has then been split into test and training sets. Finally Naive-Bayes classifier has been applied easily, all thanks to mighty sklearn package. We should probably say a couple of words about this one of the ‘coolest’ classifiers (I know, can be very subjective but still…) because although it is naïve and does not even know any difference between predictors, my experience and Wikipedia show that it provides feasible set of results on a small amount of training data within a small amount of computational time.

### TF-IDF fit&transform ###
tfidfconverter = TfidfVectorizer(max_features = 20000,min_df = 5, max_df = 0.75)
X = tfidfconverter.fit_transform(cleaned_text).toarray()
# Splitting the dataset into the Training set and Test set
X_train,X_test,y_train,y_test = train_test_split(X, target, test_size = 0.20, random_state = 0)
# Naive-Bayes Classifier
gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)
# Confusion Matrix
#cm = confusion_matrix(y_test,y_pred)
cl_report = classification_report(y_test,y_pred)

For the second model we have, RIPPER is a rule-based classification algorithm having a pretty straightforward principle. It extracts a set of rules from the training data during the training phase itself. The rules are simple Boolean expressions that are evaluated on the test data for classification purposes. To understand RIPPER, it is a must to comprehend what is a rule and a rule set. A rule is composition of properties that can be directly imposed on the data, connected by boolean “AND”s in between. On the other hand, the rule set connects these rules via boolean “OR”s. A toy rule set more or less looks like this:

the e-mail is spam <=> (word “gift” appears in d) OR (word “free” in d AND word “pass” in d) OR (word “bonus” in d AND word “bet” in d)

Note that d refers to the document (e-mail in the current context) to be classified. There exists three rules in the rules set, which are connected by “OR”s. RIPPER iteratively grows the rules until some stopping condition is reached. At each iteration:

  • The training set is split into growing and pruning sets.
  • A rule is composed by greedily adding some conditions to the rule. As the data science community is familiar from dec-trees, the condition that maximizes FOIL’s information gain formula in the growing set is added to the rule.
  • The rule is pruned (some conditions are deleted) using the pruning set. Again, it is a greedy process and the conditions are selected that maximizes some pruning metric.

Iterations end whenever a rule with lower than 0.5 precision is proposed. For more details, we would suggest this easy-to-read paper or these slides.

Weka is an open source machine learning library that is fully implemented in Java and has been used for executing RIPPER on the spam data set, with the help of JRip class. For non-programmers, it also offers a tool called “Weka Explorer” with a nice GUI to play with data. In this library, data sources are represented by the Instances class. Therefore, the pre-processed CSV data file must be read into an Instances object and the main helper for this purpose is the CSVLoader class. A remove filter is also needed to select the necessary columns from the data set.

val csv_loader = new CSVLoader()
csv_loader.setSource(new File(filename))
val data_set_csv = csv_loader.getDataSet()
val remove_filter = new Remove()
remove_filter.setAttributeIndicesArray(Array(2,3))
remove_filter.setInvertSelection(true)
remove_filter.setInputFormat(data_set_csv)
val filtered_data = Filter.useFilter(data_set_csv, remove_filter)
filtered_data.setClassIndex(0)

The class values need to be transformed into nominal format from numeric type. Otherwise, JRip won’t be happy about it, but once it is, Weka JRip instance and the classifier are created by the help of Scala’s easy-to-use syntax.

val convert = new NumericToNominal()
convert.setAttributeIndicesArray(Array(0))
convert.setInputFormat(filtered_data)
val data_set = Filter.useFilter(filtered_data, convert)
val jRip = new JRip()
jRip.setUsePruning(true)
jRip.buildClassifier(data_set)

Here, data_set is the object for our pre-processed data set. Loading the CSV to JRip is a bit tricky, therefore I will discuss it at the end to leave the following steps simpler. One can write

jRip.setDebug(true)

to see what is going on. To evaluate the model, we create a Cross Validation Object for 10 folds and then Class methods of Evaluation Object can be used to prompt the result in a ‘Scala-ish’ way:

println(“ \tPrecision\tRecall\tF”)
println(padding.format(0, eval.precision(0),eval.recall(0),eval.fMeasure(0)))
println(padding.format(1, eval.precision(1),eval.recall(1),eval.fMeasure(1)))

val conf_matrix = eval.confusionMatrix()
val total = conf_matrix.map(u => u.sum).sum
val succ = conf_matrix(0)(0) + conf_matrix(1)(1)
println(“Accuracy: “+ (succ/total))

Here is an example rule extracted by Jrip which simply means that the algorithm marks an e-mail as spam if all the words in this rule is present in the e-mail:

1: (cleaned = low rate available term life insurance take moment fill online form see low rate qualify save number regular rate smoker accept url represent quality nationwide carrier act easily remove address list go please allow number number hour removal) => label=1

I think these model details give some idea about the differences of implementations in two different language. And the results came out as such (NB is for Naïve-Bayes and RP is for RIPPER):

That question we always ask ourselves while reading an article is… So what? NB classifier results in very acceptable accuracy metrics whereas we can clearly see how one of the fanciest rule-based algorithm fails in different ways even though its accuracy seems reasonable. When we look at RIPPER’s high precision and low recall for class 1, we observe that It labels only 4% of the spam emails. The reason is NB can clearly separate the token clusters in a better way through Bayes algorithm and its learning-based structure as opposed to rule-based and identify 73% of the spam emails. This is the example we wanted to walk through in this article that is presenting how an easy-access recent ML library significantly outperforms what was available before the golden age of ML on a very daily, intuitive and practical problem.

Key Takeaways:

  • We have reached to an era that we can tackle the challenging problems of the previous century with our personal devices within a short amount of time.
  • Supply and investment on software sources have created the open-source and open-source returned the courtesy by a tremendous enhancement in the software technology, building an easy and efficient loop over human collaboration.
  • Open-source and Machine Learning have vastly changed how we attack problems, what tools/algorithms we choose and the resulting accuracy.
  • We have presented a model comparison for a simple and intuitive text classification example that is spam detection, hoping that this is an important topic since there are two TED talks about the subject. Naïve-Bayes and RIPPER algorithms have been executed in Python and Scala, respectively, where the results lead us to highlight some conclusions around coding languages and model selection.
  • If you are also curious about ‘so what’ of the ‘so what’, NB classifier detected most of the spams giving 0.94 accuracy but some of Marck Zuckerberg’s email, or a generous amount of legacy from a stranger may have slipped away. On the other hand, rule-based classifier has probably missed pretty much everything except a couple of subscription and phishing emails.

Thank you for reading up until here!!! You can find all the relevant code in our github repository and credit again goes to co-author Tolun Tosun! Also please feel free to drop a ping to us for feedback and further discussion:

--

--

Tolga Akiner
The Startup

Have a keen interest around data science, some knowledge about NLP, lots of concerns for the environment, and a PhD in Computational Material Science