DSD Fall 2022: Quantifying the Commons (8B/10)

In this blog series, I discuss the work I’ve done as a DSD student researcher from UC Berkeley at Creative Commons.

12 min readNov 21, 2022

In this gigantic text-post, I take my intermediate steps onto making a Machine Learning model for the Quantifying the Commons initiative, discussing: the training, tuning, and final analyses of models.

DSD: Data Science Discovery, is a UC Berkeley Data Science research program that connect undergraduates, academic researchers, non-profit organizations, and industry partners into teams towards their technological developments.

Note: Reading Post 7A, 7B, 8A will provide great help with context of this report.

Covering Ground: How do Machine Learning Algorithms Actually Learn?

There is a general SOP for how to make a model out of a machine learning algorithm:

Input a set of data into the Machine Learning algorithm. The algorithms each have different ways of coming up with good formula and patterns.
Test the performance of an algorithm. If it is not good, adjust numbers and datasets until it gets better.

In other words, algorithms can be imagined as “machineries that follow a procedure to find best formula for a dataset”, and the formula is the only thing that an algorithm learns while its execution.

This learning process is known as “training”, and the dataset used during “training” is known as the “training set”.
But we need to test the model as well. This is when we use the other partition of our dataset to test the accuracy and performance of a model, known as “testing set”.

Ok, so we understand that training set is for training, and testing set is for testing.

But why do we still need to test the model if it’s not being trained with all the data available?

This is because if a model uses too much data from training set, then its formula will be extremely tailored to the reality that training set captures.
However, we have no way to decide if training set provides a good image on how reality looks like.

For example, if my training set only has pictures of blue puppies and blue bagels, my test set is responsible of telling me there exist many puppies and bagels that are not blue.

If a model only follows its training set and behaves excessively inaccurate on the testing set and real-world data, we call the model as “overfitting” the training set.
In other words, the model overly fits the training set and lives in the world shaped by that training set, unaware of what the truth is regarding the distinction of puppies and bagels.

Therefore, a partition of the original dataset still needs to be used for testing, and another portion used for training. A popular choice of train-test split is 70% — 30%.

Defining Training Objectives

Generally, we can stop the model training process whenever seen fit based on the objective of a model.

Since we are a recommendation system, as long as the top 2 or top 3 possible categories of a document’s license type is predicted well based on realistic, industrial standards (70% to 90%), we can compromise by saying “the model works for its context, and it would work better under a more well-sampled dataset”.

Summary of Model Selection and Training Processes

Let’s go over how each model was selected, and how their training processes went during modeling phase of this project. We will cover their performances in the next section after these summaries.

Logistic Regression

Logistic Regression is a classifier algorithm that estimates the probability that a specific data entry belongs to one of the two provided categories. Whichever of the two categories that appear more likely would be the result of classification.

Probability is estimated and computed using what’s called a “sigmoid function”:

A quick peek at the graphs of a Sigmoid function.

Albeit this algorithm works for binary classification (which means classifying between two categories), we can adapt onto our 7-class classification by internally performing classification for each class between documents on one class and documents that are not on that one class.

Here’s how it works.

To know the probability that a document is by-licensed, I operate the classification problem between documents that are by-licensed and documents that are not by-licensed. I perform this process throughout all the 7 classes, and whichever license has the highest probability over that document not being its license would triumph as the result of classification.
For example, in such a table:

A table of license type and their binary classification probability result

Then, the result of classification in above table would be license D, since it has the highest probability of a document being under that license. And, the second-best result would then be license B.

This schema for multi-class classification on binary classifiers is known as a “one vs all classifier”.

Now, let’s discuss the training process.
Overall, this algorithm benefits from regularization more than from cross-validations, and the resulting model after hyperparameter tuning (or, fine tuning) is as follows:

LogisticRegression(
    penalty = 'l2',
    solver = "liblinear",
    class_weight = "balanced",
    C = 0.1
)

The class_weight mode was chosen as “balanced” to counter the unbalanced classification problem.

Support Vector Machine

The fundamental concept of Support Vector Machine is to plot its data entries from the training set as points and find boundaries to separate data entries of different categories.

Then, classification will be conducted based on which side of the boundary is a data entry on.

SVM has benefitted from the conciseness of data that SVD brings. At the same time, a linear kernel is found to perform better than any other alternatives:

SVC(
    C = 1.0,
    probability = True,
    kernel = "poly",
    degree = 1,
    class_weight = 'balanced'
)

To support top-k-probability as a metric for model performance, we have switched into a probability version of SVM model.

Naïve Bayes Classifier

Fundamentally, Naïve Bayes Classifier tries to maximize the conditional probability that an entry is of category C provided the entry is some specific word embedding x, and the probability is computed based on a famous probability principle called “Bayes Theorem”.

The category C’ whose such conditional probability is largest will be established as the result of this classification.
Conditional probability just means the probability of an event happening (category is C) given that a condition has happened (word embedding is x)

If you are interested, here is a very brief introduction to its mathematics.

Bayes Theorem, Very Brief Introduction.

Particularly, this assessing probability is fundamentally:

To rationalize the above:
The probability that a document is of category y given it has word embeddings x can be computed as a fraction of probabilities:

At the denominator, I place the probability that word embeddings x occurs.
At the numerator, I place the probability that the word embedding x occurs and category y simulateneously occurs.

This is equivalent of finding the probability that one document’s category is y among all sample documents whose embedding is x.

Then, it just happens that the numerator can also be calculated as
the probability that category is y times the probability that word embedding is X given y:

A figure demonstrating an intersection between events A and B, sourced here.

The mathematical computation of overall results above:

P(A|B) is the conditional probability of A given that B has happened. P(dice roll is even|dice rolled a 6) = 1 because if a dice already rolled a 6, then that result must be even.

is called “Bayes Theorem”.
Understanding Bayes Theorem is not essential to understanding this blogpost. It’s just a helpful brief introduction for content presentation.

This is not so much of an impressive model to work with, but some hyperparameter tuning (fine tuning) was still done to optimize its performance:

MultinomialNB(
    fit_prior = True,
    alpha = 10
)

Random Forest Classifier (Ensemble version)

To understand how random forests work, we must see what the trees of this forest are.

Decision trees are deep learning structures that attempt to classify data based on conditions.
To abstract away the details (because I really don’t want to make you learn more math than needed), just remember that each decision tree works exactly like the following diagram:

Rule-based decision model, which is what a decision tree is about.

Essentially, to decide whether a document is of category C, we follow along the decision tree’s branches until we reach a result that should exemplify a category.

During training, we split the training set into two (or more) partitions on every splitting node and decide what feature of the data entry should we base the splitting on via numerous metrics.
One notable metric besides probability and accuracy is entropy:

Entropy: how informationally nonuniform (chaotic) is the training set of the partition.

A drawer of only red and blue socks is much more orderly and less chaotic than a drawer of red, yellow, orange, green, and blue socks.
In that case, entropy is higher in the drawer of red, yellow, …, blue socks.

And, the takeaway: the more informationally uniform, the more about-the-same each data entry of the partitioned training set is, so the easier it is to predict and classify.
The feature or rule by which we split a node into two would then base on what minimizes the entropy of training set once partitioned by the rule.
Such is the underlying idea of this decision tree algorithm.

The training set is used to decide the decision rules of a decision tree.
Then, a random forest builds many decision trees and have these trees “vote” on what seems to be the result of classification. This is the rough summary of random forest algorithm.

Now, here’s how the decision trees can be newly constructed in the random forest:

Bagging (Bootstrap): each decision tree is based on a resampled dataset from the original training set.
Boosting: force new decision trees to reflect more on past errors from old decision trees.

It is a bit tricky to tune the number of estimators (Decision trees) in this algorithm. But, the probabilistic and self-strengthening feature of this algorithm still makes Random Forest Classifier perform the best throughout all the models tried:

RandomForestClassifier(
    class_weight = "balanced_subsample",
    n_estimators = 100,
    random_state = 1
)

Gradient Boosting Classifier (Ensemble version)

Mathematically, this is much of a complicated concept to explain.

Fundamentally, how this algorithm works is to, on each new prediction in the training set, add one new parameter into its current formula for deciding the category of an item and optimize the parameter in combination with prior formula. The first prediction starts off using a constant for prediction, and the nth prediction ends up as some n-term formula.

The way it optimizes its loss (a quantification of prediction error) is via an algorithm called “Gradient Descent”, thus the “Gradient” in its name.

This algorithm, just like Random Forest, is an ensemble model: meaning, it is built from many weaker machine learning models and expects the effort of collaborating weak models to make it a stronger model.

A good summary of that philosophy is encaptured by this following famous proverb from ancient Chinese:

Three stooges (leather smiths) might still do better than a smart strategician.

For example, random forests (Strong) uses a lot of decision trees (Weak) to make a good model, and Gradient Boosting Classifier uses a lot of weak model (small formulas) to make a stronger model as it learns more about the training set per iteration.

Provided that framework, there indeed are a LOT of potential in this algorithm, but due to time constraint of the DSD project, I did not have further opportunities for exploring different loss functions on this algorithm.

The choice of loss functions: different ways to quantify the mistakes when classifying, is a good selling point of this algorithm. It’s a shame not being able to experiment with it.

GradientBoostingClassifier(
    n_estimators = 50,
    random_state = 1
)

Therefore, we have only offered a much less modified version of Gradient Boosting Classifier. One major opportunity of model improvement would definitely be the advancement and experimentations on this algorithm’s modeling parameters.

BERT

BERT’s approach is fundamentally based on neural networks. However, our dataset is too small for a data-hungry architecture like neural network to work robustly. Therefore, BERT performed way less ideal than it is expected to, even after several regularizing and overfit-preventing measures like Dropout layers and reduced learning rate.

Past projects that internet’s articles and I have worked on involving neural network all required much larger datasets (such as those with 15000 to 30000 datapoints) to function at good potential.

This dataset (1000 entries long) is simply insufficient.

An approximate summary of our BERT model, per tutorial referenced.

Complicated model and small dataset contributed very significantly to the overfitting situation, causing the accuracy to linger below any of the five simpler models mentioned above. This just means Brute-force is not so much of a robust method for a delicate field like Machine Learning.

I talked about this to the friend who recommended me BERT, and we decided to collectievly laugh at the dataset for being beyond even the redemption of BERT.

Final Results of Models

After some fine-tuning across preprocessing procedures and expansion of training dataset, we have settled at the following summary of model performances:

A boxplot for the training accuracies across models on top-k accuracy.

A boxplot for the testing accuracies across models on top-k accuracy.

A significant trend of overfitting across all models. On top of that, a remarkably low testing accuracy for every model.

But having done all possible optimizations and inhibitions for overfitting, the last thing I can think of blaming is the dataset: being too small, too out of pattern, and perhaps too repetitive. Or perhaps license classification based on webpage content would not be a functioning prompt.

Regardless, let’s consider what we have stated when discussing training objectives for the modeling task:

Since we are a recommendation system, as long as the top 2 or top 3 possible categories of a document’s license type is predicted well based on realistic, industrial standards (70% to 90%), we can compromise by saying “the model works for its context, and it would work better under a more well-sampled dataset”.

Among which, several models are able to reach this standards.

Remarks on Modeling Process

Let me start with some clear criticisms (on myself, obviously).

Why is the dataset so bad?

But at the same time, the more I investigate the reasoning behind dataset, the more I realize it’s inevitable for the given developmental constraints.

Compared to other continuing projects led by academic researchers and machine learning veterans, here are the limitations our project has faced in this phase:

Only a week of time to brainstorm and deal with the entire machine learning modeling phase (because of other ongoing work on my part and the project deliverables’ production schedules). In fact, the entire modeling process was completed in four days, way shorter than the 7-days long visualization stage.
I have virtually no experience on machine learning.
I am a one-man team on this modeling task, responsible for every part from dataset sampling to model fine-tuning (which probably was reflected from the blog).
There are no good dataset/sampling methods to work with.
Our team has no one with data science profession/background, so there’s no one to consult about modeling

All of which are still not valid excuses. There’s still no denying for the seeming failures from this process. And maybe there were no particularly ugly failures to begin with.

The project has yielded more value than a 50% accurate model.

It is true that our project is very much constrained, but even so, we have fought under the circumstances face-to-face with this arduous, stranger path of Machine Learning, and managed to learn a wide breadth of knowledge via independent, hand-on experiences.

We didn’t run away. We cherished every opportunity we can to learn and be creative with the commons.

Which brings me to the reflection that, I’m very glad I’ve been granted the opportunity to enjoy independence and complete freedom in my modeling task, which has supported me to maximize the learning results by entertaining and devoting myself into a difficult, open-ended challenge.

This modeling experience isn’t necessarily a failure. I don’t think it’s fair to characterize an experience by the un-overcame aspects of it. Not to mention, referring to the training objective statement, models can already carry what it was supposed to succeed at. We’ve surpassed our capabilities and constraints that would’ve destined us to much less impressive efforts.

And now that we’re done with the modeling phase, it’s time to think about the future of Quantifying the Commons.

https://github.com/creativecommons/quantifying

“This work is licensed under a Creative Commons Attribution 4.0 International License”: https://creativecommons.org/licenses/by-sa/4.0/