DSD Fall 2022: Quantifying the Commons (8A/10)

In this blog series, I discuss the work I’ve done as a DSD student researcher from UC Berkeley at Creative Commons.

8 min readNov 21, 2022

In this massive text-post, I take my intermediate steps onto making a Machine Learning model for the Quantifying the Commons initiative, discussing: the word embedding options of a model, and the basics of training a model.

DSD: Data Science Discovery, is a UC Berkeley Data Science research program that connect undergraduates, academic researchers, non-profit organizations, and industry partners into teams towards their technological developments.

Note: Reading Post 7A, 7B will provide great help with context of this report.

Words to Numbers

As mentioned in previous posts, computer models cannot directly read at human-level texts. These texts must be converted into numeric features for the computers to make decisions around it. Therefore, it is time to try converting the preprocessed dataset of texts into numeric features, so that we can feed it to the coming model.

Fortunately, there are several ways to turn words into numbers. Here, we will be making introductions on considered and employed methods of text encoding.

TF-IDF

We can characterize a word by how frequently it appears across different types of documents.
Here’s an example of how it might be helpful for classification.

Because by-nc-nd licenses have more scientific documents, it is possible that the word “science” appears a couple more times inside by-nc-nd licensed documents. Therefore, documents with high frequency of the word “science” might more likely be by-nc-nd documents.

Since frequencies can be numerically expressed, characterizing a word by its frequency of appearance marks a good cornerstone for Word Embedding methods.

Among the Word Embedding methods that use “frequency” to decide what words are more important than others, TF-IDF is perhaps the most popular choice. Here’s how it works.

The measure of how frequently a term appears in some specific document of our dataset is known as “Term Frequency”. This metric helps to measure how core a word is to a document’s content: the more frequently a word happens in one document, the more essential that word is to the document.

Meanwhile, the frequency of documents that contain a specific term would then be known as “Document Frequency”. The more documents that contain a term, the higher the document frequency of that term, and the more useless the word is for identifying specific contexts of a text passage.
For example, the word “is” ought to have a high document frequency for how widely it can be used across English texts.

Then, the opposite of it is known as “Inverse Document Frequency”, measuring the rarity of a word across multiple documents. The word “is” would thus have a low Inverse Document Frequency (IDF).

Let’s mark an observation here: Words with less IDF appear more important, because it might be specifically identifying some type of documents. Meanwhile, common words just appear in every document and don’t really help classification.

For a word, if its frequency in one document is low (low TF), then it doesn’t appear essential to the content of that document. Meanwhile, if a word has a high document frequency (low IDF), it would just be treated as a normal and widely used word that doesn’t provide much context and hint on the document’s content.

Following that logic, the multiplication of TF and IDF forms a great metric that filters out words unimportant to documents and words too common to documents.

Words with high TF-times-IDF value would thus be important words that are vital in document but not widespread across documents.

A code snippet for TF-IDF based feature extraction from our dataset’s document text contents

def extract_text_features_tfidf(train, test, text_field = "train_text", svd = True):
    tfidf_vectorizer = TfidfVectorizer(
        use_idf=True,
        stop_words="english",
        max_features = 6500,
        max_df = 0.95,
        sublinear_tf=True,
        ngram_range=(1, 3)
    )
    tfidf_vectorizer.fit(train[text_field].values)
    train_vectorized = tfidf_vectorizer.transform(train[text_field].values)
    test_vectorized = tfidf_vectorizer.transform(test[text_field].values)
    svd_train, svd_test = None, None
    if svd:
        tsvd = TruncatedSVD(n_components = 100)
        tsvd.fit(train_vectorized)
        svd_train = tsvd.transform(train_vectorized)
        svd_test = tsvd.transform(test_vectorized)
    return train_vectorized, test_vectorized, svd_train, svd_test

Above is the Python TF-IDF pipeline we employed in our model. Here, we may see that we are filtering out words whose document frequencies are over 95% (meaning they are too common), and only taking the most important 6500 words into consideration.
Other than that, we are using the sublinear term frequency option (which calculates term frequency more subtly), as well as introducing a wider range for ngram_range, both for optimization purpose.

Particularly, adjusting the ngram_range allows our algorithm to also consider frequencies of adjacent combination of words. What we call “ngrams” are in nature multi-word phrases like “Among Us” (a 2-gram in this case), compared to individual words like “Among”, “Us”, which are called 1-grams.

There are good reasons to use ngrams.
For example, instead of considering just the 1-gram “science”, I might consider the importance of phrases “scientific research”, “scientific literacy” across documents as well.

Having ngrams are useful when we require phrases as features; instead of “Creative” or “Commons”, we might want to detect the importance of phrase “Creative Commons”. Via the ngram range option, our algorithm is now opened to more selectable features to construct its model.

We will discuss later about the “SVD” part saw in the code.

Word2Vec

Word2Vec is a now considered old method for word embedding, which is mainly trained for “fill in the blank” problems. During modeling progress, I completely misunderstood this as a classification-centered technology.

Meanwhile, the Python library for this Word Embedding technique, Gensim, only has a sklearn API wrapper for convenient usage in a previous version, but it is incompatible with the newer versions of C++ tools on my computer. Therefore, Word2Vec was only considered and not used at all in the modeling task of this project.

Word2Vec and Gensim impacts virtually no part of this project other than the one hour it took away from training process.

BERT

BERT, Bidirectional Encoder Representations from Transformers, not to be confused with the character from Sesame Street, is an NLP technology for Word Embedding that gains huge attention in the recent years for NLP-related machine learning efforts.

Note: NLP, Natural Language Processing, is a topic of Machine Learning research that focuses on human language text content. Fundamentally, the model we work on is also an NLP task.

# The BERT Model Architecture used for this project, if you're interested.
def build_classifier_model():
    text_input = tf.keras.layers.Input(
       shape=(),
       dtype=tf.string,
       name='parsed_cleaned_contents'
    )
    preprocessing_layer = hub.KerasLayer(
       tfhub_handle_preprocess,
       name='preprocessing'
    )
    encoder_inputs = preprocessing_layer(text_input)
    encoder = hub.KerasLayer(
       tfhub_handle_encoder,
       trainable=True,
       name='BERT_encoder'
    )
    outputs = encoder(encoder_inputs)
    net = outputs['pooled_output']
    net = tf.keras.layers.Dropout(0.3)(net) #Inhibit overfitting
    net = tf.keras.layers.Dense(
        1,
        activation=None,
        name='classifier'
    )(net)
    return tf.keras.Model(text_input, net)

Without going into the details of this technology, the general consensus of the ML community is that BERT is a VERY advanced progress in NLP. But, as for the reason why it is so advanced, there doesn’t seem to be a clear solution yet.
Regardless, it may prove to be a good choice for this project in the case we would like to just bruteforce a very strong model for classification.

But it did not work.
I will explain the reasonings as we discuss model choices in the next section.

Solving Problems other than Word Embedding

There are much more inherent problems on this model than I’d like to admit.
So out of those many of which I managed to alleviate, here are the two most significant changes I’ve done to the modeling strategies and datasets.

Singular Value Decomposition

The result of TF-IDF Word Embedding (which proves to be the only useful one in the endeavors of this project) is now a 6500-elements long vector, since the maximum feature option was limited via customization at a ceiling of 6500.

Now, here are two problems about our documents being a 6500-elements long array of number:

It’s too long. This is unhelpful to the memory.
It’s too complex. The model might be extremely confused because there are too many things to consider (6500), especially if the model relies on plotting these vectors in its algorithmic process.

Issue 1 is “not too much of a problem,” since I just switched to a new computer around the beginning of this summer.
Issue 2 is a big problem. That is why we need to attempt to compress large 6500-element vectors into smaller vectors with less elements, so to facilitate models working with text data.

The compression technology we use here is known as Principal Component Analysis, which simply put, (because as much as I’d like to, I don’t really think most of us would want to learn about college math all of a sudden) is a mathematical operation that transforms vectors into representative coordinates with way less elements (Usually, on a conversion scale like 80-to-3, and in our project 6500-to-125).

This “PCA” mathematical operation is powered by a mathematical operation called “Singular Value Decomposition”, deciding how the compression will happen.
This is why the term “SVD” has appeared in our feature extraction function’s code from above section: it’s for data compression.

SMOTE

So when will the acronyms stop in Machine Learning? Probably never.

But what I’d like you to recognize is that our classifier is encountering an unbalanced classification problem: there are too many by-licensed documents, but too few on any non-distributive documents (which also has to do with these websites’ security mechanism to prevent our side’s text sampling).

Meanwhile, the overall dataset was small to make any machine learning model work until new data was added.
In summary, the total number of entries in the cleaned dataset after three rounds of expanding it was still floating around 1000.

This is around the minimum for a model to properly function for complex tasks on our task’s degree. We need more data.

To get more data for deprivileged classes in this classification problem, I need to get more data.

An effective approach would be SMOTE: Synthesized Minority Oversampling Technique:
In this method, we randomly and continuously make new data for minority classes based on its prior existing onee. This will solve the unbalanced classification problem by providing us roughly equal number of samples for each class to be properly classified, if not sufficient already.

smote_strat = {
        k: int(
            0.8 * min(round(np.mean(dataset_counts.values)),
            dataset_counts.iloc[k] * 1.8)
        )
        for k in range(1, 7)
}
svd_Y_train = Y_train.copy()

smote = SMOTE(
    sampling_strategy = smote_strat
)
X_train, Y_train = smote.fit_resample(X_train, Y_train)
smote_svd = SMOTE(
    sampling_strategy = smote_strat
)
svd_X_train, svd_Y_train = smote_svd.fit_resample(svd_X_train, svd_Y_train)

Above is a code snippet of the SMOTE-ing strategy employed upon final fine-tunings of the model.
This will counter the unbalanced classification problem enough to make the model function under current constraints.

To see the model selection, training process, please visit Post 8B as linked here!

https://github.com/creativecommons/quantifying

“This work is licensed under a Creative Commons Attribution 4.0 International License”: https://creativecommons.org/licenses/by-sa/4.0/