DSD Fall 2022: Quantifying the Commons (7B/10)

In this blog series, I discuss the work I’ve done as a DSD student researcher from UC Berkeley at Creative Commons.

10 min readNov 21, 2022

In this massive post, I take my first steps onto making a Machine Learning model for the Quantifying the Commons initiative, discussing: what question should the model answer, what data should the model receive, and the preliminary engineerings for it.

Post 7A/10 is here.

DSD: Data Science Discovery, is a UC Berkeley Data Science research program that connect undergraduates, academic researchers, non-profit organizations, and industry partners into teams towards their technological developments.

Getting a Dataset

Creative Commons themselves do not have a dataset for this model’s purpose.

I will have to find my own.

First Instinct: Google Custom Search API

My first instinct of collecting data goes to Google Custom Search API.
As a recap, this API would allow me to retrieve a series of website URLs that fits a combination of custom rules, such as the country a webpage comes from and the license some webpages are under.

For more on how the project utilized this API in prior efforts, check previous posts here, here, here.

However, my model would need more than just a webpage’s URL.
It would also need the webpage’s contents. To abstract this part away for the sake of brevity, Python has good pipelines for capturing text contents from a webpage.

I also tuned the search engine settings (PSE) to filter out results that introduce but not use the linked license.
However, upon using this as the main data collection method, an overwhelming amount of webpages are either short sites with less than 500 characters of contents, or un-decodable .pdf and .txt files.

That would mark the incapability of this first instinct.

Second Choice: Common Crawl

Common Crawl is a large congregation of website contents collected by Creative Commons; note that the datasets from Common Crawl are usually PB-level gigantic.

For context, 1PB is roughly 1 million GB, meaning it would take around 3900 computers with 256GB memories (which is what an average computer has now) to store the entire dataset.

The alternative of using this dataset, then, is to access the dataset from cloud, perform computations of it from cloud, and then retrieve the resulting dataset onto the computer (which we call “local machine” in correspondence to “a remote place in the cloud” that stores PB-sized datasets).

There are two recommended methods of accessing Common Crawl: mostly relying on local resources, or completely relying on cloud resources.

On the local resource, an API called Spark is required. Spark itself then requires installation of 10+ libraries, 4 programming languages, and after 8 hours’ worth of reading documentations, StackOverflow threads, GitHub issues, and setting up the computer for it, the source code for using Spark to access Common Crawl dataset manages to be simultaneously outdated, incompatible, and undebuggable.
Local method of access thus does not work.

On the other hand, cloud resources that Spark requires to access Common Crawl is unavailable due to constraint on this project’s development cost (which happens to be $0) and current technical knowledge not being supportive of the charging of AWS technologies invovled.

This would note Common Crawl as unavailable resource.

Third Try: Google Custom Search API with Progressive Constraints

Eventually, I came back to try Google Custom Search with additional constraints, majorly to:

Inhibit the amount of .pdf and .txt files in search results, since they’re barely decodable
Inhibit the webpages that introduces a license rather than using it

And settled at a primary set of constraints:

Invoke the “rights” parameter in the Google Custom Search API, whose byproduct happens to enhance results that are webpages rather than PDFs.
Retune the Programmable Search Engine that the API searches will be based on by reducing the priority of pages labeled “Copyright”, “License”, and block search results from several dictionary and encyclopedia websites that explains the license without using it.

Now, since Google’s API can offer 100 search results per type of license, and I have an approximate total of 40 CC tools to explore, the maximum possible size of the dataset is known to be 4000 entries, which is not bad for many classical Machine Learning Algorithms.

However, deduplication will come to check whether some web addresses are repeatedly sampled in the dataset, and decoding might not be well-conducted on many web addresses received, the actual number of entries available in dataset would be less.
Not to mention, many scientific, modern websites have managed to create security services that block code-based content retrieval. This prevents many websites from being sample-able.

Still, Google API is the only way out now.

Afterwards, further layers of constraints have been posed on retrieved websites to encourage dataset expansion, offering more insightful training data.
The idea is essentially: in each batch of data retrieval process, we set the combination of rules that search results will fit to be stricter and different from each other, such that each step of dataset construction will involve a slightly different subset of the Internet than before. This is an attempt to gather as much different websites as possbile.

First layer of constraint is noted as above.

Second layer of constraints:

Sort results by whether it contains the text “This work is licensed under:”
Reduce the priority of pages labeled “University”, since pages from and regarding university-produced contents were of high frequency in prior sampling effort.

Third layer of constraints:

Force the search engine to not sort the search results at all
Restrict the available webpages to those that have been published within the past 12 months.

All the constraints above pose a sampling frame quite deviated from the population of websites on the Internet, since they are sampling webpages from very specific subsets of the Internet.
Therefore, the model produced by these datasets would not be so available. They are worthy experiments, nonetheless.

(I would like to blame this at the inaccessibility of other better, well managed datasets, but at the same time regret at not being able to find a better way out.)

Layers of constraints yield diminishing marginal content acquirement, and two other unmentioned layers of constraints have been removed from practice due to their sampled data disrupting the training process of our Machine Learning model rather than helping.

Organizing a Dataset

Why is it not so much of a good idea to use .csv for huge datasets?

Memory issue.
The .csv files will naturally take up huge space in the computer due to its contents not being efficiently encoded (if encoded at all).
On the other hand, SQL databases can handle and store large amounts of data, and its tabular data can be directly manipulated to form stronger or more helpful datasets.

SQL is an industry-standard option for storing data. It is essentially computer readable .csv, structured as tables of data, but cost less space and offer more functions. It is the advanced version of .csv spreadsheets.
SQL is also what computer scientists and software engineers refer as “Relational Database Management System”.

In summary, SQL proved to offer greater value for the following reasons:

Easier and less costly storage of data (as addressed in prior paragraphs)
Better integration with Python (via a SQLAlchemy-Pandas pipeline, which is incredibly useful for data scientists to engineer their datasets in)
Easy syntax for overall summary of datasets and detailed querying

For obtaining the entire dataset from our SQL database, a few simple lines of codes would do:

Very concise effort for retrieving the model’s dataset based on a SQL-Python pipeline, compared to what I used to do here.

Let’s perform some simple explorations on this dataset.

There is a total of 8609 webpages sampled throughout the three layers of constraints.

However, out of these 8609 webpages, only 1956 unique URLs exist — we actually only have less than 2000 data entries to work with.

Preprocessing

These are pre-machine learning efforts to process our data.
This is about shaping our dataset into a computer-readable and model-friendly format that helps the model work better.
Essentially, we are annotating data here and there for our model to notice what portions of the webpage contents matter the most to the model’s classification efforts.

That brings us to an essential topic: Unfortunately, computers are illiterate. Computers don’t really read “alphabets”.

But they read numbers that represent “alphabets”.
What looks like an “A” to a human eye is resembled as some string of numbers in the computer’s cognition.
Therefore, to feed text data into computer-based machine learning models, we must first express words as numeric features.

Designing this conversion process is what most of text data preprocessing would deal with; for example:

How do we represent words as numbers?
After numeric conversion, what words should be deleted based on what standards?
How important is a word to the classification of an object?
What are words? On the other hand, what are not words?

And the answers to these questions will decide how easily a model can distinguish between pieces of text data from the dataset. Therefore, it is VERY IMPORTANT to conduct preprocessing thoughtfully and be brave at experiments.

Abraham Lincoln, a famous American president, would agree with a statement that most nearly means the concept addressed above:

Give me six hours to chop down a tree and I will spend the first four sharpening the ax.

If building and training models is about chopping down trees, then preprocessing is about sharpening the ax.

We will address each stage of the model’s preprocessing of data as marked below:

Deduplication

As seen in prior analysis, out of 8609 data entries, only 1956 unique web addresses exist. Therefore, the dataset should be cleared to just these 1956 webpages:

Deduplicate the dataset by only keeping the earliest item encountered for each unique URL.

After this data purge, we may see the count of data entries left in each license typing as follows:

Now, see how the data left in this dataset spread over each classes.

The data is rather unbalanced between different license types.
We will deal with this issue in the next post.

Tolerance for Unicode/Non-English characters

The Programmable Search Engine and Google Custom Search API, collaboratively, still cannot regulate the search results they provide strictly at documents in English. Meanwhile, some documents have rather unorthodox encoding for some of its texts.
Consequentially, data entries like this can appear:

Undecrypted characters and nonsensical text

These data entries that contain more than some amount of undecrypted characters should be eliminated from the dataset as unhelpful outliers:

def remove_unicodes_aggressive(df, field_name = "contents"):
    df_remove_unicode = df.loc[df[field_name].apply(has_unicodes), :]
    return df_remove_unicode

Here, we attempt to aggressively remove all entries with more than 30 undecrypted characters outside the range of available Unicode.
This amount “30” is a result of fine-tuning.

The passive alternative is to just remove all undecrypted characters, but even if we leave the parsed contents after removing undecrypted characters, since a majority of ill-decrypted websites’ contents are scraped away as undecrypted characters, it wouldn’t prove much value to the model.
In this case, it’s better to just remove it from the dataset for the sake of model performance.

URL Removal

URLs don’t provide particularly much value in license classification. Therefore, it’s better to leave URLs out of the text contents.
For this part, we would then replace all URL-resembling text patterns into empty strings:

def remove_web_urls(s):
    return re.sub(
      (
        r"[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b"
        r"([-a-zA-Z0-9()@:%_\+.~#?&//=]*)"
      ), " ", s)

Non-word/Non-space Removal

Also remove all non-word characters and non-space characters from the text content of a webpage, so that the model would just look at the words and space of a document’s content.

Stopword Removal

Stopwords are common words in English that are popularly used because of their generality.
Examples include “to”, “from”, “a”, “the”… all of which don’t provide much valuable information about license classification. Therefore, we will remove English stopwords from the text contents of webpages as well:

def clear_stopwords(s):
    return " ".join(
      [
        word
        for word in re.split(r"\s+", s)
        if word.lower() not in nltk.corpus.stopwords.words('english')
      ]
    )

Non-English Word Removal

Also remove any non-English words, since they may distract the model from its analysis or can be leftovers of data cleaning that cannot be recognized as English vocabularies.

For example, after clearing the word “3D” into “D” (because we delete numbers in webpage text contents), the vocabulary wouldn’t hold as much sense for the computer anymore.

def remove_non_english(s):
    words = set(nltk.corpus.words.words())
    return " ".join(w for w in re.split(r"\s+", s) if w in words)

Short Word Removal

Remove short words with less than 2 characters. These words are potentially stopwords of other languages (e.g., de, la, si…) or abbreviations of someone’s middle name… its possibilities all don’t provide important information for the model:

def remove_less_than_c_chars(s, tolerance = 2):
    words = re.split(r"\s+", s)
    return " ".join([c for c in words if len(c) > tolerance])

Short Contents Removal

Remove contents whose number of characters are less than 500, which might be due to access denial or the document being a leftover of parsing efforts that removed its spam/worthless contents.

Synthesizing all above Efforts

Here are five example entries of webpages after preprocessing:

Five entries truncated to their first 550 characters

An Overview into the Next Post

Now we have completed preprocessing the dataset, but the dataset is still resembled in text. It is not quite ready to be inputted into a Machine Learning algorithm yet.
But for now, we have completed around half of what Abraham Lincoln would state “sharpening the axe”.

So, in the next post, we will work on the following portions of this modeling task:

Encode text into numbers (this is usually known as “Word Embedding”)
Decide algorithms to build Machine Learning models with
Decide dataset(s) to build models with
Fine-tune models and acquire more training data to increase model accuracy

Essentially, we are finishing the model within the next post.

https://github.com/creativecommons/quantifying

“This work is licensed under a Creative Commons Attribution 4.0 International License”: https://creativecommons.org/licenses/by-sa/4.0/