DSD Fall 2022: Quantifying the Commons (7A/10)

In this blog series, I discuss the work I’ve done as a DSD student researcher from UC Berkeley at Creative Commons.

7 min readNov 21, 2022

In this massive text-post, I take my first steps onto making a Machine Learning model for the Quantifying the Commons initiative, discussing: what question should the model answer, what data should the model receive, and the brainstorms for it.

DSD: Data Science Discovery, is a UC Berkeley Data Science research program that connect undergraduates, academic researchers, non-profit organizations, and industry partners into teams towards their technological developments.

What? Modeling?

is probably the most immediate reaction when I read that modeling was a portion of DSD presentation requirements, while our project statement was about visualization and data extraction.

To model?
What is to model?
Defining what “modeling” really is now becomes a good entrance towards a discussion on what the project’s modeling task would be.

In general, modeling refers to the practice of constructing some system of mathematical works that exemplify real-world behaviors.

Models are essentially programs that help solving problems that follow the below formats:

Regression, which is about predicting a value based on a combination of other values; for example, predicting hours of sleep based on number of clubs a student participated in.
Classification, which is about classifying a value into a category based on its traits; for example, distinguishing puppies from bagels, which is more difficult for a computer than I thought:

Image of this great dillema is sourced from here

In the world of Data Science, modeling almost directly refers to Machine Learning, the hot topic now.
What we refer as “AI” in technologies now is a superset (parent field) of Machine Learning, and what is commonly referred to as “AI” on news happen to mostly be about this “Machine Learning” thing.

But since this blog is targeted towards the common audience, I should first introduce what IS Machine Learning.

Covering Ground: What do you mean Machine Learning?

Let’s quote some famous definitions of “Machine Learning” in the field and perform some literary analysis on it:

Field of study that gives computers the ability to learn without being explicitly programmed

Machine Learning is a discipline, and more precisely, a subfield of Computer Science. It grants a computer some program or algorithm that allows it to “learn”, or to figure something out autonomously, mainly to solve regression or classification problems.
However, this level of “autonomy” hugely depends on the program and algorithm itself, which is written by humans.

So ultimately, machine learning is powered by human learning.
Those who do not understand machine learning cannot rely on the machine to perform everything.

Here’s another insight on that autonomy of Machine Learning programs:

A computer program is said to learn from experience with respect to some task and a statistic called performance measure if its performance, measured by a statistic, improves with experience.

First of all, what is this “experience” from a program? A program is not a brain, how can it perceive experience? I’d like to discuss this notion from two different possible types of experience: experience that existed and experience that is learned.

Every machine learning algorithm requires a set of data to learn patterns from. This is usually called a dataset. We then feed this dataset to a Machine Learning algorithm and train this little algorithm to identify patterns that separate puppies and bagels.

Dataset is the prior experience to a Machine Learning algorithm on what looks like a puppy, and what does not, but rather, looks like a bagel.

Then, during algorithm ongoing, sometimes the algorithm will notice for itself that a learned pattern doesn’t really apply onto other data.
For example, when distinguishing puppies and bagels, I might mistakenly make the judgement that “everything orange is a bagel” just because I’ve seen too many orange bagels, but a lot of puppies happen to be orange too!

Once the algorithm learns that this pattern is wrong, it learns an experience on what can become a wrong pattern to avoid in future.

After using a machine learning algorithm, we end up with a bunch of formulas that summarizes the patterns an algorithm has learned for its problem. This result, this bunch of formulas that decides what is a puppy and what is a bagel, is what we officially call a “model”.

Then, the model’s performance can be measured by a lot of different metrics, which we call “statistics” (Statistics, in any context, can just mean “a measurement on some phenomenon”). In this case, there are a couple of statistics to consider: accuracy, error, number of false positives… etc.

Risks and Weaknesses of Machine Learning

But from the above descriptions, we probably would find an essential factor to the success of an algorithm be “how it deals with a dataset”.

Because what a model learns is decided by its dataset.

And, on the other hand, even suppose there is a supreme algorithm that picks anything right off a dataset and fully learns it, the model it produces will still not always be successful. The reason why, is that datasets may incompletely capture reality, and then the model come to learn this incomplete image of reality as its own.

This is how the development of machine learning has invoked concerns on systematic discrimination and unintentional consequences.

Let’s see this from the view of a slightly more delightful example to demonstrate this problem: once again, puppies and bagels:

I have overwhelmingly more pictures of puppies than of bagels in my dataset, so my model only learns about puppies and barely learns about bagels. (Which is called unbalanced classification)
My puppies are all blue in the dataset, so when the model is deployed in real world, it cannot identify green puppies. (Which means a biased dataset)
My dataset might contain some unorthodox square bagels, making the model unable to identify a circular object as a possible circular bagel. (Which is said as there are outliers in the dataset)

These all can cause confused models that cannot distinguish puppies and bagels.

But, applying this logic onto some of the most controversial topics about “How do we identify if an individual has a certain property”, and the thought of it would tell you how a model can end up unintentionally or innocently discriminatory.

And now, imagine this model becomes a popular decision-making tool on some wide social network…

With that in mind, on the topic of machine learning:

“With impressive power comes impacting responsibilities”.

AND, wrong datasets cause bad models, because algorithms are extremely dependent on the data they’re fed.

A Model Solves a Question. What does Quantifying the Commons ask?

Quantifying the Commons has done a lot of efforts on extracting data for “usage of Creative Commons products”, so it would be natural to continue along this axis and investigate the usage of Creative Commons tools on the Internet.

So now, there are some brief guiding questions that I came up with to guide the itinerary of deciding what the model will answer:

Classify or regression? (There are really just these two choices)

Looking at our dataset, we don’t really have a lot of numeric variables to play around with.

In fact, we have already done regression on the number of CC Licensed YouTube videos across tens of two-month periods, and it wouldn’t be intriguing for both the researcher and the audiences to see us repeat the same experiment here.

Prior efforts on regression during visualization phase of project

That guides the direction of question to classification.

In other words, instead of asking: “What is the value of something given XYZ?”

We will ask: “What is the type of something given XYZ?”

What data can I use to build the model?

Currently, the dataset collected by Quantifying the Commons only concerns document count. Meanwhile, available items to classify involve license details (Typing, version, subcategories…), platform name, media…
So the crucial problem is that: on the equation of

There are too many right-hand side variables in my dataset. I don’t have anything to base my predictions on when it comes to classifying.
I need new datasets, new features to build my model on.

And the contents of new datasets would then have to be decided by the question of:

What kind of answers are useful to the development of this project?

The usage of CC tools is plateauing.

At these times, most for-profit organizations in the same shoes would ask for some consulting or SaaS firms to extract insight about product growth. One indispensable aspect of that process would be “classifying the plans/products a user chooses, and why”.

This can prove as valuable information for both CC per se to imply on user base, document typing that use specific license types, as well as being helpful to users who want recommendations on what licenses their documents would like to use, based on what other licenses have similar documents been predicted to use.

Now that we have done brainstorming on what model can be helpful for Creative Commons, let’s decide what question will this upcoming model answer.

And that question would be:

“What is the license typing of a webpage/web document given its content?”

To answer that, the data we need would be: a bunch of webpage contents from websites that are protected under CC tools. And the model we build should then classify a webpage under one of the following seven major categories of CC Tools.

We have just walked through the thought and brainstorm aspect of creating this model! To read the next engineering steps of building this model, visit the next post: Post 7B/10.

https://github.com/creativecommons/quantifying

“This work is licensed under a Creative Commons Attribution 4.0 International License”: https://creativecommons.org/licenses/by-sa/4.0/