DSD Fall 2022: Quantifying the Commons (10/10)

In this blog series, I discuss the work I’ve done as a DSD student researcher from UC Berkeley at Creative Commons.

7 min readNov 27, 2022

DSD: Data Science Discovery, is a UC Berkeley Data Science research program that connect undergraduates, academic researchers, non-profit organizations, and industry partners into teams towards their technological developments.

Preface

So what is Quantifying the Commons?

Over the years from 2014 to 2017, Creative Commons has been releasing a series of reports detailing size and diversity of Creative Commons product usage on the Internet:

This series of report is important as it can attract investors who are interested in the significance of Creative Commons, attract users who look forward to the usage of this large, reliable product based on the statistics, as well as offer internal staffs of Creative Commons new insights on how to further promote and develop their projects and products.

However, this effort has ceased at the year of 2017 due to the project suffering unreliable and unavailable data extraction methods for Creative Commons product usage.

Therefore, the work of student researcheers this semester would be reinventing and designing a new fata extraction method of higher ravailability and firm reliability, while also producing the visualizations for this report to nudge the future developments of Creative Commons.

The general delegation of our project is as follows:

The delegation of work throughout this project

Data Retrieval

The nature of our project would urge us to discuss the dataretreival process of the project.

The key information we suppose to find is:

How many documents on the Internet are protected by Creative Commons?

Fortunately, since each online document protected by Creative Commons will contain a hyperlink towards the tool/license it utilizes, we can use online platforms’ APIs to collect, detect, and count the number of CC-Protected documents on a range of online platforms.

Here is a general survey of online platforms the researchers work with this semester, as well as the philosophy of the API central data extraction method.

Such method for CC product usage would be outlined as follows:

Acquire a list of CC tools to investigate for (which is offered by Creative Commons)
Generate API endpoint calls to detect and count Creative Commons protected documents on each online platform within delegation
Creative/Auxiliary Subprocesses for facilitating each platform’s data collection

And upon the detection of documents and collection of data, we’d like to perform some exploratory analyses, as detailed in this post (post 4 link). Furthermore, the data extraction methods must be revised along the discoveries.

It is not within my delegation to detail the exploration and retuning for Flickr’s data collection process, therefore I will leave this as the content of my partner’s post; however, some suggestions from my side can be seen on the poster and in prior blog posts.

Having retuned the data extraction methods, we reviewed our codes and pushed it to production on this Github repo.

Data Visualization

The visualizations we produce are not only communicative, but also exhibitory, such that we may attract people by showcasing the significance of Creative Commons around the world, as well as offer insights for guiding new policies and product developments.

For that, we have applied some new principles for data visualization based on the contents of prior work, some as a response to lack of such implementations.

For the engineering process of datasets, feel free to visit this post for more details. Essentially, the process is about preprocessing, merging datasets, while translating numbers hard to interpret on their own into meaningful metrics that are suitable for visualization. Meanwhile, for APIs with dishonest behavior (YouTube Data API), missing and capped data are also imputed along the original values.

At last, here are the yields of my visualization phase.
Some highlights are represented below:

There are now **more than 2.7 Billion webpages protected by Creative Commons** on Google!

We can see that **Attribution and Attribution-Nonderivative are popular licenses** among the 3 billion documents sampled across the dataset.

Roughly 45.3% of the documents under CC protection are covered by free-culture tools.

Particularly, **Western Europe and Americas enjoy a much robust use** of Creative Commons document in terms of quantity. A development in Asia and Africa should be encouraged.

For more yields, analyses, insights, please visit this post.

Modeling

A model exists to answer a question.

To decide the question, I followed these guiding questions:

Regression or classification?
Do I have more lefthand-side or righthand-side variables for classification now? Should I collect new datasets then?
What is useful for both CC and users?

And finally decided that:

For this model, given a webpage’s content, the model should classify it among the seven major categories of Creative Commons tools.

From here on, we move from this post’s progress to this post’s discussion.

Since Creative Commons was not to offer a training dataset for the modeling task, I decided to find my own. After a bit of a round trip, I settled at using Google Custom Search API in combination with Programmable Search Engine’s customized set of regulations, which allow me to target different sampling frames of higher constraints as I expand my dataset.

Afterwards, as I store my dataset into a RDBMS, it’s time to remove text information that is not in interest for models and humans. For more details on this pipeline, see this post. Then, for more details on how we converted text into numeric features for the model to operate on, visit this article. Last but not least, to see the model training results and some feedbacks on the concurrent works of this project involving modeling, see the articles respectively linked at the above subjects.

The training result of my models excluding BERT-Neural Network architecture

Closing Remarks

Wrote this at the moment before sleep, the comments are honest, with questionable writing styles.

Participating in this project is like jumping into a dark sea.

At many times, online research’s directions are not so informatic, and neither are they always streaks of hope for progress push of the project. For instance, just the modeling portion took 20 papers to settle at a good idea of modeling, and the API portion even more, across API documentations and academic examples of similar projects.

The further dived, the more burden, the more time loss, the more excitement lost… but at last, some solution appears, as the result of that dive, as a treasure of gain and a new opportunity to dive deeper.

Meanwhile, the entirety of my work is independent from my teammate’s (while interconnected with my managers in the revision of works), while I lend concepts and lines of codes to my teammate’s. This causes a large concentration and remarkable portion of deliverables being based on my work, delegating more burden and weight to me. Hopefully, this serves as an explanation of a dominance of my appearance in the deliverables.

Now, let me clarify: jumping into a dark sea sounds scary, but isn’t so bad of an idea.

This is not just some masochistic rambling on the enjoyment and exhilaration of painfully difficult to explore work. It is honest voice.

Ifully utilized the independence and flexibility that this project has offered me to learn and apply knowledge from concurrent coursework and beyond the range of my reach, to learn new technologies, find more topics of conversation and discussion between those friends who have started research experience at the same time I do.

I really enjoy the idea of quantifying for quality: as I quantify the quality of some product and organization, the organization can use this quantification in response to develop, promote, evolve their qualities.

This is the concept of data-driven developments, on both my skillset and the work of this project, as honestly reflected throughout the 13 posts of this series. Moreover, the search in dark sea and the helplessness of it does not really interfere and cancel the learned knowledge and yielded transformations these three months of work brought.

Quantifying the Commons is a worthwhile experience, and to wrap this series up, I’m excited to jump into the next dark sea, and hope to see Creative Commons on its way to do so with what has been created this semester for the commonwealth of Internet.

Week 0: What is Quantifying the Commons?

Week 1: Starting off with Google Custom Search

Week 2: Google Search off Code

Week 3–5: Boilerplating

Week 6: In the Data Science Life Cycle, we Ask: What do you need to discover?

Week 7: Visualize the Commons: Engineerring Data

Week 7: Visualize the Commons: Exhibiting Data

Week 8: Modeling the Commons: Guide to Machine Learning

Week 8: Modeling the Commons: Spend the first four sharpening the axe

Week 8: Modeling the Commons: Text to Number

Week 8: Modeling the Commons: Training the Models

Week 9: Delivering the Commons: A Summary of Side Notes