DSD Fall 2022: Quantifying the Commons (0/10)

In this blog series, I discuss the work I’ve done as a DSD student researcher from UC Berkeley at Creative Commons.

4 min readNov 17, 2022

In this post, I define the main objective of our project and preliminary works that pave the itinerary of future progresses.

DSD: Data Science Discovery, is a UC Berkeley Data Science research program that connect undergraduates, academic researchers, non-profit organizations, and industry partners into teams towards their technological developments.

What is Quantifying the Commons?

Creative Commons is a nonprofit organization that helps to overcome legal obstacles on towards the sharing of knowledge, contents, founding a better internet with robust open culture that protects the rights of intellectual property owners.

In the previous years, from 2014 to 2017, Creative Commons have been releasing public reports detailing the growth, size, and usage of Creative Commons. The 2017 report is currently still online and findable here. Unfortunately, the effort to quantify Creative Commons has ceased at the proceeding year.

This is the preincarnation of our current open-source project: Quantifying the Commons.

Quantifying the Commons is dedicated to quantifying the size and diversity of usage in Creative Commons.
For example, did you know that over 1.4 billion online works were protected and promoted via the Creative Commons licenses?

A record on the count of Creative Commons Licensed Words, as captured from https://stateof.creativecommons.org/

Woah! Wouldn’t you like to know what happened in 2022 then?

What platforms use our licenses? At what intensity?
What is the distribution of license usage across each license type? Or version?
What is the growth of Creative Commons license usage over time?
…

But this projects’ data collection prior efforts suffered unreliable data retrieval method, from web-scraping to manual data logging.
While prone to malfunction over the updates of website architecture from data sources, these data extraction method are not particularly rigorous in performance and has a lower availability than some other code-based reliable approach that can extract desired CC usage data at any time if provided enough developmental resources.
This point will be re-illustrated in the conclusion as we compare the current works of Quantifying the Commons to its preincarnation.

To advance and continue the work of quantifying of Creative Commons product states, DSD student researchers are delegated the design and implementation for reliable data retrieval processes on Creative Commons usage data employed in previous reports.

There are many more questions regarding the “what” of Creative Commons license that we plan to answer at Quantifying the Commons, seeking ways to replicate past efforts of the preincarnation.

Week 0: Finding Sources of Data

To quantify the state of any system, we need the data of such system. So, the essential questions of this very first week are:

1. Where do we find data of Creative Commons usage?

2. What data do we need to replicate past efforts? Where can they be found?

To find data on Creative Commons usage, we should first be able to tell whether an online webpage/document uses a Creative Commons tools.
The method to use a Creative Commons tools happens to answer this question: for a document to use a Creative Commons tool, the document’s author would

“Insert a hyperlink from your licensing text to the appropriate license (CC tool).”

To find the number of online documents that use the Creative Commons license, I would just need to find the number of online documents which include a hyperlink towards the licensing text of a license.

And note that all data we seek exists on some online webpage. This makes APIs very suitable for our data collection approach.
API, fully named “Application Programming Interface”, is a code-based method to retrieve data from one side of an application to the other.
Using an API, we can then write a series of computer code to request information about a document from some search engine or webpage, and among those retrieved information of a document includes a potential hyperlink that document includes towards Creative Commons license webpages.

Furthermore, as many APIs exists for many search engines and popular media-hosting webpages, we will have very accessible data sources to mine statistics regarding usage of Creative Commons from.

Based on the overall direction of this project to replicate past effort, we will concentrate on retrieving the number of documents per platform that uses Creative Commons license.

Let us now concern the next question: What platforms should we search with API to find Creative Commons licensed documents?
Looking at the usage report for the past four years, and to represent the diversity of platforms that collaborate with Creative Commons, we decided to close our searching range onto the following platforms:

Deviantart, Source of idea for data extraction
Flickr, Source of idea for data extraction
Google, Source of idea for data extraction
Internetarchive, Source of idea for data extraction
Metmuseum, Source of idea for data extraction
Wikicommons, Source of idea for data extraction
Wikipedia, Source of idea for data extraction
Vimeo, Source of idea for data extraction
YouTube, Source of idea for data extraction

I am responsible for collecting data throughout every listed platform except Flickr.

Having these sources of data and their API approach in mind, we will start off our data collection process with Google Custom Search.

https://github.com/creativecommons/quantifying

“This work is licensed under a Creative Commons Attribution 4.0 International License”: https://creativecommons.org/licenses/by-sa/4.0/

DSD Fall 2022: Quantifying the Commons (0/10)

In this blog series, I discuss the work I’ve done as a DSD student researcher from UC Berkeley at Creative Commons.

What is Quantifying the Commons?

Week 0: Finding Sources of Data

Written by Bransthre