DSD Fall 2022: Quantifying the Commons (2/10)

In this blog series, I discuss the work I’ve done as a DSD student researcher from UC Berkeley at Creative Commons.

4 min readNov 17, 2022

In this post, I discuss the implementation aspects of our approach to extract Google search data for how many Creative Commons licensed webpages there are findable on Google.

DSD: Data Science Discovery, is a UC Berkeley Data Science research program that connect undergraduates, academic researchers, non-profit organizations, and industry partners into teams towards their technological developments.

2–0: Script

The overall data collection process should be implemented as a script. The open-source nature of this project and its intention as a reliable pathway to replicate past data collection efforts makes script a good medium for our code.

Because the code to be produced is a script, the overall architecture of this program should allow easy updating and editing for users would pull from our GitHub repo.

Meanwhile, since each user of the open-source code may have a different client credential, we have offered a file “query_secret” for each API that request a client credential to optimize the user experience of script and improve secret-hiding mechanics in our own code.

2–1: Setting up the API Credentials

The API Credentials would involve a PSE’s client ID and an API key for the script user. Both credentials can be registered via clicking buttons on the webpage for PSE and for GCS API.

2–2: Obtaining the List of CC Tools

As mentioned in prior, the final list of all available CC Tools on the Internet spans 652 items, which poses significant inconvenience towards the data collection process under development cost constraints. The instinctive response to this is to only keep the most significant and encompassing types of licenses.

Performing a quick glance on the list of licenses, we’ll see that many paths are jurisdictional versions of the unported version for each license, while the unported version for each license are the general URL for their respective type-version combination. Considering the relative significance of unported/general URLs for each license typing compared to that of the jurisdictional versions, it becomes a sensible decision to only extract those items of the list that are the general URL.

To quickly extract an array of general license paths out of the list of all license paths, I employed the data scientist favorite package: pandas.

I first parse the .txt file containing all licenses into a pandas.Series object (essentially an indexed array), then came up with a general RegEx pattern for extracting the unported/general version of each listed license’s path, applying it onto the .txt-representing object, at last taking only the unique items of that indexed array.
This brings us from 652 license items for data inspection to 52 general categories of licenses.

2–3: Obtaining the List of Countries and Languages

Since there is no way to collect the list of all available country keyword and language keyword for GCS API calls, the best we could do by far is to copy paste the table of respective values on the online documentation page into a .txt file for each of country (cr) and language (lr).

2–4: Making the API Call

Making the API Call comes with three steps:

Figuring out the API Endpoint URL (as the response data is contained by a URL, which we call “endpoint”),
Making the call,
Retrieving the data.

Fortunately, the last two steps can be handled by the Python library “requests”, which comes along with built-in measures to involve exponential backoff and a method to retrieve API call result as a JSON object.

For the first step, then, I built a function to retrieve an API Endpoint URL that involves any wished set of call parameters I designate for a custom response. These codes will become the boilerplate measure of making any API calls for the platforms we will investigate CC license usage on.

2–5: Storing Retrieved Data

The retrieved data will then be stored in a .csv file, whose column headers and file per se will be encoded and generated via internal workings of data collection script.

Here, any tabular form of data would serve well, but .csv files are easy to update and delete entries from and enjoys a well-written support for being read from, once again, the data scientist favorite pandas package.

While .tsv files would have worked for its similar favorable features with .csv, the raw .tsv file would be painstaking to read and much less concise than .csv.

https://github.com/creativecommons/quantifying

“This work is licensed under a Creative Commons Attribution 4.0 International License”: https://creativecommons.org/licenses/by-sa/4.0/