DSD Fall 2022: Quantifying the Commons (1/10)

In this blog series, I discuss the work I’ve done as a DSD student researcher from UC Berkeley at Creative Commons.

6 min readNov 17, 2022

In this post, I discuss the theoretical aspects of our approach to extract Google search data for how many Creative Commons licensed webpages there are findable on Google.

DSD: Data Science Discovery, is a UC Berkeley Data Science research program that connect undergraduates, academic researchers, non-profit organizations, and industry partners into teams towards their technological developments.

Selecting a Data Sampling Method

The Google Custom Search API will be used to investigate the number of documents on Google under each Creative Commons license type.

The main method of use is list, its documentation as linked.
In summary, this method performs a Google Search with the specified arguments provided in its API call and returns the result of a similarly conditioned Google Search.
This API is RESTful, suitable for the project by its feature: relatively intuitive query endpoint call while allowing complex parameter combinations for searching on a more constrained target range of websites.

There are a couple of essential components to a working API call via Google Custom Search API, which we detail respectively below for the transparency of data collection process: Programmable Search Engine, Query Keyword, and Additional Parameters (as expected from, once again, REST APIs).
Note: We will also discuss the plausibility of this data collection process, its performance under a realistic context, and the coming of its necessary updates upon preliminary design of data collection process.

API Call Settings

The Google Custom Search API, hereafter abbreviated as the GCS API, does not work on Google’s search engine per se.
Rather, it works on a Programmable Search Engine (PSE): a program-able search engine with customizable range and strictness of search results across a selected subset of the web.

The PSE’s client ID is set as a mandatory parameter of any GCS API call. Provided the ID of a PSE, GCS API performs its search on the offered Programmable Search Engine, and again, returns the search result.

Albeit our Programmable Search Engine is set to perform search across the entire web, per source from Google, PSE only accesses a subset of webpages available from Google Search.
Regardless, the subset of webpages is already of significant size and includes most of the available results from the Google search engine.

Every API call also rquires a query keyword, named “q” on the documentation for our GCS API method (once again, the documentation can be found here).
At the API parameter’s level, Google Search Operators like “AND”, “OR”, “site:” would function. Therefore, intuitively, I felt like using the “link:” operator is the optimal choice here: The “link:” operator allows us to search for webpages that contain a link to the URL put after the operator’s colon.
(Therefore, the query keyword for a license whose URL is X would just be “link:X”.)

As a note of caution: this data collection method malfunctions.
Per direction of the blogposts to discuss work in chronological order, and given the relatively small influence of this discovery, we will discuss why in Post 4.

Finally, we will discuss the additional parameters that will attend this call: something that selects the region of documents we view.

In previous States of the Common reports, we have recorded usage of Creative Commons licenses across countries and languages, which I took effort to replicate the effort for as well.

In this case, the additional parameters of GCS API call would be Country (cr) and Language of Document (lr). I decided to close the searching range onto 10 representative countries across each of the global continents, and 8 of the most spoken languages that Google supports searching with. To know what these languages are, check in with our GitHub repository.

Knowing how many Creative Commons documents come from per popular countries and languages can help replicate the past efforts of showing usage distribution across geographical demographics.

Now, here are some limitations and different approaches to perform approximate calculations of a “No Priori” search and an “all licensed works (all)” search.

A “No Priori” search is the attempt to find the number of licensed works throughout the subset of all webpages our Programmable Search Engine accesses. The syntax of this API call is rather easy and on-point: just don’t use the country and language parameters to pose restraints on the origin of search results.

On the other hand, there are two methods of computing an “all licensed work” search across each country, language… geographical demographics of webpage origins:

One is we simply set the URL contained to “creativecommons.org/licenses”, which will measure the number of works that are under some license. This offers an approximate computation.

The other method is rather precise: search for the number of documents under each creative commons tool’s coverage, and sum all of them up. This offers a much larger value and is generally considered the better method of computing document counts under licenses.
However, it accompanies high developmental cost if performed over every country existing on Google (which is, once again, around 240), so we have only performed an “all” search on a selection of 10 perhaps representative countries spread across the five continents.

Discussion of Developmental Constraints

Now that we have outlined the theoretical aspects of our data collection process, we must discuss if it works in practice.

First constraint of this collection method is Google API’s daily usage quota: 100 queries are allowed per day for free users.

Yet, the original list of license tools to walk through involved 650+ Creative Commons tools to summarize and sample for. To calculate such statistic across 250+ countries, 40+ languages, and 1 overall search result, we arrive at the need of 180000+ data queries, which would thus take at least 1800 days to complete.

The immediate response to this is to shrink the number of data queries needed to still replicate past effort.
In this case, we search over 10 countries rather than 250, 8 languages rather than its original quantity, but this still brings us to 12000+ required data queries.

The last resort is to shrink the number of Creative Commons tools to survey over. In this case, we made the decision to only sample the 52 general, most significant license tools whose URL also encompasses its subversions across different jurisdictions. RegEx has been used in the effort to select these 52 tools.
This decision now brings us from 180000+, to 12000+, and now to just over 1000 needed data points.

To accelerate the data sampling process, we decided to manually build and alternate between several API keys upon depletion of quota per API key as this does not violate the ToS of Google API (we, after all, do aspire to be law-abiding, beneficial citizens).

But now these implementations injure the user experience of data collection process.

For example, alternating API keys is exhausting.
If someone has a list of 30 API keys to use, the manual work of switching API keys and logging breakpoints for data collection doesn’t bode to be efficient. We will discuss an update to patch and alleviate this displeasure in Post 4.

Preemptions on Possible Improvements

The performance of this data collection process, however, cannot be very accurate.

First, any PSE can only access a subset of Google’s webpage, and all known information of the subset is that it is extremely large.

Second, the work of GCS API serves to approximate the length of search results, and thus will have approximated outcomes of total documents using license.
The numerical result can sometimes yield high volatility even under the exact same set of parameters for the API Call.

Last but not least, the data collection process concerns with API, but, at this point of time, does not adopt essential procedures for preventing and handling request rate.
To defend servers against paralysis and attacks, the servers’ APIs have the common practice to block, or even ban users who suddenly requests too much information from their application, as well as to pose a general quota for daily usage.
Measures like exponential backoff should be implemented in this data collection method to improve the user experience of API, making the sampling process safer and smoother.

In the next section, we will detail the implementations of discussed theoretical approaches and performance patches.

https://github.com/creativecommons/quantifying

“This work is licensed under a Creative Commons Attribution 4.0 International License”: https://creativecommons.org/licenses/by-sa/4.0/