DSD Fall 2022: Quantifying the Commons (3/10)

In this blog series, I discuss the work I’ve done as a DSD student researcher from UC Berkeley at Creative Commons.

6 min readNov 17, 2022

Having established the data querying pipeline for Google Custom Search API, in this post, it is time to apply its variations onto other different data sources.

DSD: Data Science Discovery, is a UC Berkeley Data Science research program that connect undergraduates, academic researchers, non-profit organizations, and industry partners into teams towards their technological developments.

The Boilerplate Approach of Collecting Data on Platforms

Many other data sources employ a simpler data collection process, so the difference between data collection process of GCS API and other APIs mainly occur in the solutions for following questions:

Does this API require client credentials?
Does this API not want data points across languages?
Does this API concern data features outside document count?
How does this API list its licenses?

Otherwise, general to the difference of each platform’s API, every API’s data collection process can be simply described as the process of:

Obtain the list of licenses.
Setup a function to retrieve the API endpoint URL and use client credentials.
Setup a function to perform the API call while storing it into some map data structure to save the response of an API call.
Record the data for each license by making API request(s) for each license.

For conciseness and the sake of detailing work, I will just list some peculiar exceptions whose data collection process is much apart from that of other APIs.

This would involve Wikicommons and Deviantart.

Notable Workarounds: DeviantArt

Deviantart just doesn’t have a fairly user-friendly API for the approach of this data collection process, so I adopted the GCS API approach with an additional parameter: relatedSite:, which lets all search results be associated with the URL inputted for the parameter.

Notable Workarounds: Wikicommons

In Wikicommons, whether a file is licensed or not is entailed by its “Category” field, which the name of this field will entail its license.

Meanwhile, these categories are organized representatively as a directed graph, where a parent category will have a directed edge towards the child category; however, surprisingly, some child categories happen to be parents of themselves.

Essentially, we attempt to traverse a cyclic graph where each node represents the information of a media category along with the number of files, pages, and subcategories under it.

Such is the general outline for a general graph traversal algorithm:

Therefore, the nature of data collection process for Wikicommons is recursive as it employs depth-first traversal. For each license path, its alias, and the document count under it is recorded to a .csv file, and the code will recursively search on each subcategory of such license path.

If a subcategory has already been visited (traversed), it will be ignored.

license_cache = {}
session = requests.Session()
max_retries = Retry(
    total=5,
    backoff_factor=10,
    status_forcelist=[403, 408, 429, 500, 502, 503, 504],
)
session.mount("https://", HTTPAdapter(max_retries=max_retries))

def recursive_traversing_subroutine(alias):
    alias.replace(",", "|")
    cur_category = alias.split("/")[-1]
    subcategories = get_subcategories(cur_category, session)
    if cur_category not in license_cache:
        record_license_data(cur_category, alias, session)
        license_cache[cur_category] = True
        for cats in subcategories:
            recursive_traversing_subroutine(f"{alias}/{cats}")

recursive_traversing_subroutine(license_alias)

If you peek into the .csv file for Wikicommons data, you’ll see that the licenses captured in data are not exactly the license paths offered by Creative Commons. This issue can be alleviated, or even eliminated, via some postprocessing using the pandas package addressed in the next post.

The Boilerplate Approach on Google Custom Search

Let us use Google Custom Search’s Creative Commons usage data collection process as a case study for the above boilerplate approach.

We first concern whether the API would like a credential in its calls.
In this case, Google Custom Search API requires two

An API key
A PSE (Programmable Search Engine) key.

Therefore, the following implementation was taken:

The query_secrets file for API keys and PSE key

There is a designated file to hold the secret and credentials of this API’s call. To accommodate the practice of API key rotation as mentioned in the theoretical discussion in prior post(s), a container for multiple APIs was provided in this file.

This file then gets imported into the main data collection script as a global variable:

def get_request_url(license=None, country=None, language=None, time=False):
    """Provides the API Endpoint URL for specified parameter combinations.
    Args:
        license:
            A string representing the type of license, and should be a segment
            of its URL towards the license description. Alternatively, the
            default None value stands for having no assumption about license
            type.
        country:
            A string representing the country code of country that the search
            results would be originating from. Alternatively, the default None
            value or "all" stands for having no assumption about country of
            origin.
        language:
            A string representing the language that the search results are
            presented in. Alternatively, the default None value or "all" stands
            for having no assumption about language of document.
        time:
            A boolean indicating whether this query is related to video time
            occurrence.
    Returns:
        string: A string representing the API Endpoint URL for the query
        specified by this function's parameters.
    """
    try:
        api_key = API_KEYS[API_KEYS_IND]
        base_url = (
            r"https://customsearch.googleapis.com/customsearch/v1"
            f"?key={api_key}&cx={PSE_KEY}&q=_"
        )
        if time:
            base_url = f"{base_url}&dateRestrict=m{time}"
        if license != "no":
            base_url = f"{base_url}&linkSite=creativecommons.org"
            if license is not None:
                base_url = f'{base_url}{license.replace("/", "%2F")}'
            else:
                base_url = f'{base_url}{"/licenses".replace("/", "%2F")}'
        if country is not None:
            base_url = f"{base_url}&cr={country}"
        if language is not None:
            base_url = f"{base_url}&lr={language}"
        return base_url
    except Exception as e:
        if isinstance(e, IndexError):
            print("Depleted all API Keys provided", file=sys.stderr)
        else:
            raise e

And we can rotate off API keys based on exception handling mechanisms.

Then, we may concern with a different aspect of the approach: attaining and representing the list of all licenses, as well as obtaining their general versions’ aliases.
In this case, our project received a file from Creative Commons noting the paths of all Creative Commons tools existed, deprecated or not. As mentioned in prior paragraphs and posts, we then summarize these into general versions via RegEx and Pandas powered approaches:

def get_license_list():
    """Provides the list of license from 2018's record of Creative Commons.
    Returns:
        np.array: An np array containing all license types that should be
        searched via Programmable Search Engine.
    """
    cc_license_data = pd.read_csv(f"{CWD}/legal-tool-paths.txt", header=None)
    license_pattern = r"((?:[^/]+/){2}(?:[^/]+)).*"
    license_list = (
        cc_license_data[0]
        .str.extract(license_pattern, expand=False)
        .dropna()
        .unique()
    )
    return license_list

Afterwards, we may follow the approach of acquiring an API call URL, performing a call from it, retrieving its response, and storing it into a file. This is outlined in the following code snippets, which also handles exponential backoff:

def get_response_elems(license=None, country=None, language=None, time=False):
    """Provides the metadata for query of specified parameters
    Args:
        license:
            A string representing the type of license, and should be a segment
            of its URL towards the license description. Alternatively, the
            default None value stands for having no assumption about license
            type.
        country:
            A string representing the country code of country that the search
            results would be originating from. Alternatively, the default None
            value or "all" stands for having no assumption about country of
            origin.
        lang:
            A string representing the language that the search results are
            presented in. Alternatively, the default None value or "all" stands
            for having no assumption about language of document.
        time:
            A boolean indicating whether this query is related to video time
            occurrence.
    Returns:
        dict: A dictionary mapping metadata to its value provided from the API
        query of specified parameters.
    """
    try:
        request_url = get_request_url(license, country, language, time)
        max_retries = Retry(
            total=5,
            backoff_factor=10,
            status_forcelist=[400, 403, 408, 500, 502, 503, 504],
            # 429 is Quota Limit Exceeded, which will be handled alternatively
        )
        session = requests.Session()
        session.mount("https://", HTTPAdapter(max_retries=max_retries))
        with session.get(request_url) as response:
            response.raise_for_status()
            search_data = response.json()
        search_data_dict = {
            "totalResults": search_data["searchInformation"]["totalResults"]
        }
        return search_data_dict
    except Exception as e:
        if isinstance(e, requests.exceptions.HTTPError):
            global API_KEYS_IND
            API_KEYS_IND += 1
            print(
                "Changing API KEYS due to depletion of quota", file=sys.stderr
            )
            return get_response_elems(license, country, language, time)
        else:
            print(f"Request URL was {request_url}", file=sys.stderr)
            raise e

At last, use a main function to let the script function with user-friendly exception handling process:

def main():
    set_up_data_file()
    record_all_licenses()


if __name__ == "__main__":
    try:
        main()
    except SystemExit as e:
        sys.exit(e.code)
    except KeyboardInterrupt:
        print("INFO (130) Halted via KeyboardInterrupt.", file=sys.stderr)
        sys.exit(130)
    except Exception:
        print("ERROR (1) Unhandled exception:", file=sys.stderr)
        print(traceback.print_exc(), file=sys.stderr)
        sys.exit(1)

Transitioning to Next Phase

There are a bit more code-review centered practices and revisions that are done in between these weeks, mainly guided by the project supervisor and some shell scripts for code file formatting. These are much non-data related practices to mention in the workflow.

Finally, having taken care of these addressed exceptions in the data collection process, I have successfully retrieved data from all 8 delegated data sources.

https://github.com/creativecommons/quantifying

“This work is licensed under a Creative Commons Attribution 4.0 International License”: https://creativecommons.org/licenses/by-sa/4.0/