DSD Fall 2022: Quantifying the Commons (4/10)

In this blog series, I discuss the work I’ve done as a DSD student researcher from UC Berkeley at Creative Commons.

Bransthre
11 min readNov 17, 2022

In this post, I perform some EDA on currently acquired data to improve the sampling processes.

DSD: Data Science Discovery, is a UC Berkeley Data Science research program that connect undergraduates, academic researchers, non-profit organizations, and industry partners into teams towards their technological developments.

In Data Science, asking questions is important. Ask what questions? Why ask questions? We ask questions to suspect and investigate the current discoveries. The first step of doing so is to explore the data we acquired.

This process of asking questions and finding characteristics of a dataset is generally summarized as “EDA”: Exploratory Data Analysis.

Let us start off with performing EDA on our current datasets. Specifically, we will be investigating and observing the following qualities of the datasets gathered at this point:

1. Structure of Dataset: Shape? Format? Columns?

2. Granularity: What does each record of the dataset stand for?

3. Scope of Sampling: What is the scope at which we sampled CC-protected documents?

4. Temporality of Data: How updated is the data? How is the data updated?

5. Faithfulness of Data: How reliable is the data? How is the data reliable?

Note: Since the full EDA is a very long snippet of text and analysis, I will move that to the appendix of this blogpost, so to not disrupt the flow of ongoing discussion on explorative analyses.

An Example of Improve-able Dataset

It would be more fitting for me to critique on my own datasets first.

Let us observe what data is currently to disposal when it comes to making visualization. All datasets mainly focus on “Document Count”, and the set of datasets (database) involve data of Document Count across license typing, version, freeness, and platform. This is good enough to replicate past efforts.

However, we would then sacrifice analysis on some valuable axes for product analysis, say: the product usage’s state across chronological and geographical demographics.
Meanwhile, it is evident that Google Custom Search API has the depth of information and capability to handle and gather the desired information listed above.

Surveying the faithfulness of my datasets, we will immediately find that YouTube’s General Search Data is almost not useable. This will be further revealed with a cross-time analysis on YouTube’s number of videos across two-month periods.

So far, the current direction for my dataset’s expansion is thus:

1. Let Google Custom Search capture information of document license usage across geographical components, such as countries and continents.

2. Let Google Custom Search and YouTube API capture information of document license usage across time, in units of years and months.

As an interlude from the future, let us explore at a dataset of another teammate and perform a short EDA on it as well:

Then, a Reflection on Flickr License Data (cleaned_license9.csv)

The motivation of diagnosing the sampling method for Flickr comes from this original visualization that my teammate provided:

A helping clarifying work of https://github.com/creativecommons/quantifying by user SusannYY

To supplement of the information lacking from visualization:

  • The horizontal axis represents time and is a series of datetime objects noting the date a photo is taken at.
  • The vertical axis represents the number of photos taken at the designated time.

As a prior discussion on the granularity of data this visualization employs, it would likely fail to capture the trend of photo counts over time.

Each point of the horizontal axis stands for one exact second, not a period of time (say a month), so the visualization shows the development of photo counts over seconds rather than some period of time (say, months).

But what really matters, and appears suspicious, is not the above.

The visualization implies that a quite significant majority of photos under this license is taken in between July and October of 2022, and furthermore, almost no photo was taken during anytime between 2016 and September of 2021.

To investigate the discrepancy of this visualization with common sense, I decided to run some pandas-driven EDA on a Flickr-specific license-document dataset from the teammate.

First, I tried to investigate why there seem to be so many photos taken on one very specific second out of the entire 40 years x-axis represents for:

Investigative work on how many records (photos) exist per second noted as the time a photo is taken.

The above code block presents the total number of documents taken during each specific time, minute, up to the precision of second as dataset provides.
We can see that 1056 photos were taken during one exact second on 2022 October 15th, which is bizarre. But they might still just be pictures that have different contents but happen to be taken at the same time.

To investigate whether this is the case, let relevant analysis on two highly repeated dates serve as case studies:

Investigative work on the uniqueness of records for each record taken at specific times with high photo counts

For all 500+ photos taken at these above dates, almost all photos have the exact same number of views, comments, the exact same of tags, the exact same description. This is highly likely to be a duplicates.

Having identified duplicates in the Flickr dataset, I should attempt to identify the reason at which this has occurred. The reason appeared very soon.
On the Flickr API documentation for the method my teammate chosen, per quote:

“Please note that Flickr will return at most the first 4,000 results for any given search query.”

How my teammate has corrected this issue, I will leave the implementation of correction as the contents of my teammate’s report.

Collecting New Datasets

YouTube

I first worked on collecting YouTube’s count of Creative Common Licensed Videos across each year.
However, YouTube still has over a million video under Creative Common License uploaded in almost every year of its existence. Therefore, I’d have to reduce the granularity of my query.

Eventually, with some trial and error and balancing the API querying load with dataset granularity, I arrived at the balancing point of recording video counts among two-month periods . To counter the stream of capped YouTube API response values, I decided to involve imputing the right number of CC Licensed videos as a task in the visualization phase (addressed in next post).

Google CSE API: General Search Data

Then, I revamped the Google CSE API’s sampling method. Per description from post 1, we were using the “link:” operator for search operation, but by several sources (mainly this and this), this operator is indicated to be deprecated.
Therefore, instead of using the “link:” operator, we have decided to go with the “linkSite” parameter in the API call, which allows searching for websites that contain a URL towards the address quoted for the linkSite parameter.

In other words, as well as in summary, we are now using the reincarnation of the “link:” operator.

Google CSE API: Chronological data

Next, worked on collecting Google’s count of CC-protected documents across 6-month periods across the recent 10 years.

Google API allows me to query the total document count fitting of my customized parameters within an arbitrary month from the current date, but not from one specific month to another (or, more likely, I just accidentally skimmed over that section of the documentation).

Therefore, the dataset of Google Time-Related Query Data in fact involves a cumulative sum of document counts across each six-month periods from now, which is not so helpful for visualization.
We will counter this issue during the visualization phase (addressed in next week), but it is worth celebrating that we have acquired the data we desired from our formerly addressed directions of improvement.

Google CSE API: Geographical data

At last, the usage of Google’s CC-protected documents across each country is collected via approximation in API calls, which was able to reduce the required API call by at least 50 times. This will offer a good enough measurement on how intensive the usage of CC tools is across geographical regions. We would then just need some geodata representation tool in Python to aid our analysis and visualizations.

Appendix 4A: EDA on Current Datasets

DeviantArt Dataset

For structure, the DeviantArt dataset is a csv file.
The columns of this dataset is as follows:

  • License Type: Qualitative Nominal, stands for general license typing.
  • Document Count: Quantitative Discrete, the number of documents under a license.

For the granularity of this dataset, each row of the dataset accounts for the record of one license typing and its document count.
The scope of the dataset is a representative subset of all DeviantArt media that can be found on Google.
The temporality of this dataset is well, as the data is collected within 1.5 months of presentation.
The faithfulness of this dataset is fine. While Google Custom Search API accesses a subset of webpages and has some volatility in query results count, its results are quite acceptable as a rough estimator of the true document count.

Flickr Dataset (photos.json)

For structure, this Flickr dataset is a json file.
After some reorganization into a csv file, having extracted useable information for visualization, the column of this dataset is as follows:
The columns of this dataset is as follows:

  • License Type: Qualitative Nominal, stands for general license typing.
  • Document Count: Quantitative Discrete, the number of documents under a license.

For the granularity of this dataset, each row of the dataset accounts for the record of one license typing and its document count.
The scope of the dataset is all public Flickr media that can be retrieved with API access.
The temporality of this dataset is well, as the data is collected within 1.5 months of presentation.
The faithfulness of this dataset is fine, as the Flickr API is generally reliable for total result approximation.

Google Custom Search Dataset (General Search Data)

For structure, the Google Custom General Search Data dataset is a csv file working as a pivot table, where each column stands for a search range (documents limited to some country, language, or no priori at all).
For the granularity of this dataset, each row of the dataset accounts for the record of one license typing and its document count.
The scope of the dataset is a representative subset of all webpages that can be found on Google.
The temporality of this dataset is well, as the data is collected within 1.5 months of presentation.
The faithfulness of this dataset is fine. While Google Custom Search API accesses a subset of webpages and has some volatility in query results count, its results are quite acceptable as a rough estimator of the true document count.

Internetarchive Dataset

For structure, the Internetarchive dataset is a csv file.
The columns of this dataset is as follows:

  • License Type: Qualitative Nominal, stands for general license typing.
  • Document Count: Quantitative Discrete, the number of documents under a license.

For the granularity of this dataset, each row of the dataset accounts for the record of one license typing and its document count.
The scope of the dataset is a representative subset of all Internetarchive media that can be retrieved by its API.
The temporality of this dataset is well, as the data is collected within 1.5 months of presentation.
The faithfulness of this dataset is fine, as the Flickr API is generally reliable for total result approximation.

MetMuseum Dataset

For structure, the MetMuseum dataset is a csv file.
The columns of this dataset is as follows:

  • License Type: Qualitative Nominal, stands for general license typing.
  • Document Count: Quantitative Discrete, the number of documents under a license.

For the granularity of this dataset, each row of the dataset accounts for the record of one license typing and its document count.
The scope of the dataset is a representative subset of all Metmuseum media that can be retrieved by its API.
The temporality of this dataset is well, as the data is collected within 1.5 months of presentation.
The faithfulness of this dataset is fine, as the Metmuseum API is generally reliable for total result approximation.

Vimeo Dataset

For structure, the Vimeo dataset is a csv file.
The columns of this dataset is as follows:

  • License Type: Qualitative Nominal, stands for general license typing.
  • Document Count: Quantitative Discrete, the number of documents under a license.

For the granularity of this dataset, each row of the dataset accounts for the record of one license typing and its document count.
The scope of the dataset is a representative subset of all Vimeo media that can be retrieved by its API.
The temporality of this dataset is well, as the data is collected within 1.5 months of presentation.
The faithfulness of this dataset is fine, as the Vimeo API is generally reliable for total result approximation.

Wikicommons Dataset

For structure, the Wikicommons dataset is a csv file.
The columns of this dataset is as follows:

  • License Type: Qualitative Nominal, stands for general license typing.
  • File Count: Quantitative Discrete, the number of files under a license.
  • Page Count: Quantitative Discrete, the number of pages under a license.

For the granularity of this dataset, each row of the dataset accounts for the record of one license typing and its document count.
The scope of the dataset is a representative subset of all Wikicommons media that can be retrieved by its API.
The temporality of this dataset is well, as the data is collected within 1.5 months of presentation.
The faithfulness of this dataset is fine, as the Wikicommons API is generally reliable for total result approximation.

Wikipedia Dataset

For structure, the Wikipedia dataset is a csv file.
The columns of this dataset is as follows:

  • Language: Language that the current query is in.
  • Page Count: The number of Wikipedia Pages under that language.
  • Article Count: The number of Wikipedia Articles under that language.
  • Edit Count: The number of Wikipedia Edits under that language.
  • Images: The number of Wikipedia Images under that language, possibly overlapping with Wikicommons.
  • Users: The number of Wikipedia Users in that language.
  • Active Users: The number of Wikipedia Active Users under that language.
  • Admins: The number of Wikipedia Admins under that language.
  • Queued Messages: The number of Wikipedia Queued Messages under that language.
  • CirrusSearch Article Words: The number in Wikipedia Cirrus Search metrics called “Article Words” under that language.

For the granularity of this dataset, each row of the dataset accounts for the record of one license typing and its document count.
The scope of the dataset is a representative subset of all Wikipedia media that can be retrieved by its API.
The temporality of this dataset is well, as the data is collected within 1.5 months of presentation.
The faithfulness of this dataset is fine, as the Wikipedia API is generally reliable for total result approximation.

YouTube Dataset (General Search Data)

For structure, the YouTube General Search Data dataset is a csv file.
The columns of this dataset is as follows:

  • License Type: Qualitative Nominal, stands for general license typing.
  • Document Count: Quantitative Discrete, the number of documents under a license.

For the granularity of this dataset, each row of the dataset accounts for the record of one license typing and its document count.
The scope of the dataset is a representative subset of all YouTube media that can be retrieved by its API.
The temporality of this dataset is well, as the data is collected within 1.5 months of presentation.
The faithfulness of this dataset is not fine for approximating total document count, as the YouTube API caps related search result at numeric value of 1000000:

The total number of results in the result set.Please note that the value is an approximation and may not represent an exact value. In addition, the maximum value is 1,000,000. — Documentation

This means we will be underestimating the video count of YouTube’s CC Licensed Video very substantially with this dataset.

https://github.com/creativecommons/quantifying

“This work is licensed under a Creative Commons Attribution 4.0 International License”: https://creativecommons.org/licenses/by-sa/4.0/

CC BY-SA License image

--

--

Bransthre

A Taiwanese student at UC Berkeley. This is where I put notes about my experiences in Cognitive, Computer Science, and UC Berkeley!