DSD Fall 2022: Quantifying the Commons (6/10)

In this blog series, I discuss the work I’ve done as a DSD student researcher from UC Berkeley at Creative Commons.

Bransthre
12 min readNov 17, 2022

In this post, I analyze and introduce the visualizations on Creative Commons product usage I have produced.

DSD: Data Science Discovery, is a UC Berkeley Data Science research program that connect undergraduates, academic researchers, non-profit organizations, and industry partners into teams towards their technological developments.

In this section, I will discuss the diagrams produced during the data visualization stage, and attempt to extract some insights on the size, diversity, and direction of growth for Creative Commons products.

Since I have never worked on the administrative side of Creative Commons, am not a Creative Commons employee, have never worked with non-profit organizations on open source projects until now, nothing I subjectively infer from the visualization can be considered very credible suggestions; rather, they are personal opinions and an individual’s interpretations.

Just like how 100 economists have 100 different ideas, 100 data scientists might as well come up with 100 different interpretations towards the same set of data and circumstances.

Since we have mentioned the reliability of datasets and their sourcing methods in prior posts of the blog, we will not explain them in this post. We will step right into each of the visualizations.

Diagram 1: Number of Google Webpages Licensed over Time

In these diagrams, we discuss the growth of Creative Commons on Google over time.

Let us first observe the performance of Creative Commons on Public Domain Tools:

The growth looks essentially linear but experiences a plateau at the recent 20 months (roughly 2 years). It was experiencing an especially fast growth between 5 years and 4 years ago. We may also see that the approximate number of documents protected with Creative Commons’ public domain tools is reaching 55.7 million, which is a sizeable deed.

Next, let us observe the performance of Creative Commons on Licenses, so to discuss and compare its performance with Diagram 1A’s phenomenon:

Compared to the growth of public domain tools usage, the growth of CC licenses usage experiences a much stabler linear growth with a much higher count of usage on the Internet. While the growth of licenses is also observably higher between 5 to 4 years ago, it also experiences a slight plateau. Compared to the plateau of public domain tool usage, however, CC licenses have only suffered a shorter plateau.

Cumulatively, the performance of Creative Commons tools usage on the Internet can be visualized as follows:

Which resembles the mixture of stable linear growth seen in license usage and decreasing rate of growth seen in public domain tools usage. But notably, an estimated 2.7 billion webpages are protected under CC tools by the end of October in 2022!

This is, hopefully, an impressive exhibition of the size of Creative Commons on the internet.

Diagram 2: Density of Creative Commons Tools usage Across Countries

In this diagram, we discuss the intensiveness of Creative Commons Tools usage across countries of the globe, with the metric of intensiveness being the density of Creative Commons documents within the cluster of all webpages on Google.

The density map of usage is as follows:

Continent-wise, we may see that Africa and the overall Asian region experience a lower usage of Creative Commons Tools, while Western Europe and Americas witness a higher density of CC-protected documents, and Oceania having a moderately intensive usage when compared to other continents. Particularly, Western Europe enjoy a much robust use of Creative Commons document in terms of quantity.

Country-wise, there are some countries that simply did not participate in the data collection process due to failing ISO-Alpha 3 codes or lack of data. PRC, on the other hand, lacks CC-use data due to the unavailability of Google within the country, and an effort on querying data via Baidu (PRC’s dominant search engine) was not successful.

Hopefully, this visualization expresses some geographical direction for where the development of Creative Commons can be encouraged and developed further.

Diagram 3: Number of Works Licensed under Each License Type

In these diagrams, we will inspect the number of works protected across each Creative Commons license type, in turn providing source of insight for how frequently each CC licenses are used across levels of freedom.

Let us first demonstrate the size of Creative Commons across the Internet:

There exist significantly more licensed pages than those under public domain, at a difference of approximately 50 times.

Meanwhile, the overall count of works under CC-tools is now revealed by the visualization to be roughly 3 billion.
This is an impressive growth compared to the measurements of 2014 through 2017, where the total number of CC-protected work from 2017 was recorded as 1.4 billion on a relatively linear growth.

Then, let us see how the number of protected works allocate across different license typing:

The visualization claims that Attribute-Nonderivative license (by-nd) and Attribute license (by), one being quite free and one being more restrictive, both enjoy a high usage on the internet.

In general, the six larger archetypes of CC License are exactly the top six licenses in terms of document count. Some tools with smaller target demographic or already have an alternative option on, such as devnations and publicdomain, are nicher licenses that in turn receives less usage from the general Internet environment.

Note that in this visualization, the x-axis measures the document count in logarithm of 10, meaning each unit of difference on the x-axis signifies a 10-times increase in value.

Let us also inspect the usage of CC License across the top six entries in Diagram 3B, this time in order of freedom according to Creative Common’s descriptions:

Once again, we see that Attribute and Attribute-Nonderivative enjoy the highest usage.

In general, licenses with less regulations seem to receive more attention than licenses with more regulations, as the sum of document count in licenses of less regulations is larger.

It is also interesting to see that for each license that owns a Nonderivative counterpart, the document count across that license and its counterpart are roughly equal, e.g.: by to by-nd and by-nc to by-nc-nd.

Finally, let us also observe how the Public Domain tools differ in terms of usage:

It can be seen that the CC0 method (publicdomain/zero) receives a higher usage than the publicdomain/mark method, by a difference of roughly 1.5 times.

Diagram 4: Number of Works Licensed under Each License Type

In this diagram, we demonstrate the usage of CC-Tools across licenses:

The intermediate versions of licenses receive less attention: for example, Version 2.1 and Version 2.5. Meanwhile, the data for these versions are prone to underestimation due to the data collection process ignoring jurisdiction versions of the visualization, but overall, this diagram portrays well that even with those slightly insignificant improvements to data, the earliest and most recent versions of the license are equipped with higher usage.

Notably, however, how cherished the license is cannot be evaluated solely by the height of its usage. The version 3.0 licenses, albeit receiving less usage, are still responsible for the protection of Wikipedia contents. There are also notable software licenses under version 2.0 generation of Creative Commons tools, whose effects cannot be underestimated because of a slightly less usage.

Reasonably, some documents might have also migrated their licenses into a newer, updated version (for example: updating the license from 2.0 to 4.0), which would then explain the lack of usages in Version 2.0 to 3.0 tools under the current timeframe.

Diagram 5: Number of Works Licensed under Each Major License Category

In this diagram, we discuss the usage of license across each character subcategories: Attribution, Sharealike, Noncommerical, and Nonderivative, in terms of the usage of these licenses on the Internet:

There can be numerous reasons why the Attribute subcategory is overwhelmingly welcome.

For one, most of the licenses involve this category inside their description, as attribution to author is a fundamental aspect of crediting the author for resource usage.

Meanwhile, nonderivative is slightly more welcomed than other lower-usage subcategories, perhaps due to its utility and binding for the user of a work to not alter its contents in foul ways.

Diagram 6: Number of Works Licensed under Free and Non-Free Culture Tools

In this diagram, we inspect the number of works licensed under free and non-free culture tools:

Numerical calculation presents that roughly 45.3% of the documents under CC protection are covered by free-culture tools, where the tool advocates for the free culture movements on the Internet and allow the work to be accessed as

most readily used, shared, and remixed by others, and go furthest toward creating a commons of freely reusable materials. — Creative Commons

This statistic, however, has been around 64% to 65% by the records of 2015 and 2016. Potentially, it infers the decline of free licenses, or fails to present a similar trend due to a data collection process on a wider variety of platforms that do not employ free licenses.

Diagram 8: Number of Works under Creative Commons Tools across Platforms

In this diagram, we digested the statistic across platforms as previously sampled and inspect the size of Creative Commons’ utility in each platform this project has collected data from:

We see that DeviantArt presents the most data under Creative Commons licenses and tools, followed by Wikipedia and Wikicommons.
The count of media licensed within YouTube is an underestimate, as described in prior sections for sampling methods.

The deviation in document count on Flickr appears unnaturally off: in the most recent report from 2017, there has been said 415.1 million licensed media found on Flickr, compared to the resulting 0.6 million found from our data collection method.

Some investigative work should be performed for this collection method. Either a deviation has occurred in the data extraction process, or the past reports’ statistics for number of documents in Flickr do not robustly reflect the number of licensed documents.

Diagram 9: Number of Webpages under Creative Commons across Countries

In this diagram, we investigate what countries are found to use Creative Commons on most documents (as supposed to how intensively, which is portrayed in Diagram 2).

We can see that United States and Canada, both countries of Americas, have a large quantity of Creative Commons documents, while many countries in West Europe also do. This verifies the inclinations we see in Diagram 2.

Since the computing method for this visualization is more costly and precise, it can only be performed on selected countries. For the next diagram, we use a less costly but less accurate method to inspect quantity of usage across all countries available beyond the selection.

From which, we may see that US possesses a large multitude of CC-protected search results, followed by several countries in Americas and Western Europe. Notably, all countries of Northern America are present in this visualization that concerns the top 14 countries in Creative Commons tools usage case count.

Continent-wise:

North America leads the document count under CC tools coverage.

Interestingly, there were 7 CC-protected documents in Antarctica, and Oceania and Africa are the main populous continents that Creative Commons has not advanced thus far as in Western Europe yet.

Diagram 10: Number of Webpages under Creative Commons across Languages

In this diagram, we will look at the number of webpages under Creative Commons across historically popular languages for CC documents, with the scope of languages set at those available in the Google query parameters:

From here, we may also see that languages used in CC-popular regions (West Europe, North America) generally have more CC-documents under it.

Diagram 11: Number of Videos under Creative Commons Licenses on YouTube

In this diagram, we observe the growth in number of YouTube videos protected under CC Licenses. Let us first inspect the count of new CC-Licensed videos in each 2-month period:

The blue line stands for the value that API call response has provided, which is systematically maximized at 1,000,000 for total results (per documentation).
The orange line, meanwhile, stands for the imputed value of new CC-licensed YouTube video counts based on linear regression, which is the decided method of imputation because most medias’ growth of CC-protected document count also experience a linear trend.

As for the cumulative count of YouTube videos under CC license over each two-month periods:

For now, this exemplifies a quadratic growth on both the original and imputed counts for individual video count (with the growth of capped API response value resembling a similar pattern to ELU), while the imputed (estimated) growth reaches at a terminal value 0.2 million higher than the lower-bound representing growth based on capped API response.

Diagram 12: Number of CC-Licensed Vimeo Videos

In this diagram, we inspect the distribution of CC-Licensed videos within Vimeo:

Where we notice that the majority of this site’s media are by-nc-nd licensed videos, with the by license (Attribution only) once again composing a significant aspect of the CC-licensed medias.

Diagram 13: Number of CC-Licensed Media on Wikicommons

In this diagram, we discuss the number of CC-Licensed Medias, mostly pictures, that exist on Wikicommons:

We discover that, while there are only three types of licenses found used on Wikicommons, all of which are Free Culture licenses, the use of by-sa licenses is much more significant than seen in other typings.

Particularly, here is how the distribution of file count (as demonstrated in Diagram 13A) differ from the distribution of page count:

Diagram 14: Number of CC-Licensed Photos on Selected Platforms

In this diagram, we inspect the overall distribution of CC-protected photos across their CC tool of choice:

From which, we observe a very similar trend with how the number of works across Internet is distributed over their choice of CC Tools: the six major types of licenses still enjoy a higher usage similar in quantity, while niche licenses enjoy way less.

Wrapping off the Visualization Stage

Wrapping off the visualization stage, I conducted some more analysis on Diagram 7 (the missing diagrams for Flickr efforts), which occurred to have some error (as described in the previous post, diagnosed with some simple pandas-driven EDA).

Otherwise, there were some calibrations conducted on the visualization to present a more accurate media count by removing platforms that present overcounting issues. For example, all DeviantArt works selected for the purpose of visualization were also involved as a Google webpage, which presents overcounting if DeviantArt is involved in Diagrams 3 through 6.

Overall, the visualization stage is a good transition from Python script writing to pandas, seaborn engineering, and the completion of visualizations officially marks the end of Quantifying the Commons as per original demand.

From now on, we move onto the suddenly observed requirements of DSD Project that Quantifying the Commons did not plan to present.
That is: modeling, machine learning, and statistical inference.

https://github.com/creativecommons/quantifying

“This work is licensed under a Creative Commons Attribution 4.0 International License”: https://creativecommons.org/licenses/by-sa/4.0/

CC BY-SA License image

--

--

Bransthre

A Taiwanese student at UC Berkeley. This is where I put notes about my experiences in Cognitive, Computer Science, and UC Berkeley!