DSD Fall 2022: Quantifying the Commons (5/10)

In this blog series, I discuss the work I’ve done as a DSD student researcher from UC Berkeley at Creative Commons.

Bransthre
10 min readNov 17, 2022

In this post, I outline the theoretical aspects of approach for generating visualizations on Creative Commons product usage.

DSD: Data Science Discovery, is a UC Berkeley Data Science research program that connect undergraduates, academic researchers, non-profit organizations, and industry partners into teams towards their technological developments.

Motivations and Rules of Visualization

Visualizations exist for two general reasons:

  1. To facilitate the process of analysis and modeling.
  2. To communicate findings and conclusion.

The purpose Quantifying the Commons leans towards the former: the visualization that this project produces are, as its preincarnations’, designed as exhibitory devices to demonstrate the scale of Creative Commons’ existence on the Internet, and so much as a presentation tools in the Data Science Discovery’s deliverables at the end of this semester.

Therefore, the visualizations we produce in this process must be tuned towards “exhibition”: comprehensible to the audiences of public, representative and impressive for the phenomenon of Creative Commons.

Summary of Required Engineering

Remember that the analyses and visualizations we conduct in the scope of this project should consider the following axes:

  • Geographical
  • Chronological
  • Platform
  • License Typing, License Version, License Freeness

Where geographical axis is not necessarily limited to location-wise measurements, but can also involve aspects of civilization, like language. After all, geometry is

the study of the physical features of the earth and its atmosphere, and of human activity. — Oxford Languages

Analyses along chronological axis, meanwhile, is very limited to the development of document count across time.

Keeping the objective of “exhibition” in mind, the visualization should be kept as easily interpretable as possible.
This means a couple of the following visualizations, specifically some that are at odds with prior work, can be applied:

  1. Utilize length more than area. It has been shown that humans are better at interpreting length than area.
  2. Utilize the breadth of data to analyze across multiple aspects of a license.
  3. Utilize a stable baseline for data on bar plots since past visualizations have involved zigzagging baselines on barplots that does not necessarily make data look more comprehensible.
  4. Utilize color more to indicate inclinations, such as document density, recency of versions, and freeness of license.

Many of the above principles are based on this material.

Preliminary Dataset Engineering Process

To facilitate the development and production of visualization, let us first engineer all dataset into the same structure.
To enhance such effort, I will first generate a DataFrame for each license typing’s general information.

Consequentially, this:

The end result of engineering and creating a license information DataFrame

For each license, we note its:

  • Tool typing (license or public-domain)
  • Version
  • Jurisdiction (if not None)
  • Individual categories (BY, SA, NC, ND, neither) in a similar style to One-Hot-Encoding.

For each dataset, then, I will forge it into a format such that it can later be merged with the above license summary DataFrame.
For example, looking at the Wikicommons dataset:

A snippet of Wikicommons dataset for CC product usage

The License Type Column does not address its license typing per Creative Common’s primary alias. However, efforts can be made for thess information to be parsed and organized into a more manageable format for merging with license summary:

raw_wikicommons_license_types = raw_wikicommons_license_data["LICENSE TYPE"]\
.str.lower()\
.str.split("/", expand=True)
raw_wikicommons_license_types = raw_wikicommons_license_types\
.iloc[:, [0, 1, 2]]
raw_wikicommons_license_types[1] = raw_wikicommons_license_types[1]\
.str.extract(r"cc-(.*)")
raw_wikicommons_license_types\
.loc[~raw_wikicommons_license_types[1].isna(), 1] = \
raw_wikicommons_license_types\
.loc[~raw_wikicommons_license_types[1].isna(), 1]\
.map(lambda s: f"licenses/{s}")
raw_wikicommons_license_types[2] = \
raw_wikicommons_license_types[2].str.extract(r".*-(\d\.\d)")
raw_wikicommons_license_types[2] = \
raw_wikicommons_license_types[1] + "/" + raw_wikicommons_license_types[2]
raw_wikicommons_license_types\
.loc[raw_wikicommons_license_types[1] == "licenses/zero", 2] = \
raw_wikicommons_license_types\
.loc[raw_wikicommons_license_types[1] == "licenses/zero", 2]\
.map(lambda x: "publicdomain/zero/1.0")
raw_wikicommons_license_types = \
raw_wikicommons_license_types.drop([0, 1], axis=1).rename(
columns = {
2: "LICENSE TYPE"
}
)
raw_wikicommons_license_data_renamed = \
raw_wikicommons_license_data.rename(
columns = {"LICENSE TYPE": "LICENSE DESCRIPTION"}
)
wikicommons_license_merge_data = \
pd.concat(
[raw_wikicommons_license_data_renamed, raw_wikicommons_license_types],
axis = 1
)
wikicommons_license_merge_data.head()
wikicommons_license_merge_data_modified = wikicommons_license_merge_data.copy()
wikicommons_license_merge_data_modified["LICENSE DESCRIPTION"] = \
wikicommons_license_merge_data_modified["LICENSE DESCRIPTION"].str\
.extract(r"Free_Creative_Commons_licenses/(.*)").dropna()
wikicommons_license_merge_data_modified.sample(5)
A snapshot of code for pandas, regex based License Type extraction from LICENSE DESCRIPTION (alias)

The above pandas, regex driven transformation led to a more manageable version of the original dataset that can be well merged with the license summary:

wikicommons_license_data = merge_with_license(
wikicommons_license_merge_data_modified,
withhold = True
)
wikicommons_license_data.head()
End result of license extraction and merging with license information DataFrame

Repeat this process for all platforms where data was collected over by the team, then we may move onto dataset engineering and visualization development as described in the next section!

Visualization Development

Each visualization follows the approximate pipeline of:

  1. Engineer the dataset to capture necessary information of the visualization
  2. Concatenate the engineered datasets
  3. Call visualization methods from the Seaborn and Matplotlib package utilities
  4. Fine-tuning and annotating the visualization for clarification, data precision, and easier interpretations

For making the effort of explanation as concise as possible, the visualizations we discuss production process of would be constrained at those that correspond to Diagrams 1, 2, 8, 11 on the official report and presentation for this project.

Diagram 1: Number of Google Webpages Licensed over Time

The dataset engineering is as follows:

The time-based search result for CC product usage, with pandas-driven modifications for visualization

Originally, the dataset’s columns express the number of licensed documents within an arbitrary number of months from now, which would not quite demonstrate the trend of document count growth at and across each 6-month period.

Negative values have been observed for the number of document count in each half year period, possibly due to the volatility of Google’s API Responses as well as a migration of licenses across several online documents.

Let us attempt an explanation:

Perhaps, during some periods of time, many online documents have migrated from one version of the license to another, as well as might having terminated the use of CC tools in turn for some other alternative, substitute, or more commercial product.

Now, a Pandas-driven approach (Pandas saves the day again) has been adopted to compute the number of licensed documents published onto the Internet in each 6-month period, and we can compute the cumulative sum of such series of data to arrive at the plotted trend chart:

google_time_diff_publicdomain = google_time_diff\
.loc[google_time_diff["Tool Typing"] == "publicdomain"]\
.iloc[:, 1:-9].sum(axis = 0)

generate_diagram_one(google_time_diff_publicdomain, "Creative Commons Public Domain Tools", 10 ** 7, graph_option = "a")
plt.title("Diagram 1A: Trend Chart of Creative Commons Public Domain Tools Usage on Google")
plt.grid()
plt.rcParams['figure.facecolor'] = 'white'
Without the annotations, this trend chart would be rather confusing: it would merely show a developing line of some shape of growth, but not entail the milestones of this growth as past efforts attempted.

Therefore, to find the points of milestones and annotate them, which the annotations are dependent on the numeric scale of each dataset to plot, some general pipelines were developed.

Here follows the rough draft of an example snippet:

if annot:
google_time_all_milestone = [
find_closest_neighbor(google_time_all_toplot, i)
for i in range(
min(google_time_all_toplot) // (milestone_unit) * (milestone_unit),
max(google_time_all_toplot),
milestone_unit
)
] + [google_time_all_toplot.index[-1]]
sns.scatterplot(
google_time_all_toplot.loc[google_time_all_milestone], ax = google_ax
)
if graph_option == "a":
for i in google_time_all_milestone:
plt.annotate(
f"At {-i} month before: \n"
f"Reach {google_time_all_toplot.loc[i]:,} Documents",
(i - 6, google_time_all_toplot.loc[i] + milestone_unit * 0.2),
ha = 'center',
bbox=dict(
boxstyle="square,pad=0.3", fc="grey", alpha=0.5, lw=2
)
)
elif graph_option == "b" or graph_option == "c":
for i in google_time_all_milestone:
if i == google_time_all_toplot.index[-1]:
plt.annotate(
f"At {-i} month before: \n"
f"Reach {google_time_all_toplot.loc[i]:,} Documents",
(
i - 8,
google_time_all_toplot.loc[i] + milestone_unit * 0.1
),
ha = 'center',
bbox=dict(
boxstyle="square,pad=0.3", fc="grey", alpha=0.5, lw=2
)
)
else:
plt.annotate(
f"At {-i} month before: \n"
f"Reach {google_time_all_toplot.loc[i]:,} Documents",
(
i - 8,
google_time_all_toplot.loc[i] - milestone_unit * 0.1
),
ha = 'right',
bbox=dict(
boxstyle="square,pad=0.3", fc="grey", alpha=0.5, lw=2
)
)

Which employs a switch-case styled handling for annotation settings.

The last finishes of this process would be attaching the titles, annotations, and visualizing aids for rendered diagrams.

Diagram 2: Number of Google Webpages Licensed over Country

The dataset engineering of this visualization is slightly more complex, as we are dealing with geographical data.

Such data comes with additional supporting packages, as well as necessary demand for manual corrections on data.
For example, the country names across different ISO-codes might have deviations from each other, some representative of those being:

  • Bolivia v.s. Bolivia (Plurinational State of)
  • United Kingdom vs United Kingdom of Great Britain and Northern Ireland
  • Taiwan v.s. Taiwan, Republic of China

Some of these have political controversies, some are due to Google’s inefficiency at naming countries that have dissapeared for a while…, but that’s not the main axis of this data analysis effort.

Merged dataset from acquired geographical data from GCS API and global map dataset from GeoPandas

The column of Alpha-3 code is the main key for merging GCS API dataset and GeoPandas API, and the geometry column contains geometrical objects for each participating country of the data collection process to be written on the global heatmap visualization on Diagram 2:

ax = geo_google_country_data_merged.plot(
column='Log CC Density', scheme="EqualInterval", k = 15,
figsize=(20, 15),
legend=True,
legend_kwds = {
"framealpha": 0.5,
'loc': 'upper right',
'bbox_to_anchor': (1.14, 1),
"fancybox": True,
"title": "CC Document Density"
},
cmap='coolwarm'
)
log_leg_entries = ax.get_legend().get_texts()
for entry in log_leg_entries:
log_bounds = [
f"{round(np.e ** float(elem) * 100, 3)}%"
for elem in entry.get_text().split(",")
]
entry.set_text(" to ".join(log_bounds))
plt.title(
(
"Diagram 2:"
"Heatmap on Density of CC-Protected Google Webpages over Country"
)
)
plt.xlabel("Latitude")
plt.ylabel("Longitude")
plt.rcParams['figure.facecolor'] = 'white'
Generation of a heatmap on intensiveness of Creative Commons tools usage across the globe.

In this visualization, while the classifier for each colored bin is based on the Log10 CC Document Density (for the sake of getting a more plottable distribution), the legend is then edited to present information for the original CC Document Density for better interpretability.

Notably, some countries were not participating data collection process, either due to foul data entries or Internet policies (say, those countries like PRC where Google is unavailable).
These issues are mostly countered via further engineering on the visualization annotation as shown above, or with their data directly marked as is in the dataset with further annotations noting their abnormalities.

Diagram 8: Number of Licensed Work across Platforms

The dataset engineering for this diagram is rather straightforward: forge every dataset into one format, and concatenate them together:

all_platforms_concat_diag_8 = pd.concat(
[
all_platforms_concat_diag_3,
deviantart_license_data_diag_3
]
)
all_platforms_concat_diag_8.sample(5)
Effort on internal grouping and external concatenation of datasets. This is a pre-correction version of the visualization’s code, deviations due to revision exist between this snapshot and the current progresses.

Insisted to use length instead of area on visualization, we employ a Log Document Count column to visualize the quantity of each platform’s document count using barplots.

Then, for clarity, an annotation was done for the visualization to still present the original document count pre-log for each platform.

An annotation can also be made for the general report and presentation that, in a logged axis of visualization, a difference in 1 unit stands for a difference of 10 times. This provides better clarity to the visualization’s audiences.

Diagram 11: Number of Licensed YouTube Videos over Time

The visualizations for Diagram 11 involve imputing the development of YouTube CC-Licensed Videos, and incorporating this prediction of growth trend into a visualization. To involve the impute values, however, we should start from engineering the dataset such that it involves the imputed values of growth.

We can impute the count of CC Licensed videos based on past data; but what should be the pattern of visualization?

Is it exponential?
Some imputing and interpolation tools in industrial packages would argue so, one being this interpolation tool.
However, exponential growth of video count provides that there will be 100 million CC-Licensed videos in the newest two months period, which would most likely be at odds with reality. Or, at least according to civil estimations, it seems to be.

Meanwhile, the count of other medias’ developments under CC License follows a linear growth, so employing a linear model to impute the count of new YouTube videos per two months can qualify as a good decision.
And fairly fortunately, Python has many easy-to-learn supports for linear regression.

In this visualization, the sklearn version of Linear Regression model was employed:

import sklearn.linear_model as lm
model = lm.LinearRegression()
index_limiter = raw_youtube_time_data_toplot["time coord"] < 2016
model.fit(
X = raw_youtube_time_data_toplot[index_limiter][["time coord"]],
y = raw_youtube_time_data_toplot[index_limiter]["Document Count"]
)
plt.plot(raw_youtube_time_data_toplot["time coord"], model.predict(raw_youtube_time_data_toplot[["time coord"]]))
plt.scatter(data = raw_youtube_time_data_toplot, x = "time coord", y = "Document Count")
Dataset engineering effort on imputed values of YouTube’s video count growth.

Later, the visualization is annotated and colored to distinguish originally retrieved, capped API call results from the imputed values of video count:

youtube_time_licensed_linear_toplot_figs = plt.subplots(figsize = (20, 10))
ax = youtube_time_licensed_linear_toplot_figs[1]
youtube_time_licensed_linear_toplot = \
youtube_time_licensed_linear_imputed.merge(time_coord_table, on = "Time")
youtube_time_licensed_linear_toplot["Data Source"] = \
["Imputed Value of Response"]*youtube_time_licensed_linear_toplot.shape[0]
youtube_time_licensed_original = \
youtube_time_licensed.merge(time_coord_table, on = "Time")
youtube_time_licensed_original["Data Source"] = \
["Original API Response"] * youtube_time_licensed_original.shape[0]
youtube_time_toplot = pd.concat(
[
youtube_time_licensed_original\
.loc[:, ["time coord", "Document Count", "Data Source"]],
youtube_time_licensed_linear_toplot\
.loc[42:, ["time coord", "Document Count", "Data Source"]]
]
)
sns.lineplot(
data = youtube_time_toplot,
x = "time coord", y = "Document Count", hue = "Data Source",
style="Data Source", markers=True, dashes=True
)

plt.title(
(
"Diagram 11A: Trend Chart of Number of CC Licensed YouTube Videos"
"across Each Two-Months"
)
)
plt.ticklabel_format(style='plain', axis='y')
plt.xlabel("Time (Year)")
plt.ylabel("Count of CC-Protected YouTube Videos per Time Period")
plt.ylim(top = plt.ylim()[1] * 1.05)
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()])
plt.xticks(
np.arange(
int(min(youtube_time_toplot["time coord"])),
int(max(youtube_time_toplot["time coord"])) + 2
)
)
plt.grid()
plt.rcParams['figure.facecolor'] = 'white'
Visualization generation efforts on cumulative count of CC-Licensed YouTube videos over two-month periods

Closing Remarks

The visualization phase is quite a short period (7 to 9 days), and did not concern too many new technologies and skills other than the popular options of Pandas, Geopandas, Seaborn, and Matplotlib. This phase’s workload is heavier in terms of dataset engineering and exploration.

The rest of time in this phase is used to write the report in the next post, where we discuss the resulting visualizations of these processes.

https://github.com/creativecommons/quantifying

“This work is licensed under a Creative Commons Attribution 4.0 International License”: https://creativecommons.org/licenses/by-sa/4.0/

CC BY-SA License image

--

--

Bransthre

A Taiwanese student at UC Berkeley. This is where I put notes about my experiences in Cognitive, Computer Science, and UC Berkeley!