DSD Fall 2022: Quantifying the Commons (9/10)

In this blog series, I discuss the work I’ve done as a DSD student researcher from UC Berkeley at Creative Commons.

9 min readNov 27, 2022

In this post, I wrap up the work-explaining portion of this blog series with future expectations and possible next steps.

DSD: Data Science Discovery, is a UC Berkeley Data Science research program that connect undergraduates, academic researchers, non-profit organizations, and industry partners into teams towards their technological developments.

Preface

From here on, since we will no longer be discussing the bits of “works” and “codes” (as they have been addressed in all prior posts), allow me to first express the “future possibilities” of this project, then discuss the expectations, baton touches. At the last post, I will put some personal reflections on this project. By the nature of these descriptions, posts will be in a slightly more casual tone now.

And as for how these works will then impact the future of Quantifying the Commons, that is left for the future to decide. For now, we have jump-started efforts to wake the project from its quintannual dormancy, and hope that these works will aid and guide the new development of Creative Commons onto different geographies, demographics, and policies.

But for whether these works will then impact the future of Quantifying the Commons as well as future decisions of the staff, we would confidently say yes.

Future Works on Retrieval

Great concerns and advices were expressed at this post from my supervisor.

Towards which, I responded on Slack about three weeks ago with the following policies and thoughts:

How often should the data be gathered and analyzed/rendered? What is the strategy for gathering data over multiple days (due to query limits)?

We can set up a persistence strategy for the script file. For example, once a quota limit is received, write down the license/query that the script has collected until in a .txt file. Next time the script launches, it resumes its progress from that .txt file. Or rotate between multiple API keys, which is currently implemented for google_custom_search. Or do both!

Another fixing point for multi-day data gathering is that the data files are currently registered using the date of querying. Fortunately, this issue can definitely be alleviated using the persistence system mentioned above.

Developing the persistence system would take average/worst case around 1–2 hours, wouldn’t be too much of a problem once the system design is down. (In addition, it coincides with an assignment from Berkeley’s introductory coursework, lucky!)

What is the strategy for ensuring automated updates do not result in broken/incomplete state if they don’t complete successfully?

As addressed above. In addition, prohibit the program from performing update with incomplete row entry. In summary, the program will then always stop at the row it didn’t finish querying on a prior day and continue by starting at that row tomorrow.

The deficiency of this strategy is that if a row is too long for the API query to complete under quota limit, there is no way to complete the entire data extraction process. This, however, is not a likely event since none of the data file’s row data has length over 40 and the minimal quota limit encountered so far is 100 requests.

Should scripts wait until completion to write data to file(s)?

Sounds like a good idea, let’s implement this once we find spare time amidst the analysis stage. If the “completion” here is referring to the “completion of multi-day collection”, it would be slightly redundant for the current design plan since we would then have to write down the queried data somewhere in the folder. In that case, it is the same as already having written queried data to some other files.

However, we can adopt the reasoning for this suggestion and render the .db file (SQLite) from intermediate .txt or .csv files once the multi-day data completion process is completed.

Can the various tasks be run using GitHub Actions?

Have not researched on this yet, but here is a list of things GitHub actions must be able to perform to operate the script:

Run Python scripts
Read from files, write contents into files
Update code repository autonomically to perform system persistence.

Should data be stored in a plaintext (immediately readable) or binary format (compressed, SQLite, etc.)?

SQL is very handy when we deal with possibly non-tabular data, or in general is trying to get access from the help of DBMS (Database Management System). Let’s bring this insight as we inspect our current data collections.

The individual datasets I worked on are mostly small .csv formatted files that don’t require SQL. However, the future dataset needed for model’s samples as well as my colleague’s dataset on (which appears to be a .json dump of) all Flickr photos of one license are considerably large, and we do have multiple databases to concern with, which would benefit with management from DBMS.

So overall, SQL would be a good idea once the scope of this project requires larger datasets. For now, the only large dataset I plan on using is an eventual dataset of websites for modeling, which I already plan to adopt SQLite3 for on my end.

I do have a SQLite3-to-DataFrame pipeline semi-ready from a prior project if we would like to equip it around the end of semester, or at least have this resource available to the next team.

This is a rough approximate of how the pipeline might look like. Comment from me on some days later will then address the existence of a better yet alternative: SQLalchemy with pd.read_sql.

Are there opportunities for code deduplication?

Definitely. As for many pipelines they have very similar exception handling and data collection. However, platforms that collect multiple datasets (Google, YouTube) and platforms that have a peculiar licensing method (Wikicommons) wouldn’t benefit from this. The only data collection process I have not played with is Flickr, but a quick glance hint that its structure is also not to benefit from a general data collection process.

The generalization of data collection process is commendable and helpful, but just because the clients of CC have diverse methods for their API usage, I’ll slightly hold this off. The reason it seems de-duplicable for now most likely comes from a majority of data collection process being delegated to the same author with same code style, boilerplate, documentation.

Future Works on Visualization

For which, I will represent my expectations via comments on my colleague’s visualization. This is not written as a critique towards individuals’ work. It is a remark on frequent errors of visualizations.

Visualizations are communicative, and by that value it must be comprehensible and interpretable without external resources to the visualization.

Put in conceptual terms: everything you need to understand a diagram should be very clearly on that diagram already. Put in concrete and practical terms, we would then begin to discuss frequent mistakes that have also been seen in this project, some unrevised such that you would still spot mistakes on our deliverables.

Let’s use the visualization from Post 4 EDA again:

What prevents me from understanding this visualization is mostly at the x-axis and y-axis’ lack of annotation. I see a bunch of date-time in the x-axis, but does this mean each x-coordinate is a day then? Or are x-coordinates split by seconds? Months? Meanwhile the y-axis has values from 0 to beyond 3000, but what is the unit? Apples? Bananas?

The x-axis and y-axis both measure some unknown quantity in unknown units; the visualization is essentially meaningless by its incomprehensibility.

But assume that now we have asked the author to explore and annotate x-axis and y-axis, then what does the title of this visualization mean? For example, what is this “usage” measured in? And furthermore, what is “license2”?

Some very significant problems with visualizations of this project also concern not just “lack of information”, but also “lack of context”. For example, while we measure “flickr photos 1967–2022”, what does “1967–2022” exactly mean, if Flickr started at 2004?

And furthermore, as Post 4 noted, this visualization suffers significant duplication issue from its dataset. After deduplication work (which was done 3 days within the day I write of the final blog post), the sampling frame of Flickr API was limited to the first 4000 (if not less in practice) search results for any licenses applicable on Flickr.

Why is this sampling frame error not written as a subtitle or annotation for the visualization? Is this guilty for misleading, albeit unintentional?

Visualizations are important because communications around data are hugely dependent on visualizations. Because of that, it is best to treat this aspect of the project diligently and precisely.

Many missing information and ill-designed layout of visualizations have caused popular misconceptions around news already. It doesn’t help to expand this issue onto an educative project.

Speaking with less gravity: aware of the significance of visualizations on common interpretation of presented subjects, let us think thrice before we jump and produce visualizations.

Future Works on Modeling

Most expectations on modeling are written here because I am incapable of completing these experiments nearing the deadline. I will address some possible developments for each model I worked with.

BERT

After some discussion with my friends, we both agree that the training dataset might be too small for BERT to work optimally. Development resources for a BERT-based model on larger datasets would involve hardwares that are, albeit not yet utilized in this project, freely accessible from several cloud computing platforms on the Internet for iPython developments. Making BERT work would then have to do with finding a better training dataset.

During research for Spark, I came across a lot of paper pointing to this C4Corpus, and I believe a replication of this method onto recent crawl datasets should be helpful for gathering a larger training dataset.

Logistic Regression

We can use libraries that help with model inference to identify helpful words for license classification. A good resource would be ELI5. Meanwhile, some hyperparameters like decision threshold have not been explored during the modeling phase and I’d like to experiment with that as well.

SVM

Using SVD with Support Vector Machine was unexpectedly helpful. Meanwhile, kernel choices have been explored already, and I found a simple linear kernel being most helpful for model performance. For this model, there is really much less things to expand on.

Ensemble Methods

Ensemble methods have been most high-performance for the modeling process, but both ensemble methods have not been experimented to their fullest potential. While Random Forest Classifiers’ decision tree parameters have not been fiddled with, Gradient Boosting Classifier’s abundant choice of loss functions have also been unexplored due to its time operation constraint.

General Directions of Improvement for Modeling

This model was originally intended as a recommendation system that receives a document from some user and returns two to three recommended Creative Commons tools. This is why top-K-accuracy became the main metric for model performance evaluation.

But, for inference, we would still need to optimize the model to the point where accuracy itself is acceptable in realistic standards (70% or above).

Submissions to Data Science Discovery

Submissions to Data Science Discovery program for applying to compete on awards involve a poster, a video presentation, and the blog post series that I am to end in the next article.

A link towards our submission will be provided here as soon as the documents and visuals of them are finalized! But for now, here’s our posterboard presentation material as a sneakpeek. The photo’s resolution has been compressed from 7200x5400 to what is presented below, so a link towards the entire picture will be attached here.

Quantifying the Common’s Final Draft for Posterboard Presentation

A remarkable portion of poster and video presentation’s contents originate from my side, perhaps for being equipped with a broader and deeper understanding of the design policies this project employs.

This would probably serve to convince myself that I have contributed a sufficient amount of effort and thought processes to the transformation of data retrieval processes, visualization policies, and experimentations of models. And in the next post, I’ll wrap the blog series up with some personal reflections on online research works, overall delegations within this project, and methods by which I learned to develop data science skillworks.

https://github.com/creativecommons/quantifying

“This work is licensed under a Creative Commons Attribution 4.0 International License”: https://creativecommons.org/licenses/by-sa/4.0/