NLP for genre predictions on FFnet: an antithesis to utilitarianism

15 min readJun 20, 2022

Background and problem statement

Have you ever wondered, ‘Gee, I’d like to make my own life harder by introducing unnecessary complications’? Well, that’s basically this idea in a nutshell; conceived in folly, and destined to end in a dustbin somewhere, languishing by the digital wayside, if you will. There are plenty of better ideas, and I have had quite a few of them whilst working on this project, but they were either too expansive, too heavy, or simply too late to me to bear them to fruition. Nevertheless, it’s been quite the enjoyable ride, as my first foray into machine learning and admittedly one of my first semi-major (if I may allow myself the conceit to call it that) ventures in programming in general. Make no mistake; this isn’t a particularly difficult idea to execute, but I am one of those who are entrenched in priding themselves for taking the easy path — but for sloth being both a sin and a matter of principle, I ask you to overlook this minor offense of mine. In any case, dallying will get us nowhere, so I shall strike forward to the point. (Get on with it already!)

The central idea of this problem was borne from my interest in developing an automated tagging system for AO3 (ao3.org). You may have heard of this site — though, if you haven’t, I’m afraid much of the rest of this post not related explicitly to the programming aspects of the project may be difficult to parse. Familiar users of the site may point out that the litany of ‘non-informative’ tags that permeate the site’s work would hinder the ability for such systems to be developed. This may be true to some extent, but the major roadblocks that I encountered while mulling over and trying out various approaches to developing a model did not include the problem of non-informative tagging. In fact, the two major issues that one would face in the development of an automated tagging system for AO3 can be set out quite simply as follows:

The highly variable lengths of works on the site, spanning hundreds to millions of words, makes it difficult to build an all-encompassing model. Attempting to rely on summaries could help, but would reduce accuracy drastically, as many summaries are similar to non-informative tags in content.
The inconsistency in tagging for essentially any tag that is not either a relationship tag or what AO3 terms a ‘Warning’. Even some of these tend to wander into inaccuracy or plain nonexistence once they get irrelevant enough, though this is less of a problem than it may seem initially.

Though many approaches in NLP have been developed recently that allow for the alleviation of the first problem to some degree, since I was not aware of most of these until quite late into the project, I was not able to utilize them. I may revisit this in the future, if nobody has made one yet by then.

In any case, after considering these issues, I decided on a similar, but probably simpler, venture; a tag prediction system (based on the work summary) for another site, similar to AO3 in purpose but quite different in its philosophy in tagging. While AO3’s tags are, for the most part, user-generated and -curated, FFnet’s tags are fixed, in both number, quantity applied per work, and definition. This massively simplified the problem, as the curve of tag application probability (i.e., whether or not a work that does technically fall into a category governed by a given tag will actually be tagged so by the author; tag awareness), given the much reduced quantity of different tags on FFnet, is much steeper; thus, tag application is much more consistent (albeit still not ‘perfectly’ so), which allows for analyses of model accuracy to be greatly simplified.

The problem to be solved is then thus: to develop a model for predicting the ‘genres’ that would be applied to a certain FFnet work, given only its work summary.

Metrics and baselines

Despite FFnet’s unconventional allowance of a maximum of two genre tags per work (which a fair majority of the site’s works stick to)[1], an analysis of model accuracy can be expressed fairly simply. Since the choice of the quantity of tags to use is fairly arbitrary among works in any case, the model will simply predict two genre tags in all cases, based upon the two classes with the highest probability scores in the output.

Explicitly stated, calculations of overall model accuracy post-training will consider the ratio of correctly predicted (2 correct tags) to incorrectly predicted (0 correct tags) works. Half-correct (1 correct tag) predictions are also included, but out of necessity will be used in the calculation under the assumption that all works are tagged with two genres in the initial data.

With:

W: Total number of works
x_2: Number of works with both tags correctly predicted
x_1: Number of works with one tag correctly predicted

It is worth noting, however, that this metric is inherently inaccurate, as it fails to take into account the initial numerical probability scores output by the model in x_1. Additionally, the calculation does not distinguish between cases where the work originally included only one tag (in which case it is, due to the nature of the model’s prediction method, impossible to get correct, as all work summaries are matched to the two most likely tags), and where the work did originally include two tags, but only one was correctly matched in the model’s prediction.

As for existing baselines, since this is (to put it lightly) a project without any financial, social, or humanitarian merit generally, it was unsurprising to me when I was unable to find any other projects with similar goals. However, as a substitution, one could feasibly compare the accuracy of the model to that of a human performing the same task. As the computed model will necessarily be much faster, raising the accuracy of the model to human levels (and perhaps beyond) should be one of the central goals.

Data collection and cleaning

Data scraping from FFnet was not helped by the lack of an official API. Searching for repositories on GitHub for tools that could aid the process was ultimately unhelpful; those I was able to find were either unsuited for my needs, or simply outdated and no longer functioning due to changes in the design of the site. I therefore decided to create my own semi-manual scraping tools, which have been made available on GitHub. It is perhaps worth noting, however, that these tools have not been tested extensively, and the regular expressions that allow for the majority of initial data cleaning will most likely not work in all environments.

Additionally, though these tools do reduce the amount of time required to collect and clean the raw data, it is still necessary to access and copy-paste information from the site to a text file manually, as I was unable to find a way to automate this process (to anyone who could help develop a tool for scraping this information automatically: I thank you profusely, and would ask you to please drop a word in the comments!)

The process of data collection is fairly straightforward, though tedious; ⌘/Ctrl-A, -C, then -V’ing from each FFnet page into a single plaintext file. The choice of which pages to be accessed is also somewhat arbitrary, as FFnet offers many filtering and sorting systems. In any case, I settled on the following criteria for inclusion:

Only English-language works;
Only K->T-rated works (set as default);
Only complete works;
Only works from media with a number of derivative works totalling greater than ~50k (also somewhat arbitrary).

With this, it is possible to narrow down the results somewhat. The specific media that were sampled are given in [2]. The initial scrape, used for the first instances of training, took data with an additional caveat; sorting by (number of) favorites, and starting scraping on the 20th page of results.

Exploratory data analysis

From this text file generated through this method[3], a (pseudo-, as the maximum is two, not one) one-hot DataFrame representation can be generated. An analysis of the tag distribution in the data, comprising 1756 separate works, revealed major imbalances, with more than half the tags being nearly unused. For the sake of simplicity, it was decided that only the major classes would be predicted for; here, the rightmost eight classes. It is to be noted, however, that the absurdly high prevalence of the ‘Romance’ tag, combined with the difficulty in leveling the training data such that an equal distribution could be achieved, means that the model’s predictions would inevitably have a strong preference for this tag.

The distribution of tags in the data; extremely skewed.

The removal of the near-unused tags (by way of simply removing any works that have such tags from the DataFrame), supplemented by the removal of works with zero tags (which occasionally occur, and which would not be helpful to a model trying to predict tagging), gives the final dataset that will be used in training the model.

Modeling, validation, and data analysis

The simplest methods available for creating models for NLP tasks like these appear to me to be based on HuggingFace; this I took and ran with. Using bert-base-uncased[4] as the pretrained model, I fine-tuned my own model by training it on the aforementioned dataset. The training hyperparameters can be viewed on HuggingFace[5], where the models are hosted.

After training and getting decently high accuracies (first image) from the initial dataset (the training data included), I tried testing the model on other data samples (second and third images).

Fig. 1 ▸ Charts showing accuracy, in percentage (x axis) of works correctly/half-correctly/incorrectly tagged by different models (y axis).

This revealed an issue. Although I had trained the model for a decently long time, the accuracy when applied to data outside the training set appeared to be unchanged (or worse, be decreased.) This suggested an issue of overfitting; given the relatively small training set size, compared to the amount of time the model had spent training, overfitting seemed like a real possibility. With that in mind, I decided to scrape together a larger dataset, and be more careful during model training to avoid overfitting. In addition to that, I rewrote my accuracy visualization code so a more granular analysis would be possible; instead of three vague classes, it would now be possible to distinguish between all five classes of prediction/ground truth comparisons, namely those in which:

The original work had two tags, both of which were correctly predicted (2t2r);
The original work had one tag, one of which was correctly predicted (1t1r);
The original work had two tags, one of which was correctly predicted (2t1r);
The original work had one tag, which was not predicted (1t0r);
The original work had two tags, none of which were correctly predicted (2t0r).

This categorization system for prediction accuracy thus requires a new accuracy metric. This is rather straightforward as well, though it remains rather arbitrary and imprecise, not unlike the previous metric mentioned above.

With:

t_2r_2: Percentage of predictions in category {2t2r},

t_1r_1: Percentage of predictions in category {1t1r},

t_2r_1: Percentage of predictions in category {2t1r}.

Note that any works which lacked any tags altogether in the original scraped data were removed as a part of the data cleaning process, as mentioned above.

I managed to scrape a new dataset of 6443 items[6] and started training the model on this set.

The tag distribution (post-minor tag removal) of the second training dataset.

Unfortunately, it appeared that even this was too small a size for effective generalization, as after only three epochs, the model began to overfit, while its accuracy on non-training data remained roughly the same [Figure 2].

Fig. 2a and 2b ▸ Charts showing accuracy, in percentage (x axis) of predictions falling into the specific categories mentioned above, against the models that produced these predictions (y axis). Each model’s full name can be found in [7]. The chart on the top, 2a, shows the prediction accuracy statistics for a specific test set[8], while the chart on the bottom, 2b, shows statistics for the training set at [6]. Each model prefixed with a ‘zdr5’ is one of the checkpoints for the model[9] used here. Consecutive zdr5 model checkpoints are separated by 3 epochs, descending. For instance, ‘zdr5-train-o3’ is 3 epochs of additional training from ‘zdr-train-o4’. Note that the latter 5 model checkpoints in 2a are the same 5 model checkpoints as in 2b.[7]

One can observe from the ballooning accuracy in [Figure 2b] contrasted with the nearly unchanged accuracy (ignoring ‘valh’, which is a different model entirely) in [Figure 2a] that the model is overfitting yet again, and failing to generalize to new data. In fact, the best-performing model (checkpoint) here appears to be zdr5-train-o4 (=zdr5-test-old-4), which was obtained after only 3 epochs of training with the new dataset[6], as successive zdr5 model checkpoints in [Figure 2a] actually see reduced accuracy.

Due to time constraints, I was unable to gather more data and try more training in the hopes of alleviating the problems caused by the model’s constant overfitting (however, I do intend to develop this further in future.) For the moment, I decided to simply use the zdr5-o4 (zdr5-train-o4) checkpoint (as the model with the highest overall accuracy) to test whether or not I had been able to achieve one of the major goals of my project; namely, whether it was able to outperform a human in terms of accuracy.

After gathering another test set of 106 items[10], I asked the manual classification participant to attempt to classify to the best of their ability the set of items into the major genres that the model was also predicting for. The participant was also instructed to always predict two genres per summary item, which would ultimately allow for the production of the same output as the model (which returns numeric values, but whose accuracy is judged based not on these numeric values, but on the two guesses with the highest scores, as previously mentioned.)

Due to a lack of participants (and a lack of due diligence for finding them on my part as well), I decided to supplement the manual classification data with my own; I performed the same task as the participant, and used the results to augment the available quality-of-manual-classification data (although, if you’d permit me a little bit of humor here — a sample size of n=2 is not a significant improvement from n=1, though if one rephrases it to ‘a 100% increase’ it may sound more impressive.) Nevertheless, this was the best-quality data I was able to gather, because, as previously mentioned, this project, being completely useless in general, does not have any supporting data aside from what I have gathered here.

Fig. 3 ▸ Charts showing accuracy of predictions. zdr5-o4-ffd7 is zdr5-train-o4 (the best model, chosen for comparison) on the ffdump7[8] test dataset. zdr5-o4 is the same model on the final test set[10]. man1 is the first human on the final test set[10]. man2 is the second human, also on the final test set[10]. Precise numerical percentage data is available above the chart.

It can be observed from [Figure 3] (and the accompanying numerical data), by comparing man1 with zdr5-o4, that zdr5-o4 performs measurably better than human classifiers in terms of accuracy, satisfying to some degree one of the goals of the project. Comparing the accuracy between two test sets (between zdr5-o4-ffd7 and zdr5-o4) shows that there are no significant differences between the results for the two test sets (while the ‘2t2r’, ‘1t1r’, and ‘2t1r’ bars vary in size slightly, the effect to overall accuracy is minor when calculated with the new metric:

This suggests that the accuracy yielded by the model with the final test set is likely not a fluke; other, larger test sets have similar accuracies as well. Furthermore, it outperforms existing non-finetuned models [Figure 4] (though this is rather to be expected.)

Fig. 4▸ Charts showing accuracy of predictions, all on the ffd7[8] test dataset. Each of these models is hosted on HuggingFace [12].

Deployment

Deployment was fairly simple; I used Streamlit to create a webapp that would allow for immediate access to the model, and allow for users from all backgrounds to try out the model for themselves. The webapp is available here, and is very simple to use — try it out!

Some examples can be found below, with appropriate credits provided (if you don’t want your work to be shown here, feel free to contact me and I’ll take it down!)

credit: https://www.fanfiction.net/s/13663441/1/Two-Months [by HibiscusAngel15]

credit: https://www.fanfiction.net/s/13507942/1/For-The-Life-You-Could-ve-Had [by LethanWolf]

credit: https://www.fanfiction.net/s/13239535/1/Microeconomics [by DearCat1]

The actual code behind the webapp is also available on GitHub.[11]

This and that

You may have noticed that this article was quite code-light for a topic like NLP; the reason for this I will say frankly: I am terrible at coding. I am terrible at documentation, I am terrible at keeping track of what is what and where it is. My Jupyter notebooks are a mess, hodgepodges of spaghetti — nay, fettucini code. I barely understand them anymore, and they run so incredibly slowly I legitimately considered rewriting the whole thing in the last week because I simply could not stand it. I wish anyone brave enough (and with enough free time on their hands) the best of luck in trying to understand and use my code (although really, why would you do that.)

Back to reality for a second: I’m not a developer. I had practically no coding experience (at least, before this project.) Despite all this, though, whether I succeed in persuading people of the nonexistent merits of my model, I think I’ve learned several valuable things throughout this project; one, the importance of doing work punctually, avoiding procrastination; two, that proper documentation and coding hygiene cannot be avoided if you want to build a project of any scale; and three, that organization at all levels is the best antidote to confusion.

I would like to thank all my mentors in AI Builders for their continued assistance in and advice regarding my ridiculous project ideas. I would like to apologize in advance to anyone reading this hoping for something useful at the end here —it’s more of the same nonsense, I’m afraid. Ultimately, this project was quite frankly a mess, and I’d be surprised if it gets the pass. However, I fully intend it to be the best-looking mess in this cordoned-off corner of the internet. (that’s supposed to be a positive thing… I think.) Here’s hoping. ■

Acknowledgments

Much of the training code is heavily based on this tutorial.

Special thanks to Neal (Twitter) for helping out with the manual classification data.

Come shout at me on Twitter!

Footnotes and references

All of my code is available on this GitHub repo. Files referenced as being used during development are in the dangerzone/ directory.

[1] likely around 60%, based on the data analysis in [Figure 2b]. It can be observed that the ‘1t1r’ bar in zdr5-train-1, representing the category of predictions in which the original work was tagged with only one tag which was correctly predicted by the model, and which takes up nearly all of the remaining space after the ‘2t2r’ blue bar, has a width of approximately 40%.

[2] Specifically, Naruto, Inuyasha, Hetalia Axis Powers, Bleach, Fairy Tail, Yu-Gi-Oh, Dragon Ball Z, Harry Potter, Twilight, Percy Jackson and the Olympians, Lord of the Rings, Avatar the Last Airbender, Pokemon, Kingdom Hearts, Star Wars, Avengers, Supernatural, Glee, Doctor Who, Sherlock, Once Upon a Time, and Buffy the Vampire Slayer.

[3] rawwebtext-linklist-p20-to-22.txt, available on GitHub.

[4] https://huggingface.co/bert-base-uncased

[5] https://huggingface.co/zdreiosis/ff_analysis_4

[6] ff_dump_final2.csv, on GitHub.

[7] All on HuggingFace,