Building a Tagging System for Techflix: Part 1

The initial design

Published in

The Research Nest

8 min readMay 7, 2023

As I was building TechFlix (TL; DR. The Netflix for software engineering content), I ran into a problem statement.

I want to have tags assigned to the content that is curated. This should be uniform and as accurate as possible. These tags will also be used to create filter options/screens on the UI. They can also be used to pre-populate hashtags for tweeting content via a custom tweet button.

Given this context, I thought about having an extensible, modular, reusable, and scalable automatic tagging system developed for tech content tagging—a lot of fancy words.

(Scroll down to the design section to skip the initial research and perspective)

This is also a very standard problem statement engineers might come across when building social media or blogging platforms. My goal for this side project branching out of TechFlix would be to create an open-source tagging system that people can easily extend to use for anything they want- from tagging stuff to even creating datasets. I have never built something like this before. So, it will also be an interesting learning experience as I build it in public.

In my specific case, this tagging system is expected to take in the input as the content of the article and then return a list of tags applicable to it.

Upon initial analysis, I found many methods to assign tags to articles automatically. Some of them are as follows:

Using a lookup table: This involves creating a predefined list of tags and their corresponding keywords or phrases that can be used to match with the articles. For example, the tag “Web Development” can be associated with keywords like “HTML,” “CSS,” “JavaScript,” “React,” etc. The method then scans the articles for these keywords and assigns the matching tags. This method is simple and fast, but it requires manual creation and maintenance of the lookup table, and it may not capture the nuances or variations in content.
Using a rule-based system: This involves creating a set of rules or criteria that can be used to assign tags to the articles based on their content or metadata. For example, the rule “if the article mentions ‘Python’ and ‘Django,’ then assign the tag ‘Python Web Framework’” can be used to tag articles that are about using Django for web development in Python. This method is more flexible and expressive than using a lookup table, but it also requires manual creation and maintenance of the rules, and it may not handle exceptions or conflicts well.
Using a machine learning model: This involves using a pre-trained or fine-tuned machine learning model that can learn to assign tags to the articles based on their content or features. For example, a text classification model like BERT or T5 can be fine-tuned on a labeled dataset of software engineering articles and their tags and then be used to predict tags for new articles. The method then feeds the articles to the model and assigns the tags based on the model’s output. This method is more powerful and adaptive than using a rule-based system or a lookup table. Still, it also requires a large and high-quality dataset for training, and it may not be interpretable or explainable.

I would go with the ML method if I were a fancy AI startup with millions of dollars in funding and unlimited computation power. With models like GPT4, we don’t need a custom fine-tuned model or any new dataset for this task. We can directly make an API call with the content and the prompt to generate tags and receive the response you need with pretty high accuracy.

However, my rational side says that using AI for generating tags is overkill. It has a few other drawbacks.

It will be financially expensive.
It will be computationally expensive and consumes more time.
It is unexplainable, and an AI model may sometimes give out too diverse, fragmented, or bizarre responses.

AI is a nice word to market some product or solution, but it is not necessarily the best way to do everything. I want a no-cost, ultra-fast, 100% explainable tagging system for my project- a proper rule-based tagging system.

And so, I began to design this system.

The Design

We need a framework of logical steps to structure this system.

Identify the article's main topic: For example, if the article is about building some websites, the relevant tag to assign would be “Web development.”
Analyze the text for keywords: The next step is to analyze the text for relevant keywords related to software engineering sub-domains. For instance, if the article has keywords like AWS or Google Cloud, we may assign a tag like “Cloud Computing” to it.
Determine the type of article: The type of article can also help in assigning relevant tags. For instance, if the article is a tutorial, it can be tagged under the same. If the article is a system design case study, then the relevant tag could be software architecture, design patterns, etc.
Apply a set of rules: Once the above steps have been completed, a set of rules can be applied to assign relevant sub-domains. Based on what the algorithm finds, tags can be assigned.

This is just the basic overview to start with.

How to actually build this system, meeting all the fancy keywords I mentioned before?

Since a Python script creates Techflix content, I decided to build this system in Python itself. Once I have a functional module, I may explore other approaches and languages based on pros and cons.

The most straightforward approach would be to create a Python function that takes in the content, makes a bunch of if-else statements, and applies logic to find the main topic, keywords, and other such stuff to ultimately assign tags by mapping them back to the desired tag.

There are a bunch of problems with this. Rules and logic are hardcoded directly into the function. This will make it hard to add and test new rules in the future. It also makes it hard for other people to use it. It can get complicated quickly, and there is little flexibility. It is also clumsy if we want to run only some rules or make alternations.

How to make it better?

Let’s try to make everything modular. We can think of having multiple blocks in the system that have a very specific use case.

A block, just to define rules.
A block to store the information on tags that are to be assigned.
A block that can apply the rules defined in whatever way we want to return a list of tags for the given content.

Thinking in terms of “building blocks” is really helpful in general.

With this approach, we separate rules from the block that actually generates the tags. That way, we can easily modify the rules however we want. Someone else can come in and define an entirely new set of rules for their specific use case, and the tagging system should seamlessly work with those new rules.

We can use a configuration file (like .yaml) to store details about the rules and tags to be assigned.

Instead of creating the TaggingSystem object directly, we can use a factory function to create it. This way, we can encapsulate the creation logic and make it more flexible. For example, we can pass the configuration file name or a dictionary of rules as arguments to the factory function and initialize the system. This will be our TaggingService.

In short, the TaggingService will take in the configurations, rules, and the TagggingSystem and initialize it as required. Doing it this way can help us scale better and add additional improvements easily. For example, if you have many articles and rules to process, you may need a caching mechanism to store the most frequently accessed information. That can be done in the service class without disturbing the main tagging system logic.

Here’s the file structure I am expecting to have for the Tagging module.

tagging_system/
│
├── config/
│   └── rules.yml
│
├── __init__.py
├── rules.py
├── tagging_system.py
└── tagging_service.py

(Note- This may not be the final design, per se. Things will evolve as we explore further)

The config directory contains the configuration files for the tagging system. In this example, we have a YAML file called rules.yml that will contain the default rules for tagging software engineering articles.

The __init__.py file is an empty file that indicates that the tagging_system directory is a package.

The rules.py file will contain the definitions of the Rule class and its subclasses used in the TaggingSystem class.

The tagging_system.py file will contain the implementation of the TaggingSystem class, which is responsible for loading the rules, matching the patterns with the article content, and generating tags.

The tagging_service.py file will contain the implementation of the TaggingService class, which uses the TaggingSystem object to tag articles and provide things like a caching mechanism to improve performance. (I probably won’t need it as I am operating at a much smaller scale)

This file structure separates the concerns of the tagging system into different modules and files, making it easier to understand and maintain. It also allows us to swap out or add new modules as needed easily.

What exactly will the rules look like?

This is open to discussion and brainstorming. So far, I have broadly come up with three types of rules:

A TopicRule that matches the article's main topic to the related domains.
A KeywordRule that matches the important keywords occurring in the article to the related domains.
A TypeRule that looks for the type of article (like a tutorial, for example) and matches it to the required tag.

Using all such rules, we can create an algorithm to determine the final set of tags that can be assigned. Let us try to visualize how it might actually work.

If a TopicRule matches the article's main topic as “Software Architecture,” it can assign related domains such as “System Design” as a potential tag.
If a KeywordRule finds that the article has keywords like “AWS” or “Azure,” it can assign “Cloud Computing” as a potential tag.
If a TypeRule finds that the article is a case study on the successful use of some technology, it can assign stuff like “Success Stories” or “Migration” as a potential tag.

We can have multiple rules under each category, and once we check for everything, we can consolidate the final list of tags based on some additional logic.

Next, we can try to think about the following:

The logic for defining and implementing these rules
The data structures to use, keeping performance in mind

I will be covering them and more of the design and implementation in the next part of this article series.

Link to the next part:

Building a Tagging System for TechFlix: Part 2

Here's the updated structure I am following. tagging_module/│├── config/│ └── rules.yml│├── init.py├──…

link.medium.com

Feel free to suggest any ideas, approaches, or better ways to do things in the responses!

Building a Tagging System for Techflix: Part 1

The initial design

The Design

Building a Tagging System for TechFlix: Part 2

Here's the updated structure I am following. tagging_module/│├── config/│ └── rules.yml│├── __init__.py├──…

Written by XQ

Here's the updated structure I am following. tagging_module/│├── config/│ └── rules.yml│├── init.py├──…