Building a Tagging System for TechFlix: Part 2

Initial implementations

XQ
The Research Nest
Published in
4 min readMay 13

--

Photo by Kelly Sikkema on Unsplash

Following up on my previous article, here’s what we will explore in this article.

We will build a “tagging_module” in Python to assign tags to software engineering articles curated by TechFlix automatically.

I will implement a simple keyword-based rule to assign tags and package it into a reusable and extensible module. Without further ado, let’s get started.

The File Structure

Here’s the updated structure I am following.

tagging_module/

├── config/
│ └── rules.yml

├── __init__.py
├── tagging_system.py
└── tagging_service.py

My initial rules.yml file looks something like this. Note that this is just a sample, and the original file contains more information.

Web:
- web development
- front-end development
- web design
- html
- css
- javascript
- responsive design

Mobile:
- mobile app development
- ios development
- android development
- swift
- kotlin
- objective-C
- mobile user interface

Database:
- database management
- sql
- nosql
- relational databases
- database design
- data modeling

Cloud:
- cloud computing
- amazon web services
- aws
- microsoft azure
- azure
- google cloud platform
- infrastructure as a service

AI:
- artificial intelligence
- machine learning
- deep learning
- neural networks
- natural language processing
- nlp
- ai

DevOps:
- devops
- continuous integration
- continuous delivery
- version control

Security:
- cybersecurity
- encryption
- authentication
- network security
- secure coding practices
- vulnerability assessment

Backend:
- backend development
- server-side programming
- apis
- node.js
- ruby on rails
- python
- rust

It’s basically information related to the tags I want to assign and the corresponding keywords.

Ideally, you should segregate multiple rules by providing more information for a more comprehensive system. For example, you can specify rule types and have different types of data that can be used for different rules. Let’s keep it simple for now.

The Tagging System

It can use a simple algorithm as follows:

  • Read the rules from the YAML file
  • Get the tags and the corresponding keywords
  • Check if any keyword exists in the article content and assign the corresponding tag
  • Return the list of all assigned tags

Pretty straightforward. Here’s what it looks like in the code.

# tagging_system.py

import yaml
from typing import List

class TaggingSystem:
"""
A class for assigning tags based on a set of rules.
"""
def assign_tags(content: str, rules_path: str) -> List[str]:
"""
Assigns tags based on a set of rules specified in a YAML file.
"""
if not content or not rules_path:
raise ValueError("Invalid content or rules_path")

try:
# Load rules from YAML file
with open(rules_path, 'r', encoding='utf-8') as f:
rule_dict = yaml.safe_load(f)

# Convert keys to sets and store lowercase content
rule_dict = {tag: set(keys) for tag, keys in rule_dict.items()}
content = content.lower()
tags = []

# Check each rule
for tag, keys in rule_dict.items():
common_words = keys.intersection(content.split())
if common_words:
tags.append(tag)
return tags

except Exception as e:
# Raise an exception instead of printing and returning empty list
raise ValueError(f"Error assigning tags: {str(e)}")

Notice a few things in the code.

  • The code is wrapped under a try block with exception handling. It is always recommended to write production-ready code with the try blocks. This helps us debug things better and not break the flow if an error happens.
  • We are using the intersection method to find common words and then assign the tag instead of trying to loop through keywords manually and check if they exist in the content.

Right now, we have an unintelligent and raw approach to assigning tags. We check keywords; if they exist in the content, we assign the corresponding tag. However, this is more than enough to start with.

The Tagging Service

Here’s the code for my service class.

# tagging_service.py

from typing import List
import os

from tagging_module.tagging_system import TaggingSystem

class TaggingService:
"""
A class for providing a tagging service.
"""

rules_path = "tagging_module/config/rules.yml"

def __init__(self, rules_path: str = None):
if rules_path is None:
rules_path = TaggingService.rules_path
if not os.path.isfile(rules_path):
raise FileNotFoundError(f"Rules file not found: {rules_path}")
self.rules_path = rules_path


def assign_tags(self, content: str) -> List[str]:
return TaggingSystem.assign_tags(content, self.rules_path)

It’s a simple wrapper on top of the tagging system. Notice how we are initializing the rules.yml file here. We can now utilize this service to assign tags by passing the article content.

Also, we need to add the below line of code in the __init__.py file inside the tagging_module folder to make it more organized to import the service class.

from .tagging_service import TaggingService

The below code will initialize the tagging service with default parameters.

from tagging_module import TaggingService

tagging_service = TaggingService()

And now we are ready to use it wherever we want.

Testing

Here’s some test code to understand how it’s actually working.

from tagging_module import TaggingService

tagging_service = TaggingService()

content = "This is a sample content for tagging. It has keywords like web development, html, ios and sql"

tags = tagging_service.assign_tags(content)
print(tags)

The output:

['Web', 'Mobile', 'Database']

Pretty neat.

There are a lot of optimizations, enhancements, and new rules that can be added to this system. For example, think about detecting a “tutorial” or a “system design” article. Also, we need to think about patterns in the text beyond simple keyword matching.

How do we extend this module such that it is highly customizable and can work for a variety of rules that anyone can plug and play with?

I will explore more in that direction in the next article in this series!

I used this initial version to assign tags to all the articles curated at Techflix v0.5, and it does a decent job overall.

You may check out the end results at xqbuilds.com/techflix

--

--

XQ
The Research Nest

Tech 👨‍💻 | Life 🌱 | Careers 👔 | Poetry 🖊️