Detecting Sports Content with Machine Learning

5 min readApr 28, 2024

📚Introduction

n the vast landscape of digital content, the ability to categorize and understand text is essential for efficient information retrieval and analysis. Whether it’s filtering news articles, organizing research papers, or personalizing recommendations, text classification plays a pivotal role in various applications. In this blog, we delve into the fascinating realm of text classification, focusing on a specific task: detecting sports-related content amidst a sea of diverse documents.

Through the lens of machine learning, we embark on a journey to unravel the intricacies of distinguishing sports-related documents from their non-sports counterparts. Armed with a small but illustrative dataset, we navigate the process of training a classifier to recognize the distinctive features of sports content, from the thrill of the game to the nuances of policy discussions surrounding sporting events.

Join us as we explore the methodologies, challenges, and insights gleaned from this endeavor. From the basics of natural language processing to the practical application of classification algorithms, this blog offers a comprehensive overview of the techniques employed to unlock the world of text classification.

Whether you’re a seasoned data scientist, an aspiring machine learning enthusiast, or simply curious about the magic behind intelligent systems, this exploration into the realm of sports text classification promises to inform, inspire, and ignite your curiosity. So, let’s embark on this adventure together and discover the hidden patterns that lie within the words we read.

📚Table of Content

Import library
Train and Test Data
Data Preprocessing
Train the classifier
Model Evaluation
Model performance improment
Model Evaluation

📚Import library

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

📚Train and test data

# Train and test data. Both the full documents and their labels ("Sports" vs "Non Sports")
train_data = ['Football: a great sport', 'The referee has been very bad this season', 'Our team scored 5 goals', 'I love tennis',
              'Politics is in decline in the UK', 'Brexit means Brexit', 'The parlament wants to create new legislation',
              'I so want to travel the world']
train_labels = ["Sports","Sports","Sports","Sports", "Non Sports", "Non Sports", "Non Sports", "Non Sports"]

test_data = ['Swimming is a great sport',
             'A lot of policy changes will happen after Brexit',
             'The table tennis team will travel to the UK soon for the European Championship']
test_labels = ["Sports","Non Sports","Sports"]

📚Data Preprocessing

# Representation of the data using TF-IDF
vectorizer = TfidfVectorizer()
vectorised_train_data = vectorizer.fit_transform(train_data)
vectorised_test_data = vectorizer.transform(test_data)

📚Train the classifier


# Train the classifier given the training data
classifier = LinearSVC()
classifier.fit(vectorised_train_data, train_labels)

📚Model Evaluation


# Train the classifier given the training data
classifier = LinearSVC()
classifier.fit(vectorised_train_data, train_labels)

# Predict the labels for the test documents (not used for training)
print(classifier.predict(vectorised_test_data))
#['Sports' 'Non Sports' 'Non Sports']

However, the third case is wrongly classified. Why do you think that might be? Matching problems (e.g., “car” is different than “Cars”) Cases never seen before (e.g., the classifier has never seen the word “table”) “Spurious” correlations and bias (“car” appears only in the positive category)

Lets look into how we are representing our documents

from sklearn.metrics import accuracy_score

# Vectorize the test data using the same TF-IDF vectorizer used for training
vectorized_test_data = vectorizer.transform(test_data)

# Predict labels for the test data
predicted_labels = classifier.predict(vectorized_test_data)

# Calculate accuracy
accuracy = accuracy_score(test_labels, predicted_labels)
print("Accuracy:", accuracy)

Accuracy: 0.6666666666666666

📚Model Performance improment

from sklearn.feature_extraction.text import TfidfVectorizer

# Function to show the feature weights of a document (to be explained later)
def feature_values(doc, representer):
    doc_representation = representer.transform([doc])
    features = representer.get_feature_names_out()
    return [(features[index], doc_representation[0, index]) for index in doc_representation.nonzero()[1]]

pprint([feature_values(doc, vectorizer) for doc in test_data])

[[('sport', 0.5773502691896258),
  ('is', 0.5773502691896258),
  ('great', 0.5773502691896258)],
 [('brexit', 1.0)],
 [('uk', 0.3466689227843291),
  ('travel', 0.3466689227843291),
  ('to', 0.29053561299308733),
  ('the', 0.6594480187891556),
  ('tennis', 0.3466689227843291),
  ('team', 0.3466689227843291)]]

Lets try again, with stop-word removal this time


import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords

# Load the list of (english) stop-words from nltk
stop_words = stopwords.words("english")

# Represent, train, predict
vectorizer = TfidfVectorizer(stop_words=stop_words)
vectorised_train_data = vectorizer.fit_transform(train_data)
vectorised_test_data = vectorizer.transform(test_data)
classifier = LinearSVC()
classifier.fit(vectorised_train_data, train_labels)
binary_predictions=classifier.predict(vectorised_test_data)

print(classifier.predict(vectorised_test_data))
# Expected: [Sports, Non Sports, Sports]

['Sports' 'Non Sports' 'Sports']

📚Model Evaluation

from sklearn.metrics import f1_score, precision_score, recall_score

# Binary problem
binary_labels = [1, 0, 1]
binary_predictions = [1, 0, 0]

# Quality values (with respect to class 1 by default)
# Show our quality
precision = precision_score(binary_labels, binary_predictions)
recall = recall_score(binary_labels, binary_predictions)
f1 = f1_score(binary_labels, binary_predictions)
print("Micro-average quality numbers")
print("Precision: {:.4f}, Recall: {:.4f}, F1-measure: {:.4f}".format(precision,
                                                                     recall,
                                                                     f1))

binary_labels = ["A", "B", "A"]
binary_predictions = ["A", "B", "B"]
precision = precision_score(binary_labels, binary_predictions, pos_label="A")
recall = recall_score(binary_labels, binary_predictions, pos_label="A")
f1 = f1_score(binary_labels, binary_predictions, pos_label="A")
print("Precision: {:.4f}, Recall: {:.4f}, F1-measure: {:.4f}".format(precision,
                                                                     recall,
                                                                     f1))

Please Follow and 👏 Clap for the story courses teach to see latest updates on this story

🚀 Elevate Your Data Skills with Coursesteach! 🚀

Ready to dive into Python, Machine Learning, Data Science, Statistics, Linear Algebra, Computer Vision, and Research? Coursesteach has you covered!

🔍 Python, 🤖 ML, 📊 Stats, ➕ Linear Algebra, 👁️‍🗨️ Computer Vision, 🔬 Research — all in one place!

Don’t Miss Out on This Exclusive Opportunity to Enhance Your Skill Set! Enroll Today 🌟 at

Machine Learning projects course

🔍 Explore Free world top University computer Vision ,NLP, Machine Learning , Deep Learning , Time Series and Python Projects, access insightful slides and source code, and tap into a wealth of free online websites, github repository related Machine Learning Projects. Connect with like-minded individuals on Reddit, Facebook, and beyond, and stay updated with our YouTube channel and GitHub repository. Don’t wait — enroll now and unleash your Machine Learning projects potential!”

Stay tuned for our upcoming articles where we will explore specific topics related to NLP in more detail!

Remember, learning is a continuous process. So keep learning and keep creating and sharing with others!💻✌️

Note:if you are a NLP export and have some good suggestions to improve this blog to share, you write comments and contribute.

👉📚GitHub Repository

👉 📝Notebook

Ready to dive into data science and AI but unsure how to start? I’m here to help! Offering personalized research supervision and long-term mentoring. Let’s chat on Skype: themushtaq48 or email me at mushtaqmsit@gmail.com. Let’s kickstart your journey together!

Contribution: We would love your help in making coursesteach community even better! If you want to contribute in some courses , or if you have any suggestions for improvement in any coursesteach content, feel free to contact and follow.

Together, let’s make this the best AI learning Community! 🚀

👉WhatsApp

👉 Facebook

👉Github

👉LinkedIn

👉Youtube

👉Twitter