Building, Containerizing and Deploying a News Classifier App

Abhigyan
Analytics Vidhya
Published in
12 min readFeb 7, 2021

This article is all about the end-to-end project of News Classifier.
App has been built using streamlit, containerized with Docker and deployed on AWS using Fargate.
This article aims to give a complete walkthrough of the process.

For this, we will be using the data from Kaggle. LINK

Create a .IPYNB file

For this, you can use jupyter notebook or Colab.
I personally recommend Colab as most of the packages are already installed.

→ Installing Libraries

!pip install autocorrect
import sys
!{sys.executable} -m pip install contractions
!pip install zeugma

→ Importing Libraries

import re
# --------------------------------------------------------------
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# --------------------------------------------------------------
import nltk
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('wordnet')
nltk.download('punkt')
nltk.download("stopwords")
# --------------------------------------------------------------
import contractions
from autocorrect import Speller
# --------------------------------------------------------------
import pickle
# --------------------------------------------------------------
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,matthews_corrcoef
from sklearn.metrics import plot_confusion_matrix

→ Reading the data

data = pd.read_json("/Data/News_Category_Dataset_v2.json", 
lines = True)
data.head()

→ Data Cleaning

new_data = data.drop(['date','link'],axis = 1).copy()

Checking for the distribution of classes:

plt.figure(figsize = (10,5))
plt.xticks(rotation=90)
sns.countplot(new_data.category,
order=new_data.category.value_counts().index)

Mapping Similar categories together:

new_data.category = new_data.category.map(lambda x: "WORLDPOST" if x
== "THE WORLDPOST" else x)
new_data.category = new_data.category.map(lambda x: "EDUCATION" if x
== "COLLEGE" else x)
new_data.category = new_data.category.map(lambda x: "ARTS & CULTURE"
if x == "ARTS" else x)
new_data.category = new_data.category.map(lambda x: "FOOD & DRINK"
if x == "TASTE" else x)
new_data.category = new_data.category.map(lambda x: "PARENTING" if x
== "PARENTS" else x)
new_data.category = new_data.category.map(lambda x: "STYLE & BEAUTY"
if x == "STYLE" else x)
new_data.category = new_data.category.map(lambda x: "WORLDPOST" if x
== "WORLD NEWS" else x)
print(new_data.category.nunique())
print(new_data.category.value_counts())

Handling Null Values:

for i in new_data.columns:
print(i)
print(len(new_data[new_data[i] == ""]))
print("--------------------")
new_data[new_data['headline'] == ""]
new_data = new_data[new_data['headline'] != ""]
new_data = new_data[new_data['short_description'] != ""]
  • Now that we have removed null values we can create a new data frame to move forward.
df = new_data[['headline','short_description','category']]
df.reset_index(drop=True)
df.head()

→ Feature Engineering

df['total'] = df['headline'] + " " + df['short_description']
df.head(2)

→ Data Preprocessing

df_cleaned = df[['total','category']]

Defining function to expand contractions:

def expand_contractions(text):
sent = ""
for word in text.split():
sent = sent + " " + contractions.fix(word)
return sent.lower()

Defining function to correct the spelling of words:

def spell_check(text):
spell = Speller()
sent = " "
spells = [spell(w) for w in (nltk.word_tokenize(text))]
return sent.join(spells)
spell_check(expand_contractions(df_cleaned.total[20]))

Defining function to preprocess the whole data dataset:

def text_preprocess(data,x):
lemmatizer = WordNetLemmatizer()
for i,row in data.iterrows():
filter_Sentence = ''
sentence = row[x]
sentence = expand_contractions(sentence)
#sentence = spell_check(sentence)
sentence = re.sub(r'[^\w\s]',' ',sentence)
#removing extra space
sentence = re.sub(r"\s+"," ", sentence, flags = re.I)
sentence = re.sub(r"\d", " ", sentence)#removing digits
#removing single characters
sentence = re.sub(r"\s+[a-zA-Z]\s+", " ", sentence)
#removing multiple characters
sentence = re.sub(r"[,@\'?\.$%_]", "", sentence, flags=re.I)
words = nltk.word_tokenize(sentence) #words = [w for w in words if not w in stop_words] for word in words:
#filter_Sentence = filter_Sentence + ' ' + str(word)
filter_Sentence = filter_Sentence + ' ' +
str(lemmatizer.lemmatize(word))
data.loc[i,x] = sentence
return data
text_preprocess(df_cleaned,'total').head()
df_cleaned.to_csv('clean.csv', index = False)

→ Modeling and Evaluation

Reading the cleaned data:

df_cleaned = pd.read_csv("/content/clean.csv")
df_cleaned.head()

Using TFIDF Vectorizer to generate word vectors:

tfidf = TfidfVectorizer()
X = tfidf.fit_transform(df_cleaned['total'])
X.shape

Dumping the TFIDF vectors in pickle file to use it on the unseen text:

pickle.dump(tfidf, open("tfidf.pkl","wb"))

Splitting the data into train and test set:

y = df_cleaned['category']
x_train,x_test,y_train,y_test = train_test_split(X, y,
test_size = 0.2)

Building a LogisticRegression model:

Lg = LogisticRegression(max_iter = 1000).fit(x_train,y_train)

Model Evaluation:

pred = Lg.predict(x_test)
print(accuracy_score(y_test,pred))
print(matthews_corrcoef(y_test, pred))

Plotting Confusion Matrix:

plot_confusion_matrix(Lg, x_test, 
y_test,xticks_rotation='vertical',
include_values=False)

→ Saving the model

pickle.dump(Lg,open("LogisticRegression_model.pkl","wb"))

Creating a Streamlit WebApp

  • For this we will be creating a .py file, you can use PyCharm, Spyder or any IDE of your choice.
  • Create a folder and add all the pickled file in it.
  • Create a .py file for the streamlit code in the same folder.
  • Open cmd by clicking on the location bar and typing cmd then enter.
  • In the CLI write:
    streamlit run <name>.py file

→ Importing Libraries

import streamlit as st
import pandas as pd
import pickle
import requests
import numpy as np
import re
import nltk
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
#from nltk.corpus import stopwords
#import contractions
nltk.download('wordnet')
nltk.download('punkt')
nltk.download("stopwords")

→ Loading the pickled files

model = "LogisticRegression_model.pkl"
vocab = "tfidf.pkl"

→ Defining function to preprocess text

def text_preprocess(text):
lemmatizer = WordNetLemmatizer()
filter_Sentence = ''
sentence = text.lower()
#sentence = expand_contractions(sentence)
#sentence = spell_check(sentence)
sentence = re.sub(r'[^\w\s]',' ',sentence)
sentence = re.sub(r"\d", " ", sentence)#removing digits
sentence = re.sub(r"\s+[a-zA-Z]\s+", " ", sentence)
sentence = re.sub(r"\s+", " ", sentence, flags=re.I)
sentence = re.sub(r"[,@\'?\.$%_]", "", sentence, flags=re.I)
words = nltk.word_tokenize(sentence) words = [w for w in words if not w in stop_words] for word in words:
#filter_Sentence = filter_Sentence + ' ' + str(word)
filter_Sentence = filter_Sentence + ' ' +
str(lemmatizer.lemmatize(word))
return [filter_Sentence]

→ Defining function to vectorize clean text

def vectorize(vocab, text):
vec = pickle.load(open(vocab, "rb"))
tf_new_1 = TfidfVectorizer(vocabulary = vec.vocabulary_)
return tf_new_1.fit_transform(text)

→ Defining function to predict the category

def predict(text, model):
Lg = pickle.load(open(model, "rb"))
return Lg.predict(text)

→ Defining function that fetches news online

def retriev_news(secret, url, category, top_news):
parameters = {
'q': category, # query phrase
'pageSize': top_news, # maximum is 100
'apiKey': secret # your own API key
}
# Make the request
response = requests.get(url, params=parameters)
# Convert the response to JSON format and pretty print it
response_json = response.json()
df = pd.DataFrame(response_json['articles'])
for i in range(len(df)):
st.title(df['title'][i])
st.image(df['urlToImage'][i], width=750)
st.subheader(df['description'][i])
st.write(df['content'][i])
#st.text(df['url'][i])
link = df['url'][i] + "/"
st.subheader("Watch full Story at:")
st.markdown(link)

→ Defining the main function that uses all the function created

def main():
st.title("News Classifier and Online News Fetcher")
st.image("https://raw.githubusercontent.com/AbhigyanSingh97/NewsClassifier_And_Online_news_fetcher/main/GIF/newspaper-clipart-black-and-white-8.jpg", width = 150, use_column_width=True, clamp = True)
secret = 'SecretAPI_KEY'
url = 'https://newsapi.org/v2/everything?'
category = [] options = ['Only Predict', 'Predict and Search the Internet',
'Use Manual Input']
us_ip = st.radio("Check News online by", options)
if us_ip == 'Only Predict':
text = st.text_area("Enter your news here")
if st.checkbox("Predict"):
if text != "":
clean_text = text_preprocess(text)
pred = predict(vectorize(vocab, clean_text),
model)
st.success(pred[0])
else:
st.subheader("Enter News to Classify in the Text
Area!")
if us_ip == 'Predict and Search the Internet':
text = st.text_area("Enter your news here")
if st.checkbox("Predict"):
if text != "":
clean_text = text_preprocess(text)
pred = predict(vectorize(vocab, clean_text), model)
st.success(pred[0])
st.write("More online news for the", pred[0])
top_news = st.slider("Select Number of News to be
Displayed", 1, 20, 1)
retriev_news(secret, url, pred[0],
top_news=top_news)
else:
st.subheader("Enter News to Classify in the Text
Area!")
if us_ip == 'Use Manual Input':
user_ip = st.text_input("Enter the category you wanna see.")
if st.checkbox("Search Internet"):
if user_ip != "":
top_news = st.slider("Select Number of News to be
Displayed", 1, 20, 1)
retriev_news(secret, url, user_ip,
top_news=top_news)
else:
st.subheader("Please Enter the Category first!")
if __name__ == '__main__':
main()

→ Everything at once

import streamlit as st
import pandas as pd
import pickle
import requests
import numpy as np
import re
import nltk
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
#from nltk.corpus import stopwords
#import contractions
nltk.download('wordnet')
nltk.download('punkt')
nltk.download("stopwords")
#-------------------------------------------------------
model = "LogisticRegression_model.pkl"
vocab = "tfidf.pkl"
#-------------------------------------------------------
def text_preprocess(text):
lemmatizer = WordNetLemmatizer()
filter_Sentence = ''
sentence = text.lower()
#sentence = expand_contractions(sentence)
#sentence = spell_check(sentence)
sentence = re.sub(r'[^\w\s]',' ',sentence)
sentence = re.sub(r"\d", " ", sentence)#removing digits
sentence = re.sub(r"\s+[a-zA-Z]\s+", " ", sentence)
sentence = re.sub(r"\s+", " ", sentence, flags=re.I)
sentence = re.sub(r"[,@\'?\.$%_]", "", sentence, flags=re.I)
words = nltk.word_tokenize(sentence)
words = [w for w in words if not w in stop_words]
for word in words:
#filter_Sentence = filter_Sentence + ' ' + str(word)
filter_Sentence = filter_Sentence + ' ' +
str(lemmatizer.lemmatize(word))
return [filter_Sentence]
#-------------------------------------------------------
def vectorize(vocab, text):
vec = pickle.load(open(vocab, "rb"))
tf_new_1 = TfidfVectorizer(vocabulary = vec.vocabulary_)
return tf_new_1.fit_transform(text)
#-------------------------------------------------------
def predict(text, model):
Lg = pickle.load(open(model, "rb"))
return Lg.predict(text)
#-------------------------------------------------------
def retriev_news(secret, url, category, top_news):
parameters = {
'q': category, # query phrase
'pageSize': top_news, # maximum is 100
'apiKey': secret # your own API key
}
# Make the request
response = requests.get(url, params=parameters)
# Convert the response to JSON format and pretty print it
response_json = response.json()
df = pd.DataFrame(response_json['articles'])
for i in range(len(df)):
st.title(df['title'][i])
st.image(df['urlToImage'][i], width=750)
st.subheader(df['description'][i])
st.write(df['content'][i])
#st.text(df['url'][i])
link = df['url'][i] + "/"
st.subheader("Watch full Story at:")
st.markdown(link)
#-------------------------------------------------------
def main():
st.title("News Classifier and Online News Fetcher")
st.image("https://raw.githubusercontent.com/AbhigyanSingh97/NewsClassifier_And_Online_news_fetcher/main/GIF/newspaper-clipart-black-and-white-8.jpg", width = 150, use_column_width=True, clamp = True)
secret = 'SecretAPI_KEY'
url = 'https://newsapi.org/v2/everything?'
category = []
options = ['Only Predict', 'Predict and Search the Internet',
'Use Manual Input']
us_ip = st.radio("Check News online by", options)
if us_ip == 'Only Predict':
text = st.text_area("Enter your news here")
if st.checkbox("Predict"):
if text != "":
clean_text = text_preprocess(text)
pred = predict(vectorize(vocab, clean_text),
model)
st.success(pred[0])
else:
st.subheader("Enter News to Classify in the Text
Area!")
if us_ip == 'Predict and Search the Internet':
text = st.text_area("Enter your news here")
if st.checkbox("Predict"):
if text != "":
clean_text = text_preprocess(text)
pred = predict(vectorize(vocab, clean_text), model)
st.success(pred[0])
st.write("More online news for the", pred[0])
top_news = st.slider("Select Number of News to be
Displayed", 1, 20, 1)
retriev_news(secret, url, pred[0],
top_news=top_news)
else:
st.subheader("Enter News to Classify in the Text
Area!")
if us_ip == 'Use Manual Input':
user_ip = st.text_input("Enter the category you wanna see.")
if st.checkbox("Search Internet"):
if user_ip != "":
top_news = st.slider("Select Number of News to be
Displayed", 1, 20, 1)
retriev_news(secret, url, user_ip,
top_news=top_news)
else:
st.subheader("Please Enter the Category first!")
if __name__ == '__main__':
main()

Containerizing the App using Docker

If you are using windows follow this link to install docker in the system.

  • First, create a requirements.txt file that contains the name of all the packages you have used and save it to the folder which you created and saved the app file along with pickled models.

→ Creating a Dockerfile

  • Open a notepad in your system and type the following command.
FROM ubuntuRUN apt-get update &&\
apt-get install python3.7 -y &&\
apt-get install python3-pip -y
WORKDIR /streamlit-dockerCOPY requirements.txt ./requirements.txtRUN pip3 install -r requirements.txtCOPY . .EXPOSE 8501CMD streamlit run app1.py# streamlit-specific commands for config
ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8
RUN mkdir -p /root/.streamlit
RUN bash -c 'echo -e "\
[general]\n\
email = \"\"\n\
" > /root/.streamlit/credentials.toml'
RUN bash -c 'echo -e "\
[server]\n\
enableCORS = false\n\
" > /root/.streamlit/config.toml'
  • Make sure to name it “Dockerfile” without any extension.
  • Then choose All files in the save as type and save the file in the location where all the files are present.

→ Building an image

  • Go inside your folder where all the files are kept.
  • Click on the address tab and type cmd
  • log in to docker
  • Build the image using the following command
docker build . -f Dockerfile -t <name-of-the-image>

→ Running a container

  • Run the container using the following command
docker run -d --rm -p 8080:8501 <image-id>
  • -d — Tells docker to run it in an attached mode making the cmd free for other commands
  • — rm — Tells docker to delete the container as soon as it is stopped.
  • -p — It is used for port mapping, in my case I am telling the container to run in my device in the localhost 8080 port.

→ Accessing Docker container from browser

  • go to your browser.
  • On the top write localhost:8080 and press enter.

→ Pushing Docker image to DockerHub

  • First, we need to tag the image with the name of our docker hub account
docker tag <name-of-image> <dockerhub-username>/ 
<name-of-image>:latest

Deploying the image using AWS Fargate

  • First Create a free AWS account.
  • Click on this link to go ECS Console.

→ Configuring Custom container

  • Click on configure in the custom container
  • Give container name
  • In the image URI, you can directly provide dockerhub’s repository name, AWS ECS will pull in the image for you from the DockerHub
  • In the container port give 8501 as this is the port we exposed for our container in the docker file.
  • After filling in all the things click on the Update button that appears on the bottom right.

→ Editing Task Definition

  • Once you’ve clicked on the Update button, you will go now have to scroll down and click on Edit button.
  • Inside you can type in the name you want for the task definition otherwise, click on save if you don’t want to.
  • Leave the rest as it is if you want to.
  • Click on Save that appears on the bottom right corner.
  • Then click on the Next button that appears on the bottom right of the page.
  • Then click on the Application Load Balancer,
    It is a good practice while deploying containers. An application load balancer would allow having a single endpoint that load balances between multiple instances of the service.
  • Click on Next that appears on the bottom right corner.
  • Now, give your cluster a name.
  • AWS will automatically create VPC ID and subnets for you.
  • Click on Next that appears on the bottom right corner.
  • Now, review your inputs and click on Create.
  • Now just wait for the service to be created for you.
  • Once the service is created, click on the View Service button the top.
  • On the service, page scroll down and click on the Target Group Name under Load Balancing.
  • Then click on the link to the Load Balancer.
  • Copy the Public DNS.
  • Paste the DNS in your Browser followed by the port with a colon.
  • Here, you go! App has been successfully deployed on AWS.

Happy Learning!!!

Check the My GitHub Repository for everything.

Like my article? Do give me a clap and share it, as that will boost my confidence.
Also, check out my other post and stay connected for future articles on the basics of data science and machine learning series.

Also, do connect with me on LinkedIn.

Photo by Markus Spiske on Unsplash

--

--