Topic Modeling of War articles and Other articles -1-

Ozan ERTEK
4 min readApr 3, 2022

--

In this project,I have used Wikipedia API to get articles from wikipedia.The aim for this project is to seperate war articles and other articles between by using Natural Language Process and Latent Dirichlet Allocation.And creating a flask app for seperating writen text war contain.I will explain all i did,step by step.You can visit project repository(HERE).The methodology of the project is given below.

First PART -1-

  • Getting data by using Wikipedia API
  • Adding all data to MongoDB
  • Connect Python to MongoDB to load dataset
  • NPL and Lda Process

Second PART -2-

  • Modeling And Flask App

1) Getting Data by using Wikipedia API

I used 40 topics(science,math,ukraine-russia war topics etc.) for my project.Each topic i get 500 text dataS and totaly i have got 20000 text datas.

Firsty,you need to install wikipedia api by code :

!pip install wikipedia

then import

import wikipediaapi
import wikipedia
import csv

choose which language that do you want to get data :

wiki_wiki = wikipediaapi.Wikipedia('en')

and scrap data by using that codes:

in this section,topic_name is which topic do you want to get data,put the name on it.example; if you want to get data about math : change topic_name to mathematics.and keep do this for all topics.

topic_name ={}
for items in wikipedia.search('topic',results=500):
topic[items] = wiki_wiki.page(items).summary

and save the data.

with open('topic_name.csv', 'w', newline='',encoding="UTF-8") as csvfile:
header_key = ['Key', 'Text']
new_val = csv.DictWriter(csvfile, fieldnames=header_key)
new_val.writeheader()
for new_k in topic_name:
new_val.writerow({'Key': new_k, 'Text': topic_name[new_k]})

and your text data is ready to use.saved as topic_name.csv

after scraped all text datas for this process.We are ready to add all data to MongoDB.

2)Adding all data to MongoDB

After you scraped text datas.you have topic_name1.csv,topic_name2.csv etc.All these csv files,you have to upload MongoDB by using MongoDB Atlas or MongoDB Compass.Atlas and Compass free to use.You can download from https://www.mongodb.com

And then create cluster and upload all csv files on it.

Now,You have added all text datas to MongoDB.

Next Step is to Connect Python to MongoDB to load dataset.

3)Connect Python to MongoDB to load dataset

In this Step,we load text data that scraped from wikipedia in jupyter.

First you have to import library that we will use

import pandas as pd
import re
import numpy as np
import string
from sklearn.feature_extraction.text import CountVectorizer
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import pymongo
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

if not installed,installed by command:

!pip install lib_name

and now,we have to connect mongodb cluster from jupyter notebook

client = pymongo.MongoClient('mongodb+srv://user_name:password@clusterproject4.pjobe.mongodb.net/myFirstDatabase?retryWrites=true&w=majority')

user_name and password are your determined password and name before.

print(client.list_database_names())

with this code,we can see our database_name that created before for wikipedia text datas.

now,you can select your database as db for reading datas

db = client.wikipedia #my database name is wikipeadia
col=db['alldata'] # my dataset name is alldata
#your database name and dataset name may be different.

Use these below codes for load all data:

all_rec = col.find()
list_cursor = list(all_rec)
df = pd.DataFrame(list_cursor)
df.head()

and your dataset is ready to process.

4) NPL and Lda Process

This is final step.In this section,we clean and make our text data for modeling(process like vectorize,gensim)

With these codes,you can clean and make ready your text data.(tokenize,stopwords etc.)

def remove_punc(txt):
txt_nopunt = "".join([c for c in txt if c not in string.punctuation])
return txt_nopunt
df['clean_text'] = df['Text'].apply(lambda x :remove_punc(x))

— —

def remove_stopwords(txt_tokenized):
txt_clean = [word for word in txt_tokenized if word not in stopwords]
return txt_clean
df['txt_no_sw'] = df['clean_text_tokenized'].apply(lambda x: remove_stopwords(x))

— —

def tokenize(txt):
tokens= re.split('\W+',txt)
return tokens
df['clean_text_tokenized'] =df['clean_text'].apply(lambda x: tokenize(x.lower()))

Now,your text data tokenized(seperate all words as ‘ , ’ ).and stopword(clean all adjective(he,she,it) and (is,are,in ,on) like these words cleaned.

stopwords = nltk.corpus.stopwords.words('english')

from nltk.stem import PorterStemmer
ps= PorterStemmer()
stemmer = SnowballStemmer("english")

Stemming

def clean_text(text):
text="".join([c for c in text if c not in string.punctuation])
tokens=re.split('\W+',text)
text =[word for word in tokens if word not in stopwords]
return text
df['txt_nostop'] = df['Text'].apply(lambda x: clean_text(x.lower()))
def stemming(tokenized_text):
text = [ps.stem(word) for word in tokenized_text]
return text
df['txt_Stemmed'] =df['txt_nostop'].apply(lambda x : stemming(x))

Lemmatizing

def lemmatize_stemming(text):
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):

result = []

for token in gensim.utils.simple_preprocess(text) :

if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
result.append(lemmatize_stemming(token))

return result

and with below code,you can add new column that your final clean text data.

process_txt = df['Text'].map(preprocess)

getting dictionary that you all words in your dataset :

dictionary = gensim.corpora.Dictionary(process_txt)
dictionary.filter_extremes(no_below=5, no_above=0.7, keep_n=100000)
bow_corpus = [dictionary.doc2bow(doc) for doc in process_txt]

Now you have clean text data and dictionary that you all words in your dataset.

In second part,we talk about modeling for seperating 2 different topics.

— — — — — — — — — — — — — — — — — — —

Thanks for reading my article :)

Hope to see you again in my next article…

Ozan ERTEK

--

--