C1 B1: Creating a search Engine on Stack Overflow Questions Data — (Processing the data)

Vikalp Jain
5 min readOct 30, 2019

--

How many of you ever visited the homepage of Stackoverflow? :D :P

Introduction:

Hi Scientist, if you are reading a blog than I can estimate that you are the person who read enough blogs about Data Science and ML/AI and now he wants his hands dirty. You want to solve a real-time problem. And fed up as no one is giving an insight into what how and where to proceed! Yes, do continue to read this blog where I will share my experience to make a search engine that will search Stack overflow questions based on our queries.

Tags: Python, Machine Learning, Artificial Intelligence, Search Engine, Natural Language Processing, Data Science, AppliedAI Course, A26, DataTorch

Index:

  1. Get the data
  2. Process the data
  3. Store the data in a usable format.

Source of data: https://archive.org/details/stackexchange

Coding in: Python 3.x (Jupyter Notebook)

The machine I am using is GCP with16gb RAM.

To download the stack overflow data we need the dump with this size:

stackoverflow.com-Badges.7z download — 233.9M

stackoverflow.com-Comments.7z download — 4.1G

stackoverflow.com-PostHistory.7z download — 82.1M

stackoverflow.com-PostLinks.7z download — 24.4G

stackoverflow.com-Posts.7z download — 13.9G

stackoverflow.com-Tags.7z download — 787.6K

stackoverflow.com-Users.7z download — 477.2M

stackoverflow.com-Votes.7z — 1.0G

Warning: “The file size will increase 5x times once you extract those.”

Generally for this task, we need the data in the format by which it is easily understandable by most of the preprocessing libraries in Python.

Let’s write a Rough Algorithm:

We have to read the data from XML Parser using
“import xml.etree.ElementTree as ET” Library.

first, find the root tag and for each tag, we have to find out what are the columns we get for all the columns we will store the data in a CSV file so python pandas can read it easily.

but……………..

Here I am mentioning few problems which anyone will face while processing the data mentioned below:

if your machine is having 16GB of RAM how can you proceed to process 120GB of the file? You need an XML streamer that will extract the data online when it read the data without loading the complete file.

Let's say if you are processing the XML Online than you have to store it somewhere, again if we have less RAM we can not store the processed data in a large file. We have to create chunks of files (let’s say 4gb in size).

Let’s talk about code.

for function get_file_size: It is just a fancy function to show what is the size in a human-readable format. I didn’t spend much time because it wouldn’t value much to our target achievement.

function xml_to_csv:

xml_filename: input parameter to get the input file name

out_filename: filename string where we want to store the CSV file.

major_part_no is a variable which we will use to create a different name for different chunks for the output file.

L31: csv writer will write the csv to output file.

L32: to note down the time spent by our code.

L33: Printing file size of input xml file

L38: it is linked with L45, which will print the column names so we can utilize the column name once we will use data frames.

L39: iterparse function is the xml parser which works on the bases of events

start event occurred when we started parsing the file. it will run node by node.

Each node contains the value which will have all the columns.

To get more inside try to print “node.attrib” which internally return a dictionary having keys as column name and values as row values.

L40: writer.writerow is the writer we defined in L31 which will take the file object from L30.

L41 & L42: These two lines are the main LOC(Line of Code) which are responsible for cleaning up the ram. Once we delete the node it will make sure to collect the linked object created by XML Parsers which are no use as of now.

L45, L46 & L47: These are the lines just to print the column name so in future, if we use the data frame we can name the columns. Though we can automate it by writing the column name in a metadata file and read it when we use chunk but we have to remember that our main target is to make a search engine not “process the data on a single click” we can add the enhancement for our coding practice in backlog but as of now this is fine. Column names are not changing in these files.

L48: we are deleting the node object once we will store the processed node in CSV file. It really helps us in cleaning the RAM.

L49: we are checking continuously file size if it is greater than 4GB.

L51: If it is greater than we will increment this variable by 1 which is utilized on L52, to create a new file which later consumed by CSV writer to write inside the new file.

so we will have the new file name xyz.csv0,xyz.csv1,xyz.csv2,xyz.csv…..,xyz.csvN

Here we can see from the last output line Time Elapsed to create a chunk: 2567 seconds which is 42 minutes.

What is missing? There are a few ways by which we can optimize our code.

Parallelly? probably it is not possible because we are reading the XML file which is running serially and we have only 16GB RAM which has limited space so we cant handle much with parallel threads. But yes we can process multiple files at a time which will utilize complete processor with much smaller chunks.

That's how we are calling the function and parsing the XML file. to store it in chunks. The above code is self-explanatory. Please let me know how you processed the data. As I am not going to put my code in a few months. Please don’t forget to share my blog and upvote to keep me motivated.

--

--