File Processing in Big Data Systems: Which is Quicker? Which is better?

A Comparative Analytics Study Benchmarking Popular Programming Languages and Execution Engines.

Thomas George Thomas
Geek Culture
6 min readFeb 6, 2021

--

Photo by PAUL SMITH on Unsplash

Introduction

Have you ever wondered which programming languages and execution engines are the quickest or the slowest at processing files? Are you in a dilemma as to which programming language should you code in to solve your business problem efficiently? Well look no further, here’s your answer.

We take a look at popular languages like Python, Java, and Scala and execution engines like Hadoop and Spark and see how they fare at processing files and benchmark them.

Methodology

We explore and conduct data analysis & comparisons of the execution times taken for computing the word count of input text files varying from extremely small to extremely large sizes in various programming languages and execution engines individually.

We write sample word count programs to process these files and execute them. We then calculate the time taken to process the files individually and gather the results. Further, we collect our sample findings and observations and draw comparisons. All of the findings from individual analyses are collected and combined in a Google Colab notebook where we then plot graphs using matplotlib and draw conclusions based on our findings.

Photo by Wesley Tingey on Unsplash

Files

For this experiment, we need two kinds of files. A large file that is input text only and a small text file while keeping in mind that they should be an appropriate size so that they don’t skew the performance tests that we are conducting.

Taking these limitations into consideration and that we are running a Data Science experiment (of sorts). We picked something relevant, For the Large text file, we chose big.txt which if you are not familiar with, is from the book Beautiful Data(Segaran and Hammerbacher, 2009) from the chapter Natural Language Corpus Data where they talk about running spell correction. How apt!

For the small file, we use the data that we have scraped from the Apache Hadoop Wikipedia page which is another execution engine that we are benchmarking in our experiment.

Data Analysis

Enough talk, Let’s begin looking at Programming languages and Execution engines and how to benchmark them.

Programming Languages

The approach that we take is to write word count programs in the respective languages and parse both the large and small files that we have chosen. Since we repeat the same steps for Python, Scala and Python, we take the example of Python for demonstration:

Python

We begin our program in Python by reading the file as our first step. We use the small text file as our example:

# Reading filefile=open("../Input-Files/apache-hadoop-wiki.txt","r",encoding="utf-8")

Now for our actual word count, we make use of the built-in structure Dictionary:

# Initializing Dictionary
dict = {}
# counting number of times each word comes up in list of words (in dictionary)for word in file.read().split():
dict[word] = dict.get(word, 0) + 1
file.close()

We write the results that we have just computed:

#write the file
fw = open("small-result-python.txt","w",encoding="utf-8")
fw.write(str(dict))
fw.close()

A small snippet of our computed word count results:

{'Apache': 79, 'Hadoop': 201, 'From': 3, 'Wikipedia,': 1, 'the': 211, 'free': 1, 'encyclopedia': 1, 'Jump': 2, 'to': 122 }

Now to Benchmark our program, we import the time package and insert hooks at the appropriate times.

So our full program looks like this:

import timestart = time.time()# Reading filefile=open("../Input-Files/apache-hadoop-wiki.txt","r",encoding="utf-8")# Initializing Dictionary
dict = {}
# counting number of times each word comes up in list of words (in dictionary)
for word in file.read().split():
dict[word] = dict.get(word, 0) + 1
file.close()#write the file
fw = open("small-result-python.txt","w",encoding="utf-8")
fw.write(str(dict))
fw.close()
end = time.time() print("Execution time :", end - start)

Similarly, We collect the results for both the small and large file and plot a graph for all the languages individually:

(Left) Bar chart depicting processing times for Python | (Middle) Bar chart depicting processing times for Java | (Right) Bar chart depicting processing times for Scala | Images by Author.

Execution Engines

We consider Hadoop and Spark as our execution engines. We will make use of standalone mode/cluster for our experiment.

We repeat the same steps that we took for programming languages and results look like this:

(left) Graph depicting processing time taken by Hadoop | (right) Graph depicting processing time taken by Spark | Images by Author

Comparative Analytics

Now that we have all the individual results for programming languages and execution engines. Let’s compare them and draw useful insights.

We use Google Colab to plot the benchmark graphs in python:

We start by importing the required libraries: pandas and matlplotlib

import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.style.use('ggplot')

Preparing our language dataset

# initialize list of languages 
data_lan = [['java', 0.123, 1.179 ], ['scala', 10.172, 3.528]]

# Create the pandas DataFrame
df_languages = pd.DataFrame(data_lan, columns = ['languages', 'small','big'])

# print dataframe.
df_languages

Since this is a growing experiment, and in the future, we could benchmark with more languages and execution engines, we use this method:

# Appending Data, Use this method in the future when trying to add more languagesdf_languages = df_languages.append({'languages':'python','small':0.006, 'big':1.154}, ignore_index=True)
df_languages
# Set the languages as index for our x axis
df_languages.set_index('languages', inplace=True)
df_languages

Resultant data ready for graphing:

Dataset with clear X & Y axis demarcations | Image by Author

Plotting our Data:

ax = df_languages.plot(kind='bar',
figsize = (10,10)
)
plt.xlabel('Languages')
plt.title('Execution times for each language for both small and large files')
ax.set_facecolor('white')
ax.tick_params(axis='x', colors='black', labelsize=14)
ax.axhline(0, color='black')
ax.legend(facecolor='white',fontsize=14)
ax.tick_params(top=False, left=False, right=False, labelleft=False)
for p in ax.patches: #display the percentages above the bars
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
ax.annotate('{} s'.format(height), (x, y + height + 0.1),fontsize=14)
plt.show()

Conclusion

(left) Graph comparing & benchmarking the file processing time for Programming languages | (right) Graph comparing & benchmarking the file processing time for Execution Engines | Images by Author

For programming languages, we observe that Python has the least execution time for small and large files while Scala has the largest execution time. Interestingly, Scala takes 7 seconds more for processing the small file rather than the larger file.

For execution engines, we observe that the Spark engine has the least execution time while Hadoop’s Mapreduce engine has the highest execution time. This is in line with the claim that Spark is 100 times quicker than Hadoop.

I hope that this experiment shed some light on which programming language or execution you should use for your next project. Overall, each programming language and execution engine have their pro’s and con’s, but with this experiment, we now know which of them have the best performance benchmarks.

This project is growing and if you would like to contribute towards it with more programming languages and execution engines, do so at GitHub.Thank you for reading.

References

  1. Natural Language Corpus Data: Beautiful Data
  2. Apache Hadoop Wikipedia

--

--

Thomas George Thomas
Geek Culture

Data Analytics Engineering Graduate Student at Northeastern. Ex Senior Data Engineer & IBM Certified Data Scientist. https://thomasgeorgethomas.com