Python Vs Julia Data Frames (Part 1)

Abeer Yehia ElBahrawy
Python Pandemonium
Published in
6 min readDec 2, 2016

Python, Julia or R?!

The basic question everyone tiptoeing to big data and data science ask, so I asked it too. For me the choice was narrowed down to Python and Julia since I relatively have some knowledge in them compared to a single online course in R. Then I thought why to miss the fun of any language lets do both! So i decided as much as possible to replicate the basic tasks I am doing and evaluate both languages.

My assessment will focus on three main things. First, code elegance, you know that feeling you get when you look at a function and say huh that looks nice and readable. Second was it easy to go through documentations and get help. Finally and typically performance.

In this series of posts (hopefully) I will show quick and easy implementations to some trivial analysis and manipulations for data frames, data retrieval and web scraping in both Julia and Python. I am not an expert here but I will share the learning process with you and the point of all of this well, is mostly for my own pleasure. Also for Julia particularly it is an attempt to answer the question of is it doable in Julia and how fast and easy?!

We will start here with basic data retrieval from http request. We are going to use Wikipedia http requests to get data about both the views and revisions of “Donald Trump” Wikipedia page.

First things first which libraries are we going to use for Python and which packages for Julia? For Python we will use the following libraries

For Julia we will use the following packages. One quick note on Julia packages they are (usually) very easy to install. Just type in your Julia command line Pkg.add(“Package Name”) and you are off to go.

Now we will write a function that requests from Wikipedia the number of daily views for a specific page starting form certain start data and end date. The requested data will be in json format so important fields will be extracted from the requested data and stored in Data frame.

There is already a built in function in pandas that reads json output format to data frame which is “read_json”, however that will work when your json file is nicely formatted and data is relatively easy to be extracted into columns and rows. In our case we will write the function to extract the data ourselves.

I used here two ways to construct the data frame; the first one i which I construct it from a list of dictionary and in the other one I construct it by appending to an empty data frame

Append to data frame

As you can see function dataframe.loc[] access a specific index in the dataframe. If length of the dataframe was given to the function instead the dataframe expands to add a new row. It is important here to note the difference between .iloc[] and .loc[] we will leave this for another post.

The second implementation in python is using dictionary which much faster of course.

Data frame form dictionary

Python has a performance and memory line by line profiler which is very helpful to identify bottle necks in your code. Also timeit is better fit for small code snippts.

Back to the data, the resulting data frame will look like this

In Julia, well the code is just logical! Just like you append (push in Julia language) to arrays and python lists you can also push in a data frame, simple huh.

The “!” in Julia means that the change will happen on the same dataframe equivalent to python (inplace=True) parameter.

If we compare between the two similar functions which use appending to data frame Julia is much faster however building form dictionary solution in Python enhanced performance dramatically. Trying to find a straight forward similar solution in Julia was not that easy (please feel free to share it if you can). We may declare Python pandas as the winner here.

Now we want to plot a simple graph for the evolution of number of views per day.

Same thing for Julia and can make it interactive too using “Plots” package or “Interact”. Sure you can do the same thing in Python using libraries such as “Plotly

To get the data of revisions history almost the same function only the json file structure is different.

In Julia

The data will look like this

ًWhat if we want to ranks users based on the contribution frequency in Donald Trump’s page. In Python we can use the groupby function as follows

Julia data frames has the beautiful “by” function which allow us to do the exact same thing.

Python help can be found literally everywhere, there is nothing you can’t find either in the documentation or in blogs and stack overflow. Julia on the other hand is a bit harder to get support for, however, the documentation is very well written and the code is just smooth and predictable.

That is it for now! if you are curious about the code or not clear enough drop off comment.

Some nice resources can be found here:

--

--