Doing cool data science in Java: how 3 DataFrame libraries stack up

I have been exploring Java tools to perform easy data analysis of big datasets, since our production systems at are Java-based.

Data frames are popular tools for Data Science in R and Python (through pandas). A good data frame implementation makes it easy to import data, filter and map it, calculate new columns, create aggregations, and more.

I chose three Java open source dataframe libraries from Github: tablesaw, joinery, and morpheus (I also shortly looked at datavec, but couldn’t make it do what I wanted, more on that later). As a toy dataset I extracted some population data of European cities and countries from eurostat.

I used the libraries to answer two simple questions (about population and relative growth in the dataset). In this post, I will describe my experience with each library.

All code is available in a GitHub repository.

The data about cities that I extracted from Eurostat looks like this:

TIME,CITIES,INDIC_UR,Value,Flag and Footnotes
2008,Germany,"Population on the 1st of January, total",82217837,
2008,Germany,"Population on the 1st of January, male",40274292,
2008,Germany,"Population on the 1st of January, female",41943545,
2010,Graz,"Population on the 1st of January, total",:,
2010,Graz,"Population on the 1st of January, male",:,
2010,Graz,"Population on the 1st of January, female",:,
2010,Graz,"Population on the 1st of January, 0-4 years, total",:,
2010,Graz,"Population on the 1st of January, 0-4 years, male",:,

I tried to answer two questions:

Which cities / regions had the highest population in 2017?

Which cities / regions had the highest relative growth from 2010 to 2016?

Using the libraries, I needed to clean the data up a bit, as : represented missing values and some items were in the file twice. Then the data needed to be pivoted to group the regions and get all years as columns. Finally, a “growth” column needed to be calculated and the answer computed by filtering the “total” rows and sorting on the correct column.

In Python with pandas, these operations are quite easy to do (also see the notebook in the GitHub repository):

data = pd.read_csv('urb_cpop1_1_Data.csv')
filtered = data.drop(data[data.Value == ":"].index)
filtered['key'] = filtered['CITIES'] + ':' + filtered['INDIC_UR']
filtered['Value'] = pd.to_numeric(filtered['Value'])
cities = filtered.pivot_table(index='key', columns='TIME',
values='Value', aggfunc="mean")
# Top 10 cities by pop in 2017:
cities.filter(like='January, total',axis=0).
sort_values(by=[2017], ascending=False).head(10)
# Highest growth cities:
cities["growth"] = (cities[2016] / cities[2010] - 1) * 100
cities.filter(like='January, total',axis=0).
sort_values(by=["growth"], ascending=False).head(10)

Tablesaw seems to be the most actively developed of the three. The GitHub account is pretty active with frequent commits by different developers.

Update end of November 2018: @ljw_larry showed me how to simplify my code significantly and pointed me to some limitations that were removed in recent versions of tablesaw. For instance, it now supports integer-typed columns, pivoting a table, and I missed an essential part of the API which makes filtering on particular values very easy.

Internally, a data frame (called Table) is a collection of column objects that contain the data. It’s not very easy to add a partial row to the table, you need to manually iterate over all columns and add null values (or call appendMissing()) for all the columns that you don’t have a value for (yet). It would be nicer if the Table has a call appendRowthat would give access to a new, empty row object on which you can call setters.

(Before November:) I didn’t find a good call to do the pivot operation, so I implemented it myself by calling forEach on the Table to iterate over all rows and build the result in a new frame. Unfortunately, a call to filter rows on the “total” type wasn’t available so that also needed to be done by rebuilding the frame and doing the filtering manually. Also, there was no call to straightforwardly add the computed “growth” column.

Joinery is another open source data frame implementation that has a nice fluent API, featured on its first page of documentation. However, the joinery javadoc doesn’t have the usual side panel that allows you to explore the package and other classes. I found that inconvenient.

Joinery does have a pivotoperation but it took a bit of time to figure out how it works because it doesn’t have any documentation. After the pivot, the column names were of type Integer (the years), but some other functions in the library require the names to be of type String. The way to sort descending on a column is by prepending '-' to the column name, but sortBy("-2016") doesn’t work as the String "2016" doesn’t match the Integer 2016. The way to work around this was to address the column by index and make that negative (i.e. sortBy(-10)). Addressing columns by index was needed in a number of other places as well. Unfortunately, joinery doesn’t make it very easy to get the index of a column because it doesn’t expose a columnIndex(String colName) or similar method.

Another small gotcha was that the pivot operation put city and type information into a special row-index field which is not part of the regular table cells. This made it impossible to use nice functional constructs like select() to filter on them, and a forEach() over all rows was needed to filter the rows I wanted.

Joinery had, unlike tablesaw, a good API to add the calculated “growth” column.

Morpheus has the most functional and straightforward API of the three, and I ran into the fewest problems with it.

I found two small drawbacks: the pivot operation was not natively supported (although the code I had to write for it was still relatively short) and missing values aren’t well supported for primitive columns (i.e., the Integer and Double typed columns). All cells with a missing value were set to a default (0 usually). I still got the results that I wanted, but having full support for missing values would be nicer in many situations.

I also briefly tried datavec, which is a part of deeplearning4j. It has some great features: you can execute its data frame transformations locally, but it also has Apache Spark as a supported executor, so your code is easily scalable when needed. I stopped exploring this library when I got to the pivot step which I don’t think is possible with the built-in transformations. Custom transformation code seemed to require upwards of 100 lines of code instead of a couple of lines with lambda functions I was looking for.

All three libraries do what they’re supposed to do and didn’t take that long to get familiar with. In the end, I liked the API of morpheus the best and the code I created was the shortest ( in the github repository).

Update end of November 2018: The improved version of tablesaw is actually the best now, and that’s the library I would use.

However, it’s still a lot longer than the required pandas code: 50+ lines of code vs. 10 or so.

Python is not only shorter, but also has an interactive notebook experience readily available. At this moment it’s hard to make a case for using Java for data science. Maybe kotlin can change that, and projects like krangl look promising.

Building the top Android CPI network, AppTornado, AppBrain, Swiss Codemonkey