Kobotoolbox Data Issue: How we fixed with R?

Anjesh Tuladhar
5 min readAug 21, 2017

--

TL;DR
The following R code reads two XLS legacy files from kobotoolbox, replaces the codes based on mapping table and merges them and generates a new XLS file — ready for analysis.

Issue with KoboToolbox

We used http://www.kobotoolbox.org/ for data collection for one of our clients. Kobo has one unsolved issue — you get timeout when you try to download the large data file after certain size. After number of email exchange with one of the kobotoolbox support, we felt that this is not going to get solved.

We could download the XLS legacy file, but not XLS (the proper one).

Screenshot from kobotoolbox showing the export types

That means that our downloaded file looks like this with all the codes and these need to be mapped and changed to labels before we start working on the data.

Kobotoolbox XLS Legacy file contents — codes and the mapping table (codes -> labels)

Next we need to change these codes into appropriate labels. Each of these codes has values, maintained in a separate file.

For a single sheet with limited data, we could do search/replace or use some excel trick to do the replacing. However we are talking about huge number of data. There are 7 sheets in XLS legacy file with the followings number of rows and columns.

With that many number of columns, it’s hard to even scroll the excel horizontally. Copy-paste and excel trick is not going to be easy.

We were still receiving survey data, and we wanted to do quick analysis.

Quick solution with R

I had done a bit of R. At quick glance, this problem seems to be quickly solvable using R.

First I read these mapping codes into a dataframe.

choices <- read_excel("data/questions.xls",sheet="choices")

Next I have to write one liner to replace the codes to labels for each column and then write back to the dataframe. Not bad.

Replacing codes-to-labels in individual column

But there are files with many columns that need to be replaced. If i miss something, then i have to check the similar-named headings. Confusing.

Replacing codes-to-labels in individual column, when there are lots of columns to replace

I used multi-line editing in sublime to write the code to overcome the possibility of making mistakes in typing the same line of codes for different columns. That’s a real time saver.

Using multi-line editing in Sublime to write codes

I have to do this for all 7 sheets. You know the number of columns. It’s tedious, laborious and still prone to errors, despite multi-line editing. It would be better to automate the replacing feature.

Automating the column code replacing

The following function takes the sheet dataframe, goes though each column, reads first non-empty row value, and checks in the mapping table before it does the entire column code-label replacement.

Now i don’t have to worry about individual column as well as the number of sheets. I can just pass the sheet variable to this function and the codes get mapped and replaced with labels with this one liner sheet1 <- fixSheet(sheet1).

Reading excel file

readxl package is used to read excel files. I get the following warnings for some sheets when i read excel with the following command.

sheet1 <- read_excel(filename, sheet=1)

The result is that the values with warnings are not pulled into dataframe.

Warning during read_excel

If i read the same file with col_types, then i wouldn’t get that error.

read_excel(filename, sheet=1, col_types=c('text','text', ...)

But I need to expand ... in col_types=c('text','text', ...) to as many as the number of columns in the sheet to read everything as text.

This following function does that. I have to read the same file twice to ensure that i read the file as text in second reading.

Now i can convert the codes to labels for any number of sheets without worrying about making human-mistake.

sheet1 <- readSheetContentAsText(filename, 1)
sheet1 <- fixSheet(sheet1)
sheet2 <- readSheetContentAsText(filename, 2)
sheet2 <- fixSheet(sheet2)
....

The whole process takes less than a minute for our kobo-proclaimed huge file.

Then I used WriteXLS package for writing corrected excel data to get started with analysis.

WriteXLS(x=c("sheet1","sheet2","sheet3"), outfilename)

The codes get replaced to appropriate labels

Merging two excel data files

Now if you have multiple XLS legacy files with the same structure, you just have to add few extra lines and you are done.

Fixing two XLS legacy files and merging results

Code

The working code is available in github.

Insights

I remember the talk by Rich Hickey on “Simple Made Easy” — watch if you haven’t. The above procedure is simple but not easy. This might not be the simplest solution out there but the number of hours i spent on fixing this is quite less. My initial reaction would have been python if i wasn’t not aware of R and its features. Python (without Pandas) would have been easy but not simple.

--

--