What I use for data-driven journalism

People regularly ask what tools to use or what programming language to learn for data-driven journalism (ddj). There is no right answer for it, especially considering that technology and tools available are evolving quickly in the field.

Nathan Yau from FlowingData recently described how he works in data visualisation. His post applies perfectly to data-driven journalism tools:

“What tool should I learn? What’s the best?” I hesitate to answer, because I use what works best for me, which isn’t necessarily the best for someone else or the “best” overall.
If you’re familiar with a software set already, it might be better to work off of what you know, because if you can draw shapes based on numbers, you can visualize data.

I was interested to review my own toolkit. Spoiler alert - this post is code-centric and will mention a lot R. This is just because I am familiar with it. I do not think everybody should necessarily use my workflow. I will not discuss much Excel, Python, Javascript, … I am well aware however they are more typically used in ddj.

Before I dive into my typical workflow and tools for 2016 so far, I should mention that I work as the sole data-journalist in my newsroom. It is more common in news outlets to have data/visual journalism teams, with people specialised on specifc sub-areas of data-driven journalism. My workflow is pretty much data-journalism on a shoe string.

Also by ideology and because I am a nerd, I use (nearly) solely open-source free tools. Again, it is just because these are what I am more familiar with. But if there was a proprietary framework with which I can do things faster and better, I would switch in a heartbeat.

Data acquisition, cleaning and formatting

Tabula: Sometime you have to deal with data-journalist’s worst enemy: data trapped in a pdf. This simple tool, no coding required, makes the process of getting data table out from pdf less painful.

Open Refine: I usually work with raw data directly from R. But if your data is too messy, cleaning your data by scripting or manually in a spreadsheet can get tedious. Open Refine makes data cleaning interactive and reproducible. It brings the best of both world of scripting and manual cleaning.

LibreOffice / Google sheet / MS Excel: The less I use a spreadsheet software, the happier I am. Excel is unfortunately still a standard format to distribute data. I typically use it to inspect data and for basic data cleaning or reshaping.

R: I will come again later to my beloved swiss-army knife language R. R is a free open-source statistical computing language. A statistical framework sounds overkill to publish stories for the masses? Just think of it as one of the most popular programming language to deal with data. There are heaps of packages to extend its functionality and it has a large helpful user community.

You can scrape data with R (with rvest for instance, similarly as with Python’s beautiful soup) or get data directly from open data portals’ API (World Bank, Eurostat, …). But R really shines to shape your data (merge, subset, aggregate, …) with packages such as tidyr & dplyr.

Analysis

In data-driven journalism, it is critical to explore your data rapidly. This means querying your data with questions you have or look for patterns or outliers in your data.

A random ggplot2 example from “Cookbook for R” (http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/)

Data exploration is typically an iterative process where new questions or ideas arise as you dig in your data. To me nothing beats R for exploratory data-analysis. You can quickly reshape your data and produce a vast array of different graphics suited to address any questions you might have. The R package ggplot2 is particulary helpful for that.

Furthermore, with R markdown you can create sleek pdf or html reports mixing code and the resulting graphics. This is a great feature to document your work, but also to publish your complete methodology along with your story. Similary as with scientific papers, the idea is methods used in data-driven journalism should be explicit, transparent and reproducible.

Production graphics

Static data visualisations

R (ggplot2 + Inkscape/Illustrator): Default graphics produced with R might only appeal to enginneers... With a few lines of code though, you can greatly improve and template the chart’s look (check for instance this ggplot2 graphic).

It is often important to add text and explanation to your graphic. This can be done of course programmatically in R, but if you have a lot of annotations it can get tedious. R graphics can be saved as pdf or as svg and manually edited in Inkscape (free & open-source)/ Adobe Illustrator to add an “annotation layer”. This is for instance how I created the graphic under.

Heatmap of the ATP ranking of all number 1 tennis players. Made with R + Inkscape.

I aim to produce in the future more static graphics using only R though. If you are pursing some kind of mobile-first strategy, you may want to use parsimoniously large interactive graphics and produce more “responsive” vector graphics. Vector (svg based) because you want your graphics to look crisp and pixel perfect on any screen size. Responsive design to handle also different device sizes elegantly but in terms of layout. For instance this graphic made with R show multiple maps. Depending of you screen size, you will have many map boxes on one row and if it is small it will have fewer.

Interactive data visualisations

Example of a choropleth map of Switzerland made with datawrapper (http://charts.swissinfo.ch/hOyS4/4/)

datawrapper: Data-driven journalism ≧ fancy dataviz. I suppose people should know that data-driven journalism is much more than fancy data visualisations. Data stories do not always necessite innovative graphics to best convey a message. Standard bar or line chart often works best to make a point. For that, I am fond of the charting tool datawrapper. It is open-source but offers cheap paying options for hosting responsive interactive charts. It is used across our newsroom by all journalists. We got a datawrapper chart layout fitting our website, so I am not tempted to spend time on minor design tweaks as I typically do when I code a graphic. And it recently extended its chart options: choropleth, bubble map, faceted bar charts, bullet chart, …

R + rCharts / htmlwidgets d3.js is hands-down THE programming language for interactive data visualisations. For me though, who is not proficient in javacript/d3.js and the fact that I have to create graphics in ten languages for the media I work for (including right-to-left arabic), coding data visualisations from scratch in d3.js is too time-consuming.

In my case, I found that creating interactive data visualisations from R using bindings to javascript (with packages such as rCharts or htmlwidgets) is a great alternative. Of course, this doesn’t offer the same visualisation freedom as coding with d3.js. But I consider the limited possibilities/customisations can be worth the time saved. And a workflow advantage is that your data analysis and production graphic code can be in one script based on the same underlying data. This makes reproducibility and updating graphics a breeze.

Binding from R to leaflet.js (map) or Highchart (charting library) to name only two packages, offer a wealth of interactive graphic possibilities. Here are some examples (click on the thumbnails to see the interactive graphics):

Some interactive data visualisations produced from R

That’s about it. This post is longer than I expected, but I feel my workflow is more complicated to explain than to use. I would be curious to know how other ddj people/teams work or any tips to do things faster or better.

See also