Getting into data visualization — where should I start?

I love data — and I broadcast that fact pretty widely.

An altogether unnecessary analysis of whether picking 1st in fantasy football matters. It doesn’t.

If you’ve attended a party with me recently, I apologize for talking your ear off about data visualization tools for the web or the cool R package I was playing with recently.

If you play fantasy sports with me, you’re welcome for the charts. So many charts.

That has perhaps unsurprisingly led to me hearing this question more and more frequently:

“Nick, I want to get into data analysis and visualization, where should I start?”

Unfortunately there is no perfect one-size-fits-all solution — everyone’s needs are a little different, and what folks already know can vary widely. One of the things I love about the current technical/educational/business climate is that smart people from very different career paths and backgrounds are saying to themselves “I need to get more into data.”

But amid those differences and after many conversations I have seen enough commonalities to put together what I think is a useful starter list. Obviously this isn’t for everyone. Are you an experienced engineer laughing at the idea of learning Javascript or Python for the first time? Already know D3.js and wondering whether to learn one of these or roll your own chart library on top of D3? This isn’t for you. This is for the academic scientist, school teacher, research consultant, project manager, funemployed guy or MBA grad (all recent examples) who wants to begin from closer to scratch.

If that describes you, organized from “no coding” up through “I ♥ code”, this is where I think you should start.


No coding

First, if you haven’t pushed Excel’s boundaries, it’s worth doing. Seriously. Learn pivot tables at least. It may sound lame, but Excel can do a lot more than people expect. It can even make pretty charts if you try hard enough.

If you have some data already and just want a good tool to explore it visually or to export more compelling charts, Tableau is incredibly popular and powerful. There is a free public version and a very expensive paid version which you can get for free as a student. It can publish to the web, or to static graphics to include in research papers, post to Instagram or print out as giant wall-sized charts. The Tableau Public website has a lot of quality examples posted for you to get inspiration from.

Sadly, the next “No coding” tool I like to recommend, Infoactive, is shutting down…but on the bright side it is because they were acquired by Tableau. This hopefully means good things for Tableau Public in the future. I will plug a free book spearheaded by the Infoactive team that is useful background on data visualization design using any of the tools I cover here:

Data + Design
A simple introduction to preparing and visualizing information

Some coding

If I were picking one single programming language to use solely with data I would pick R. It’s free, supported by tons of ongoing development adding useful packages on top of the base language, and there are great free resources to learn it. First among those resources — I cannot recommend these Coursera classes highly enough:

Taking all of them might be overkill for a true beginner, but the track of classes walks a nice line from the introduction of key data science terms and ideas, through exploratory data analysis (which covers useful packages for R like ggplot, a very popular visualization tool) all the way to adding interactivity, publishing to the web via Shiny and storytelling with data.

R is what I use most frequently for small, quick analyses and ad hoc visualization — if you’ve got a dataset that Excel is struggling with (too big, not flexible enough, poor visualizations), R is perfect for exploring quickly.

This is also the time for a quick “yes, you should probably learn some SQL.” SQL is very targeted in scope compared to R (really, it’s far from an apples to apples comparison)—but if there are databases that you need to dive into to gather data for use with any of these other tools or languages, there is a good chance you’ll want to know SQL, and it will pay dividends in the long run.

I ♥ code

More often than not, the question of “where should I start?” comes in response to a fantastic interactive visualization presented on the web. I’m a huge fan of all the recent innovation in this area (see my in-depth survey of innovative work here).

Unfortunately, if you really love this piece:

…it can be disheartening to find out how much you have to learn to be able to build your own. It’s worth reiterating up front that “being as good as the New York Times” is a tough goal. A worthy one, but tough.

Fortunately, there are many great resources to help.

The library behind the interactive piece above, and many of the data visualizations running in the browser today is D3.js, created by Mike Bostock. If you want to publish online or make interactives, D3.js is a great tool to learn. This does mean you’ll need to learn some Javascript in general and then D3.js specifically.

Bostock’s website is a gold mine of examples and tutorials (you can’t beat learning from the creator of the library…). I’d also recommend Interactive Data Visualization for the Web by Scott Murray, which you can either buy from O’Reilly or work through for free online:

Interactive Data Visualization for the Web
This is a book about programming data visualizations for nonprogrammers. If you’re an artist or graphic designer with visual skills but no prior experience working with data or code, this book is for you. If you’re a journalist or researcher with lots of data but no prior experience working with visuals or code, this book is for you, too.

The online version is excellent — you actually write code snippets within the book itself, run them, and compare your output to interactive examples that run within the book itself too. Murray also does a nice job of targeting the book at beginners, walking you through the basics of how web browsers work, HTML/CSS and Javascript, before diving headlong into the details of D3.

One area to call out as a particular strength of D3 is geospatial visualizations. D3 is great at creating maps of many flavors, and there are nice dedicated tutorials available if that’s your area of focus:

D3 can be difficult to use directly, but there are many tools you can use on top of it to make your life a little easier. I’d recommend learning at least the basics of D3 rather than only using a more abstract plotting library, but if that proves intractable, a tool like Plot.ly can help make things feel more approachable.

Finally, if you really want to learn a do-it-all programming language that just happens to be great at data visualization, go with Python. Python is the most general purpose and powerful tool of anything I’ve listed, and it’s quite popular in the data science community.

I find Python very approachable as a multi-purpose programming language, but in truth it is probably overkill if all you want to do is explore and visualize data. Youtube is built with Python, for example…1 million lines of it. If you do go the Python route, the Code Academy course is a short (10–20 hours) and fun introduction to the language.

Finally, much like D3.js for Javascript or ggplot for R, there are many Python libraries dedicated to data visualization. Seaborn (which builds on an older popular library, matplotlib) and Bokeh are probably the best-in-class right now, but this is a quickly evolving and improving landscape. Both the Seaborn and Bokeh websites include galleries showing off the kinds of visualizations you can create with those tools.