Back in February, the Data@Urban editors were making comments on an early draft of the post “Why The Urban Institute Visualizes Data with ggplot2.” Author Aaron Williams made the statement:
“Point-and-click tools for data visualization are popular, but they are neither reproducible nor scalable.”
Which prompted quite the lively debate among our editors. As interesting as it may have been to publish the full transcript of our “track changes” comment threads, we thought it might be more useful to record our thoughts via a live conversation.
So, inspired by 538’s Slack Chats, five of us sat down for a friendly debate! (The transcript has been lightly edited for clarity.)
Aaron (data scientist in the Income and Benefits Policy Center at the Urban Institute and author of the original blog post): Researchers should not use point-and-click tools like Microsoft Excel for research. Research should be accurate, reproducible (we should show our work!), and efficient. With tools like Microsoft Excel, meeting all of these goals is challenging, if not impossible.
Jon (senior fellow in the Income and Benefits Policy Center): You’re going right for it, eh?
Graham (chief data scientist): Woah.
Jon: I think we need to set the stage just a little bit. My initial comment — and I think it will come through here as well — is that these point-and-click tools (like Excel and Tableau) are still the workhorse for a LOT of people. So while I agree with your list of reasons above, we also need to be realistic about how people (researchers) work.
Jessica (director of research programming): Perhaps, too, something to consider is what people learned as they completed their education. Is a component of this debate that younger researchers lean towards tools like R?
Ben (Ben Chartoff, director of data visualization): Great point, Jess, I think I agree with Aaron that these processes should be reproducible and accurate and would happily discuss whether or not Excel and others can meet those criteria (Jon thinks they can, I’m on the fence). In terms of being efficient, that comes down to the individual researcher, right? As Jess says for a lot of folks at Urban, Excel might be waaay faster than R.
Jon: Right, efficiency is a bit of an “eye of the beholder” thing, right? If I’m a qualitative researcher and have never coded before, using Excel is going to be much faster. That being said, of course, learning a coding language is an upfront investment that can pay off later.
I think my primary rebuttal to Aaron is that so many people don’t know how to code, don’t really need to code, and may never need to code, that learning a programming language may not have a big payoff.
Aaron: I agree that many people are faster with Excel, but efficiency only matters if the research is accurate and reproducible. The first two principles can live without the third. The third can’t live without the first.
Jon: I think you can still be accurate in tools like Excel and Tableau, and at the same time, you can be a shoddy coder and make fundamental mistakes. So, I think I’m a bit less concerned about differences in accuracy across tools and a little more concerned about reproducibility.
Ben: Well I think they overlap, right? An error in code is inaccurate in a reproducible and documented way. If you slip up in copy and paste, there’s no record of that mistake.
Jon: Yes, but I shy away from too much copying and pasting. For example, in Excel, I use formulas and the Data Import features so that the data in Excel is linked to the output from my Stata code.
Aaron: Absolutely. Humans are fallible whether they’re using Excel or code. The goal should be to minimize the chance of a mistake and maximize the chance of catching a mistake. This is difficult when the consequences of a mistake are buried in a cell or because of clicking the wrong thing. Scripts create a record that can be easily reviewed. Excel review isn’t the same as code review.
Jon: In a tool like Tableau, you also have the link to the data and then, maybe, Calculated Fields, which, like formulas, are reproducible.
Graham: Wait, can we go back to Excel? I feel like it’s probably the most common “data analysis” tool and people’s first go-to.
Jon: Trying to get me back on track, Graham?
Graham: Cracking the whip.
Jon: I think there may be a couple of issues here. There’s the analysis side of working in Excel. Here, I wouldn’t suggest people use it for in-depth analysis or, especially, regression results. But done well, you can still track and review that kind of work. But, I admit, creating graphs in Excel/Tableau/PowerBi is inherently drop-and-drag, and there you don’t have a record of the steps used to create it.
Aaron: In reality, more people are going to be convinced by a plot than a regression coefficient. The stakes are high. When Reinhart and Rogoff made their famous Excel mistake (the Excel Depression), countries adopted austerity and people suffered. Visualizations should be held to just as high of a standard as summary statistics and regression coefficients. That requires accurate and reproducible code.
Ben: I was waiting for Reinhart and Rogoff to come up….
Jon: I was waiting for someone to shout out “But Reinhart and Rogoff!!” 😱
Aaron: Furthermore, summary statistics and the like are often more reproducible in point-and-click tools than in visualizations. Data visualizations are entirely hidden in Excel. The chances of cell reference mistakes and the like are HIGH.
Jessica: As a person who writes code a whole lot, unraveling logic and formulas inside Excel to back out what is happening is a real pain.
Jon: IMO, the way Reinhart-Rogoff used Excel was not a responsible way to use Excel. If they had used formulas, like SUMIF and COUNTIF, some of those errors — though not necessarily all — would have been caught. Some could have also occurred with poor coding techniques.
Back to the reproducibility of graphs in Excel — I agree that it can be hard (if not impossible) to back out how someone made a specific graph. But I also want to broaden the kind of Excel users we think about. I suspect a lot of people are creating your basic Excel graphs. And though they are still not reproducible in a coding way, they are still pretty simple to figure out.
Graham: I also want to jump in and defend Excel a bit. I feel like we’re being a bit unfair. Excel’s super useful for so many things. Most data mistakes are corrected by looking at the data, not necessarily through statistical checks. And using Excel is an excellent way to understand the data, get a feel for it, and find mistakes you may have missed if you just used code (especially if you’re not an expert programmer).
Aaron: I’m not arguing that code is always accurate. I’m arguing that code minimizes the chances of mistakes and maximizes the chance of catching mistakes. MAYBE Reinhart-Rogoff would have caught the mistakes with different Excel logic. But that doesn’t mean it’s a best practice. This also highlights more advantages of code: version control and open science. It’s fundamentally easier to share and version code than Excel workbooks. And this is ignoring open source; Excel, Tableau, and the like are often proprietary. Giving as many people access as possible is important to crowd-sourcing accuracy! 👍
Jon: It’s possible that my hang-up here is that our perspective is from a big research institution. But it’s likely true that many, many people are doing work that is not as data-intense or complex, and thus using a coding environment isn’t necessary for their work. Your open science/source point is well taken, but it’s also the case that everyone knows how to use Excel. And thus, it’s easier to share in a tool like Excel rather than in code, which not everyone (both within and across organizations) will be able to do or read.
Graham: So then it seems like the question becomes, should programming always be required to do an analysis? Does every analysis have to be reproducible and involve a programmer? Are we now being gatekeepers and empowering fewer people to do data work?
Aaron: Not always, but almost always. Any number or figure intended for consumption by others (or your future self) should be created in a manner that is reproducible. Also, Excel isn’t that easy and simple coding is not that hard. Much time and money is spent teaching Excel.
Ben: Sure, but can these things be reproducible without code?
Jon: Wait, wait, wait. If we’re going to compare simple Excel with simple coding, Excel has got to win every time!
Ben: A handwritten list of instructions can be used to reproduce work. It might not seem the best way to a programmer, but for a lot of processes, that’s what documentation means right?
Aaron: Absolutely! It’s Excel that obscures the reproducibility!
Jon: I think Ben’s right. When I teach dataviz to business students, they have very little need (if at all) to learn to code. Maybe their organization has a group of data scientists who collect and collate the data, and their job is to build a dashboard, refresh the data, and show the results to their boss, colleagues, and clients.
Graham: Alright folks, we’re running out of time. Let’s get your final thoughts!
Aaron: I don’t think simple Excel is inherently easier than simple coding. It’s only after elementary school, middle school, high school, college, and work experience that Excel seems so simple. This will change because elementary school, middle school, high school, college, and work are changing.
Fundamentally, I’m more interested in where analysis should be and how we should get there. Yes, lots of people use Excel. I think analysis and the world will be better if some of those people abandon Excel and use open-source, reproducible tools.
Jon: Well, I certainly don’t disagree that coding (and statistics!) should be taught in early education (I’d drop calculus). But I also think there are still — and will be — advantages to using these point-and-click tools: for sharing and for quick data exploration and manipulation.
Aaron: Ok, let’s go watch basketball.
Ben: Stay tuned for our next post, how wrong is Jon about dropping calculus?