DataBasic: A suite of data tools for the beginner

Easy-to-use web tools that help data newbies (and the more experienced) grasp & learn the basics

By Matt Carroll <@MattatMIT>

I’m a data geek from way back at The Boston Globe. Naturally enough, I have a deep interest in data, data visualizations, and the tools used to build them.

That includes starting to review new data and data viz tools as they become available. This is my first review, and I’m looking at a suite of three tools for data beginners, released under the name The basic idea is to introduce “easy-to-use web tools for beginners that introduce concepts of working with data,” as the site explains.

The tools are: Word Counter, WTFcsv, and SameDiff. Each aims to solve a particular data problem, and they do their work well. But what’s of particular interest to me is what all three accomplish in a very deep way — they are easy to use.

That’s unlike many other data tools, which are geared for hard-core users. Not that there’s anything wrong with complicated tools for hard-core users. Sometimes the tools do complicated data work, and by necessity are fairly complex. I’ve used data tools for a long time, so am not put off (too much) by sharp learning curves and crappy user experiences, as long as the tool does what’s promised. Plus, frankly, I’m paid to put up with the pain, so I work through it.

But I also understand how new users can be intimidated and turned off by these types of tools and experiences. That’s really too bad, because there’s so much data available that can help people understand so much more about their community, politics, the environment, even their own finances. The possibilities are limitless. What’s needed are simple tools that can be mastered easily so more people can participate in data surfing.

That’s where the DataBasic tools fit in. They are beautiful and simple, easy and intuitive to use, and are great for beginners. Who are they useful for? Anyone who might want to dive into data, but is unsure where to start, including students, community groups, or journalists (that’s me). The tools were tested in classrooms and workshops to make sure they worked well and were easily understandable. The developers get that learning data tools can be a miserable experience. As the web site says: “These tools were born out of frustration with things we were trying to use in our undergraduate classes, so we feel your pain.”

They’re even determined to put some fun into data. For training, they even include some data sets such as song lyrics from Beyonce and a list of survivors from the Titanic.

Here’s a quick rundown, with a more detailed description below:

  • Word Counter analyzes your text and tells you the most common words and phrases. Putting on my journalist hat, right away I can see reporters using this as a standard tool for basic analysis of politicians’ speeches.
  • Same Diff compares two or more text files and tells the user how similar or different they are. Of the three, this tool offers the most intriguing possibilities. For example, it could be used to analyze how a politician’s stump speech evolves over time. Or how witness’ statements have changed, or how any particular set of statements has changed.
  • WTFcsv is more traditional, in that it’s geared towards spreadsheets, the data maven’s basic tool. It’s designed for the data newbie who has no idea what to do with a spreadsheet. It helps a user peek behind all those columns and rows through some simple analysis. BTW, “Csv” refers to a common text file that can easily be imported into a spreadsheet program like Excel. “WTF” stands for … well, you know.

The tools were developed by two people with a deep interest and knowledge of data and data visualizations, Rahul Bhargava, a research scientist at the MIT Media Lab’s Center for Civic Media, and Catherine D’Ignazio, an assistant professor in the Journalism Department at Emerson College in Boston. (BTW, big transparency alert: I’m hopelessly conflicted writing this review, as I’m friends with both Catherine and Rahul. Also, I’m Catherine’s partner in creating a cool photo engagement app for newsrooms called NewsPix, and I work with Rahul in Civic Media.)

I’m looking at these tools from the point of view of a beat reporter wondering how to use them to help find stories or dive deeper into material. And while they are designed for beginners, I can see Word Counter and SameDiff gaining traction with experienced reporters, as well.

Word Counter: A ‘word cloud’ is only the beginning

Word Counter takes text and analyzes it in several ways. It creates your basic word cloud, but also does word counts, in several interesting ways.

For instance, let’s take the speech Donald Trump’s speech in June when he announced his candidacy for president.

WordCounter breaks its analysis into a few different pieces — A word cloud, Top Words, Bigrams, and Trigrams.

Top Words is a basic word count. In Trump’s speech, the most common words were People (47 mentions), Going (44), Know (42), and Great (34).

Bigrams and Trigrams are counts on two- and three-word combos. Top Bigrams: Don’t (39), going to (38), I’m (37)

And top Trigrams: “I don’t” and “(We’)re going to”, tied at 11; and “that’s the,” “going to be,” and “in the world,” tied at 10.

A nice touch: Each of the lists of the Top Words, Bigrams and Trigrams can be downloaded as a csv.

It’s a nice tool that might help quickly analyze or provide some insight into someone’s speech, talk, or writing. Unfortunately, this tool didn’t help our aspiring data journo come up with a story.

Too bad there wasn’t a tool in this suite that would let us take Trump’s speech announcing his candidacy and compare it with Obama’s speech doing the same.

Oh wait — there is, and it’s called…

… SameDiff: Let’s compare speeches by Obama & Trump

I took the speeches Obama and Trump made announcing their candidacy for president and easily imported them as text documents into SameDiff.

The first thing SameDiff does is give us a report, which announces: “These two documents are pretty different.” A shocker, huh? Who would think that Obama and Trump might share widely divergent views? (Yes, that’s a joke.) But that’s good. Big differences means that there is a lot of contrast between the two speeches, which gives us something to write about.

SameDiff creates three columns: The first and third columns are basically word clouds noting the top words used by each candidate (you can get the exact word count by hovering your cursor over a word.) The middle column lists the words in common.

It’s interesting to see the differences in what the two men talk about. With Obama, top words include: Health, future, divided, opportunities, family and children.

With Trump, some top words: Going, China, Trump, Mexico, stupid, Obamacare.

What does this tell us? We can see that Obama in his speech was focused very much on individual needs, including healthcare, and was concerned about divisions in the country.

Trump focussed more on problems in the international arena (not to mention he likes talking about himself). And he’s not above throwing around derogatory terms like “stupid.”

The column that shows which words they used in common? Not that illuminating, in this case: Great, country, jobs… Words that you might find in most political speeches given by any random politician.

As a reporter, I can’t say I could write a story based on this analysis. But it does help shine a little light on the slant each candidate was pushing, which could help inform whatever story I do write. This tool was helpful.

WTFcsv: Helping the beginner database maven

The goal of WTFcsv is to show the data newbie what’s inside a spreadsheet, or a file that can be imported into a spreadsheet. Like the other apps, it is well designed and simple, with a clean interface.

It was simple and intuitive to import a file (it can take a csv, xls, or xlsx, which are three common spreadsheet file types).

Once a file is imported, it provides basic information about what’s in the file. To test it, I imported a US Census file on college graduation rates at the state level. The state-by-state data shows the education levels of state residents, 25 or older. There are 12 columns of information, which range from state names and populations to education levels, such as “Less than 9th grade,” and “Bachelor’s or higher.”

The analysis, provided in a nice card display, broke down each column. For instance, it looked at the percents of the 25 and older population with a bachelor’s degree or higher education. The WTFcsv histogram broke the info into ranges that, for example, that four states had between 18–22 percent. (Ahem, and using my own analysis: For Bachelor’s or higher, Massachusetts led all states with 39.4 percent.)

It also provides a “What do I do next?” set of questions that can help prod the beginner.

Bottom line: Maybe of use for people who are new or are intimidated by spreadsheets, but that’s about it. It will be of some use for awhile, but as data users become more experienced, they will turn to other, more powerful tools, such as pivot tables, to do the same types of analysis.

Overall, the DataBasic suite

I like all three tools. All are all simple to use. The design and user experience is great. “Intuitive” is the key word. I had no trouble importing files. My favorite was SameDiff, because I can see how useful this would be for even an experienced reporter, but I can see how all three would benefit wanna-be data reporters.

Matt Carroll runs the Future of News initiative at the MIT Media Lab and writes the “3 to read” newsletter, which is a weekly report on trio of stories and trends from across the world of media.