Learn more by doing less with spreadsheets

Asmir Avdicevic
qri.io
Published in
6 min readOct 23, 2020

We’ve all been there, generating a monthly report for our business, and doing 12 revisions to get to its final form. We finally turn it in and we’re then asked to get the previous month’s report, stored somewhere deep on our laptop. Once that’s turned that in (hopefully the correct version), we look at both side by side to see where and how things changed.

On the surface it seems simple, but for most businesses it’s vital to understand what’s happening month to month. How are our ads performing? How many customers did we gain? How loaded is our pipeline? And on and on.

Spreadsheets at the workplace

Spreadsheets are an invaluable tool at work and help us work with data on a daily basis and provide a human level interaction with it. Nearly every business depends on it, but managing templates and keeping track of files and revisions quickly gets cumbersome. Once those problems are dealt with, we still need to digest that data and get some insights from it.

To deal with it I’ve got a couple suggestions:

  • Use versioning software to track changes instead of keeping track of filenames
  • Use some basic analysis tools to understand how the data is changing over time at a glance

This becomes more important as the business grows and your operations become more complex. Eventually you may have to hire someone to do the reporting. And once you grow into it, tooling becomes your best friend to keep scaling with minimal overhead.

Stop doing `Final_Rev_3_Final_Finalest.xlsx`

Most popular spreadsheet tools now offer some sort of version control to highlight changes in a table so it’s easy to understand how data changed from version to version. Google Sheets allows you to browse changes by automatically generated snapshots or by explicitly named versions.

Now that we use versioning of our data for more clarity and less error prone workflows, we need to gain some insights from the data. While in-depth reports, read-throughs and detailed analyses are mandatory to make big decisions, we often want to quickly understand what is happening without spending hours or days sifting through the spreadsheet.

Understanding change

Spreadsheets have magnificent tooling to generate some quick reports and plot data. Coupled with version control this means that we can easily glance at a summary of a table just by selecting the appropriate version. Yet, it still doesn’t highlight changes between the versions, and we have to keep clicking back and forth between the versions looking at the same graph and data points any number of times. This often results in us missing the bigger picture as we are so focused on the fine grained details. All we wanted to know is what’s new this month.

There’s a whole subset of analytics that deals with understanding data at a glance, namely descriptive statistics are a great starting point to quickly summarize some data columns.

Without getting into too much detail and math, the purpose of this is to show you some basic numbers on some data so you can understand what is going on and decide whether to dive deeper into it or move on. With the world revolving around numbers, a quick example would be viewing the sum, average, count and distribution of values.

I took some visitor stats from some of my pages to make a quick demonstration.

Here’s a quick summary of active users for August:

  • 31 rows (for 31 days)
  • Sum: 743
  • Average: 23.97
  • Std. deviation: 8.04
  • Median: 23
  • Top 10% days (90th percentile): 34
  • Bottom 10% days (10th percentile): 13

Quick graph showing active users by day and distribution of days by count.

Now the same for September:

  • 30 rows (for 30 days)
  • Sum: 879
  • Average: 29.3
  • Std. deviation: 13.26
  • Median: 31.5
  • Top 10% days (90th percentile): 43.3
  • Bottom 10% days (10th percentile): 10.9

As you see, both give some very quick and meaningful info for the respective months. But how did we do with the months relative to each other?

A cursory view of the raw numbers shows some kind of improvement, but that’s neither quantitative, nor rigorous. We need more.

Setting the descriptive stats side by side offers plenty of insight without looking at the above two reports:

From the first graph we can see a ~20% increase in traffic from August to September. On a daily basis we performed ~5 active users better on average, but we also had a lot more oscillation. In general, all metrics lean towards the right and show slight to moderate growth, though our worst days in September are worse than the worst days in August.

There’s no need to look at either of the tables individually or parse through their charts. Furthermore not even the comparison bar chart is required to understand the change, the simple delta chart tells plenty.

The purpose of the above analysis is not to say you can learn more by taking a shallower dive into the data, only that you can learn a lot at a glance and thus make more decisions in less time. This becomes increasingly more useful as there are more and more types of reports you are generating and consuming.

Phew, okay. So I got my answer, but that required some work.

A Need for Better Tooling — A Data Differ

Software developers have one tool at their disposal that I don’t (yet) have as a data analyst — a differ, or an interface that shows me the differences (what changed) from one version of a file (in my world, a dataset) to the next.

The good news is, I work at Qri to try and improve the world of data with better tooling. We already have version control as a core principle in evolving a dataset over time.

Recently we put descriptive statistics and diffing onto our near term roadmap to further aid day to day workflows with datasets. We’re kicking around a few ideas on what that differ (or change report), would look like, how it’d be used, what it should tell you, etc., and we’re looking for feedback. Here’s what we’ve dreamt up so far:

A Data Differ Wires:

As you might recognize, this format is inspired by software code diff reports. However, there are MANY reasons data is different from code, and thus deserves its own considerations:

  • Data files are MUCH bigger than code.
  • All code can be shown as text and be reasonably interpreted. Data is often best displayed with a visualization (bar charts, graphs, etc), or in columns and rows (a spreadsheet!).
  • Data can be organized & sorted without truly ‘changing’
  • …and more.

These variations add complexity to the discussion on what comparison tools (differ) should do and look like.

Are there more considerations for a data differ? We’re inviting all data analysts, data engineers, data scientists & statisticians to weigh in.

How would you use a data differ? What would you want it to tell you about how a one file differs from another? How granular or high-level do you get when inspecting changes?

This post will be one in a series on this emerging feature, so please follow up!

--

--