Designer’s Field Guide

Understanding Data

How to talk about data with technical people

Tim Sheiner
Salesforce Designer

--

Photo by Aaron Burden on Unsplash, cropped from original

Learning to speak data

In 2001 the fallout from the dot com bust was making it hard to find software development work anywhere in the Bay Area. I was out of work for seven months before a connection my father had made for me a few months earlier turned into a paying gig as Director of Design at Pharsight, a start up that developed statistical software for Big Pharma.

My father was a scientific advisor to Pharsight; in fact, the company’s science was entirely based on theory and methods he had invented as a professor at UCSF. This was nepotism of a fairly benign sort: I needed a job, Pharsight needed a designer, and my dad was thrilled for me to be learning the “family” business. I had never actually understood what my dad did so I was also excited to have learning about his life’s work become a professional responsibility for me.

Eager to hit the ground running, I ran right into a wall: I could not understand what the people at Pharsight were talking about! It turned out the thing my dad had invented was a statistical technique called population pharmacokinetic modeling and simulation, and if the name wasn’t intimidating enough, the people who did that work were ridiculously smart, insanely technical and spoke a language I did not understand. It sounded like English, but many familiar words like normal, continuous, or uncertainty did not mean what I expected them to mean. The one thing I did understand, though, was that these people absolutely worshipped data. I realized that if I understood data, I would be able to decode their language and culture. I immersed myself, asked a lot of stupid questions, read a few books and eventually, slowly, began to understand what these rocket scientists were talking about.

Fast forward to now, and these things have changed:

  • Modeling and Simulation is now called AI
  • AI has gone mainstream
  • Statisticians are now called Data Scientists
  • Lots of companies have data scientists on their payrolls

The net net on all of this is that demand is high for designers who can create experiences that display data in useful and interesting ways. In my personal experience this became much, much easier to do once I’d learned to speak the crisp, precise and slightly odd language used by technical people for talking about data. The rest of this article is my attempt to share with you, as succinctly and simply as possible, my hard won understanding of how to speak data.

The Model

For those of you who — like me — always want a map of the territory, here is the model for this essay.

Structured Data
Observations vs Statistics
Continuous Data
Discrete Data
Special Cases
Time
Categorical Numbers
Booleans
Structures
Name-Value Pairs
Objects
Arrays
Matrices vs Tables
Metadata
Data Set
Format
Schema
Unstructured Data
Examples
Perceiving vs Parsing
Takeaways

Let’s break this down.

Structured Data

The highest level breakdown I use for thinking about data is structured vs unstructured. Data is structured if you understand the structure; everything else is unstructured. Both kinds of data have value; the difference is that getting at the value of structured data is faster and less risky than unstructured. Unstructured data may be full of useful nuggets, but you won’t know for sure until after you’ve spent time and money decoding the structure.

I’ll start this discussion with structured data and once we’ve got that sorted, I’ll return to unstructured at the end.

Observations vs Statistics

Speaking the language of data means thinking about the world as a place that you understand by making observations: values obtained by measuring, recording or counting some quantity or behavior found in the world. Observations are as close to the absolute truth about the world as one can get. These things are usually recorded as numbers, but not always, and often have some kind of unit associated with them, even if that unit is not reported to you. The weight of a harvest of artichokes, the frequency of waves striking a shore or the favorite super hero reported by a class of kindergarteners are all observations. If you can sense it and record it, it’s an observation.

A statistic, on the other hand, is a value that is calculated from a set of observations. It is not an observation; it summarizes observations, taking many values and reporting them as just one value. The average of a set of numbers is common example of a statistic.

The critical thing to remember about statistics is that they are opinions not facts. A statistic is a point of view, a particular approach to summarizing the world, that may be helpful but always contains less information than the observations from which it was derived.

A statistic is a point of view, a particular approach to summarizing the world, that may be helpful but always contains less information than the observations from which it was derived.

Here are examples showing the relationship between observations and statistics. Notice how the statistic is true but quite literally contains less information:

// gather weights of some people
// observations:
subject 1 = 145.6 lbs
subject 2 = 189.2 lbs
subject 3 = 220.5 lbs
subject 4 = 137.8 lbs
//statistic
average weight = 173.3 lbs
~~~~~~~~~~~~~~~~~~~~~~~~~// what is your favorite fruit?
// observations:
subject 1 = apple
subject 2 = apple
subject 3 = orange
subject 4 = apple
subject 5 = kiwi
subject 6 = orange
//statistic
most preferred fruit = apple

Common synonyms for statistic are aggregation and summary.

Continuous vs Categorical

Data, made up of observations or statistics, comes in two flavors: continuous and categorical.

Continuous Data represents quantities that are measurable or observable across a range, where any value within that range is possible. The temperature of a room, or the price of a stock or the weight of a person are all examples of continuous data.

// the weights of humans is continuous data
subject 1 = 145.6 lbs
subject 2 = 189.2 lbs
subject 3 = 220.5 lbs
subject 4 = 137.8 lbs

Continuous data values are always numerical. Two common synonyms for continuous data are the words metrics and measures.

The standard way to visualize continuous data uses a line as the literal representation of the idea that, in between the data points that you have, is an infinite number of equally true data points that you didn’t happen to collect.

data from https://datahub.io/core/sea-level-rise, visualization by Plotly

Categorical data describes things that fall into groups or categories. These things have specific, indivisible values. A person’s native language or favorite kind of fruit are examples of categorical data.

// reported fruit preference is categorical data
report 1 = apple
report 2 = apple
report 3 = orange
report 4 = apple
report 5 = kiwi
report 6 = orange

This kind of data is also often called discrete data, or in business applications, dimensions. Categorical data can be words or text but to make a graph you may have to transform text values into some kind of a number such as a count.

The most common way to represent categorical data is with a bar chart. The breaks along the x axis between the bars represents the idea that there are no possible values between those listed. In other words, in terms of categories, the idea of ‘between’ makes no sense.

data from https://datahub.io/core/population, visualization by Plotly

Special Cases

There are a 3 kinds of data that require special consideration when designing visualization or analytic experiences.

1. Time
Time is a special case of continuous data. It is hard to say precisely why it is special without getting too metaphysical, but the basic reasons are something like this:

  1. Time is universal; it affects everyone everywhere the same way.
  2. Time is a force like gravity; its behavior is directional, predictable, unalterable and never changes.
  3. Time, at least as far as life on this earth is concerned, is a quantity that is simultaneously linear and rhythmic; it progresses regularly and relentlessly into the future while at the same time bringing the same seasons around again and again, year after year.
  4. From a data analysis point of view, it is always valid, and often useful, to correlate quantities by time.

From a data analysis point of view, it is always valid, and often useful, to correlate quantities by time.

2. Scores and Identifiers
Numbers are usually continuous data but there are two kinds of numbers that are categorical: scores and identifiers.

A number is a score when it represent an opinion, preference or some other subjective thing. For example the response to the following request, though numeric, is not actually a number, it is an opinion:

rate this restaurant: 1-2-3-4-5

You can see this by realizing that the request could be written:

rate this restaurant: terrible-poor-good-great-excellent

The point is to be careful calculating statistics with scores. For example, when reporting the “average” score think about using the mode, not the mean. After all, if there is no score between “poor” and “good” then, even if you can calculate it, should you really report a score between 2 and 3?

A number is an identifier when it is really a name. Common examples are part numbers, zip codes, passport numbers, and serial numbers. These are all situations where a large number of things need names that follows a specific format and are unique. That is hard to guarantee if the names are words, so numbers are used instead. Of course, it doesn’t make sense to do any kind of statistical arithmetic with numbers that are really names!

Sometimes you’ll hear the identifier special case called ordinal data which will clue you in that you are talking to a statistician.

3. Booleans
Boolean data is a special kind of categorical data that has only 2 possible values that are the opposite of each other. You can think of the possible values as true/false or yes/no or on/off or 0/1 or whatever makes the most sense to you. The restriction to two values makes it possible to represent logic as data. This epiphany underlies the entirety of computing, but that is another story for another time.

Data Structures

Now that we have some words to describe data, let’s describe the most important structures we use to organize data.

The simplest useful representation of data is as a name-value pair.

Name-Value Pair
The simplest useful representation of data is as name-value pairs; in fact, this is the structure we’ve been using to represent data in all of the examples above. This concept is that data has two parts: the name of some quantity, feature, element or variable we care about and the value assigned to it. By convention, the name is listed first. Sometimes you will hear this kind of structure referred to as a key-value pair. Here are examples:

// name-value pairs
// often a colon is used as the separator instead, e.g. x:3
x = 3
pi = 3.14
color = green
height = 6'2"
city = "san francisco"

The name-value pair is the most granular kind of data structure. However, we seldom deal with just a single name-value pair so we need structures to manage collections of them.

Objects
If you work in interaction design, the object is the most important data structure after name-value pairs themselves. Objects are useful for packaging data up as a message to send from one piece of computational logic to another because they represent compactly the data and the schema (a form of metadata we’ll discuss below) at the same time.

A valid and useful way to think about an object is as a physical object where the values describe the properties or attributes of the object.

// generically, an object is a set of name-value pairs
{
"name": "jane",
"age": 37,
"gender": "female",
"married": true, //note the boolean!
"children": true
}
// that can itself have a name
"person": {
"name": "jane",
"age": 37,
"gender": "female",
"married": true,
"children": true
}
// and can be nested, where any value can itself be an object
"person": {
"name": "jane",
"age": 37,
"gender": "female",
"married": true,
"children": {
"child1": {
"name": "ray",
"age": 6,
"gender": "female",
"married": false,
"children": false
},
"child2": {
"name": "dora",
"age": 8,
"gender": "female",
"married": false,
"children": false
}
}
}

The object notation used in the examples above is JSON which has become a standard because it is easy for both humans and computers to read and write.

Arrays
An array is an ordered structure for organizing multiple name-value pairs, except that the name part is implied.

// arrays of values for the same named quantity
AGE = [37,38,40,37,37]
NAME = ["jane", "keung", "sam", "vashti", "lenora"]
// array of values for different named quantities
PERSON = ["jane", 37, "female", true, false]
// like objects, arrays can be nested
PEOPLE = [
["jane", 37, "female", true, false],
["keung", 38, "male", false, false]
]

The values in the arrays are referenced (that is, programmatically read or written) using an index, which is an integer count of the number of values in the array, with the odd (and confusing!) caveat that the index starts at zero, not one.

//array
PERSON = ["jane", 37, "female", true, false]
//select array values
PERSON[0] = "jane"
PERSON[2] = "female"
//nested array
PEOPLE = [
["jane", 37, "female", true, false],
["keung", 38, "male", false, false]
]
//select values by giving an index for both arrays, in order
PEOPLE[0][0] = "jane"
PEOPLE[1][0] = "keung"

Occasionally, rarely, you will hear an array called a vector, which will clue you that you are talking to a mathematician.

A rule of thumb for understanding the differences between objects and arrays is that arrays are more useful for computation while objects are more useful for communication.

A rule of thumb for understanding the differences between objects and arrays is that arrays are more useful for computation while objects are more useful for communication.

Matrices
A matrix is a two dimensional representation of data made by combining arrays. By convention, the rows of a matrix represent records (also called instances or objects), and the columns represent dimensions (also called properties or attributes). So if you had data about a group of people, you’d have a row for each person and a column for each separate kind of property that you know about them, like this:

a table that is a matrix

Another convention is that the first column (from the left to right reading perspective) is the identifier, the unique “handle” for referencing a given record.

By comparing this matrix with the object and array examples above you can see that all three structures are capable of containing the exact same data, they just do it differently.

If you hear a matrix called a data grid you are talking to a business data scientist who uses Python and if you hear it called a data frame you are talking to an academic statistician who uses R.

Matrices are also often called tables. Sometimes this is a technical term, used to describe the organizing structures in a relational or flat file database. But more often table is used to describe something that is not truly a data structure, but instead a data visualization that uses typography, cell border decorations, observations and statistics to tell a story.

a table that isn’t a matrix

Metadata

Data is not useful if you don’t know what it means. Metadata is the generic name for data that explains what other data means.

Data Set
A data set is more of a conceptual structure than a physical one. When people talk about a data set they are describing a collection of data where all the name-value pairs are somehow related to each other (or, at least, where they believe them to be), such as

  • data collected from a particular experiment
  • data collected about a particular issue
  • data extracted from a particular source

The point of speaking about a chunk of data as a data set is to communicate that you believe a given chunk of data is relevant to solving a particular problem, or that what you need is a chunk of data relevant to solving a particular problem.

Format
Every useful data set has a defined data format. This means that symbols used to represent each name-value pair are consistent and always mean the same things. Knowing the format is critical to being able to read the data. Here’s an example of a single date as four different name-value pairs each formatted in a different way:

Date = 4/15/17
Date = 04/15/2017
Date = April 15, 2017
Date = 1492239600

As an English-speaking resident of the U.S.A, I can easily and correctly parse the first 3 of those 4 formats. A European person could probably parse them too, but it would take a fraction longer because he would have to remember that the month and day are given in the opposite order he expects. Neither the European nor I can get meaning from the last example which is in format called UNIX Epoch Time unless we convert it to a something we understand.

Schema
Every useful data set will also have a data schema. Schema and format often get used interchangeably. The difference is that the data format describes how each name-value pair is organized. The schema describes how all the name-value pairs in a data set are organized. A schema will always provide an ordered list of the names and might contain additional useful information:

// just the names
NAME
AGE
GENDER
MARRIED
CHILDREN
// name, format
NAME: string
AGE: integer
GENDER: string
MARRIED: boolean
CHILDREN: boolean
// name, format, meaning
NAME: string; first name of person
AGE: integer; age in years of person
GENDER: string; gender of person
MARRIED: boolean; true if currently married
CHILDREN: boolean; true if has children

If you hear a rich schema like the last example called a data dictionary that will clue you in that you are speaking with a data scientist.

Unstructured Data

Our discussion of structured data has been precise; our discussion of unstructured data will be more vague. This is because unstructured is the name for all the data that you do not know how to parse, so it is a broad and, well, unstructured group of things. ;-)

Examples
Unstructured data can be images, videos, sound, or the things people say. Don’t be fooled by the negative connotation of the word unstructured! Unstructured doesn’t mean messy and it certainly doesn’t mean meaningless because obviously pictures, videos and people’s words contain meaning. All unstructured really means is any data that is not formatted as name-value pairs. The reason why this distinction matters is that extracting meaning from unstructured data requires much more work than getting meaning from structured data.

All unstructured really means is any data that is not formatted as name-value pairs.

Let’s illustrate this with machine-generated logs, an example of a kind of unstructured data that comes up frequently in the context of analytic applications.

Here are some logs from the computer where I first began writing this essay:

// system logs
May 18 15:54:02 tsheiner-ltp osqueryd[19430]:
{"name":"processes","hostIdentifier":"47846BD9-8EC0-5A11-A45A-
88A77D6B2E66","calendarTime":"Thu May 18 22:54:02 2017 UTC","unixTime"
:"1495148042","columns":{"cmdline":"\/System\/Library\/PrivateFramewor
ks\/SyncedDefaults.framework\/Support\/syncdefaultsd","md5":"692124354
3d1b213e8ac479b7b3a36db","name":"syncdefaultsd","parent":"1","path":"\
/System\/Library\/PrivateFrameworks\/SyncedDefaults.framework\/Support
\/syncdefaultsd","pid":"22309","start_time":"59878"},"action":"removed
"} May 18 15:54:15 tsheiner-ltp com.apple.xpc.launchd[1]
(com.apple.quicklook[22336]): Endpoint has been activated through
legacy launch(3) APIs. Please switch to XPC or bootstrap_check_in():
com.apple.quicklook May 18 15:54:20 tsheiner-ltp Console[22330]: BUG
in libdispatch client: kevent[EVFILT_MACHPORT] monitored resource
vanished before the source cancel handler was invoked

Obviously, there is meaning in those logs. But the collection is hard to read, and without some kind of schema, hard to understand.

Here is the same log information after I spent 10 minutes copy/pasting, adding some assumptions, a bit of swearing and messing with a JSON validator:

// I've made each log an object with name "log[n]", 
// I'm pretty sure about the 'host' element,
// I'm guessing about the 'process' one,
// and lumping everything else into 'msg'
{
"log1":{
"timestamp":"May 18 15:54:02",
"host":"tsheiner-ltp",
"process":"osqueryd[19430]",
"msg":{
"name":"processes",
"hostIdentifier":"47846BD9-8EC0-5A11-A45A-88A77D6B2E66",
"calendarTime":"Thu May 18 22:54:02 2017 UTC",
"unixTime":"1495148042",
"columns":{
"cmdline":"\/System\/Library\/PrivateFrameworks\/SyncedDefaults.framework\/Support\/syncdefaultsd",
"md5":"6921243543d1b213e8ac479b7b3a36db",
"name":"syncdefaultsd",
"parent":"1",
"path":"\/System\/Library\/PrivateFrameworks\/SyncedDefaults.framework\/Support\/syncdefaultsd",
"pid":"22309",
"start_time":"59878"
},
"action":"removed"
}
},
"log2":{
"timestamp":"May 18 15:54:15",
"host":"tsheiner-ltp",
"process":"com.apple.xpc.launchd[1] (com.apple.quicklook[22336])",
"msg":"Endpoint has been activated through legacy launch(3) APIs. Please switch to XPC or bootstrap_check_in(): com.apple.quicklook"
},
"log3":{
"timestamp":"May 18 15:54:20",
"host":"tsheiner-ltp",
"process":"Console[22330]",
"msg":"BUG in libdispatch client: kevent[EVFILT_MACHPORT] monitored resource vanished before the source cancel handler was invoked"
}
}

After doing the work to format the data into an object, it is easier for me to read but there is still a lot I don’t understand. For example, I can read the text “BUG in libidispatch client” and I can understand that the bit of text after that text is a description of the bug but I have no idea why I might or might not care about this information.

And that is exactly the issue with unstructured data: it takes an unpredictable amount of time and energy to get it into a structured form, and then even after you have done so, you still have to figure out if it matters to you. This limitation applies to computers just as it does for humans. Even though computers are amazingly fast, and electrons are amazingly small, moving billions and billions of them around to transform unstructured data into structured data costs time and money. Not to mention the cost of the engineers creating the code to do the parsing in the first place!

The issue with unstructured data is that it takes an unpredictable amount of time to get it into a structured form, and even after you have done so, you still have to figure out if it matters to you.

The tipping point is the difference between perceiving and parsing. Perceiving is that effortless moment of knowing. Parsing is that slog of arranging, reading and interpreting symbols. Sighted readers of this blog can look at the image of the flower at the top and immediately perceive a huge amount of information about how a flower is constructed. They don’t need to do any work to get that knowledge; they just look at the image and almost instantly know things. On the other hand, parsing the information in that image into some kind of structured data format is a lot of work. However, until that work has taken place — even though you have a ton of data in that flower image — you can’t analyze it, graph it or use it to compute statistics.

There are two takeaways here.

First, parsing is always possible but computational expensive.
Even though unstructured data carries a lot of meaning that can help make good decisions, it tends be valued less than structured data because it is harder to interpret and display. A good example of this are testimonials. Testimonials are people telling you how they feel about something. Talk about useful information! But most decision makers don’t think of testimonials as data or pay it the same amount of attention as they do metrics because it isn’t structured so it is harder to access, harder to interpret.

Second, the pool of structured data is always getting larger.
This is what humans do and have always done: we structure data. This means that data that wasn’t structured and easily displayed yesterday for making a decision may be available tomorrow. For example, between the time I first started writing this essay (May 2017) until the time I finished playing with it (February 2019), Google image search became so much better that my example of a picture of a flower as unstructured data already verges on quaint.

Conclusion

As a UX designer, your need to understand how to use data as the raw material for your work will only increase in the coming years. The reason is that across the board, data analysis must become an intuitive, nearly unconscious part of people’s day to day interactions with their computers, tablets and smartphones. The virtual world in which we spend so much time will be a safer, more useful and fairer place to navigate if we are all able to use data to make decisions about the veracity of a piece of information, the trustworthiness of a feed or the wisdom of a particular investment decision.

Data analysis must become an intuitive, nearly unconscious part, of people’s day to day interaction with their computers, smart phones and tablets.

Right now the ability to use data to make better decisions lies primarily with the large companies that collect our behavioral data as payment for services — admittedly incredibly useful services — they provide us for “free.” It is now become crystal clear that free is a myth and one-way mirror hiding the risk, unfairness, and societal cost of this one-sided use of data.

Changes are coming and you will be part of that. My colleague Kathy Baxter has written eloquently about the need for companies to develop ethical frameworks to guide their product development and data usage roadmaps. I agree completely and add the idea that each designer also needs a personal ethical framework that guides us in doing all that we can to enable our customers to be successful, happy and well. I understand, of course, that your customers are not right now thinking “gee, I can’t wait to do more data analysis!” but this is precisely where you, as a UX designer, play a key role. You must use all your skills to create great data-driven interfaces that are easy to understand, intuitive to process, and help people make wiser, better decisions.

Thank you Alan Weibel, Ian Schoen, Raymon Sutedjo-The for your feedback, edits, and guidance.

Follow us at @SalesforceUX.
Want to work with us? Contact
uxcareers@salesforce.com
Check out the
Salesforce Lightning Design System

--

--

Tim Sheiner
Salesforce Designer

System thinker, story teller, designer, husband, father of 3, San Franciscan, Bernal Heights neighbor