Designing data for people

Published in

When I Work Data

9 min readOct 22, 2019

Photo by Patrick Robert Doyle on Unsplash

Working in data, it is easy to feel separated from the realm of user experience. I don’t design experiences — by the time I get on scene, user experience is something that has already happened. Sure, we get the occasional dashboard to flex our visual muscles on, but that is nowhere near the majority of the work. However, this visual-centric thinking fails to recognize what I think is the largest and most meaningful aspect of user experience in data science: the data set.

If you make a product that has users, then you are creating a user experience. I make data sets and those data sets have users, so I am creating an experience for the users interacting with those data sets. While this falls outside of our normal classification, anyone that has worked with a variety of data sets knows it to be intuitively true. Some data sets are intuitive to work with, while others are just a slog.

So how do we design a data set to provide the best user experience?

1. Don’t just take it from the source

In the industrial context many data sets originate from an application database, however don’t make the mistake of letting it be the data set. Application databases are designed for one specific purpose: to support the application. While it is sometimes possible to also use them to understand the application, that isn’t the goal they were designed for. The application uses the database to keep its implemented state, but its logical state is usually derived from a mix of code and database. Implemented state is sometimes useful, but logical state is what users want.

2. Expand bitmaps

Bitmaps are an incredibly useful tool for packing a ton of information into a very small space. They take the binary encoding of an integer and use each bit to encode a single piece of boolean information. This allows a 1 byte integer to hold 8 boolean values. While this is awesome from the perspective of application engineering, it is kind of terrible from the perspective of user interaction.

Taking an example from When I Work, an account can have any combination of our three products: Scheduling, Attendance, and Hire. This could be represented as a bitmap field with 1 corresponding to Scheduling, 2 corresponding to Attendance, and 4 corresponding to Hire. If we carry this implementation detail forward into our data set it will look something like this:

account_id | products
---------------------
1          | 3       # 0b011
2          | 1       # 0b001
3          | 7       # 0b111

Give it a try, which accounts have which products? Was it easy for you? Consider all of the advanced mathematical concepts that a user needs to know to successfully work with this field. They need to know binary encodings of integers, modular arithmetic, integer division, and bitwise logical operations. Apart from just this mathematical knowledge, consider the outside knowledge the user needs to bring in. How does one discover which product maps to which bit? They can’t discover it from the data.

Consider if we were to expand those fields:

account_id | has_scheduling | has_attendance | has_hire
-------------------------------------------------------
1          | True           | True           | False
2          | True           | False          | False
3          | True           | True           | True

Now give it another try, which accounts have which products? It is both easier to see what the fields logically mean and what their current value is. A user can be dropped into this data set with very little information or mathematical skill and still be able to figure it out.

3. Spell out enumerations

No one in application programming is going around checking

row["message_type"] == 3

Why should your users need to? In application logic there is going to be code to abstract it to something like

row["message_type"] == MessageTypes.MISSED_SHIFT

Unfortunately for us when designing a data set we don’t get to insert such an abstraction between the data and our user, so we need to make the data speak for itself. As a consumer of data sets, which would you rather see:

message_id | message_type 
-------------------------
1          | 3
2          | 1
3          | 2

message_id | message_type
------------------------------
1          | "missed_shift"
2          | "shift_published"
3          | "swap_accepted"

Disk space is cheap, but human understanding is hard to come by. Expanding enumerations costs us in storage space, but dramatically increases the ease of interpreting both the data set itself and any queries that might be written against it.

4. Repeat a field multiple ways if it communicates multiple things.

Sometimes a field communicates more than one thing at the same time and it is good to make that information clear to your users. One example of this is in the sort of role that a user might have in a business.

In the When I Work app a user can have one of 4 roles: they can be a general employee, a supervisor, a manager, or the account admin. Each one of these roles has an increasing level of permission. In application logic this might be an enumeration with:

employee -> 1
supervisor -> 2
manager -> 3
account_admin -> 4

Following the previous rule we could just replace these integers with their expanded strings, but doing this would obfuscate information. The numbers don’t just convey a category, they also convey an ordering. To communicate these two ideas we can instead state it twice.

user_id | user_role_type | user_role_level
------------------------------------------------
1       | employee       | 1
2       | account_admin  | 4
3       | manager        | 3

Having the role declared as a string allows the user to easily figure out what it means conceptually and adding in the role’s level allows them to easily see how they related to each other.

5. Make implicit data explicit

The main purpose of an application database is to persist the state of the application it serves, but this does not mean that it will hold that state explicitly. At When I Work someone might want to ask one simple question,

Which accounts are currently able to use our product?

This seems like it should be easy to answer until you start thinking about all the reasons someone might not be able to use our product. Did they hibernate their account? Have they paid their bill yet? Has their trial period expired? For the application it makes the most sense for this bit of state to be implied by the database, but not explicitly stated. Instead, the application holds the logic to check the various conditions and determine whether someone can use our product, but it isn’t written in any specific spot in the database.

When we find ourselves in this sort of situation the best thing we can do for our users is convert that implicit state into explicit state by doing all the computation and recording the result. This makes the data set easier to use and will very likely make it more reliable by preventing various attempts by your users to construct the implied state themselves.

6. Pick a naming convention and stick with it

Fields and constants are going to need names. When it comes to writing them out there are lots of options to pick from: camelCase, snake_case, PascalCase, SCREAM_CASE, kebab-case. For each specific type of thing you are naming, pick one and stick with it. I might decide all my fields will be kebab case and all my categorical values will be camel case, or maybe I’ll just decide that everything should be snake case. A lot of options can work, but your users will thank you for making a consistent decision.

The reasoning behind this is pretty simple. People can remember words pretty easily. I can know the field I’m looking for is “first name”, but if there are a mix of naming conventions in a data set I also need to remember if it is firstName or first_name or FirstName, and that sort of nuance just doesn’t lodge itself into a person’s memory quite as well. By knowing that there is only one convention all your user needs to do is apply it.

7. Names should always have the same meaning

If your data set is composed of multiple parts (often this will mean multiple tables) keep the meanings of names consistent over the entire data set. If the location field has been holding GPS coordinates don’t suddenly use it to hold a zip code — doing so would fundamentally change the meaning of what it is referring to. Where before location represented a point, now it would represent an area.

Keeping the meaning of names consistent avoids confusion and allows users to transfer knowledge they developed working in one part of the data set to other less familiar parts.

8. Have a data grammar

By having a structured way of creating names a user can read any name within the data set and have a pretty good idea of what it represents as well as the rules it will follow. We’ve already seen some examples of this earlier.

Earlier we saw three fields: has_scheduling, has_attendance, and has_hire. Now suppose we come across a new field named has_tasks. Without knowing anything else about it, what can we infer? We can expect that it is going to be a boolean field, tasks is likely a product or feature, and the question being answered by it is whether or not that product or feature is enabled.

A data grammar can extend beyond individual fields. If a data set has multiple parts, then the structured naming of those parts can allow users to more easily transfer knowledge between related parts. As an example, suppose I were creating a data set about user interactions with our apps. This data set should span both web and mobile interactions, but the types of interaction we can have on each platform are just different enough that they don’t really fit into a single table. Instead of trying to squeeze them into one table with swaths of nulls for the areas in which the two deviate, I can instead represent them as two tables with shared conventions: mobile_interactions and web_interactions. By having clear patterns between naming and conceptual relatedness users can quickly learn what to expect just by looking at a name.

9. Take out the trash

If a field is garbage 20% of the time your users will quickly start thinking it is garbage 100% of the time. If the bad data can be identified, correct it. If it can’t be corrected, null it out. If there is no clear way of identifying the bad data, drop the field in its entirety from the data set.

When working in the field of data, trust is the hardest resource to come by. Knowingly allowing bad data to be part of a data set erodes the trust your users have in that data set as a whole. By removing known bad data you build trust with your users and enable them to spend their energy focusing on analysis over bushwhacking.

10. Remember you are designing an API

A data set forms an interface between the system that generates it and the various systems that consume it and thus is an API. This means that as a data set evolves over time one should be cognizant of the sort of changes that are breaking and how they would be discovered. As professionals we have an obligation to avoid making breaking changes without good reason.

There are lots of reasons why one would need to introduce breaking changes. Sometimes we choose a poor name the first time around and really need to add clarification, like changing pay to annual_salary. Sometimes we realize a field we thought was good is actually garbage and should be dropped. These are examples of renames and deletes, and insofar as breaking changes in data go, they’re the good ones. Maintaining quality demands that these sorts of changes be made, but at least they tend to be obvious. Plan that they will need to occur and establish how you are going to manage them. This might be through release notes, an email blast, or maybe just a message in a chat room. The important thing is that users get an explanation of what happened and what they should do next.

There are some changes that should be avoided as much as possible, specifically changes in format and meaning. These are the sorts of changes that ruin a day. These types of changes tend to break much further from their source. If names used to be last, first and are now first last the format of the field has changed and some poor analyst or engineer is going to spend a couple hours trying to figure out where their script got things flipped. If income used to be pre-tax and is now post-tax the meaning has changed and it will probably take a really long time for anyone to notice.

When the need arises to change meaning or format the kind thing to do is instead drop and replace them. The format of name didn’t change, the field was removed and replaced with first_name and last_name. The meaning of income didn’t change, it was removed and post_tax_income was added. These will still be breaking changes, but at least they will be obvious changes rather than subtle ones.

Ultimately data sets are intended for humans; we should design them with that purpose in mind. Doing so will make our data sets more useable and ultimately more impactful to the larger world.