We use words to create the data in our universe
Illustration by Jon Jacquet, PNDLM¹

Pay Attention To Your Words!

Important data definition lessons from projects gone wrong

Dean Black
Towards Data Science
10 min readJun 7, 2020

--

In today’s world we all want good data to work with and guide us, but the truth is, good data is hard to come by. It takes a lot of thought — and we think in language — so you’ll have greater odds of success creating good data if you really stop and zero in on the language you use to define your data.

Below I relate some brief stories about companies which suffered significantly when their designers missed important, subtle mistakes in their thinking. The truth is, these kinds of mistakes are easy for anyone to make, because we all use language reflexively every day: language is so natural, we simply don’t give our use of it much thought. By reading these stories and considering their lessons, it is hoped you will be inspired to carefully review your company’s use of language from time to time, in order to avoid similar suffering.

Before we dive into their nightmares, though, let’s take a short journey back to a more peaceful time.

“How many antelope did you see at the lake this morning?” Grok inquired.

“Seven,” Esmeralda responded, dejectedly.

“Hmm. I wonder where the others are?” Grok was working on a new adornment for the cave wall. Should he incorporate this recent trend somehow? He was also concerned. He could remember when the antelope came to the lake every morning in large numbers. He frowned because he hadn’t kept track of how many came in the years before — there had been so many, exact counts hadn’t seemed important.

Before we go on, did you notice something amazing? Not one computer or technologist anywhere in our story! There’s a common misperception these days that data is about technology, but it’s not. Data has been with us forever, and working with it is fundamental to human thinking. We all do data. It’s the business of everyone.

The Essential

The beauty of Grok and Esmeralda’s short scene above is it shows how easy and natural data analysis comes to us. It usually progresses like this:

  1. There’s something we’re interested in
  2. with a related problem, that leads to
  3. questions, which cause us to seek
  4. answers (that’s where data comes in), and, in the end,
  5. consider how to best record and tell the story of what we learned

And it’s all so easy, even Grok can do it! The common thread in all five elements? Thinking—ideas—expressed in language.

Before you arrive at data, you have to navigate a lot of language. And to talk about the data later? More language.

In the historical march of technology, data became lumped into technology because Data Processing started to manipulate data with computers, but adding computers never altered the language basis of data itself. If you want to be great at data, you’ll need to be a master of your words.

The Words Really Matter

Numbers without words are meaningless. 42, anyone?² Some words, without context, are also basically meaningless. Thing. There, you feel better, right? You know exactly what I’m talking about.

We’ve all seen data context and meaning problems in action many times, and recently we lived a master-class lesson, as we watched the global pandemic unfold. Consider this article from early in the US outbreak: The Official Coronavirus Numbers Are Wrong, and Everyone Knows It. The article discusses the variances and limits of the available numbers, and declares that the data are untrustworthy because the processes used to collect them lacked unity of thought. It details how the CDC focused only on testing travelers, rather than testing more broadly. At the core, the CDC’s concept or definition of who needed to be tested to obtain the answers sought was poorly conceived and too narrowly defined. They hadn’t thought deeply enough about the problem we faced to create the right questions and, therefore, collected a poor data set. None of the problems discussed were technology problems, but all of them were thought/concept/definition problems.

In most day to day conversations, quick or informal definitions don’t cause huge problems because we can usually correct misunderstandings with a little bit of extra conversation later. “Oh, you were counting the eggs with red spots! Got it.” However, when we’re responding to international crises, or planning to run a business for years based on the data collected, we simply can’t afford to operate off-the-cuff. Precise, well thought out words deeply matter.

So, Try to Avoid Some of These Mistakes

Nothing teaches like mistakes, so let’s distill a few lessons from the suffering of others.

Who Are Your Customers?

In 1999, expansion of the Internet in the US was booming, fueled by the introduction of new broadband technologies. One of the early startups was pushing its network nationwide, running as fast as it could to interconnect homes and businesses using DSL for the last mile. As a first-year new hire, I came onboard to manage the data team, just as the initial data systems were being rolled out by consultants.

Executives had proclaimed their simplified strategy to the consultants: to move fast, the company would only sell to Internet Service Providers (ISPs), relying on them as middlemen to bring in our end users. “The only type of customer we will have are ISPs, so set up the Customer Relationship Management and web ordering systems accordingly. We’re going to be a wholesaler. We won’t ever have any direct customers.” So the development team turned its back on the traditional, broader definition of a customer, which is anyone who buys something from your company, and programmed (Customer = only ISP) everywhere. There was no provision left in the systems for any individual person to purchase directly from us. You can probably guess where this is going.

Within a year, as the nationwide network neared full deployment, the company found it was burning too much cash and needed to accelerate its revenue growth to offset the burn-rate: we were simply bleeding too much money because ISPs couldn’t be on-boarded fast enough. Leadership decided the company had to pivot and add selling to retail customers directly. To accomplish this, though, all of the customer and order handling systems had to be seriously overhauled, at great expense and lost time.

Lesson 1: Stick with traditional, common definitions whenever possible. There are many reasons time-proven definitions, like Customer, exist. Picking the wrong data definition can seriously limit the flexibility and speed of your business.

Lesson 2: If you create your own custom definitions of common words, you run the risk of causing accidental misunderstandings later. For instance: when you work with new parties who don’t know you only mean “ISP” whenever you use the word “Customer.”

Accounts Are Not Customers

A short time later, I took an assignment with a start-up division of a large, well-known tech company. They were entering a new market and had been building systems for the venture. During my first couple of weeks, I noticed that project participants were using the words “Account” and “Customer” interchangeably, as synonyms for each other, and I came to realize they had implemented their order tracking system as if the two were identical. I was a bit surprised a mature tech company’s development team had failed to catch such a distinction, since they were experienced developers, but it shows how even experts can easily fall into subtle word traps if they are not careful.

I assume that in the example scenarios they had considered during early planning, their Customers only had one Account. Now, as they arrived at system and integration testing, they started running into real problems. “What do we do with large, multi-state customers who want separate accounts in each state? They’re all still the same customer and we want to book all of their revenue under a single company, so how can we handle that?” The way the system had been conceived and designed, they couldn’t.

Lesson 3: If you are using multiple synonyms interchangeably, pay attention! You most likely have different concepts in play, not a single idea. Explore the nuances of those synonyms thoroughly.

When is a Pole not a Pole?

On assignment with a regional power company, I was asked to review their asset management system, which had evolved over the years and, of course, had automated a business created well before the dawn of computerization (electricity carried on power poles).

When they had constructed their system, they had created a Pole database table, in which they logically kept track of the power poles they had installed out in the field. All was pretty good, except, unfortunately, the designers never saw the pole as separate from the geographic location at which it was planted in the ground.

The result? Not only did a pole number identify a pole, but also the pole’s installation location on work-orders, field maps, etc. The pole and its location were seen as one and the same thing, even though with a little contemplation, it’s clear they are not the same, just as you are not your home’s street address (even if you work from home all day long now!).

This(Pole = Location-of-pole)mistake caused problems in many ways. For example, in order to replace Pole 2000, they would need to send a work team to the pole’s field location and install a second “Pole 2000” — hmm… how do we put that into the system? — then try to issue a work order to move the power lines from (existing) Pole 2000 to (new) Pole 2000. The system simply couldn’t handle any of that, because it only permitted one pole at a location, and so they had to manage replacements with external paper record keeping and off-line, manual processing.

Lesson 4: Things and their locations are often mistaken for each other. We commonly refer to some things this way out of convenience — “The mole above your left eye” or “The Server at IP address 10.1.127.10” — but it does not make the thing and its location identical. If you examine the full lifecycle of your object and its location, you’ll likely find the important places where they part company.

One note on the lesson above: this is also the root of a very nasty problem baked into today’s Internet, where the IP address serves as both the network location of a device and its identity. So even the original, really smart Internet wizards tripped over this same subtlety. The IP problem, it turns out, is perhaps the single biggest reason hackers can attack all of us with virtual anonymity these days — and the cost to fix this problem would be massive. Clearly, some up-front concept errors, if not caught, have huge consequences later! (Learn more about the IP problem.³)

Don’t Pick a Fight With Humans: You’ll Lose

A friend of mine was a senior architect for a large, multi-state insurance company’s data warehouse initiative. They spent years and millions of dollars building the data transformation “pipelines” between all of their regional systems, to bring the data into what they hoped would be a great new data warehouse. They were excited to get it all working and see the data start flowing…until they dug into the results.

The discovery: employees in the various regions had been using similarly named fields for completely different purposes. While the data had looked consistent and homogenous to the design team (and the definitions had seemed consistent, too), in real use the data was quite inconsistent. In many cases, they found that the users of the systems had been unaware of (or chose to ignore) the official definitions. When the data was finally flowing into the warehouse, everyone discovered that it simply didn’t tell a reliable story and could not be used to make business decisions. After the years and millions spent, the outcome was a virtual disaster.

Lesson 5: Even if you write great data definitions, don’t expect the users to read them or adhere to them. They’ll see the name of the field and put in whatever they think belongs, and sometimes they’ll even knowingly enter inappropriate data if it helps them get today’s job done. Before you rely on any user-supplied data, ensure it passes good validation and sanity checks.

Final Notes to Business Leaders

As you read this, besides the lessons above, please take away these two additional thoughts:

  1. Data is never beyond any of us, it is simply smart use of language. Get deeply involved, driving the development of your company’s data. Future companies will need to have a strong data strategy led from the top.
  2. Even after you get fully involved, your company will likely do better if you get someone experienced to help you lead your data charge, and as we’ve seen, nothing teaches like mistakes! As you interview, look for a person who has made and/or worked through critical data mistakes in their career and will freely talk about what they learned. Like any seasoned, thoughtful designer, good data folks have many stories about things they would do differently next time! I personally wouldn’t hire any Chief Data Officer, Data Architect or Data Scientist who isn’t happy to tell you about the many mistakes they’ve made and what they learned from them.

[1] Illustration created specifically for this article by my colleague,
Jon Jacquet ©2020 PNDLM | Used by permission

[2] What? You don’t recognize “the Answer to Life, the Universe and Everything,” as recorded by Douglas Adams in The Hitchhiker’s Guide to the Galaxy? Simple proof that all numbers need a good story to explain them!

[3] One good place to start learning about the IP Address problem is rfc4423 in section 4, Background. Paragraph three states: “IP numbers are a confounding of two namespaces, the names of a host’s networking interfaces and the names of the locations (‘confounding’ is a term used in statistics to discuss metrics that are merged into one with a gain in indexing, but a loss in informational value).” I added the bolding. Unfortunately, that’s pretty hard to understand. My translation: they mushed two ideas together, losing the separate meanings and important, related capabilities in the process! Just as you are not your home’s address, a device is not its IP address, either.

Perhaps the best book on this was written by one of the chief architects of Boeing’s solution to the problem: Richard H. Paine, Beyond Host Identity Protocol: The End to Hacking as We Know It (2009). The solutions he helped Boeing develop and describes in the book later became official technology standards. The RFC referenced above is one of the related standards documents. Unfortunately, many years later, those standards have been largely ignored — much to the glee of hackers and spies everywhere, no doubt.

--

--

Towards Data Science
Towards Data Science

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Dean Black
Dean Black

No responses yet