The Great Metadata Debate

A Sheriff and a Cyborg Argue About How to Keep Tabs on Your Records

Molly Sions
Capital One Tech
9 min readJun 3, 2019

--

One of the reasons that Ancient Greece continues to exert outsize influence on modern thought is the simple fact that the Greeks were really, really good at record keeping. Given that metadata is the records behind the records, it seems only right to hijack the Greekiest (the r in that word is doing a lot of heavy-lifting) of literary forms to talk about this deceptively simple concept. As such, I’m going to get a little experimental and summon two opinionated characters to the blogosphere for an old-school Platonic dialogue.

On one side of the table (which is, of course, a circle) we have a time-traveler simply known as The Sheriff 🤠. The Sheriff comes from a law-ensconced upbringing in the Old West, being the son of a judge in a frontier settlement during the 19th century. While he has been yanked into our era, the climate in which he grew up still serves as his guiding compass. Having been raised by a man who saw trial after trial go awry due to legal loopholes, the Sheriff has very little confidence in complex systems, feeling that they simply do not function without human judgment to straighten them out.

Metadata, to him, means humans explaining data to other humans. There are system elements that have to be included, such as where the data came from, but the meat of it is in the descriptions — sentences that say, clear as can be, what each data point means.

On the other side we have Bot the Cyborg 🤖. While human in speech patterns, he is an AI bot at his core, and lives in fear of human error hidden in his programming. He still remembers the time his brother, Byte the Cyborg, got caught in an infinite loop when he was younger, introduced by a software engineer who neglected to write ATDD covering that portion. The Cyborg has zero confidence in humans to remain organized and efficient, preferring the correctible imperfections of an automated solution over the catastrophic implications of human oversight.

Metadata, to him, means producer systems being transparent. He sees lineage and transformation records as a magic bullet that humans have not recognized the importance of. The Cyborg hates poetry and fiction because of their relentless ambiguity, and finds metadata descriptions to be more of the same. Humans, the Cyborg thinks, have a habit of thinking that there is a such thing as a non-ambiguous sentence. There isn’t, and the more that humans cling to the idea of one, the more deceived they will be.

Let’s begin.

🤠 Sheriff: Data is useless without metadata. It might as well be a string of random letters and numbers, or a pile of non-ASCII characters. In fact, data without metadata would be better off as an unintelligible mess, because then no one would be tempted to use it. Instead, we have humans trying to intuit their way to correct data usage.

That doesn’t work at scale.

When you capture metadata, you have to be complete. Otherwise, the datapoint you miss will get proliferated, and by the time it is several consumers removed from the source, its meaning will be distorted. If that distorted meaning finds its way into the wrong model feature, it could end up impacting thousands of people.

🤖 Cyborg: I am in agreement.

🤠 Sheriff: And that is why we need to hold our data stewards to the highest of standards.

🤖 Cyborg: I am no longer in agreement. While your previous 101 words gravitate toward the correct conclusion, your ultimate decision is illogical.

The big data world is scaled to the level at which 99.99999999% reliability is necessary. Yet you state that all of that data can be tracked with human intervention. These two points conflict with one another. Please resolve.

Suggested resolution: Conclude that automation of metadata is necessary.

🤠 Sheriff: Guess I should’ve known better than to start a debate with a glorified bug zapper. Metadata is not a problem for automation to solve. Period. It’s what enables automation. You can’t automate without documentation, and when nothing is automated, you have humans.

That’s why we need to invest in metadata management UI’s, strong incentives for stewardship, and a complete and total lockdown on producing data without metadata.

And we have conflict! To begin, it looks like we have a classic chicken-egg scenario on our hands.

The Sheriff feels that humans, the egg in this case, are responsible for automating processes in the first place. Therefore there will never be a problem with metadata if, each time a human goes to automate a process that will create data, they make the effort to write down the precise information about that data.

The Cyborg thinks the chicken has already run off and multiplied, and to think that we can reverse the course of time enough to track metadata on all the data we already have, much less the data we’re going to have, is naive. Our only hope is to treat a lack of metadata the way we’d treat any other bug: Build something that fixes it.

Let’s continue.

🤖 Cyborg: Consider two postulates:

  1. System A is reading data.
  2. System A does not know where the data is coming from.

These statements cannot be simultaneously true.

Humans can query a database without thinking about the nuts and bolts. Computers cannot. They need the lineage in order to establish a connection. A data element, traced back to its original source, with all of its transformations labeled, is transparent and meaningful. Today’s data environment is a near-infinite string of mutations and calculations. If 100% of systems record where their data came from and what operations were performed, then the ideal metadata system is a human-consumable display of lineage.

Asking humans to document their data neither adequately captures mutations nor exhaustively captures lineage. They have neither the time nor the motivation to take these things into account as they produce metadata; instead, they write redundant paragraphs, all of which are extremely vulnerable to underlying shifts, and the longest of which offer a false sense of security among consumers. Asking humans to document their data is myopic.

🤠 Sheriff: Your argument comes undone like handcuffs on a magician. You’re saying that there are changes upon changes upon changes to every data element, and at the same time you’re saying that tracing those changes will be good enough for an audit? As a man of the law, I can tell you that won’t pass muster. Lineage is a good thing to have in addition to other things, chief among them being the descriptions written by every data producer. There is a difference between not being wrong and being right. Accurate, incomplete information is not wrong, but it’s not right, either. An expert is an expert because their information is both accurate and complete.

Furthermore, if the reason that good, intuitive metadata tools wouldn’t solve the problem of wonky data is that there are just too many data transformations happening, then producing even more data by automatically tracking those transformations will only exacerbate the problem. You’re fighting fire with fire, data with data.

🤖 Cyborg: Select a company at random and look at the five-year history of the stock price, charted twice-daily, at open and close of the stock exchange. When you look at that curve, you are processing 3,650 individual data points. Do you understand them, though, as an overwhelming amount of information, or as a relatively straightforward curve? It is only because the information is complete, detailed, and reliable that it can be summarized and digested efficiently. That is the backdrop against which anomalies become evident.

🤠 Sheriff: Having a clean summary does not solve the problem on its own, though. The summary might help someone get a bird’s eye view on where there are issues, but think about the effect that will have on the people encountering those issues.

Imagine a data element that has been passed around by twenty different systems. Then imagine using five data elements that have been passed around that many times. What you end up having is a data consumer combing through record after record, saying, “Okay, this data element was converted to a string here…I think that’s fine. And here is where it was tokenized by this algorithm, okay, wait, is that the same algorithm that tokenized the other data we’re using? Hold on, let me check that…”

Data consumers will get lazy, rushed, or both, and end up using, and proliferating, data that has been mutated. That problem will lead you right back to what I’ve been saying — data elements need to come with a concise description, authored by an expert.

Is it just me, or is the Cyborg not as down on humans as he seemed at first? He seems to be under the impression that humans are great at detective work when they can see the full picture. His core contention is that humans can consume information a lot more efficiently than they can produce it, and they are being arrogant by refusing to admit that.

The Sheriff, on the other hand, sees the Cyborg’s suggestion of a second layer of automation as irrelevant to the core problem. He ascertains the root cause of metadata gaps to be a process issue — metadata authoring takes more time than it should, so people give up on it. If reading the metadata takes more time than it should, the Sheriff contends, people will give up on that too. He sees the Cyborg showing bias toward the things that are easy for computers to understand rather than thinking through how humans actually use metadata.

🤖 Cyborg: Difficulty is signal. If a data element is hard to investigate because it has a long and complex lineage, then consumers will shy away from such data. Likely, if the lineage display is well-designed, consumers will click on the permutation highest up the tree and use that version, reducing reliance on transformed fields. The data that resides further upstream will proliferate more, having lots of consumers and uses, while the data that resides further downstream will tend to be used in more specialized cases.

🤠 Sheriff: That assumes a perfect, jolly world in which there is no functional difference between consuming Data Element A and Data Element Z, but that is not the world we live in. Data Element A might be pristine, but it also might live in a system that simply cannot afford to go down, meaning it would be a bad idea to burden that system by using it as a source.

Data Element Z, on the other hand, might be twenty-six rungs down the ladder, but live in a database to which a connection has already been built, making it extremely easy for the engineering team to consume. Delivery and velocity are important at every company, you cannot assume a person’s motivation will always center around data quality.

🤖 Cyborg: Your concern is valid only in the short term. Over time, the effect of lineage tracking will be to make the System Zs of the world into System Ms, and then into System Es. Tracking means metrics and metrics mean goals and goals mean change.

🤠 Sheriff: I just don’t think you’re right about this.

🤖 Cyborg: Likewise.

This is the part where I disappoint you. I know you want a declared winner, whose declaration of victory reveals this post’s conclusion. Instead of telling you who “won,” let me ask you this:

Who did you want to win?

Notice I’m not asking who won, I’m asking who you were rooting for. Metadata is an inconvenient topic, and your conclusions on it are probably tied to your professional background, the same way the Cyborg’s and the Sheriff’s upbringing makes each of them tilt one way or the other. You have probably decided on a winner, but don’t forget to think about what you learned from the loser.

RELATED

DISCLOSURE STATEMENT: These opinions are those of the author. Unless noted otherwise in this post, Capital One is not affiliated with, nor is it endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are the ownership of their respective owners. This article is © 2019 Capital One.

--

--

Molly Sions
Capital One Tech

Half analyst, half python engineer, three-quarters product manager, all blogger