Data Quality: a lesson from the myth behind Popeye the Sailor

What a cartoon from the ’30s can teach you?

Vinícius Mello
hurb.labs
8 min readFeb 20, 2021

--

“Data is like garbage. You’d better know what you are going to do with it before you collect it” — Mark Twain. Image by QuoteFancy.

Data is being increasingly used by businesses on a decision-making basis. Until a few months ago, I had a different understanding of what data quality is. I wasn’t used to thinking about it. Recently, I started to learn more about the subject; then, I found some interesting, and a bit humorous, stuff I’d like to share.

When I was a kid, I watched Popeye the Sailor and other TV cartoons as an ordinary kid from the ’90s. I grew up listening to my mom saying, “You should eat more spinach to become stronger,” and she always gave me the example of this character Popeye, one of the most famous cartoon characters, created by Elzie Crisler Segar in 1929.

Image by Pixy.

If you’re not familiar with this character, let me introduce him to you. Popeye was a sailor with his odd accent and improbable forearms who instantly possess super-strength after ingesting an always-handy can of spinach, a sort of anti-Kryptonite. It gave him his strength and perhaps his distinctive speaking style. He often takes many daunting challenges, such as battling his brawny nemesis Bluto for the affections of his love interest Olive Oyl, often kidnapped by Bluto. Without any effort, he only needed to take some spinach to become stronger.

The Myth

There is no doubt that eating vegetables is good for your health [7] (if you don’t eat those, you definitely should!!), but recently, I discovered that this character was created based on a Data Quality Problem. I’ll explain it!

The fact behind this starts over fifty years earlier from the first publication of the strip. In 1870, Erich von Wolf, a German chemist, investigated the amount of iron in spinach. Von Wolf unintentionally missed a decimal point in his observations while transcribing data from his notebook, altering the iron content in spinach by magnitude order. Although only 3.5 milligrams of iron are actually in a 100-gram of spinach, the agreed reality was 35 milligrams. To put this in perspective, if the calculation were correct, each 100-gram would be like consuming a tiny piece of a paper clip.

Once this wrong number was installablepublished, the nutritional value of spinach became legendary. When Popeye was created, its misunderstood health properties lead the studio executives to suggest that Popeye should eat spinach to get his strength — In reality, if for iron, he should be eating the cans. It was only in the 1930s [1], more than 50 years after the first publication, that someone rechecked the numbers and finally corrected this mistake.

Nevertheless, Popeye helped improve American spinach consumption by a third [2]! But the harm was done. It spread and spread, and only recently went by the wayside, probably aided today by the relative obscurity of Popeye. But the mistake was so common that an article about this spinach case in 1981 [2] was published by the British Medical Journal, trying its best to finally demystify the problem.

Fake!, a British Medical Journal article about the spread information of the use of spinach.

Despite the increase in vegetable consumption and the strip's success, decisions were taken based on this printed error. Luckily, nothing bad has happened, but imagine if critical decisions have been taken on this. Even this error was corrected years later, it had already spread, and even today, some people still believe in this myth.

The Importance of Data Quality

Now, let’s change the context a little bit. Companies all the time and from various sources are ingesting data. Because of data quality problems, some of them have their own “spinach” false consumption, which is not immediately evident. The ones that haven’t been watching for it might have a long history of data-driven decisions based on poor quality data.

Articles about Data Quality constantly cites one expression:

Garbage In, Garbage Out

Garbage In, Garbage Out effects. Image by the author.

You can create amazing dashboards or even complex machine learning models, but if the quality of the data consumed is not good enough, I’m sorry, but the outcome will not be good either. Your analysis and models are just as good as your data.

If you’re not familiar with this concept, here are a few definitions:

Data quality is a measure of the condition of data based on factors such as accuracy, completeness, consistency, reliability, validity, and whether it’s up to date.

Data Quality is to measure the condition of data throughout different perspectives and different dimensions [3].

  • Accuracy: How well does a piece of information reflect reality?
  • Completeness: Does it fulfill your expectations of what’s comprehensive?
  • Consistency: Does information stored in one place match relevant data stored elsewhere? Is everybody looking at the same data?
  • Timeliness: Is your information available when you need it? Is data refreshed on time and the right cadence?
  • Validity: Is the information in a specific format, does it follow business rules, or is it in an unusable format?
  • Uniqueness: Is this the only instance in which this information appears in the database?

Other authors include more dimensions, I’ll go with these as an example, but I strongly recommend studying more about it.

Data errors can be propagated and, after a few years, can be hard to correct. That is why it needs to be paid attention to by businesses and data teams. We are now living in a data-driven era in which data-based decisions are taken every day. Many organizations are still doing their digital transformation to start gaining from the use of their data. But many who are already in this process want to build complex machine learning models without developing a good policy on data quality. The algorithms by themselves cannot do magic and fix issues in data unless they’re being trained for it.

How can an organization improve its Data Quality?

There is no one-size-fits-all rule. One thing you need to know is that you’re going to have any data issues eventually. You can’t always avoid it, but instead, you can create mechanisms that alert that an error has occurred. Test the data, start thinking about how you can guarantee the dimensions showed above. Is data accurate? Is it valid according to the business rules? Is it refreshed on the right cadence, or are you making decisions based on stale data?

Let the stakeholders know that errors can, and will, occur, but you have measures to identify and fix them. So you can still give your stakeholders more trust and make them feel more secure making decisions based on reports or even accept predictions from machine learning models.

If 10% of your data isn’t accurate, it can change the perception of the overall. If this 10% of dirty, poor quality data is considered in one decision without knowledge, the results can be highly affected.

The pipeline debt

It’s technical debt in data pipelines mostly due to the lack of tests and documentation. Traditional Software Engineering focuses on creating unit tests for the development of applications, but it is much more complex when dealing with data.

A Data Pipeline can extract and transform information from one or more sources (sometimes within the organization or from partners), then load it to another location. The code can be tested many times, but if the source data changes suddenly through an update to an upstream application, the schema of the data can change and break everything. Worst, if a bug is not detected and begins to populate a column with incorrect values, it may mess with downstream activities that depend on it.

So, dealing with Data Pipelines means constantly testing the code and testing the data continuously because it’s never known when something can change. An article from Great Expectation folks [4], an open-source project written in python, can help you create expectations and assertions for different datasets. There the authors explain more about pipeline debts and how this tool can help a Data Team improves its datasets quality. Below is an example of it:

The use of Great Expectations to define a column must lie between 60 and 75, at least 95% of the time. Image by https://greatexpectations.io/

The Great Expectations tool is a Python package, installable via pip or conda, that allows data teams to create expectations and validations for every dataset. Expectation, a flexible, declarative syntax for describing the expected form of data, is the core abstraction. Expectations provide an excellent medium for communication, surfacing, and documenting latent knowledge about the shape, format, and quality of data used in exploration and creation. It is a powerful tool for testing when used in production.

Final Considerations

I hope that somehow I may have shown why it’s essential to have a Data Quality culture within an organization. Despite being a decimation error, the example of the myth behind Popeye shows that even minor errors can be hard to fix after their spread. Data Teams need to check its code continually, and most important, check the data.

The utopia would be defining quality checks for all datasets, especially for those on a critical decision-making basis. My recommendation is to determine what datasets are most important within the company and start creating quality checks. This tool Great Expectations can be helpful, but there are other tools out there, open-sources and paid ones.

Sometimes, even these quality checks, these assertions that check if the column values lie within a specific range, won’t be enough to guarantee the trust, and you’ll need to use anomaly/outlier detection techniques [5, 6]. Only the context you’re dealing with can tell you if it’ll be necessary to address these more advanced techniques. I will explain more about it in future posts.

References

  1. The Science News-Letter. Spinach Over-Rated as Source of Iron Vol. 28, №749. Aug. 17, p. 110 (1935)
  2. Hamblin, Terence J. “Fake.” British Medical Journal (Clinical research ed.) 283.6307 (1981): 1671.
  3. Pipino, Leo L., Yang W. Lee, and Richard Y. Wang. “Data quality assessment.” Communications of the ACM 45.4 (2002): 211–218.
  4. https://medium.com/@expectgreatdata/down-with-pipeline-debt-introducing-great-expectations-862ddc46782a
  5. Hodge, V.J., Austin, J. A Survey of Outlier Detection Methodologies. Artif Intell Rev 22, 85–126 (2004).
  6. The Nutrition Source. Vegetables and Fruits. [online] Available at: <https://www.hsph.harvard.edu/nutritionsource/what-should-you-eat/vegetables-and-fruits/>

--

--

Vinícius Mello
hurb.labs

Head of Engineering & AI @ hurb.com | Passionate about technology, leadership, and martial arts