Messy Health Data: a Nielsen Analogy

Kelvin Chan
Unraveling Healthcare
4 min readJun 2, 2016

As messy and complex as data is in healthcare, these complexities are not unique to the industry. In fact, industries from finance to media have all dealt (and continue to deal with) analogous issues.

I think it’s helpful to reflect on some of the challenges and solutions that other industries face. To that end, I’ve gone off on a little tangent here to describe an interesting industry analogy that Nielsen, the almighty TV ratings provider, continues to face.

As I mentioned in this post, connected data is difficult for these two main reasons below:

  • Indirect measures: a proxy measurement used when you can’t directly measure the thing you are trying to assess.
  • Imprecise inputs: data entry modules which allow users to freely input data resulting in data inconsistencies.

Health Data Problem 1: Indirect Measures

Indirect measures are utilized when measuring from the source would prove impractical.

A good non-health analogy can be seen in TV Nielsen ratings. For decades, media companies have asked “how many viewers are watching their shows.” Nielsen set out to answer this question in the 1970s by distributing “Set Meters,” devices installed behind TVs to measure what you watch.

However, these “set meters” don’t measure viewer activity; they measure TV activity. A measure of TV activity may tell you that any given TV was tuned to CBS for 4 hours. Viewing activity would be if the viewer was actively watching CBS for 4 hours (as opposed to moving around the house with the TV on).

Furthermore, it was difficult to tell who exactly was watching the show. Was it the 5 year old child or 60 year old grandparent? This made demographics data even tougher to interpret.

Today’s emergence of new mediums highlight the meter’s inaccuracies. By using the TV as the indirect measurement for viewing activity, Nielsen loses out on alternative mediums to consume shows. First people switched to cable, then satellite, and now online content across Netflix, YouTube, Amazon and the likes.

The “old world” of health data faced similar obstacles. How do payers know what procedures their members are getting? How do pharmaceutical companies know how many people are using their drug?

The industry approach thus far has been to measure this data at the wholesaler level for drug sales (e.g IMS data) and payer-level claims data for health service utilization.

Unfortunately, at this measurement-level, it’s difficult to derive a more granular analysis. Just like a link-less Wikipedia article, poor measures severely limit the types of questions you can ask. Can we see what age groups are taking X drug by measuring it at the wholesaler level? Do we know how these drugs are used? Can claims data tell us about the diagnoses that led to this operation? What if a member switches insurance coverage?

Much of this granular information remains locked away in the clinical and medical documents that your physician maintains. Despite the potential, these documents have their own set of complications.

Health Data Problem 2: Imprecise Inputs

Imprecise inputs are inputs that elicit responses that are not easy to characterize. “How old are you?” elicits a clean, numerical answer. “How do you feel?” can elicit multiple, unquantifiable answers that would trouble even the most accomplished statistician.

In parallel with “Set Meters,” Nielsen sent out surveys (known as “viewing diaries”) to supply supplemental demographics and actual viewing activity data. By linking these qualitative surveys to “Set Meter” data, one could answer who exactly in the household was watching and when.

In Dec 2015 after almost a century of usage, Nielsen finally phased these surveys out. Replacing these surveys are “People Meters,” devices with buttons to signal which person is watching a show.

One likely reason is that hand-filled surveys are imprecise, and thus impede efficient linkage to “Set Meter” data. Consider all the errors that can result from a survey: did the viewer record the accurate time? did the viewer record AM vs PM? did the person record “Walking Dead” or “The Walking Dead”?

A design change from surveys to “People Meters” can limit misspellings, incorrect entries, and facilitate more survey entries with its greater ease.

Much of today’s health data sprint follows a similar problem set. Clinical records also notoriously rely on imprecise inputs, outputting messy, unstructured data. A commonly cited statistic (albeit from 2008) is that 60% of all clinical documents are unstructured and thus, hard to use.

Payers or health insurance companies faced this issue in the 1970s. Physicians bill payers to get paid. Yet, it would be near impossible to manually read through every bill that a physician submitted.

The address this, Medicare created a coding system that mapped to different procedures. If a physician prescribed a drug or performed a procedure, s/he would then bill the payer via a numerical code to signify that procedure (e.g. a CPT code).

This coding system standardizes nomenclature for medical terminology, and in concept should lead to cleaner data.

But, this did not happen.

Rather, billing and “coding” became such a burdensome, complex task that procedures were often inaccurately coded. The software used to input codes was often difficult to use.

My goal in highlighting Nielsen is to illustrate that these challenges were and continue to be cross-industry. Through iterative changes in design, Nielsen’s already overcome certain challenges. Nielsen ratings continue to be the de facto standard for TV ratings, a testament to how indirect measurements or imprecise inputs can still be “good enough.”

Unfortunately, “good enough” is not “enough” when it comes to decisions about healthcare.

--

--

Kelvin Chan
Unraveling Healthcare

Healthcare professional working on how data can help solve many of today’s current health problems. Former consultant in drug strategy. All views are my own.