Measure Seventy-Five Times, Cut Once: Further Blood Glucose Meter Testing

13 min readApr 5, 2016

Summary

Despite what manufacturers claim, home blood glucose meters are terribly inaccurate and show systematic bias when compared to each other. Here, ten meters are tested in a real-world setting to better understand these biases. The results show average differences exceeding 30% for some meter pairs, leading to both acute and long-term detrimental treatment. Furthermore, the refusal of meter companies to acknowledge and publicize these differences demonstrates their lack of commitment to the diabetic community.

OneTouch Verio, OneTouch VerioSync, OneTouch UltraMini, and Bayer Contour Next Link blood glucose meters showing different results from the same blood sample.

Introduction

First things first, if you haven’t read the initial article, you should go have a look:

A Craftsman Blames His Tools: Blood Glucose Meter Accuracy & Long-Term Diabetes Control

Modern glucose meters have insufficient real-world accuracy for the treatment standards patients are expected to meet.

medium.com

That work explored the apparent differences between five common blood glucose meters, laying out a test methodology and posing some questions about the effects on people with diabetes. This piece adds additional meters and samples to the database and explores a more general question: given a single blood sample, what relationship can we expect between any two glucose meters’ results?

To get a 30,000-foot view of the situation, let’s start with a hypothetical …

I have two glucose meters: Meter A and Meter B. When I look at the user manuals for these two meters, both of them tell me that they report a plasma-equivalent-calibrated glucose measurement using a capillary blood sample. When I look at the tables in the back of these manuals, they both report similar accuracies and precisions. Physically, I use them both in a similar fashion: wash hands, insert test strip, prick finger, apply sample. These two products are indistinguishable in most ways to me, the user, per my observation and interaction but also per the manufacturers’ claims. And, for that matter, per the FDA’s stamp of approval.

So I follow the standard procedure on each of the meters, but instead of sticking my finger twice and getting two blood samples, I use the same sample on both meters. Let’s say I’m a reasonable person and understand that using two different finger pricks can introduce other sampling issues.

Meter A shows my result as 200 mg/dL

Meter B shows my result as 260 mg/dL

Interesting. These things are supposedly “the same” in all the important ways I can imagine, but they are telling me two different things. So what’s the issue?

I do a half dozen more tests at different times of day …

Meter A: 100, 80, 120, 230, 60, 150 mg/dL

Meter B: 130, 104, 156, 299, 78, 195 mg/dL

So as a user, my Bayesian prior has now been updated: whereas before I was fairly certain that Meter A and Meter B were essentially the same, I now am fairly certain that they are not.

Am I correct? Are they not?

And, importantly, if I am correct and they are not functionally “the same,” what are the ramifications? Why are all the references I should trust to tell me whether they are or aren’t the same—the manufacturers, the FDA, my doctors—telling me otherwise?

When I go to my endocrinologist and I pick up a pamphlet or see a chart on the wall, it says something like “keep your glucose between 80 and 180 mg/dL”; never before have I seen a chart saying “if using Meter A, keep your glucose between 80 and 180 mg/dL; if using Meter B, keep your glucose between 104 and 234 mg/dL.”

A key thing to point out is that none of what I just wrote relies at all on a comparison to a more accurate laboratory diagnostic tool. In fact, almost none of it requires any understanding of error characterization or sources, either. It’s a pretty basic set of observations that are confusing and upsetting and have no reasonable justification that I’ve found. Most assessments I’ve been offered are based around issues with finding an objectively “true” glucose measurement, which is extremely important in its own right but is not the issue at hand. When meter accuracy comes up in discussions, it nearly always goes back to error sources and lab comparisons. What I’ve never heard discussed is something more basic: why are these meters so different? When an A1c reading comes back 1.0% higher than a care team was targeting, how often do you think the question “which meter do you use” comes up? How often do you think they say “oh, you use Meter A? Well in that case we need to lower all of your targets”?

Just like in my previous post, I am not interested in determining which of these meters is objectively the most accurate. While that is certainly important, it is the concern of many other studies and requires a different set of tools. Instead, I am trying to provide one possible framework for comparing glucose meters and offer my own personal data as an example to start investigating.

Data and Analysis

Here is the link to the data discussed in this piece:

Glucose-Meter-Comparison.xlsx

Dropbox is a free service that lets you bring your photos, docs, and videos anywhere and share them easily. Never email…

www.dropbox.com

It’s an Excel file, so Dropbox tries to render a preview of it. I encourage you to download it, slice it up, import the raw stuff into your favorite SQL tool, and have a go at your own flavor of analysis.

The dataset contains readings from the following glucose meters:

LifeScan OneTouch UltraLink
LifeScan OneTouch UltraMini
LifeScan OneTouch Ultra2
LifeScan OneTouch Verio
LifeScan OneTouch VerioIQ
LifeScan OneTouch VerioSync
Abbott FreeStyle Lite
Abbott FreeStyle Precision Neo
Bayer Contour Next
Bayer Contour Next Link

The test procedure used was similar to that in the previous piece—and in fact, that data is included in this set—except that the number of meters in each test varied from two to five. This has some known effects on the analysis that are detailed in the Caveats and Shortcomings section below. Control tests were also performed on the meters for which I had the correct control solution. Where available, multiple strip lots were used as well. All of the test details, from lot number and expiration date of the strip used in each test to the serial number of the meters, is contained in that spreadsheet.

So as not to bury the lede any further, here is a summary of the results:

Average expected ratio of meter results (column divided by row) based on test results.

This table shows the relationship, on average, between any two of the ten meters. For example, a glucose of 150 mg/dL on a OneTouch Ultra2 is on average equivalent to a 168 mg/dL on a Bayer Contour Next (150 mg/dL x 112% = 168 mg/dL). Similarly, a glucose of 90 mg/dL on a OneTouch Verio is on average equivalent to a 68 mg/dL on a OneTouch UltraLink (90 mg/dL x 76% = 68 mg/dL).

Another representation of this data can be done on a number line showing the relative difference between two meters. Below, the values have all been normalized to the OneTouch UltraLink, which reads the lowest (on average) of all the meters tested. This is similar to reading across the top row in the table above. You could of course arbitrarily normalize to any of the meters, but since the UltraLink was my meter of choice for so long, I stuck with it here.

Relative difference of meter results normalized to the OneTouch UltraLink.

So how were these values determined?

If there were just two meters involved, this would be quite simple. The easiest way would be to divide each reading on one meter by the corresponding reading on the other meter, than average those quotients. You could extend this further and instead do a regression analysis, selecting the form of the equation you think best fits the data. Either way, you end up with an equation relating the two meters’ results. But what if you have more than two meters, and not every test involved all of the meters (i.e. you tested them in pairs)? Or what if not every meter could be tested against every other meter? With ten meters, to get ten tests on each pair, it would take 450 blood samples (and 900 strips)! I have neither fingertips nor insurance that is so forgiving.

Instead, I used a system of equations to solve for each relationship. I took a shortcut and used Excel’s Solver tool—I wanted to keep this analysis in Excel so that it was accessible to a broader population—but the same result can be done with your favorite linear algebra tool. I also chose direct linear relationships to keep it simple, though any arbitrary function could be used. You can see where all of this is done in the spreadsheet.

This method allows for comparison between meters that weren’t tested directly. I didn’t have access to sufficient strips for some of the meters, so I used them all up early and was not able to test them against meters I acquired further on in the process. The table below shows how many direct tests were performed between each pair.

Number of direct, same blood sample tests performed on each pair of glucose meters.

Even with numerous zeros on the chart, relationships can be estimated based on all the other networked relationships. Of course, the more direct comparisons there are, the stronger the existing relationship will hold in the face of new evidence. Ideally, over time, the gaps will be filled in with additional testing. This is similar to how you might do computer rankings of sports teams, for example.

Discussion

So what to make of these numbers? The meters could more generally be grouped into three bins, each of which contains approximately equivalent meters:

Bin 1:

LifeScan OneTouch UltraLink
LifeScan OneTouch UltraMini
LifeScan OneTouch Ultra2
Abbott FreeStyle Precision Neo

Bin 2 (10–15% higher than Bin 1):

Abbott FreeStyle Lite
Bayer Contour Next
Bayer Contour Next Link

Bin 3 (25–35% higher than Bin 1):

LifeScan OneTouch Verio
LifeScan OneTouch VerioIQ
LifeScan OneTouch VerioSync

Of interest, no meters using the same test strips are contained in multiple bins. For example, all of the meters using OneTouch Ultra strips are in Bin 1 and all of the meters using OneTouch Verio strips are in Bin 3. Per FDA regulation, both of these meter bins could be meeting the minimum requirements by having 95% of their readings with 20% (or even 15%) of the “truth,” as determined by a laboratory standard. In that case, both meters are showing bias, but they are both also within acceptable limits. What I don’t quite understand is why the manufacturer would want this bias to persist.

The number presented on the screen is not a direct measurement. The current or voltage output measured is converted to an equivalent glucose reading using a characteristic equation, which itself contains one or more calibration constants. To make these meters read similarly, the calibration constants can be adjusted (by the manufacturer, not by the user). Even if the meters use different enzymes (glucose oxidase vs. glucose dehydrogenase, for example), and/or the equations can’t be made to match exactly over the entire range of possible glucose values, a consistent +30% bias of one over the other does not make any sense.

Remember also that these values are averages, which means their effect on treatment is best understood in the context of long-term impacts. By contrast, individual tests ranged widely, with the largest difference being +59% when the OneTouch VerioIQ read 156 mg/dL and the OneTouch Ultra2 read 98 mg/dL. For me, a bolus of 1.0 to 1.5 units would be appropriate for the VerioIQ reading, whereas that same bolus would put me in or near a coma using the Ultra2.

So what is the response from the manufacturers? Well, their social media teams love seeing this pop up:

Tweet of one of the tests and response from OneTouch’s social media team.

Of course, the funny thing is, that’s exactly what I now expect from their products.

When it comes to confronting the companies, comparing one brand to another isn’t too useful because each company can simply point their finger at the other. But because I have so many OneTouch meters (in part because I have prescriptions for the strips), it’s easy to go a level deeper with them. To be clear, I’m not passing judgement on them in lieu of their competitors, I simply don’t have access to enough supplies at this time.

I’ve spoken with OneTouch customer service on numerous occasions about this issue and can draw up an outline of their script from memory. After the formal event data collection, control solution testing, and assessment of potential harm incurred, they make some attempt to answer my questions. The salient point they try to convey is that they do not recommend testing meters against each other. They also emphasize that their meters pass numerous tests, that they aren’t hearing about this problem from other customers so it’s probably not a broader issue, and that I should choose which meter to use by calibrating it against a lab test. This last point is somewhat curious given the procedure they suggested, which involved checking my finger “within thirty minutes” of getting a lab blood draw, indicating a severe lack of understanding of how quickly blood glucose can change.

When pressed, they did go so far as to say that the Verio line of meters (all those that use Verio strips) do, anecdotally, tend to run a bit higher than the Ultra meters. As for why, I’ve heard everything from “smaller blood sample size” to “it does 500 measurements” to “it’s just newer technology.” So does this mean the Verio technology is more accurate than the Ultra technology? The answer according to OneTouch customer service is a resounding no. So somehow they’re different, but they are still equivalently accurate.

I don’t expect the first or second tier of their customer support group to be able to speak honestly about this issue, even if they know more than they’re letting on. They’re just pawns in this game, unfortunately, and they don’t have much to gain by stepping out of line. At the corporate level, however, I do have higher expectations. Whether this is legal or bureaucratic in nature, the fact that OneTouch has not acknowledged this officially is abhorrent. Do they not have enough data to make a statement about bias between their own product lines, or do they simply not feel obligated because they are technically following the rules? Either way, they are failing the diabetes community.

What to Do?

So given this state of home glucose meter technology, what can you do? Well there are a number of ways to help yourself and help push the issue forward:

Test your meter periodically against a laboratory standard. You need to do the tests almost simultaneously (sometimes you can even use the same blood, if your phlebotomist is feeling generous) so as to avoid any time lag. If your meter is significantly off, consider other options.
I periodically post images of my tests on Twitter using the #samedrop hashtag, which I apparently reappropriated from people making fun of dubstep. Go ahead and do some side-by-side tests and post them with that hashtag, mentioning the manufacturers as well if you’d like.
If you’re feeling a bit more committed, run your own series of tests and share the data with the community. Even if you only have two meters, and even if they’re the same ones I have, it’s valuable to gather a broader set of data. Feel free to reach out if you have any questions on methodology.
Talk to your endocrinologist, primary care physician, or diabetes educator about meter selection. Why do they recommend the meters that they do? Is accuracy important to them? Are they aware of these potential differences?
Contact your meter manufacturers and ask them to explain the results that you see, assuming they aren’t up to your standard.
Contact the FDA and report an issue with your blood glucose meter via their portal (see links in the Resources section below).

Not everyone has the time or supplies available to do all this testing, which is why we typically entrust the manufacturers to do it and the FDA to ensure it is correct. Given these results, however, its clear that we need to do more on our own.

Caveats and Shortcomings

Neither the data, nor my analysis, nor my visualization choices are the final or only say in this matter. I subscribe to the philosophy that there is almost always someone out there who can do it better, so you might as well encourage them to do it. I do try to be explicit with the methods and assumptions I make so that there aren’t any misconceptions. To that end, here are a few potential issues with my work:

While I managed to test a few different strip lots for some of the meters, there are so many lots and so many types of strips that it is hard to cover a decent swath. Still, I didn’t see any appreciable difference between the lots I did manage to compare.
I only have a single meter of each model, so if there is an issue inherent to my specific meter, that would not be exposed. All of the meter/strip combinations passed the control tests (where performed), which should have caught any egregious issues.
I tried to get a wide variety of glucose readings, but the distribution is definitely biased towards the 80–180 mg/dL range. I wasn’t willing to artificially move my glucose to get out of range, but I did make an effort to run additional tests when I found myself there naturally.
I don’t have all of the meters here that I’d like to, which is mainly caused by my lack of access to test strips. Most glaringly, I don’t have any Roche Accu-Check meters tested.
The effect of testing more than two meters at a time is that, if one of the meters produces an outlier (for whatever reason), it gets amplified during the regression analysis. Basically, this means that the relationships established in tests with multiple meters tend to reinforce themselves compared to tests with only two meters. The effect is reasonably small once you’ve done a number of tests, so I don’t believe it is skewing any of the analysis more than a few percentage points at worst. This is partly why I think the binning analysis is more useful than trying to be precise.
There could always be an error in the spreadsheet.