Visualization Redesign: CDC Seroprevalence Surveys

Brian Gourd
Brian Gourd’s Portfolio
6 min readOct 29, 2020

By Brian Gourd

Since March of this year, the CDC has been partnering with commercial research laboratories in order to study the seroprevalence of the SARS-CoV-2 antibodies in certain areas of the country. Conducting this study allows the CDC to better understand the lifespan of the antibodies and the data collected from it should be expected to directly correlate to areas with higher levels of infection, since obtaining the antibodies is directly linked to first being infected. While the study is still ongoing, the results from between March 23rd and May 3rd have been posted to the CDC’s website, and can be found in Figure 1 below.

Figure 1: Antibody Study results from the CDC. March 23 — May 3, 2020

When looking at this graph, the reader may notice some oddities about it that can draw them away from viewing it or make the data harder to understand, since an abnormal type of graph was used to display it. From studying the visualization above, it appears to present both substantive and perceptual challenges to a viewer attempting to interpret the information.

At a perceptual level, one issue that jumps right out is the use of bubbles to represent the weight of each percentage. While this isn’t an issue in every scenario, it presents serious problems here because they exist on a set of axis that already have defined values. The viewer knows that 6.93% of individuals tested in the NYC Metro were found to have antibodies, so why does the bubble go above that on the y-axis? Centering these bubbles around their observed y-values makes them appear to be higher on the y-axis than they should since the top of the bubble extends beyond the y-value. Additionally, this also visual trick disproportionately affects data points based on their size. Larger values will have a larger circle, and thus extend further from their observed y-value than smaller ones do, so two different values may appear further from one another than they should. Finally, at first glance, these bubbles can confuse the viewer about the x-axis as well. The bubbles extend multiple dates over the x-axis, yet their size is independent from when the study was conducted. The reader may instead accidentally interpret this as meaning that a certain bubble only contains data from the dates covered by its length. In reality, the bubbles are only proportional to the precent of individuals that tested positive for antibodies, so they are really unnecessary since this value is already displayed by the x-axis.

The data presented in Figure 1 also has issues with its substance. This doesn’t mean that the data being shown is wrong per se, just that it may not really be telling the user whats important. For example, the data that the graph covers is collected from different months, in different parts of the country, that were all impacted by SARS-CoV-2 in unique ways. It’s unclear how someone would use this data in a meaningful way without limiting the study to a certain date or area. Displaying two independent variables (date and location) in a 2D space is almost pointless because we can’t see how each one plays a role on the dependent variable. If we limited these variables, then a “seroprevalence by date” or “seroprevalence by location” graph could be made and would no doubt be easier to understand (assuming its presented in an understandable way). Unfortunately, trying to break this up by both date and location gives the visualization little impact, since you cannot really draw connections between any data points. This is also due to the fact that there is only one data point for each location. While the graph would appear to be showing change over time since the x-axis is by date, this obviously can’t be achieved due to the insufficient data.

It seems like this visualization probably should have never been made since it doesn’t seem to display any useful information. However, it does do a pretty good job at misleading someone if they just glances at it, or can’t fully understand it, which is why the motivation behind this graphic should be discussed. This was found on the CDC’s Covid Data Visualizations webpage, and I barely had to scroll to find it. While it is not framed in a political way, it does come from a government agency which may have motivation presenting data in a certain way. The bias in this data can be seen in the first perceptual issue discussed; the bubbles extending beyond their proper y-value. If a viewer were to quickly glance at the axes and then the top of the bubbles, they may have a skewed idea of the prevalence of antibodies in these area. Obviously this is a bit ridiculous, since the percentage is shown within each bubble, but the point is that it has the ability to twist a viewers perspective of the data towards a more positive light. This is assuming a higher seroprevalence rate is considered a good thing, which really depends on personal concerns. Additionally, when clicking further into the study, I found another visualization containing more data from later dates, shown below in Figure 2.

Figure 2

Without a doubt, this visualization is more meaningful and easier to understand, and the fact that it wasn’t the one displayed on the main webpage again calls into question the motivations of the CDC. While this visualization is easier to understand, it also shows a clear drop off in seroprevalence estimates in certain areas. The reason this visual may be hidden deeper in the study is because this drop off may confirm some peoples fears about the lifespan of antibodies in our systems, and the possibility of people catching the virus twice or it not going away until a vaccine is found.

To recreate the visual from Figure 1, while still only using the data presented in it, was not an easy task. This isn’t much data at all and the purpose of the original visualization was unclear. To present it as a seroprevalence vs. time line graph would be misleading since it would omit the fact that data is collected from different areas of the country. The same goes for a seroprevalence by location bar graph. This would omit that the data is from different months of the year. In the end, the only clear and easy to understand way to present all the data from the original figure was to create three bar graphs, one for each sampling period. These graphs are shown below in Figure 3.

Figure 3: My redesign of Figure 1. Three simple bar graphs

Obviously, these graphs aren’t glamorous, and they still don’t really display much, but it seems like the best that can be done for the data given. I chose to simplify the original design by changing from bubbles to a bar graph, since it seems like that is what the original designer should’ve been going for in the first place in order to not mislead. Additionally, I broke the graphs up by sampling period because the dates in the original visualization had no way of connecting the data, since the sampling locations varied between dates. This is still the case, it is now just clearer that the different sampling periods are unrelated to one another. These visualizations also illustrate how little information is really presented in the original graphic. Nothing can be done with the information from the second sampling period without another location to compare it to, and the other sampling periods show difference between totally different, unrelated areas of the country. Without the data from Figure 2, nothing meaningful can be displayed, which again calls into question why Figure 1 was on the main page and Figure 2 was behind a link, at the bottom of the study.

In the end, while my redesign of Figure 1 certainly simplifies the information presented, it probably does a better job of showing why Figure 1 should’ve never been made in the first place. All the data points are unrelated in some way, and the information presented has no clear use, other than to mislead. It’s probably best described by saying that the original visualization was a complicated way of saying nothing.

--

--