Data Overload: How to Deal With Too Much Information

Published in

Data to Decision

6 min readNov 15, 2022

Why the devil isn’t always in the detail

Organisations from every sector of the economy are facing a deluge of data, and there is an expectation that we should be using every last drop: decisions must always be data-driven, data-led, evidence-based. In many cases, this is too much — a recent survey found that 95% of employees were overwhelmed by data, and that people were increasingly relying on gut-feel due to the information overload.

Faced with complex problems and/or too much data to handle, what should we do? Ignore the data and rely on instinct; or stop procrastinating, accept that it’s going to be hard, and just dive into the detail?

Start small and simple

Keeping it simple doesn’t mean ignoring the detail, but it does mean stepping back to try to understand causal relationships: what is the big picture; which factors are most likely to affect the outcome significantly, and crucially, which are not.

From the outset, we need to be able to see the wood for the trees. And then keep the wood in mind as we undertake detailed analysis. That is, if it’s really necessary.

Many analysts, researchers and data scientists (myself included) have a tendency to get sucked into the detail. We want to know how-stuff-works and how-people-behave at the micro level. What are the mechanics of the situation? And this extends to data — down to asking why we see specific values in certain rows in a database. This is valuable in lots of situations, especially in scientific research seeking deepen our understanding of the world.

However, it’s also a recipe for getting lost in the weeds, and losing sight of the reasons for doing the analysis. This is especially important in the context of decision making. Before jumping down the rabbit hole of a line-by-line analysis we ought at least to believe it possible that the analysis will change the direction of the decision.

The high level process set out below is designed to help you navigate complex problems and guide your data analysis to help reach decisive conclusions — and avoid analysis paralysis. The final section then considers a worked example of understanding article engagement (a field where I have already been down many rabbit holes): what actions should I take to improve engagement with this blog post?

Before reading on, get in touch to find out more about how we handle these problems at the Efficient Data Group.

The problem is defining the problem

Define the questions you want to answer.
Think causally about your problem. Without worrying about the data, what are the most likely causal explanations for these questions.
Match these causal mechanisms to your data. What can you actually measure, and what data is already being collected? It can be useful to think in terms of the mechanisms which are likely to have generated the data: are these the same as the causal processes from the previous step?
Review the original questions. Given your data, what questions can you answer without making too many assumptions?

Uncomfortable honesty is required at these last two stages. Everyone wants a success, but we have to avoid simply willing it into existence. All too often, the questions you can answer are not the ones you would like to answer. Similarly, the things you can measure are not the things you are asking questions about.

Knowing when to stop

If you can answer the original questions, then great, dive into your analysis. But … only go into detail if you are convinced it could affect the answer to the original question. Once an answer becomes obvious, then stop.

This is easier said than done. Intellectual curiosity will always be tempting you into more detail, to try a new modelling technique, to clean some more input data, but you must resist! First convince yourself that it is genuinely important to the decision making exercise.

If, on the other hand, you conclude that you can’t answer the original question, then congratulations! This is a difficult conclusion to reach, but a vital one in conducting analyses which will lead to the right decision being made. In this case you have two options:

iterate, by redefining the goal with the questions you can answer, as long as it will still be useful to the final decision; or,
stop.

Again, honesty is required. Sometimes it’s important to admit that the data cannot answer your questions, no matter how hard you try. All too often, individuals and organisations will avoid this awkward conclusion, and keep plugging away at some analysis which is never going to answer the questions being asked.

Don’t stop monitoring

One final point. Although you might not be able to guide your current decisions using data, you may still be able to gain valuable insights in the future by looking back at historic data. So, don’t stop collecting the data.

Are you still reading this?

In this final section, I look at the question of reader engagement as a worked example of the process outlined above. Step one is to define the goal: I want to understand what I can do to improve engagement with my blog posts.

The second step requires that I try to understand the drivers of engagement. There are lots of possible reasons why users might or might not engage. For example, the quality of the article; the relevance of the topic to the user; the type of device they are using; their emotional state whilst reading; how the user found the article; and so on. The list of possibilities is endless. However, we can make a reasonable assumption that, on average, the first two listed are going to be very important, and the others less so.

Now comes the tricky part: matching what we have done so far to the data. Can we actually measure engagement, quality of the article, or relevance of the topic to the user? Simply put, no.

We can suggest a reasonable proxy for engagement such as the time spent on the article page, but we certainly don’t have anything on article quality, and it’s unlikely we would ever have sufficient information about a user to judge whether the topic was relevant (unless, of course, you are Google).

So, where does this leave us? First, we should redefine the goal to make it transparent that we are addressing a different question: what I can do to increase the time users spend on blog post pages? Laid bare, this doesn’t sound quite right. Longer articles are probably going to increase time on page, so perhaps we need to iterate and suggest a new proxy of something like time spent per word.

More fundamentally, I don’t have any way of measuring the two principal causes I identified: article quality, and topic relevance to the user. So, perhaps I just need to stop.

But wait! I can measure device type, and that might affect time on page. This is true, but I can’t realistically influence the devices that readers use, and so it isn’t directly relevant to helping me understand what I can do to increase time spent per word (as a proxy for engagement).

The conclusion is that the answer to the original question is not in my data. I’m happy that article quality and topic relevance are likely to be the most significant factors affecting user engagement. Based on this, I need to strive to write better articles, and try to get these articles published in appropriate places. These answers don’t come from the data, and I’m not going to find a better answer in the data.

This might feel like giving up. It really isn’t: it’s just being honest about where the data can and can’t help you, and avoiding doing unnecessary analysis.

However, the final caveat is that I could run experiments. As long as I continue collecting data, I can experiment with different changes to see how this affects my proxy performance metric. But, if you are still reading this, that will have to wait for another blog post.

Originally published at https://www.efficientdatagroup.com on November 15, 2022.