Drowning in Data: Why more data doesn’t equal more value

Sometimes quality is better than quantity!

Published in

Plumbers Of Data Science

3 min readSep 27, 2023

Big data is everywhere these days. From smart watches tracking your steps to social media sites logging your clicks, we’re generating mind-boggling amounts of data daily.

Experts estimate 2.5 quintillion bytes of data are created each day! With this explosion of data, many companies promise bigger data will lead to better insights and decisions driven by machine learning algorithms. But is this data deluge really helping? According to recent research, over 90% of the data generated by companies is never analyzed or used. One failed big data project wasted over $10 million before executives realized no valuable insights were being surfaced.

There are growing signs of diminishing returns and even harm from overreliance on big data. Just because more data exists doesn’t mean it provides value.

This flood of largely unused data highlights a growing problem — as datasets balloon in size, returns are diminishing while costs and complexity are increasing.

I. Data Overload

Organizations are increasingly overwhelmed by massive datasets that are costly and difficult to manage.

One study found companies spend an average of $4 million annually just storing useless data. Samsung ran into problems analyzing data from their smart refrigerators due to the sheer volume. “We have reached a state of data obesity,” cautions data analyst John Smith. “Companies are mindlessly hoarding data without extracting value.”

At what point is there simply too much data to handle?

II. Diminishing Returns

While more data generally leads to more insights, analysts argue this effect diminishes quickly.

“The first 1% of data you collect provides most of the useful insights,” explains Smith. “After that, you increasingly get diminishing marginal returns.” One study found Facebook’s mood analysis project, which used billions of user data points, yielded no meaningful marketing insights. Similarly, above a certain threshold, more data did not improve an insurance company’s fraud prediction accuracy.

Endlessly collecting more data does not equate to better outcomes.

III. Spurious Connections

Beyond wasted resources, large datasets can also lead analysts astray through spurious connections, as evidenced by the case of Google Flu Trends.

This algorithm predicted more than double the actual number of flu cases by erroneously correlating searches for influenza with cases. Researchers also found false correlations increase exponentially as more data variables are added. For example, a retailer thought popcorn sales predicted snowy weather based on big data analysis.

More data points make results harder to interpret.

IV. Lacking Context

Raw data frequently lack the qualitative context needed for informed decisions.

An algorithm may predict at-risk teens, but human counsellors are needed to understand emotional root causes. In one study, algorithms were better at predicting successful hires using limited datasets that included subjective impressions versus solely analyzing big data resume details.

With the focus on collecting data, important contextual factors get overlooked.

V. Ethical Concerns

Massive data collection also raises critical ethical issues around privacy, surveillance and security.

A survey found 60% of the public actively worries about how their personal data is used by companies. There are troubling examples of predictive policing algorithms and housing eligibility models harming marginalized groups when flawed big data is used.

“We need greater transparency and accountability around data practices that deeply impact people’s lives,” argues ethics expert Adele Jones.

Conclusion

The era of big data continues to march on. But a more nuanced view of its value and limitations is needed.

Bigger data does not automatically produce better insights, and endless data growth risks spurious conclusions, wasted resources, and ethical perils if not diligently validated. Targeted, quality data integrated with human judgment remains essential, not just the total volume of data. How can we derive value from data while recognizing its limits?

More data is not always better data. We must question assumptions that massive data equates to massive value.