The Stupidity of the Crowds

Several well meaning and important clients have clung to the following fallacy: Collect as much data as you can, pour it all into a pot, and let the machine learning algorithms return the gold.

This shaky foundation is perpetrated by the success of the notion “the wisdom of the crowds”, which is downright false in many cases. It is not the case that adding more humans to your crowd, or more data to your database, makes for better predictions. In certain situations, it can, but not always. The best way to combine information to make a decision, when you don’t previously know how accurate your advisors are, is a mathematical problem which has been solved and adding more advisors may or may not help:

If you use the right algorithm, you have an upper bound on accuracy which is very approximately:

Your prediction error ≤ prediction error of best advisor + log (the number of advisors you combine) [1]

So adding more advisors if they don’t have the chance of being the best advisor can just make things less accurate. When advisors each have a part of the puzzle, ie when each advisor has real but partial knowledge of the task, then it is worth combining the knowledge. But choosing the right advisor is important.

And to be clear, this stands whether “advisors” are real people, or additional data streams. Look at the formula: Adding additional, less information rich data streams can decrease the predictive power of your machine learning efforts.

Google Flu Trends Disaster

As a concrete example of more not necessarily meaning better, one can cite one of the most public failures of big data; by Google flu trends.

Google flu trends is a service originating in 2008, that uses 45 or more google search term frequencies to forecast the prevalence of flu cases in doctors surgeries. Some hailed it as the ultimate way to forecast the future, but it has been significantly off, not just once, but for the last three years running. So off, that a straight trend line drawn from the preceding few weeks, would be more accurate [2].

There is one other important figure: 80–90% of patients incorrectly self-diagnose, so it is no surprise that all those searches for flu related keywords; the information density of the input data was pretty low.

Google flu trends provides a direct parable to social trend monitoring, and the prediction of any media campaign based on historical big data…the immediate takeaway: keep your input data as clean as possible, and if it’s not, make sure to combine it appropriately.

Future tech is even bigger data

So we have to hunt out quality data sources. Data sources which tell us a lot about what we want to predict. Secondarily, we should hunt out complimentary data sources, data that tells us about different, non-overlapping aspects of what we want to predict.

And this understanding is very relevant as we stand on the edge of even biggerdata.

Data from your mobile is sold by carriers as market intelligence, anonymised into blocks of indistinguishable records. This includes demographics, location, time and website accessed. So from a marketing point of view it lets you do things like see what sites are being checked where. Eg. If you run a Marks & Spencers store, you can find out what other clothing stores are being checked, by your customers from their mobiles in-store. This is yesterdays data.

The data of your heart rate, increasingly accurate in-store location data, facial expression recognition. In an aggregated anonymised form, these data streams are almost certain to arrive at the marketers disposal soon.

Internet of things = even bigger data.

We will have wristbands which pulse in real time with the emotions of our significant others [3]. And there will be data providers, anonymising, analysing and selling the insights.

But if you’re data-mining, make sure you mine at the right coal face, where insights actually are. More does not necessarily mean better. Information densityis the foundation of value, without which, no amount of money you spend on a new machine learning algorithm will make a difference.


Dr. Finn Macleod runs a smart targeting and verification service to grow your twitter following. It uses data from outside of twitter to make sure the accounts you follow are real. You can check it out here:

[1] Aggregating Algorithm — see the work by V.Vovk, and Weighted Majority Algorithm — Littlestone and Vermuth.

[2] “The Parable of Google Flu: Traps in Big Data Analysis” David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani

[3] See for example some work by Roz Picard:

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.