4 Pain Points of Big Data
and how to solve them
by Teresa Nick
Big data analytics is an amazing tool at the epicenter of the digital revolution. But it’s not foolproof. Here’s how successful companies deal with its potential drawbacks.
Big data has a lot to teach us, and we have a lot to learn. While pioneering ventures into big data have resulted in remarkable success, there have been a few high-profile failures as well. Like the misleading information about lead levels in Flint, Michigan, drinking water, resulting in a slow response to the crisis. Or in 2009, when a sophisticated flu-detection algorithm missed an unseasonal outbreak. Or when polls predicted that something wouldn’t happen, and it did.
Though shocking at first, when we dig a little deeper, we find that all technical failures in big data analytics can be explained by a simple model with four failure modes. This means that if we ask the right questions and pay attention, we can avoid making costly mistakes moving forward.
The bird’s-eye view
Business requirements, data, bias-variance trade-offs, and people all have an impact on analytics systems. But for brevity, this article focuses on technical failures that we can prevent.
Technical error encompasses both bias, which indicates how far the model is off from reality, and variance, which obscures the data signal.
Bias can originate as a business decision that leads to data-interpretation errors when the business case does not fit requested functional information. Bias can also originate as a technical decision that leads to analytics failures when a bad choice of model poorly fits the raw data.
In order to be successful, both business leaders and data scientists need to agree on (1) the required functional information for translation of raw data into actionable business insights and (2) the quality of the data for determining confidence in those insights.
A simple schematic of a complex system
Big-data analytics is an iterative process that progresses from the identification of a business need, to question formulation, to model design, to data acquisition/analyses, to addressing the business need with a business solution.
Here is a diagram of the simplified big data analytics system:
Simplified diagram of the big data analytics building and monitoring process.
The four key stages in the loop are:
- Form the question and answer
Starting from the upper left of the figure, business leaders identify a business need and pose a question that functional information should answer. The question and answer may be reformed based on new functional information from within the analytics loop.
2. Form the model
Moving right in the figure above, we next design and build a machine-learning model that provides the requested information from the available data.
3. Acquire and analyze data
Raw, unprocessed data are acquired in a variety of ways, from electronic sensors to surveys. Once the data are in a standard form, they can be analyzed and used iteratively to train the model (lower right).
4. Address the business need
Ideally, the analysis results in functional information that can address a business need (lower left). This information may be used, for example, to place an ad for a specific product based on user clicks in a browser screen, or to trigger an alert on an app based on motion in a driveway.
Bias happens in 1 and 2, the upper part of the loop. It impacts how data are interpreted. Variance happens in 3 and 4, the lower part of the loop. It clouds incoming data signals, leads to overfitting of models, produces poor information, and reduces the ability to make accurate business decisions.
Real life reveals key failure points
Failures tend to occur during four key decision points of the data-analytics model (highlighted with yellow boxes in the figure). Here are examples of each, with some recommended safeguards:
1. Are the data high quality?
The old adage “garbage in, garbage out” (GIGO) never rang truer than in this era of big data. Poor-quality input will always produce faulty output.
Example: The water crisis in Flint, Michigan. Improper treatment of river water produced drinking water that contained high levels of lead and may have caused an outbreak of Legionnaire’s disease. According to authorities involved in the event, misleading information led to bad decisions and slowed response to the city-wide water crisis.
Example: Social-media bots and the uptick in ‘fake news.’ Numerous recent headlines have resulted from the activities of bots on social media. The proliferation of fake news, content pollution, “astroturfing,” and the like has added much noise to social-media data, which makes it difficult to test hypotheses and develop models.
Evaluate the data. Most organizations will give their data a cursory check, but you need to really examine the data source to evaluate whether it’s credible. That includes understanding its context. Those providing misleading information in Flint likely felt that their personal livelihoods were at stake. Understanding this fact may have led the organizational leaders to question the data and dig deeper. Instances of deliberate skewing of social-media posts using bots present another example when data must be interpreted in the context of much misleading noise. A true quality-control capability is needed.
Build analytics skills in leadership .To prevent bad decisions based on bad data, leaders need a basic level of data-analytics education to help teams evaluate data. Companies could also create a C-level “analytics expert” position, such as a chief data officer or a chief analytics officer, who won’t necessarily review each piece of data that comes in, but will have the experience to put protocols and processes in place to determine if the data are good.
Independently verify. Because of the ubiquity of noise in data, independent verification can help qualify data and reveal its underlying truth.
Avoid “big data hubris.” More data does not always correlate with better models or more reliable findings. Make sure to test models with new data.
2. Does the model fit the data?
While there can be an almost infinite set of permutations of machine-learning model issues, they can essentially be boiled down to being too simple, too complex, or fit to the wrong data. If models are fed training data that is sparse or not representative of the data they will see in production, they typically will be “underfitted” and will make mistakes. If models are trained by bad data or “overfitted” to a specific data set during training, they will make mistakes or perform in ways their creators did not anticipate.
Example: Missed epidemics. A sophisticated flu-detection algorithm used searches for flu-related terms to predict epidemics. It overestimated flu outbreaks, most likely because it failed to account for changing inputs to the model due to regular improvements in the main search algorithm (i.e., the training data differed from the data that the algorithm received in production).
Example: A chatbot that hung with the wrong crowd. A chatbot was shut down on Twitter after one day due to offensive comments. It had the ability to repeat users’ phrases and “learn” from them. Some users exploited this vulnerability to have the bot learn to make inflammatory statements. This is an example of a model that was essentially “manipulated” by data it wasn’t designed to filter.
Check for common machine-learning pitfalls. Overfitting and underfitting are well-described pitfalls of machine learning that can be detected by comparing new, non-training data with the model. Discrepancies likely mean the model needs to be updated. In dynamically shifting environments, it’s critical to employ iterative testing of models, as well as cross-validation and regularization techniques.
Minimize preprocessing shifts or retrain the model. Tech companies cannot be asked to stop improving their main search algorithms. However, if possible, shifts in the inputs to machine- learning algorithms should be avoided after the model is trained. If the system needs to be more dynamic, then iteratively retraining the model can help optimize performance.
Minimize algorithm dynamics or monitor the outputs. Minimizing the dynamics of an algorithm makes it more predictable. However, in a learning algorithm, minimizing algorithm dynamics is not possible because the algorithm undergoes changes as it “learns.” In this case, monitoring the outputs and responding quickly are viable options. Another approach would be to successively train and test the algorithm on wider and wider user bases.
3. Should the business have confidence in the data?
Getting the right information to answer the right business question fundamentally relies on communication between business and technical units. While section 1 focused on determining data quality, the focus here is on whether technical teams communicate data-quality information to business leaders or others using the data.
Example: Sampling methods skew political polls. The recent influx of wrong predictions made by political pollsters is a good example of questionable information that was poorly communicated. Most polls predicted Conservatives and Labour in a dead heat for the 2015 UK general election; the results were a strong Conservative win. Most polls predicted approval for the Colombian peace deal with FARC rebels; it was defeated. Most polls said that UK voters would vote against the UK leaving the European Union; the results were for Brexit — leave. A study by the Market Research Society and British Polling Council found that, for the 2015 UK election, “the polling miss was caused by unrepresentative samples.” Issues with sampling and their detrimental effects on analysis results were not well communicated running up to the election.
Example: Analytical models overinterpret poll data. In many US states, the votes for Clinton vs Trump were within the margin of error, yet some pollsters reported that their models predicted the result with high confidence. The high variance of the underlying data was not reflected in the reported confidence levels.
Communicate. In the analytics data set, it’s critical to communicate how well the sampled data reflects reality, i.e., provide a grounded confidence reading on the output. This will help the business better calibrate decisions.
Embrace the margin of error. If findings are within the margin of error, then they are within the margin of error. No mathematical model is going to make the data cleaner or more reliable. It is critical to communicate levels of confidence in the data and model to the business team before they make decisions.
Be skeptical. Machine-learning tools should pull findings out of the data — not create them out of thin air. Be cautious if there are dramatically different results with repeated sampling, or when splitting a sample and comparing results of the subsamples.
4. Does answering the question help the company?
If you’ve taken the previous steps, you have good data, you’ve got a great model that fits the data, and the information answers, with confidence, the question that you asked. So what could go wrong? Sometimes, big data analytics may not address the business need. Even worse, they may hurt the company more than they help.
Example: Premature pregnancy predictions. A retailer applied big data analytics to customer data for the prediction of pregnancy. The idea was to send targeted ads to expectant parents in order to build brand loyalty early. As it turned out, the model was better at predicting pregnancy than were immediate family members, which resulted in some bewildered would-be grandparents and negative press because of privacy concerns.
Example: Racial profiling for ads. A social media company enabled advertisers to use ethnicity as a filter for ad displays. If the advertisers were selling homes or recruiting employees, then the filtering was actually illegal in the US. Several news outlets reported negatively on this story.
Keep privacy concerns top of mind. Privacy continues to impact consumer acceptance of big data analytics and the Internet of Things. Regulatory bodies are particularly concerned about privacy issues, with laws varying by geography.
Consult customers. Consulting with customers can help with issues of data value and privacy in big data analytics. For example, feedback from focus groups could help guide the usage of data that may be considered private.
So, as you continue to solve big challenges with big data, don’t forget to ask the right questions and build the right methods. Though big data analytics are often taken as gospel, the truth is, humans still need to lead the way.