Predicting On Real World Data
The great majority of data analysis is concerned with discovering relationships in data by plotting it to a line. How does increasing rainfall affect the amount of coffee sold in a convenience stores in the Pacific Northwest? What are the effects of higher afterschool funding on long-term academic success? These are relationships that one could reasonably expect to follow a simple trajectory. But when you increased the complexity by adding more data, it changes how the variables affect your outcome.
Data analysis tools today are built mostly for discovering these simple relationships. Linear and nonlinear alike, the most common approach is to fit these relationships to a line. With proper tweaking and weighting these tools are very effective for making predictions on data containing simple relationships.
But what do you do if the relationships in your data aren’t linear, if the data just can’t be fit to a curve, if there are interaction effects or if the data becomes complex and it’s not clear how all of the different variables are affecting an outcome? How do you know what data to work into your model? Or what model to use for your data?
Predicting on this data requires the ability to see across the complex relationships that exist in real-world data. Not only does a linear relationship not describe most data, but the many different factors that are impacting your outcome can’t be displayed in just two dimensions. You could solve this problem by breaking your data into pieces to solve each part separately — but this loses the big picture view of all those different data points. The effort of creating those data mash-ups, coupled with the potential for data discovery across wider data, are good arguments for maintaining the integrity of complex data.
Today companies are looking to create a 360 degree view of their customer or understand the interactions of thousands of networked machines are on the forefront of this kind of complex analysis. In the future, greater access to data and increased competition will mean everyone will face data complexity. Knowing more about your customers will become the burden of small and medium enterprises, and when more devices are networked, the companies that make those devices will have to ensure they are tracked and monitored. As these problems move downstream, those small and medium business will also need the skills and resources to tackle their own data complexity.
The path forward is tools that can automatically detect relationships in your data and give you the answer on how to predict based on the interactions, no matter how complex. Because our world is not linear. And we do not have the time or skills to figure out every data set.
Would love to hear your thoughts. Keep them coming!