Data Science is a lot more than Predictive Modeling

Published in

State of Analytics

4 min readApr 9, 2016

Statisticians, like artists, have the bad habit of falling in love with their models — George Box

In the corporate world, Data Science is often loosely equated with Predictive Modeling. This is understandable for a number of reasons. Two of the most popular books on the topic have prediction in their title (see here and here). Nate Silver, arguably the world’s most famous Data Scientist rose to fame after correctly making predictions (on a massive scale). And, one of the most popular Data Science websites is essentially a forum for competitive prediction. However, there are many flavors of Data Science.

In this article I will give a brief description of 6 different types of analysis that could be considered Data Science. This isn’t meant to be an exhaustive, or mutually exclusive list, but it highlights the diversity of techniques and applications of Data Science. For each category, I will summarize what it refers to, how it could be used with an anecdotal application, and I will also list out some analytical techniques that fall within that particular area.

Predictive Modeling

Predictive modeling (also referred to as Supervised Machine Learning) generally refers to estimating a quantity or category of interest based on historical correlations. For example, if a company wants to forecast their sales into the future based on the amount of money they spent on advertising, they could use predictive modeling to accomplish this. Technically, there are many different modeling approaches, from general linear modeling, to tree-based approaches, to ARIMA, to neural networks, all of which come with their own set of strengths and weaknesses.

Network Analysis

Network analysis refers to the analysis of connections between entities as a method of uncovering information about those entities. For example, if a company wants to better serve their customers, they might use network analysis to help connect their customers with each other or identify which of their customers carry the most influence. Technically, network analysis often comes down to the choice of definitions for connections and metrics used for defining the relevance of those connections.

Simulation

Simulation refers to the representation of a system or process that is defined by known relationships. One reason you would use a simulation is in estimating quantities that are otherwise too complicated to estimate. For instance, if a if a company wants to open up a new office, and they know the relationship between sales and economic indicators for the population of the city, they may utilize simulation to help them find the optimal location. One of the most popular simulation procedures is called Monte Carlo Simulation, which can be thought of as the process of repeatedly drawing values from distributions as inputs to a set of models that then provide a range of possible output values.

Recommendation Engines

Recommendation Engines are tools that make suggestions to human users. If a website wants to personally guide their readers to content they might like, they could implement a recommendation engine to accomplish this. Technically, approaches to recommendation engines can be classified into two general categories: collaborative-based filtering, which is based off of user behavior, and content-based filtering, which is based off of matching user preferences and product characteristics.

Clustering

Cluster analysis is all about identifying and representing membership in groups without knowing beforehand, what makes those groups similar. For instance, a company might want to identify customer segments to better market their products. They could use clustering to identify these groups based on what they know about their customers such as demographic attributes and purchasing behavior. Beyond customer segmentation, clustering can be useful for a wide variety of things: identifying outliers, compressing data, and even creating inputs to predictive models are all valid uses of cluster analysis. Technically, two of the most popular clustering techniques are K-means clustering and hierarchical agglomerative clustering.

Natural Language Processing

Natural language processing refers to a large collection of tasks all relating to the extraction of meaning from textual data. When an online business wants identify common phrases in their product reviews, they would use natural language processing techniques to extract this meaning from unstructured textual information. Some common natural language processing applications sentiment analysis, topic modeling, and document classification. Natural language processing is an extremely hot area both in academia and is only starting to be used widely within the mainstream of corporate america.

The State of Data Science

Looking at the state of Data Science within the corporate mainstream, some of these techniques are more widely used than others. For instance, most companies produce some sort of forecasting. But many companies are not analyzing their textual data in a meaningful way, or thinking about how to empower their employees with recommendation engines.

One prediction that I’ll make. The upcoming years will be an exciting time to work in Data Science.