Part 1: Data Visualization Throughout the Data Science Workflow (Article 1)
We want to think that the data speaks for itself, but a picture is worth a thousand words.
This is the first part in a three part series entitled Visualizing Data: Why, When, and How. Part 1, Article 2 can be found here. Part 2, When Is Data Visualization a Good Choice?, focuses on determining when visualizing your data is an appropriate approach for communicating information. Part 3, The Importance of Integrity, consists of 3 articles that focus on factors that affect effective and honest communication of a data story.
Scientists, including data scientists, often focus on numerical methods and analyses, glossing over visualization and communication. We want to believe that “the data speaks for itself.”¹ However, data visualization is an essential tool in a data scientist’s toolbox. Data visualization allows you to see patterns that would be invisible — or inefficiently visible — from looking at numbers alone. It can help you more efficiently make sense of data, facilitating good decision-making.
This is true throughout the data science workflow. Beyond exploratory data analysis, data visualization can help you build better statistical and machine learning models, and it can speed up your understanding of the patterns in your data. And ultimately, there is no point in doing good science if you’re not able to communicate the results, whether you are a natural scientist contributing to the foundation of scientific understanding or a data scientist making insight actionable.
Much has been said on the importance of data visualization. In this article, rather than just focusing on its importance, we will walk through some simple, accessible examples of where it serves its purpose in the data science workflow (thus visually demonstrating the importance of visualization!²).
In this two part series, we will explore several main (sometimes overlapping) applications to which data visualization can bring efficiency or effectiveness:
Key Applications of Data Visualization
Data Science Process
Data visualization can help you develop intuition about your analytical process and enable you to apply data science techniques appropriately.
We’ll go through two examples here: applying statistics appropriately and choosing model parameters.
Applying statistics appropriately
It is essential to look at your data before applying any sort of statistical analysis, to make sure that your data fit the assumptions of the model you’ve chosen. One of the best classic examples of this is Anscombe’s Quartet.
Anscombe’s Quartet is a set of four datasets that have nearly identical descriptive statistics, despite the fact that the patterns of data are actually very different. These datasets were developed by the statistician Francis Anscombe in 1973 for the very purpose of demonstrating the value of plotting data. Each dataset is a set of 11 x and y values. All four datasets have exactly the same mean and sample variance of x, and nearly the same mean and variance of y, correlation between x and y, linear regression line, and coefficient of determination of the linear regression, to two to three decimal places.
Despite these similarities, it’s clear from plots of the data that a blindly applied linear regression would be a very poor choice.
From looking at the plots, it appears that linear regression is a reasonable choice for modeling dataset a. Linear regression would also be reasonable for dataset c; however, the point at x = 13 is a clear outlier, and this point should be further investigated and/or removed from the overall model.
In contrast, linear regression would be the wrong choice for datasets b and d. Dataset b would be better modelled by polynomial regression, and there is no apparent relationship between x and y in dataset d, for which linear regression only appears to be reasonable because of the outlier at x = 19.
To take this point a step further, data visualization can not only show you whether your chosen model is appropriate, it can also help you determine what kind of model to use or what sorts of transformations to use on your features. For example, let’s say that you’re working with data that looks like the figure below:
Without looking at a plot of the data, you might have reason to assume that y is linearly dependent on x, and try linear regression. The resulting model wouldn’t be terrible, but it wouldn’t be optimally describing the actual pattern in the underlying data. A better approach would be to use polynomial regression or to log-transform x.
The calculated RMSE values alone (printed on plots), even without the plots, would indicate that linear regression is not the best option for this dataset when compared to the other models. However, without visualizing the data, you might have decided that the linear regression was sufficient, and you might not have considered either of the other two approaches.
This is a simple example with clean data. Nonetheless, the point is still applicable with messy, more complex data, whether applied to individual variables or to many: Visualizing your data can help you assess its overall relationship, and thus help you determine how to appropriately apply statistical or predictive models.
Choosing model parameters
Visualization can help you choose hyperparameters for a machine learning model. To demonstrate this point, let’s work with the classic iris dataset, one of the best known datasets for testing or learning classification models. This dataset has four features (Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width) with which to predict iris species (setosa, versicolor, and virginica).
When conducting a grid search to tune the hyperparameters for a machine learning model, you have to choose the range of values over which to search.
With one parameter, it’s easy to see whether or not you’ve examined enough values. For example, here are the results of a grid search for the mtry parameter of a random forest model predicting species using a training data set from the iris data. (mtry determines how many variables to split on for each decision tree split in a random forest.)
Take a look at the table at the bottom of the output that shows the results for accuracy and kappa for each value of mtry.
With four features, mtry can only be between 1 and 4 (inclusive). It is clear that accuracy increases with increasing mtry until 3, and then levels out. Thus, the optimal value is mtry = 3. Had you stopped searching at mtry = 2 or even mtry = 3, it would be clear that you should continue to try higher values to be certain about the optimal value.
This is also clear from a plot of the results.
In this case, visualizing the data through a plot makes it a little easier to see the pattern, but it’s not absolutely necessary — it is relatively straightforward to understand what’s happening from the numbers alone.
With two parameters, however, it’s harder to see this pattern directly just by looking at the numbers. To demonstrate this point, we will work with an artificially generated dataset using the function mlbench.threenorm from the mlbench package. This particular dataset was generated with five features and 300 observations. Plotted against the first two features, the two-class target variable looks like this:
Let’s try to use a SVM (support vector machine) to separate the classes. SVM requires tuning for 2 parameters, cost and gamma.
Here are the results of a grid search for the gamma and cost parameters (over cost = c(1e-2, 1e-1, 1, 1e1, 1e2, 1e3, 1e4) and gamma = c(1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1)), using 10-fold cross validation. The output shown below includes the function output and the first 10 rows of the table showing the calculated error given each pair of parameter values tested.
The tuning algorithm found an optimum at cost = 1 and gamma = 0.1. Did you pick the right range to search over? Should you extend your search in any particular direction, or look at smaller step sizes? To what extent does your choice of parameters affect the result?
One way to quickly assess these questions is to plot a heatmap of your results.
From this plot, you can quickly see that the optimum solution (cost = 1, gamma = 0.1) is on the edge of your tested values. There are other nearby combinations of parameters that have low error rates close to that of this solution, and it would be worthwhile to extend your search, by checking larger values of gamma while keeping cost constant.
In this case, a visualization is quick and easy, and allows for intuitive understanding of results. It can also be used for exploring the process to understand it better.
The first step when building a machine learning model is to split your data into a training set and a test set. Data may be split randomly and/or based on the distribution of some target variable. While the exact split will not, ideally, affect the eventual model substantially, it will have some effect.
Algorithms that split data between training and test sets often have some element of randomness in them. In the context of a computer algorithm, randomness is typically only pseudorandom; the computer has to start somewhere. To ensure that they are working with the same dataset each time the code is run, data scientists will sometimes do something called setting the seed. This effectively starts the random number generator in the same place each time, leading to the same split.
In the above example, data was split based on the target (i.e., to ensure even distribution of classes across the train and test sets), after setting the seed to 222. How does the choice of a SVM parameters depend on the choice of seed?
The plots below show the results of the grid search for cost and gamma, as performed above, but with several different seeds. The figures below make it clear that changing the seed when splitting training and testing data has an effect on the results of the grid search, and can even affect the final conclusion in terms of the optimal parameters.³
The best parameters for each seed are as follows:
While the overall pattern is similar across all dataset splits, the results of different analyses point to different values for cost and gamma. This demonstrates that the optimal choices for cost and gamma depend on the exact data, and thus how the dataset is split.
Looking at the heatmaps, it is easier to more intuitively understand how error varies across different combinations of cost and gamma, and thus how different splits of the dataset could lead to different conclusions. Additionally, you can visually see that using, for example, the best parameters from the fourth split (seed = 444; cost = 1000, gamma = 0.01) when building a model with data from the first split (seed = 111) would still lead to a model with relatively low error. This might give you more confidence in picking model parameters despite the observed variation across different data splits.
This is an example of model variance, in that variability in the starting dataset affects the resulting model. Collecting more data should help reduce this, theoretically decreasing the effect on selection of model parameters.
The three examples above demonstrate ways that data visualization can help you work through the data science process more efficiently, particularly in terms of appropriately applying statistics or machine learning models, and choosing model parameters. However, the usefulness of data visualization throughout your analytical process extends to other applications as well — it is much easier to quickly make sense of patterns in an image than in columns of numbers!
In the second article of Part 1, we will explore ways that data visualization can be useful for two other aspects of the data science process: drawing insight from your data, and communicating data insights to others.
- Apologies to those of you who also had data is corrected to data are for all of the many years you spent in graduate school: “the data speak for themselves.”
- Visualization inception: we cease to exist.
- The seed is the same for all runs of the tuning algorithm (grid searches); only the seed for the initial data split differs across the four examples.
Originally published at www.t4g.com on February 28, 2018.