Part 1: Data Visualization Throughout the Data Science Workflow (Article 2)
Beyond helping you work through the data analysis process, data visualization can also help you draw insight from the results of your analysis.
This is the second half of the first part in a three part series entitled Visualizing Data: Why, When, and How.
In the first half of Part 1, Data Visualization Throughout the Data Science Workflow, we worked through some straightforward, accessible examples of data visualization and looked at where it serves a purpose in the data science workflow.
Part 2, When Is Data Visualization a Good Choice?, focuses on determining when visualizing your data is an appropriate approach for communicating information.
Part 3, The Importance of Integrity, focuses on factors that affect effective and honest communication of a data story: How Plot Parameters Influence Interpretation, How Color Choice Influences Interpretation, and Maps — Potentials & Pitfalls.
Key Applications of Data Visualization
Data Science Insight
Beyond helping you work through the data analysis process, data visualization can also help you draw insight from the results of your analysis. In this use case, visualization is about presenting quantitative information in a way that makes sense to an analyst or other technical audience. The goal is to make patterns more visible than they are when looking at the numbers or summary statistics themselves.
For example, visualization is often essential in analysis of networks. Networks are made up of nodes and edges. Nodes represent individual things or actors, whereas edges connect the nodes and typically represent some sort of connection or co-occurrence. Networks might be used to describe systems like social networks, where nodes are people and edges are friendships or connections; or shipping routes, where nodes are ports and edges are routes; or perhaps the connections between different departments in an organization, where nodes are departments and edges represent email exchanges between departments.
The initial goals of examining networks are often around understanding their overall structure: whether most nodes are equally connected to each other, or whether there are multiple groups or clusters of nodes forming strongly interconnected communities; and whether or how these communities are connected. Networks can be described and compared with summary metrics that indicate characteristics of their overall structure. However, the structures of these kinds of systems are not easily represented by simple tables or plots.
Networks can be described by metrics that indicate their size and density. In particular, a network’s diameter is its longest geodesic distance (the smallest number of edges necessary to connect two nodes), and its edge density is the proportion of all possible network edges that are present in the network.
Nodes can be assessed in terms of their relative “importance” to the overall network using measures of centrality, as described here and depicted here. For example, betweenness centrality is the number of geodesic distances that go through a node (how many shortest paths use that node as a bridge?). Closeness centrality is the inverse of the total geodesic distance from a given node to all other nodes (is this node very close to all other nodes? If so, its closeness is higher.). Graph level centrality scores can be calculated by centralization, which is effectively the sum of the node-level centrality scores normalized by dividing by the maximum score for a theoretical graph of the same size and parameters.
As an example of the importance of visualization for networks, let’s look at two networks that represent employee skill sets. In this case, nodes are different skills (e.g., techniques, programming languages, etc.), and edges indicate co-occurrence of skills between different employees in the same department. Closer nodes and stronger edges indicate more frequent co-occurrence of the skill in any one individual. These networks have the following metrics:
The first network is more dense, with a smaller diameter. Its betweenness is lower, so fewer nodes act as consistent bridges for shortest paths between other nodes. However, closeness is high, so the paths to other nodes are generally short.
The second network is less dense, with a larger diameter. The higher betweenness indicates greater variation in terms of whether nodes act as bridges for shortest paths between other nodes — some do, and some do not. The lower closeness centrality indicates greater distances from other nodes.
You might be able to begin to understand what they might look like, but you’re likely trying to build a mental picture around the metrics. Visualizing the network graphs directly will help your understanding and intuition of what these metrics mean.
These two network graphs are plotted below. In the plots, nodes are also colored by clusters. These clusters were determined using a “cluster walktrap” function, which looks for densely connected subgraphs via random walks, assuming that short random walks will typically connect nodes in the same cluster.
As was indicated by the metrics, the first network looks like a denser network with a shorter overall diameter, whereas the second is less dense with a longer diameter. However, it is quickly clear that being able to visualize these networks can help you develop intuition around how they are structured. This sort of visualization-facilitated intuition not only helps you understand the data you’re working with, it also facilitates the process of asking further questions to explore this data.
In the first network, there are many clusters of co-occurring skills, but also a substantial amount of overlap in skill sets. This could lead you to ask questions around whether and how this overlap contributes to departmental functioning. Is this structure ideal or intentional, or would it be helpful for some employees to specialize further?
In the second network, there are several clear skill set clusters that are connected by one or a few nodes. This may lead you to ask questions around those nodes with high betweenness. For example, which are the skills that bridge other skill sets, and why are they limited? Do the gaps between skill sets make sense for the departmental goals and tasks, or are there opportunities for more overlap between these skill sets that, if fostered, would improve employee effectiveness? Do employees take advantage of these different skill sets through collaboration?
While these structures could be described by different metrics, actually seeing the graphs enables you to glean insights and delve deeper into understanding the system that you are exploring.
Data Science Communication
Data visualization for data science communication can require a different set of skills than the tasks described above, in that these sorts of visualizations need to be interpretable to a wider range of audiences and skill levels. In this case, visualizations need to be accessible, potentially containing contextual information and/or multiple sources of data. However, they also need to be digestible and interpretable, meaning that the amount of data presented should be constrained, and the visualization itself should be clean, with any noise stripped away.
These sorts of visualizations are the ones that are deployed to non-technical audiences, to help them make real-world decisions. For example, after Hurricane Harvey hit Texas last summer, relief organizations — with often limited resources and access to information — needed a way to determine which areas to target for their efforts. Catholic Charities USA worked with DataKind, MapBox and ATTOM Data Solutions to build the Catholic Charities USA Disaster Operations Map, which combines data on social vulnerability and natural hazard risk to assess different regions around Houston (and the rest of the USA).
The result integrates data from multiple sources, creating insight — and a useful tool — that is greater than the sum of its parts.
The resulting visualization takes advantage of spatial data by presenting aggregated findings on a map. This allows users to more easily make use of the data, and find spatial patterns and relationships, than if it were in a table of risks and vulnerabilities listed by county.
In this case, the spatial aspect of the data makes it particularly important to visualize the data in order to communicate it well. However, maps are not the only contexts for which visualization is important for communication, or for which it is beneficial to take advantage of the spatial structure that visualization applies. Recall that the network graphs demonstrated above did not describe spatial data, but plotting them in two-dimensional space helped our human minds make sense of them. The same approach applies for any dataset with internal relationships — which hopefully is what you will always be able to work with!
The importance of data visualization during exploratory data analysis is often emphasized, but visualization can also be an important tool throughout the data science workflow. The more you look at your data, the better your intuition around the secrets it holds and what questions you should ask of it next!
Next in this series is Part 2: When Is Data Visualization a Good Choice?, which focuses on determining when visualizing your data is an appropriate approach for communicating information.
Originally published at www.t4g.com on March 19, 2018.