Network graphs analysis (Part 2 of 2): Visualizing the characters of Hamilton as a social network
This is the second of a two-part article series on network graphs (please check out Part 1 here). In this article, I demonstrate how to visualize a network graph using the lyrics from Hamilton, as well as analyze it using graph centrality measures.
Getting the data
“In data science, 80 percent of time spent is preparing data, 20 percent of time is spent complaining about the need to prepare data.”
Data scientists may not agree on everything — but we agree that the most difficult part of any project is getting the data. Lucky for us, that part is behind us for the purposes of this article. There is a nice clean dataset of Hamilton lyrics readily available on Kaggle that you can simply download and start graphing.
Exploratory analysis
This is what the Hamilton dataset looks like. There is one line of record per character/per song/per line of lyric.
- Title refers to the name of the song.
- Speaker refers to the character who is singing a given line.
- Lines refers to the particular line of lyric within the song.
I use Microsoft Power BI for my exploratory data analysis (EDA). With just a few drag-and-drop clicks I’m able to quickly explore the dataset:
No surprises here: The titular Hamilton and his nemesis Burr have the highest number of lines in the musical.
The fast-paced song “The Room Where It Happens” has the highest number of lines of all the songs — followed by, somewhat fittingly, “Non-Stop.”
So what else can we do beyond creating static bar charts? Power BI has an interactive visual called a decomposition tree to drill down into count(lines) by both song and speaker. This is a nifty feature that can be used to analyze an aggregate measure across multiple dimensions.
For example, I can click on the song “Satisfied” and look at which speakers have the highest number of lines within the song (and if you’re a Hamilton fan, you already know the answer!) — and look at which lines they have sung as well.
Even this screenshot doesn’t do it justice — really, the decomposition tree is one of my favorite Power BI visuals. It can be used for any kind of root cause analysis, such as for projects involving visualizing sales and adoption by country and industry, and so on.
Building an adjacency matrix
In order to build a network graph of all the Hamilton speakers, the following must be defined:
- Nodes (list of speakers)
- Edges (to connect each pair of speakers)
- Incidence function to map each pair of vertices to an edge (with an optional weight)
The incidence function I’ve chosen is the Number of songs each pair of speakers appears in together. My assumption is that the more songs two characters appear together in, the stronger their relationship.
Weight {speaker,x, speaker,y} = #songs that feature both speaker,x and speaker,y
Using R’s dplyr, I am able able to transform my original dataset into an {src, dest, weight}
entity, and then convert that into an adjacency matrix. I can then use graph.adjacency in R’s igraph package to create a “graph object” from this adjacency matrix, which I can then use for plotting and other analyses.
Visualizing the network plot
The graph_obj can be visualized using the plot.igraph function. Because this function has many custom layouts to choose from, I start by rendering the same graph using the “star” layout.
The result is technically a network plot. But is it possible to do even better? The chart above seems to suggest that all vertices and edges have equal importance — but that undermines the whole point of visualizing a social network. Some characters are indeed more “significant,” and some speakers have stronger relationships relative to others. How can this graph reflect that?
This is where edge weight and vertex degree come into play. I start by playing around with the parameters of the plot.igraph function to make edge.width (i.e., the thickness of the edge in the plot) relative to the weight, and vertex.label.cex (i.e., the font size of the vertices) relative to degree.
Much better! Characters with a higher degree are visually larger, and the distinction between strong and weak relationships is also apparent from the darkness of the lines. This iteration is much more intuitive and lets the viewer immediately grasp the relationships among characters.
What else is possible? While plot.igraph is great, its use is restricted to static graphs. So, I used the visNetwork library in R to make an interactive network graph. The library makes it possible to zoom in and out of multiple parts of the graph (especially useful with a particularly large graph), and has support for Shiny.
Centrality measures
In Part 1, I briefly discussed various centrality measures used to quantify the relative significance of the nodes. I can use igraph’s degree(), betweenness(), and eigen_centrality() functions to get these results:
It looks like Aaron Burr has the highest betweenness centrality (the “bridge”) in our graph, while Hamilton has the highest eigenvector centrality (the “influencer”). Make what you will of that.
Conclusion
In this series of articles, I have explored the usefulness of network graphs in visualizing relationships among entities, and I’ve shared an example of building a network graph from a flat dataset. It’s important to note, however, that network graphs are not without drawbacks. For example, they can be resource intensive. As is the case with any matrix operations, scalability and performance sometimes take a hit. There is also a “cold start” problem — if your dataset is too sparse or there aren’t really many relationships among entities, a network graph is not an effective solution. Used correctly and in the right context, however, they can be valuable to business.
Code
• https://github.com/iswaryam/hamilton/
• Dataset credit: https://www.kaggle.com/lbalter/hamilton-lyrics#
Resources
• https://www.rdocumentation.org/
• https://docs.microsoft.com/en-us/power-bi/
• https://igraph.org/r/
• https://en.wikipedia.org/wiki/Network_theory
• https://gephi.org/
• http://datastorm-open.github.io/visNetwork/