Survivor: Entity Extraction and Network Graphs in Python

Published in

The Startup

7 min readAug 19, 2020

During the lockdown, I watched and re-watched copious amounts of television. I turned around and saw that people were *gasp* learning new skills. So it occurred to me — why not combine the two for some enjoyable productivity?

Survivor is one of my favourite shows. In this hit US reality show, a group of people live under primitive conditions, compete in challenges, and periodically have to vote someone out. In the end, the winner is decided by a jury of the booted contestants. It’s such a great mix of survival, strategy and social dynamics!

Can we understand Survivor social dynamics based on what contestants say about each other? I will be using confessionals, which are basically when players speak to the camera in private. Specifically, I am going to analyse how often players mention each other.

In this article, I lay out my exploratory analysis step-by-step:

Identify data sources
Prep confessional data
Visualise by plotting network graphs
Draw insights

Since this is very free-form, I will be working in a Jupyter notebook. Key packages include pandas for ETL, spaCy for entity extraction, and networkx for visualising.

I’m conscious that the snippets here don’t always adhere to the highest of coding standards— bear in mind that this is a fun side project. Plus, I also want to show the real, genuine working process. Any comments on the thinking, coding and analysis are welcomed!

1. Identify data sources

The first step — finding usable data.

At this point, I have some ideas but I hadn’t decided what to do with Survivor data yet. So this step is crucial because the availability of good data informs my problem statement. A little extra research can also save lots of data handling time!

Luckily, some googling quickly identified sources of Survivor data:

Performance data inc. challenges, jury votes and tribal council record
Confessionals transcribed

Personally, I’m more interested to do some natural language processing (NLP) on the confessionals— so I am going to focus on this and put the performance data to one side.

The confessionals data come in google sheets by season. I picked Millenials v Gen X (S33, filmed 2016) because it was the latest in my re-watch.

One sheet looks like this. Each row in column B is one confessional. The numbers, like (1/5), indicate that it’s the 1st of 5 confessionals that episode. Note that they’re not labelled with the actual episode number.

In a different file, there is a summary showing the number of confessionals by contestant and episode. If only this also included their finish — who won, who was a finalist, and who made the jury?

At a glance the data looks imperfect but well maintained (there are even guidelines on what gets transcribed). Kudos to the maintainers, I am definitely using these!

2. Prep confessional data

Now I need to read and clean this data so it’s ready to be analysed! There are two parts: ‘data’, i.e. the text of the confessionals, and ‘summary’ which has metadata on how the contestant did.

Since I don’t own these google sheets, I decided to save the relevant tabs as csv files. In ‘summary’, I also manually added a ‘Finish’ column to indicate ‘winner’, ‘finalist’ etc. as there is no way to work this out from the data alone.

Now we’re ready to do some analysis. In order to understand how players talk about each other, I need to identify the names mentioned in confessionals.

3. Visualise by plotting network graphs

My first instinct is to plot a heatmap so we can easily spot where the numbers are. This is easily done with seaborn.

Unfortunately the heatmap is a bit hard to read… even if we filter down to only the 20 players, it’s still difficult to match labels to the values.

Instead, I think we need a network graph showing who spoke about who. The players will be nodes, with arrows in between for mentions. More mentions = bigger arrows!

There’s a bit of fidgeting with formatting, as always… but this is in the right direction now.

Immediately it’s very noticeable that David and Zeke are very central, alongside winner Adam. For Zeke, this is probably a reflection of the infamous incident where another player ousted him as transgender in front of his whole tribe. With David, this is probably a reflection of his popularity (therefore more of his confessionals getting airtime).

Just looking at the graph, it’s obvious that Zeke is more central than, say, Mari (bottom left, yellow dot). But it’s hard to tell if he is more central than David or Adam. Let’s calculate this properly, then I’ll be able to easily check hypothesis like ‘the winner is always the most central of all finalists’:

The winner, Adam, is indeed the most central of all finalists — however, he only ranks 4 out of 20.

Clearly, I need to do this with multiple seasons to find generic Survivor patterns (if they exist). So I tidied up my code a little and made it slightly more flexible.

In my final notebook, I set up so that new graphs can be drawn with one line of code. It’s a little slow, but it does the trick for what I need!

4. Draw insights

Here are networks from 3 different seasons. Winners are in red, finalists in blue, jury members green, and the rest yellow.

Winners

Winners (red) are always the most central of all finalists (blue). I wonder if this holds through all 40 seasons!

By design, winner are in the game for the max number of days, so there are more episodes where their confessionals can be aired. It might be interesting to see this analysis by episode — doing something like this.

However, it’s not easy to identify winners just based on mentions. Although Rob (RI, middle) and Tony (WaW, right) are both the most central by far in their winning seasons, Adam (MvG, left) breaks the pattern. Rob and Tony are both big characters — and that’s not true of all winners.

“Boston Rob” Mariano (left), still married to Amber who bested him in Survivor: All Stars; and Tony Vlachos (right), two-time winner for Cagayan and and Winners at War

Finalists

This is much less consistent. Finalists (blue) aren’t even always in the upper half — look at Natalie in WaW!

Natalie’s non-centrality is driven by the edge of extinction twist. She was actually the first player voted out and only re-joined the game at final-6. Obviously, that meant fewer mentions from fewer players.

In RI, Phillip is much more central than Natalie, driven by many mentions from Rob. However, in MvG, the non-winning finalists have very similar levels of centrality. In all three seasons, prominent jurors (green) come out more central: David, Andrea, Ben.

My conclusion: finalists as a group is not very homogenous.

So… what did I learn?

I realised that confessionals are telling, but you can’t draw hard conclusions based on this alone.

For starters, confessionals are heavily edited by production to sculpt interesting characters and build a storyline. Players also have very different styles, and even winners come in many shapes and forms. So confessionals paint an incomplete picture, in the same way that you can’t analyse the game using only challenge wins.

Having said that, this is a very basic piece of analysis that only scratches the surface.

Some ideas for further analyses:

Full picture: analysing centrality for all 40 seasons, possibly segmenting by the twists in play (with vs without immunity idols, with vs without potential redemption after being voted out)
Episode by episode view: this could show anomalies like controversial incidents, medical evacuations, and classic moments such as Erik giving away his individual immunity.
Signal spotting: Which way will a challenge or a vote go? Fans in early seasons noticed that during suspenseful votes, host Jeff Probst tended to first read the name of the player not going home. Perhaps there are also tell-tale signs from confessionals.
Sentiment analysis: differentiate between positive, neutral negative mentions might show us alliances and voting blocks. This can be done using tools like this.
Returnees: any relationship between confessionals and whether players come back in a future season?
Predictive power: combining this with challenge performance and voting outcomes, how well can we predict players’ finish? How important a predictor would the confessionals be?

I had fun picking up networkx for the first time while thinking about one of my favourite shows. What cool Survivor network graphs would you like to see? What about for other TV shows, movies, and books?