Analyzing the shoulders of giants: Breaking Bad (1)

Walter Hartwell White, the main protagonist of Breaking Bad.

Well, let’s face one of the most important assioma in this world: everyone loves Breaking Bad.

I must say, I found a lot of people which do not like it: well, ok, you all are right but I strongly believe that anyone with a lil’ bit of serious TV series background cannot deny the fact that Breaking Bad is one of the best series ever made. I’d like pointing out that I’m not a Breaking Bad fanboy (Scrubs FTW!) although I find myself agreeing to the last sentence. BB is precisely directed, everything has a specific place in the screenplay and there are no holes. Dialogues are never flat and each character has a particular and detailed role. The series is an intricate network of human nature, criminality and sin.

Why are we here? Well, suddenly, after a rewatch, the computer scientist soul which devourers myself came out from the hell and said erggggh, we niiidd to do someeeefhiing. I had to listen to it, so I started thinking about whether there was something interesting I could get out of BB by analyzing it.

Introduction (or WTF am I doing?)

In order to analyze BB I need to access and process its content, the show itself; it means that I should need to process the video, audio or script of it. Now, there are some issues for the first two source of information:

  • to process video, one of the most common approach would be the one of deep learning but I can’t afford to build a neural network and train it on the whole series;
  • to process audio, but I really do not know enough about in order to do that :(

Thus, I decided to go for the script way. Unfortunately, I didn’t find any script (or at least a complete one covering every episode) of the series, so I went for subtitles. Dozens of subtitles are available on the Internet and to find those for BB was an easy job: I downloaded them from here. For each episode, there are various version of the same subtitle but from different releasers. In a preprocessing step, I tried to select only the subtitles from the same releaser when possible.

In the following sections, I’ll briefly explain what I did, show you some cute graphs and try to come up with some explanation. This is the first part of a series of posts in which I try to analyze BB (and maybe other TV shows).

Obviously, I don’t claim any scientific significance here, it’s just for killing the cat!

I DON’T CARE, gimme da code!

All of the code I used in each preprocessing step is available on GitHub. The whole point of this post was to improve the expertise on some python libraries as matplotlib, scikit, etc. I don’t think I will clean the code neither make it fast or whatever for now, due to a lack of time.

Well, let’ start with this first part!

Distribution of talking time

I remember another series, Lost, in which the characters seem talking A LOT of time! They were used to explain everything happening on the isle and the most of time, everyone (including the viewer) understood NOTHING. So, I was curious to find the talking time of BB.

I started by mining the talking time for each episode by using the subtitles. Each row of a subtitle (which is a line or a part of a line for an actor) has a starting time point (STP) and an ending time point (ETP); having these values, the time length of the row is given by the difference between ETP and STP, thus the talking time for an episode is the sum of all row lengths for a subtitle.
For the running time of each episode, I simply scraped the web.

Talking time vs Running time for each Breaking Bad episode.

By the data, we find that the average talking time for a BB episode is 24 minutes, and the average running time of an episode is 47 minutes. The talk density, defined by 24/47, is 0.5047, which means 50% of the series is talking! I actually expected this results, giving the kind of the show.

Moreover, the minimum talking time is at episode 4x10 with ~18 minutes and maximum is at episode 5x4 with ~31 minutes.

We can take the sum out of this graph, in order to check if the talk density holds for each season:

Indeed, we have 48.16% for the first season, 47.18% for the second, 47.93% for the third, 51.02% for the fourth and 55.77% for the fifth one. Note that the first season consists only of seven episodes meanwhile the second season has thirteen episodes, although the talk density for the first season is greater than the talk density for the second one!

Polarity per episode

In sentiment analysis, the polarity of a sentence or, more in general, a text object is a degree of positivity or negativity of the object itself. Before giving the result of this kind of analysis, I must say with a high confidence that the results it’s only for fun and curiosity. Indeed, to compute the polarity of some sort of a dialogue between two or more characters, which practically is the script of an episode, we need (at least)

  • to map each line to an actor: by using only subtitles, this cannot be achieved;
  • to structure the text in such a form similar to a dialogue: again, by using only subtitles, this results in a huge issue; subtitles confuse lines from various actor due to the fact that they serve primarily the user which is watching and reading at the same time.

Anyway, here are the results!

Giving that the results are not reliables at all, we see that the minimum polarity corresponds to the third episode of the first season (we all know what happens there :P) while the maximum corresponds to the tenth episode of the second season.

Some sort of conclusion

Well, I had fun writing this but I’m not completely satisfied with some sort of results: the time related ones were too simple while the polarity one needs a lot of research in order to understand how well natural language processing can be applied to it.

In any case, I hope you enjoyed this excursus! Feel free to leave comments, critiques or idea!