Analyzing Star Trek scripts with Python

Published in

SkyTech

4 min readFeb 4, 2023

Recently, I came across a dataset on Kaggle with every line of dialogue in the Star Trek TV franchise. I decided to answer some questions. To start with, which characters had the most lines?

# TOP 10 characters by line count
line_counts = {}
for series in data:
    for episode in data[series]:
        for character in data[series][episode]:
            # initialize a line count for the character if its not already in dict
            if line_counts.get(character, None) is None:
                line_counts[character] = 0
            # inside the episodes loop, so we add the line counts in this episode to total
            line_counts[character] += len(data[series][episode][character]) 
top_characters = sorted(
  line_counts.items(),
  key =lambda item: item[1],
  reverse= True
)[0:10]
print(top_characters)

[('PICARD', 12438),
 ('JANEWAY', 11358),
 ('KIRK', 10461),
 ('SISKO', 8664),
 ('ARCHER', 8648),
 ('RIKER', 7654),
 ('DATA', 6304),
 ('WORF', 5895),
 ('EMH', 5760),
 ('CHAKOTAY', 5595)]

If you’re a fan, you’ll find it unsurprising that the captain of the ship in each series tops the list. What might be suprising is that fan favorite Mr. Spock apparently didn’t make the cut? Where’s Mr. Spock! But there’s a simple explanation. The original Star Trek series only ran for 3 seasons, as opposed to the 7 seasons later series enjoyed.

Let’s see which series had the most lines:

series_counts = {}
for series in data:
    for episode in data[series]:
        for character in data[series][episode]:
            # initialize a line count for the series 
            if series_counts.get(series, None) is None:
                series_counts[series] = 0
            #add the line counts in this episode to total
            series_counts[series] += len(data[series][episode][character]) 
series = sorted(series_counts.items(),key =lambda item: item[1], reverse= True)[0:10]

print(series)
#[('VOY', 68335),
# ('DS9', 67096),
# ('TNG', 62605),
# ('ENT', 35097),
# ('TOS', 29195),
# ('TAS', 4326)]

So… Star Trek Voyager is our most dialogue-heavy series, with Star Trek Deep Space 9 shortly behind.

Another question we can ask is, how often do captains speak compared to their crew? For this, we can divide the number of times the captain speaks in each series by the total dialogue count for each series.

series_captains = {
    "TOS": "KIRK",
    "TAS": "KIRK",
    'ENT': "ARCHER",
    "TNG": "PICARD",
    "DS9": "SISKO",
    "VOY": "JANEWAY"
}
captain_counts = {}
for series in data:
    for episode in data[series]:
        for character in data[series][episode]:
            if character == series_captains[series]:
                if captain_counts.get(series, None) is None:
                    captain_counts[series] = 0
                captain_counts[series] += len(data[series][episode][character]) 
for series in series_captains:
    percentage_talk[series] = captain_counts[series] / series_counts[series]

print(percentage_talk)
#{'TOS': 0.3120054803904778,
# 'TAS': 0.3069810448451225,
# 'ENT': 0.24398096703421945,
# 'TNG': 0.19821100551074194,
# 'DS9': 0.12912841302015024,
# 'VOY': 0.16607887612497257}

Sure enough, Captain Kirk does a lot of talking! In the Original Series, the captain has 31% of the lines. Compare this to Deep Space Nine, where Sisko speaks only 12% of the time.

Now here’s a question the computer shouldn’t be able to answer: which Star Trek character is happiest?

To answer this question, we can turn to a Python library for emotion detection called text2emotion. Libraries like these look for keywords and phrases to try to detect the emotion in a sentence. This library isn’t by any means the best available and it had some trouble detecting basic emotions, so I wouldn’t recommend it, but it is still pretty neat.

Here’s an example of a phrase it registered as angry:

te.get_emotion ("I can't believe you")
# {'Happy': 0.0, 'Angry': 1.0, 'Surprise': 0.0, 'Sad': 0.0, 'Fear': 0.0}

However, the phrase “ Over my dead body!” is not recognized as being angry.

te.get_emotion('over my dead body!')
#{'Happy': 0, 'Angry': 0, 'Surprise': 0, 'Sad': 0, 'Fear': 0}

So we should take the results in the next section which a grain of salt.

Nevertheless, let us continue: The first step is to tally up the total scores of each character for each emotion. I ran this only against Deep Space Nine and it took hours to run, so if you’re following along, make sure you have some free time.


count = 0
sums = {'Happy': 0, 'Angry': 0, 'Surprise': 0, 'Sad': 0, 'Fear': 0}
output = {}
for episode, character_lines in data['DS9'].items():
    for character, lines in character_lines.items():
        if character in top_15:
            if output.get(character, None) is None:
                output[character] = {'Happy': 0, 'Angry': 0, 'Surprise': 0, 'Sad': 0, 'Fear': 0, 'Count': 0}
            for line in lines:
                result = te.get_emotion(line)
                output[character]['Count'] +=1
                result = te.get_emotion(line)
                for key in result:
                    output[character][key] = output[character][key] + result[key]

This gives us something like the following, which we’d normalize by dividing the score for each character by their line count:

# note - the actual line count and numbers are higher
results = {
"SISKO": {'Happy': 352.5899999999997,
  'Angry': 184.1900000000002,
  'Surprise': 343.3099999999999,
  'Sad': 622.3799999999997,
  'Fear': 947.7399999999998,
  'Count': 4599}
}

# average the results
adjusted = {}
for char in results:
    adjusted[char] = results[char].copy()
    if adjusted[char]['Count'] > 0:
        for emotion in sums:
            adjusted[char][emotion] = output[char][emotion]/float(output[char]['Count'])
            del adjusted[char]['Count']

We can then load these results into a pandas DataFrame and play around with them:

Dataframe showing DS9 main character by happiness in ascending order

And make some bar graphs using matplotlib as ‘plt’:

Here we can see that happiest sounding characters among the main cast of Start Trek Deep Space Nine are Garak, Quark, and Jake; while the angriest sounding are Dukat, O’Brien, and Dax. Garak and Quark are also among the most sarcastic characters, so a further question might be if these “Happiness” scores are genuine, or the algorithm is missing the sarcasm in some of the text.

That’s it for today in analyzing Star Trek. All in all, it was pretty easy to quickly get insights from this dataset. All I had to do was a quick Google search for “star trek scripts dataset”, think of some questions that weren’t too difficult to answer, and get cracking.

The full code for this article is available as a Jupyter notebook at https://github.com/skyfox93/trek_analytics

Analyzing Star Trek scripts with Python

Written by Skylar S