This new python package makes text analysis fun

By: Brad Lindblad
LinkedIn | Github | Blog | Twitter

Image for post
Image for post

Following the success of the {schrute} R package, many requests came in for the same dataset ported over to Python. The schrute and schrutepy packages serve one purpose only: to load the entire transcripts from The Office, so you can perform NLP, text analysis or whatever with this fun dataset.

Quick start

Install the package with pip:

pip install schrutepy

Then import the dataset into a dataframe:

from schrutepy import schrutepy df = schrutepy.load_schrute()

That’s it. Now you’re ready.

Long example

Now we’ll quickly work through some common elementary text analysis functions.

from schrutepy import schrutepy
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import nltk
from nltk.corpus import stopwords
from PIL import Image
import numpy as np
import collections
import pandas as


Probably with whiskey

Image for post
Image for post

How many times have you looked back on code you wrote a few months back and thought, “what the hell was that?” I regularly scratch my head at code I wrote a few days prior, especially if I rationalized my spaghetti code away as a “scratch file.”

There is objectively good code and bad code, in the same way there is good writing and bad writing. …


The Jimothy of text datasets

Image for post
Image for post

Github Repo

This is a package that does/has only one thing: the complete transcriptions of all episodes of The Office! (US version).

Use this data set to master NLP or text analysis. Let’s scratch the surface of the subject with a few examples from the excellent Text Mining with R book, by Julia Silge and David Robinson.

First, install the package from CRAN:

# install.packages("schrute") 
library(schrute)

There is only one data set with the schrute package; assign it to a variable

mydata <- schrute::theoffice

Take a peek at the format:

dplyr::glimpse(mydata) #> Observations: 55,130 #> Variables: 7 
#> $ index <int> 1, 358, 715, 1072, 1429, 1786, 2143, 2500, 2857... #> $ season <chr> "01", "01", "01", "01", "01", "01", "01", "01",... #> $ episode <chr> "01", "01", "01", "01", "01", "01", "01", "01",...
#> $ episode_name <chr> " Pilot", " Pilot", " Pilot", " Pilot", " Pilot...
#> $ character <chr> "Michael", "Jim", "Michael", "Jim", "Michael", ...
#> $ text <chr> " All right Jim. Your quarterlies look very goo... #> $ text_w_direction <chr> " All right Jim. …

About

Brad Lindblad

I do data science, machine learning, fly drones, paint and write.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store