Following the success of the {schrute} R package, many requests came in for the same dataset ported over to Python. The schrute and schrutepy packages serve one purpose only: to load the entire transcripts from The Office, so you can perform NLP, text analysis or whatever with this fun dataset.

Quick start

Install the package with pip:

pip install schrutepy

Then import the dataset into a dataframe:

from schrutepy import schrutepy df = schrutepy.load_schrute()

That’s it. Now you’re ready.

Long example

Now we’ll quickly work through some common elementary text analysis functions.

from schrutepy import schrutepy
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import nltk
from nltk.corpus import stopwords
from PIL import Image
import numpy as np
import collections
import pandas as

The Jimothy of text datasets

Github Repo

This is a package that does/has only one thing: the complete transcriptions of all episodes of The Office! (US version).

Use this data set to master NLP or text analysis. Let’s scratch the surface of the subject with a few examples from the excellent Text Mining with R book, by Julia Silge and David Robinson.

First, install the package from CRAN:

# install.packages("schrute") 

There is only one data set with the schrute package; assign it to a variable

mydata <- schrute::theoffice

Take a peek at the format:

dplyr::glimpse(mydata) #> Observations: 55,130 #> Variables: 7 
#> $ index <int> 1, 358, 715, 1072, 1429, 1786, 2143, 2500, 2857... #> $ season <chr> "01", "01", "01", "01", "01", "01", "01", "01",... #> $ episode <chr> "01", "01", "01", "01", "01", "01", "01", "01",...
#> $ episode_name <chr> " Pilot", " Pilot", " Pilot", " Pilot", " Pilot...
#> $ character <chr> "Michael", "Jim", "Michael", "Jim", "Michael", ...
#> $ text <chr> " All right Jim. Your quarterlies look very goo... #> $ text_w_direction <chr> " All right Jim. …


