This new python package makes text analysis fun

By: Brad Lindblad
LinkedIn | Github | Blog | Twitter

Following the success of the {schrute} R package, many requests came in for the same dataset ported over to Python. The schrute and schrutepy packages serve one purpose only: to load the entire transcripts from The Office, so you can perform NLP, text analysis or whatever with this fun dataset.

Quick start

Install the package with pip:

Then import the dataset into a dataframe:

That’s it. Now you’re ready.

Long example

Now we’ll quickly work through some common elementary text analysis functions.


Probably with whiskey

How many times have you looked back on code you wrote a few months back and thought, “what the hell was that?” I regularly scratch my head at code I wrote a few days prior, especially if I rationalized my spaghetti code away as a “scratch file.”

There is objectively good code and bad code, in the same way there is good writing and bad writing. Writing good code isn’t all that different from writing good prose. …


The Jimothy of text datasets

Github Repo

This is a package that does/has only one thing: the complete transcriptions of all episodes of The Office! (US version).

Use this data set to master NLP or text analysis. Let’s scratch the surface of the subject with a few examples from the excellent Text Mining with R book, by Julia Silge and David Robinson.

First, install the package from CRAN:

There is only one data set with the schrute package; assign it to a variable

Take a peek at the format:


I’m proud to announce the release of an R package that has cured one of my own personal itches: pulling and working with USDA data, specifically Quick Stats data from NASS. tidyUSDA is a minimal package for doing just that. The following is cut out from the package vignette, which you can find here: https://github.com/bradlindblad/tidyUSDA

Why tidyUSDA?

Why do we need yet another “tidy” package? Why do I have to install so many geospatial dependencies?

Valid questions. If you work with USDA data, you know that it is difficult at times to find what you need, when you need it. The sheer…


The “Great Restructuring” of our economy is underway. That’s the official name for what we know is happening: the best are rising to the top, and the mediocre are sinking to the bottom. It’s the Matthew Principle in-motion.

In Brynjolfsson and McAfee’s 2011 book Race Against the Machine, they detail how this New Economy will favor those that have the skill set or the capital to interface and invest in new technologies such as deep learning and robotics, which are becoming more ubiquitous every day.

Cal Newport’s Deep Work outlines two core abilities for thriving in this new economy:

1…


Follow these simple steps to install custom themes in RStudio -themes included!

When my brain is tapped out after a grueling morning of data sciencing, my browser will invariably point to a smorgasbord of websites that RescueTime will classify as “entertainment.” From the Hot Network Questions page on Stack Overflow (“Question: why do demons heal faster than angels?”) to looking into Hadley’s commits for the last six months.

After I feel like I’ve seen the entire internet, I’ll sometimes start fiddling with my IDE theme, as if changing the hex color of my function calls will actually improve my work. …


Working with Wood is Fun.

Security cam footage of my office

The era of the specialist is over.

Many adept technologists agree that being good at one thing is the same as being good at none; “two is one and one is none”. Those who can work across many disciplines — the polymaths — will dominate the future of business. As technology increases at an exponential rate, new industries are emerging that are the result of the cross-pollination of existing disciplines. Blockchain is a good example. …


Working with Wood is Fun.

The era of the specialist is over.

Many adept technologists agree that being good at one thing is the same as being good at none; “two is one and one is none”. Those who can work across many disciplines — the polymaths — will dominate the future of business. As technology increases at an exponential rate, new industries are emerging that are the result of the cross-pollination of existing disciplines. Blockchain is a good example. …


tl;dr: If you are impatient and want to go straight to the fun stuff, here is a link to the interactive web dashboard that I built to supplement this analysis. Best viewed on desktop PC.

On a recent Monday afternoon that I would normally have spent slumped over my desk gingerly pecking out bits of code, I stumbled across an article about my home town that made me do a double-take. “Fargo violent crime tops U.S. national average for the first time.” …


tl;dr:If you are impatient and want to go straight to the fun stuff, here is a link to theinteractive web dashboard that I built to supplement this analysis. Best viewed on desktop PC.

On a recent Monday afternoon that I would normally have spent slumped over my desk gingerly pecking out bits of code, I stumbled across an article about my home town that made me do a double-take. “Fargo violent crime tops U.S. national average for the first time.” …

Brad Lindblad

I do data science, machine learning, fly drones, paint and write.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store