Stories by Greg Rafferty on Medium

The Strategic Data Scientist: What 200 Years of Automation Can Teach Us About Thriving in the AI…

Greg Rafferty — Wed, 23 Jul 2025 21:18:21 GMT

The Strategic Data Scientist: What 200 Years of Automation Can Teach Us About Thriving in the AI Era

Image generated by Gemini

The data scientists who survive won’t be the ones who code better than ChatGPT — they’ll be the ones who think strategically

This article is an excerpt from my new book about how data scientists can not just survive the AI wave, but use it to level up their careers. If you’d like to learn more, please check it out on Amazon!

In 2023, ChatGPT wrote its first (successful) SQL queries. By 2024, it was debugging Python code, generating data visualizations, and explaining statistical concepts better than most junior analysts. Today, generative AI can automate much of what we’ve traditionally called “data science work”: the technical execution that used to fill our days and justify our salaries.

If you’re a data scientist watching AI tools get better at your daily tasks, you’re probably asking the same question that skilled workers have asked for centuries: Will technology make me obsolete?

History has an answer. And it’s more nuanced than you’d expect. The patterns that emerge from 200 years of technological disruption reveal not just what’s coming, but exactly how to position yourself ahead of it.

The Historical Pattern: Automation Eats the Bottom First

This isn’t the first time technology has reshaped entire professions. Every major technological revolution follows a predictable pattern: automation starts with the most routine, codifiable tasks and works its way up the skill ladder. Understanding this pattern reveals where we’re headed and where the safe harbor lies.

The Industrial Revolution (1760–1840s): From Craft to Factory

Before the Industrial Revolution, operating a loom was a demanding craft. A master weaver coordinated hands and feet with perfect rhythm, maintained even tension across threads, and executed complex patterns from memory. Different fibers required different techniques. A single mistake could ruin yards of fabric. Weaving was a precise, responsive art demanding years of practice.

Then came mechanization. Textile production was broken down into discrete, repeatable steps. The complex craft of weaving was reduced to tending machines that worked faster and more consistently than human hands. Skilled weavers became machine operators overnight.

The same pattern played out across industries. Blacksmiths who once forged custom tools became factory workers feeding raw materials into standardized processes. The artisan’s comprehensive knowledge was replaced by the factory worker’s specialized efficiency.

Most data science work today follows this same pattern. We take complex business questions, break them down into standardized analytical steps, and execute them using established tools and frameworks. The craft of data analysis has been industrialized.

The Second Industrial Revolution (~1870–1914): Electrification and Mass Production

Henry Ford’s innovation wasn’t just building cars, it was breaking car manufacturing into tiny, repeatable tasks that could be performed by workers with minimal training. Complex processes were decomposed into simple, measurable steps. Training time dropped from years to weeks.

Modern data science teams often work exactly like Ford’s factories. Junior analysts handle data cleaning and basic aggregations. Mid-level data scientists focus on specific modeling techniques. Senior practitioners review outputs and communicate results. Each person handles their piece of the analytical assembly line, but few understand or control the entire process.

The Computer Revolution (1940s–1980s): Automating Calculation

The computer revolution targeted information processing. Before computers, large organizations employed rooms full of “human computers”: people whose job was to perform calculations and process information by hand. These weren’t unskilled workers; they often had strong mathematical backgrounds and were the data scientists of their era.

But computers didn’t just make calculation faster, they made it scalable. One machine could do the work of dozens of human computers, working 24/7 without breaks, errors, or salary negotiations. The human computers thought their mathematical skills made them irreplaceable. They were wrong.

Today’s data scientists often make the same mistake, thinking our technical skills — our ability to write SQL, manipulate data, and build models — make us indispensable. But these skills are becoming as automatable as the mathematical calculations that once required human computers.

The Internet and Open-Source Era (1990s–2010s): Access, Scale, and Commoditization

The internet democratized access to information and tools. Knowledge that once required expensive training became freely available. Complex technical capabilities were packaged into user-friendly tools that didn’t require deep expertise to use.

This era was about commoditization. Tasks that once required specialized expertise became accessible to anyone willing to spend a weekend learning the basics. Data analysis, web scraping, and statistical modeling became commoditized skills.

Most techniques that feel “advanced” in data science today are actually mature, well-documented processes packaged into plug-and-play libraries. Exploratory data analysis, feature engineering, model selection, and performance evaluation all follow established patterns that can be systematized and automated.

The Gen AI Leap (2020s–): Automating Thought Templates

This brings us to today’s revolution, which is fundamentally different. Previous waves replaced physical labor, calculation, and information processing. Gen AI is targeting the patterns of thought itself.

For the first time, we have tools that don’t just store knowledge. They mimic expertise. They can take a business question, reason through analytical approaches, generate code, interpret results, and communicate findings. They’re automating the analytical reasoning process that most data scientists consider their core competency.

What This Means for Data Scientists Today

History doesn’t repeat, but it rhymes. Every wave of technology reshuffles the labor market, but the pattern is consistent. Automation starts with routine work and gradually moves up the skill ladder. What’s different this time is that AI is targeting cognitive work: the patterns of thought we assumed were uniquely human.

But the workers who thrived weren’t the ones who competed with the machines. They were the ones who learned to work with them strategically.

The most successful textile workers during the Industrial Revolution weren’t the ones who tried to out-weave the machines. They were the ones who became pattern designers, quality supervisors, and production managers. They moved from executing craft to directing strategy.

The same principle applies today. The data scientists who will thrive aren’t the ones who will write better SQL than ChatGPT. They’re the ones who will learn to work strategically — asking better questions, making better decisions, and driving better outcomes.

What Strategic Data Science Looks Like in Practice

During my research into how data scientists can thrive in the age of AI, I saw the same pattern repeat itself again and again: that the skills protecting data scientists from AI automation are identical to the behaviors that differentiate senior individual contributors from their junior colleagues.

This isn’t a coincidence. The strategic thinking that makes you irreplaceable to AI also makes you promotion-ready. As AI compresses skill requirements upward, the L4 mindset won’t be sustainable at the L4 level, and the L5 way of working won’t cut it for L5 roles.

Here are four key areas where strategic data scientists consistently outperform their peers:

1. Strategic Question Framing and Problem Selection

While AI can answer almost any analytical question you give it, it can’t determine which questions are worth asking in the first place. Strategic data scientists understand that the right question is more valuable than the perfect analysis of the wrong question.

In practice, this means:

Developing business acumen to identify which problems actually matter
Learning to scope initiatives that are both impactful and feasible
Anticipating second-order effects of your recommendations

Example: Instead of asking “What are our customer churn rates by segment?” a strategic data scientist asks “Which customer behaviors predict churn early enough for us to intervene, and what interventions are most cost-effective?”

When everyone can run the analysis, the competitive advantage shifts to understanding which analysis to run.

2. Cross-Functional Leadership and Influence

The biggest shift from tactical to strategic thinking is moving from answering questions to shaping strategy. Strategic data scientists don’t just analyze what happened, they influence what happens next through leadership without formal authority.

In practice, this means:

Developing cross-functional leadership skills to drive initiatives across teams
Building stakeholder management abilities to create consensus around data-driven decisions
Executing high-impact projects by working through influence rather than hierarchy

Example: Instead of simply reporting that Feature A has higher engagement than Feature B, a strategic data scientist identifies why, proposes how to apply those insights to other features, and leads the cross-functional effort to implement the changes.

The most successful data scientists understand that impact comes from driving initiatives, not just delivering analyses.

3. Strategic Communication and Executive Presence

AI can generate charts and summaries, but it can’t tell a story that changes minds. Instead of merely presenting findings, strategic data scientists craft narratives that drive decision-making at the executive level.

In practice, this means:

Mastering executive communication styles and preferences
Framing analytical insights in terms of business impact
Moving from “here’s what the data shows” to “here’s what we should do about it”

Example: Instead of presenting a chart showing declining engagement metrics, a strategic data scientist presents a story: “Our engagement drop correlates with the new feature rollout. Here’s a three-step plan to fix it, including the expected timeline and resource requirements.”

The most successful data scientists spend as much time crafting their communication as they do running their analyses.

4. Intelligent AI Integration and Workflow Optimization

Rather than competing with AI, strategic data scientists learn to work with it as a force multiplier. This means understanding how to delegate routine tasks effectively, knowing when to trust AI outputs and when to dig deeper, and maintaining strategic oversight.

In practice, this means:

Using AI for data cleaning, initial exploration, and code generation
Maintaining quality control and strategic direction over AI outputs
Focusing human effort on problem definition, method selection, and solution design

Example: Let AI generate your initial data exploration scripts and basic visualizations, but you determine which patterns are worth investigating, when to test versus model, and how insights connect to business objectives.

The key is treating AI as a powerful junior analyst that needs clear direction and quality oversight while you focus on the strategic decisions that drive impact.

The Path Forward: From Execution to Strategy

The transformation from tactical executor to strategic partner isn’t just about learning new skills; it’s about fundamentally changing how you approach your work. The analytical skills you’ve already developed are the perfect foundation for strategic thinking. You already know how to break down complex problems, evaluate evidence, and draw logical conclusions. Now it’s time to apply those same capabilities to business strategy, stakeholder management, and cross-functional leadership.

The data scientists who will thrive in the next decade won’t be the ones who resist this change. They’ll be the ones who embrace it and use it to do more strategic, more impactful work. They’ll learn to work with AI as a strategic co-pilot, delegating execution while focusing on the higher-level thinking that drives real impact.

Most importantly, they’ll understand that their value doesn’t come from their ability to write perfect code or build flawless models. It comes from their ability to identify the right problems, ask the right questions, and drive the right outcomes.

The question isn’t whether AI will change your job. It’s whether you’ll change your job before AI does it for you. The automation wave is here, and it’s time to decide: will you let it sweep you away, or will you learn to ride it to the top?

Did this post ignite your curiosity about becoming a more strategic data scientist? Buy The Strategic Data Scientist: Level Up and Thrive in the Age of AI. Learn the frameworks, mindsets, and tactics Strategic Data Scientists use to drive impact without managing people; and discover how to work with AI as a strategic co-pilot, not a replacement.

Getting Started with Facebook Prophet

Greg Rafferty — Wed, 31 Mar 2021 23:18:39 GMT

Getting Started with Prophet

Everything you need to to know to build your first model with Meta’s advanced forecasting tool

Forecasting Time Series Data with Prophet, available at https://amzn.to/42xTkOb

This post contains an excerpt from my new book all about Prophet. The book is available for purchase on Amazon. The full contents of the book are listed at the end of this post!

If you find this article useful and would like to use Prophet to improve your forecasts, please consider buying the full book at https://amzn.to/42xTkOb.

Building a simple model in Prophet

The longest record of direct measurements of CO₂ in the atmosphere was started in March of 1958 by Charles David Keeling of the Scripps Institution of Oceanography. Keeling was based in La Jolla, California, but had received permission from the National Oceanic and Atmospheric Administration (NOAA) to use their facility located two miles above sea level on the northern slope of Mauna Loa, a volcano on the island of Hawaii, to collect carbon dioxide samples. At that elevation, Keeling’s measurements would be unaffected by local releases of CO₂ such as by nearby factories.

In 1961, Keeling published the data he had collected thus far, establishing that there was strong seasonal variation in CO₂ levels and that they were rising steadily, a trend that later became known as the Keeling Curve. By May 1974, the NOAA began their own parallel measurements and have continued since then. The keeling curve graph is as follows:

The Keeling Curve, showing the concentration of carbon dioxide in the atmosphere

With its seasonality and increasing trend, this curve makes a good candidate to try out Prophet. This data set contains over 19,000 daily observations across 53 years. The unit of measurement for CO₂ is PPM, or parts per million, a measure of CO₂ molecules per million molecules of air.

To begin our model, we need to import the necessary libraries, pandas and Matplotlib, and import the Prophet class from the fbprophet package.

import pandas as pd

import matplotlib.pyplot as plt

from fbprophet import Prophet

As input, Prophet always requires a pandas DataFrame with two columns:

ds, for datestamp, should be a datestamp or timestamp column in a format expected by pandas.
y, a numeric column containing the measurement we wish to forecast.

Here, we use pandas to import the data, in this case a csv file [Note: This csv can be downloaded at https://git.io/JYG6T], and then load it into a DataFrame. Note that we also convert the ds column to a pandas datetime format, to ensure that Pandas is correctly identifying it as dates and not simply loading it as an alphanumeric string.

df = pd.read_csv(‘co2-ppm-daily_csv.csv’)

df[‘date’] = pd.to_datetime(df[‘date’])

df.columns = [‘ds’, ‘y’]

If you’re familiar with the scikit-learn (sklearn) package, you’ll feel right at home in Prophet because it was designed to operate in a similar way. Prophet follows the sklearn paradigm of first creating an instance of the model class before calling the fit and predict methods.

model = Prophet()

model.fit(df)

In that single fit command, Prophet analyzed the data and isolated both the seasonality and trend without requiring us to specify any additional parameters. It has not yet made any future forecast though. To do that, we need to first make a DataFrame of future dates and then call the predict method. The make_future_dataframe method requires us to specify the number of days we intend to forecast out. In this case, we will choose ten years, or 365 days times 10.

future = model.make_future_dataframe(periods=365 * 10)

forecast = model.predict(future)

At this point, the forecast DataFrame contains Prophet’s prediction for CO₂ concentrations going ten years into the future. We will explore that DataFrame in a moment, but first let’s plot the data using Prophet’s plot functionality. The plot method is built upon Matplotlib; it requires a DataFrame output from the predict method (our forecast DataFrame in this example).

We’re labeling the axes with the optional xlabel and ylabel arguments, but just sticking with the default for the optional figsize argument. Note that I am also adding a title using raw Matplotlib syntax; because the Prophet plot is built upon Matplotlib, anything you can do to a Matplotlib figure can be performed here as well. Also, don’t be confused by the odd ylabel text with the dollar signs; that just tells Matplotlib to use its own TeX-like engine to make the subscript in CO₂.

fig = model.plot(forecast, xlabel=’Date’, ylabel=r’CO$_2$ PPM’)

plt.title(‘Daily Carbon Dioxide Levels Measured at Mauna Loa’)

plt.show()

The graph is as follows:

Prophet Forecast

And that’s it! In those 12 lines of code, we have arrived at our ten year forecast.

Interpreting the forecast DataFrame

Now, let’s take a look at that forecast DataFrame by displaying the first three rows (I’ve transposed it here, in order to better see the column names on the page) and learn how these values were used in the above chart:

forecast.head(3).T

After running that command, you should see the following table print out:

The forecast DataFrame

The following is a description of each of the columns in the forecast DataFrame:

‘ds’ — Datestamp or timestamp which values in that row pertain to
‘trend’ — Value of the trend component alone
‘yhat_lower’ — Lower bound of the uncertainty interval around the final prediction
‘yhat_upper’ — Upper bound of the uncertainty interval around the final prediction
‘trend_lower’ — Lower bound of the uncertainty interval around the trend component
‘trend_upper’ — Upper bound of the uncertainty interval around the trend component
‘additive_terms’ — Combined value of all additive seasonalities
‘additive_terms_lower’ — Lower bound of the uncertainty interval around the additive seasonalities
‘additive_terms_upper’ — Upper bound of the uncertainty interval around the additive seasonalities
‘weekly’ — Value of the weekly seasonality component
‘weekly_lower’ — Lower bound of the uncertainty interval around the weekly component
‘weekly_upper’ — Upper bound of the uncertainty interval around the weekly component
‘yearly’ — Value of the yearly seasonality component
‘yearly_lower’ — Lower bound of the uncertainty interval around the yearly component
‘yearly_upper’ — Upper bound of the uncertainty interval around the yearly component
‘multiplicative_terms’ — Combined value of all multiplicative seasonalities
‘multiplicative_terms_lower’ — Lower bound of the uncertainty interval around the multiplicative seasonalities
‘multiplicative_terms_upper’ — Upper bound of the uncertainty interval around the multiplicative seasonalities
‘yhat’ — Final predicted value; a combination of ‘trend’, ‘multiplicative_terms’, and ‘additive_terms’

If the data contains a daily seasonality, then columns for ‘daily’, ‘daily_upper’, and ‘daily_lower’ will also be included, following the pattern established with the ‘weekly’ and ‘yearly’ columns. Later chapters will include discussion and examples of both the additive/multiplicative seasonalities and of the uncertainty intervals.

In the forecast plot above, the black dots represent the actual recorded y values we fit on (those in the df[‘y’] column) whereas the solid line represents the calculated yhat values (the forecast[‘yhat’] column). Note that the solid line extends beyond the range of the black dots where we have forecasted into the future. The lighter shading notable around the solid line in the forecasted region represents the uncertainty interval, bound by forecast[‘yhat_lower’] and forecast[‘yhat_upper’].

Now let’s break down that forecast into its components.

Understanding components plots

In Chapter 1, History and Development of Time Series Forecasting, Prophet was introduced as an additive regression model. Figures 1.4 and 1.5 showed how individual component curves for the trend and the different seasonalities are added together to create a more complex curve. The Prophet algorithm essentially does this in reverse; it takes a complex curve and decomposes it into its constituent parts. The first step towards greater control of a Prophet forecast is to understand these components so that they can be manipulated individually. Prophet provides a method plot_components to visualize these.

Continuing on with our progress on the Mauna Loa model, plotting the components is as simple as running these commands:

fig2 = model.plot_components(forecast)

plt.show()

As you can see in the output plot, Prophet has isolated three components in this data set: the trend, a weekly seasonality, and a yearly seasonality:

Mauna Loa components plot

The trend constantly increases but seems to have a steepening slope as time progresses — an acceleration of CO₂ concentration in the atmosphere. The trend line also shows slim uncertainty intervals in the forecasted year. From this curve, we learn that atmospheric CO₂ concentrations were about 320 PPM in 1965. This grew to about 400 by 2015 and we expect about 430 PPM by 2030. However, these exact numbers will vary depending upon the day of the week and the time of year, due to the existence of the seasonality effects.

The weekly seasonality shows that by days of the week, values will vary by about 0.01 PPM — an insignificant amount and most likely due purely to noise and random chance. Indeed, intuition tells us that carbon dioxide levels (when measured far enough away from human activity, as they are on the high slopes of Mauna Loa) do not care much what day of the week it is and are unaffected by it.

We will learn in Chapter 4, Seasonality, how to instruct Prophet not to fit a weekly seasonality, as is prudent in this case. In Chapter 10, Uncertainty Intervals, we will learn how to plot uncertainty for seasonality and ensure that a seasonality such as this can be ignored.

Now looking at the yearly seasonality reveals that carbon dioxide rises throughout the winter and peaks in May or so, while falling in the summer with a trough in October. Measurements of carbon dioxide can be 3 PPM above or 3 PPM below what the trend alone would predict, based upon the time of year. If you refer back to the original data, in the Keeling Curve, you will be reminded that there was a very obvious cyclic nature to the curve, captured with this yearly seasonality.

As simple as that model was, that is often all you need to make very accurate forecasts with Prophet! We used no additional parameters than the defaults and yet achieved very good results.

This excerpt is from chapter 2 of Forecasting Time Series Data with Facebook Prophet available now on Amazon. The book has more than 250 pages of examples, lessons, and descriptions of every single aspect of Prophet and more than 10 instructive datasets are provided to help you learn how to perfect your forecasts by demonstrating Prophet functionality from the simple to the advanced with fully working code.

The full book contains the following chapters:

The History and Development of Time Series Forecasting
Getting Started with Facebook Prophet
Non-Daily Data
Seasonality
Holidays
Growth Modes
Trend Changepoints
Additional Regressors
Outliers and Special Events
Uncertainty Intervals
Cross-Validation
Performance Metrics
Productionalizing Prophet

If you enjoyed this Medium post, please consider ordering it here: https://amzn.to/42xTkOb. If you do read the book, I would be thrilled to hear your thoughts!

Getting Started with Facebook Prophet was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

COVID-19 dashboard

Greg Rafferty — Mon, 23 Mar 2020 18:56:09 GMT

I built a web-based dashboard using Dash to visualize the pandemic

Stuck behind a paywall? Click here to read the full story with a Friend Link!

I’m Greg Rafferty, a data scientist in the Bay Area. The code for this project is available on my GitHub and the dashboard is live at https://covid-19-raffg.herokuapp.com/.

I built a web dashboard using Python and Dash, with charts made in Plotly. The data is provided by Johns Hopkins Center for Systems Science and Engineering and automatically updates to the dashboard nightly at 5:30pm, Pacific time.

Focus selection

The dashboard can be set on the pandemic globally, or with a focus on either the United States or Europe through the radio buttons on the top:

This button changes the underlying data for each displayed chart to reflect the selected region.

Each evening at roughly 5pm, Johns Hopkins updates their data source with new cases from the day. My dashboard automatically runs an ETL script to download the new data, process it into the format required by the dashboard, and upload it to Heroku. The headline here declares when the data was most recently updated.

Components

There are five main components of the dashboard: the indicators, the infections rates for the selected region, the case analysis by sub-region, the infection map, and the trajectory chart.

Indicators

There are four indicators, each consisting of the current value for the indicator, in red, and a percent change from yesterday, in green.

CUMULATIVE CONFIRMED is the running total of all cases tested and confirmed in the selected region.
CURRENTLY ACTIVE measures only the cases active today. It is calculated as ACTIVE = CONFIRMED — DEATHS — RECOVERED
DEATHS TO DATE measures the running total of all COVID-19-related deaths
RECOVERED CASES is the number of cases in which the patient is deemed to have recovered from the illness and is no longer infected nor contagious.

Infections

The infections chart displays the totals for CONFIRMED, ACTIVE, RECOVERED, and DEATHS for the selected region, by date. Hovering the mouse over the chart will reveal the counts for each of these measures on the specific date. Using the mouse, you can zoom in and out or click and drag to select a box to zoom in on. Additionally, hovering over the chart (or any chart on the dashboard) will make visible several control buttons in the top right of the chart. There are slightly different options for each chart, but of particular usefulness is the ability to reset the chart back to original zoom level.

Cases by Sub-Region

The cases graphic displays a line chart by sub-region of either CONFIRMED, ACTIVE, RECOVERED, or DEATHS, selectable with the radio buttons below the chart. If the selected region is Worldwide or Europe, the sub-regions displayed are countries. If the selected region is United Sates, the sub-regions are the states. On hover, the exact count of the selected metric is displayed for the sub-region the mouse is over.

By default, it displays sub-regions which were of particular interest when this dashboard was created. The dropdown-bar on the bottom allows you to select different sub-regions for display, either countries for the Worldwide and Europe focus or states for the United States focus. Typing in the dropdown-bar will allow you to search for sub-regions.

As with the other two line charts on this dashboard, clicking on an item in the legend will temporarily remove that item from the chart. Clicking again will add it back. Double-clicking an item will remove all other items and isolate that singular item on the chart. Double-clicking again will add back all items.

Infection Map

The infection map features a circular marker over each sub-region. The size of the marker is relative to the square root of the CONFIRMED cases within that sub-region and the color indicates the percentage of those cases which were newly confirmed within the previous 7 days. Essentially, the size of the marker is a measure of how many people have caught the virus within that sub-region since the outbreak began and the color is a measure of how active the virus currently is, with dark red indicating the virus is actively spreading and white indicating that it is more under control. Hovering over a marker will reveal the country name and the exact value of the two measures. As with the other charts, the map is zoomable and dragable. Below the chart is a slider bar controlling the date at which the map displays data. By default it is set for the most recent date available but by dragging to the left you can see the spread of the pandemic through time.

Trajectory

This chart displays the trajectory of the pandemic within sub-regions. The x-axis displays the cumulative confirmed count by sub-region and the y-axis displays the count of cases which were confirmed in the previous week. With this visualization, once a sub-region has managed to control the pandemic to some extent, the line should suddenly drop down, as China (green) and South Korea (orange) have in the image. Although date is not on either of the axes, the data is still plotted by date; hovering over any line will display the date on which that data point was recorded. Additionally, the date slider on the bottom also controls this chart; so along with the map, the change throughout time of the trajectories can be inspected.

COVID-19 dashboard was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

What’s the Most Wonderful Time of the Year? Hint: It’s not what The Economist says

Greg Rafferty — Thu, 27 Feb 2020 01:45:17 GMT

Analyzing Spotify’s valence score with Python and Matplotlib

Photo by Marcela Laskoski on Unsplash

Stuck behind the paywall? Click here to read the full story with a Friend Link!

I’m Greg Rafferty, a data scientist in the Bay Area. The code for this project is available on my GitHub.

In the Feb 8th, 2020 edition of The Economist, the Graphic Detail section briefly discussed an analysis of Spotify data suggesting that July is the happiest month on average (Sad songs say so much:
Data from Spotify suggest that listeners are gloomiest in February). I attempted to duplicate their study and came to some different conclusions, and made some new discoveries along the way.

The Data

There are two sources of data needed for such an analysis. The first is the Top 200 most streamed songs, by day and by country, which Spotify makes available on spotifycharts.com. Because I didn’t want to select each individual country and date from drop-down menus and manually download the almost-70,000 daily chart csv files, I built a scraper to do it all for me.

The second set of necessary data is the valence scores for each song in those charts. Spotify makes this data available via their developer API. To access this data, you’ll need to sign up for credentials here. Thankfully, Spotify doesn’t make this too difficult though, so you shouldn’t have any problems. I built a second scraper which goes through each unique song from the Top 200 charts and downloads its feature vector. There are several features available here, but what The Economist used was the valence score, a decimal between 0 and 1 which describes the “happiness” of the song.

This score was originally developed by a music intelligence and data platform called Echo Nest, which was acquired by Spotify in 2014. A (now dead, but available via the Wayback Machine) blog post only has this to say about the score:

We have a music expert classify some sample songs by valence, then use machine-learning to extend those rules to all of the rest of the music in the world, fine tuning as we go.

Other features available via the API include tempo, energy, key, mode, and danceability, among others, and it has been speculated that these features play a role in the valence score. At any rate, it’s a bit of a black box how the valence score is arrived at, but it does seem to match up with the songs very well. However, as the training data is highly likely to favor popular music, I wonder if classical, jazz, or non-Western musical styles are not scored as accurately.

The Analysis

My analysis showed a very similar distribution of data compared with The Economist’s analysis:

source: https://www.economist.com/graphic-detail/2020/02/08/data-from-spotify-suggest-that-listeners-are-gloomiest-in-february

Despite the different appearances of our two charts, the shape of the two kernel density estimations is very similar (if you’re not familiar, a kernel density estimation, or KDE, is pretty much just a smooth histogram with the area under the curve summing up to 1). The locations of those key songs along the valence axis also matches up. You can see that on average, Brazilians listen to “happier” music than the rest of the world, and in the United States listeners stream music which on average is less happy than the rest of the world. As we’ll see, the music of Latin America typical scores very high in valence.

It’s the second chart from The Economist where I see some key differences.

source: https://www.economist.com/graphic-detail/2020/02/08/data-from-spotify-suggest-that-listeners-are-gloomiest-in-february

First, let’s look at that ten-day moving average chart in the upper right corner. It finds that February displays the lowest average valence and July the highest. Here is what I found:

In my analysis, December has the highest average valence, followed by August, and then in third place is July. I initially found some very different average valence scores for February as well, and so investigated why our data sets could be different. The Economist used data from January 1st, 2017 (the earliest available on Spotify Charts) until January 29th, 2020 (presumably, when they performed their scraping). I had all of that data, plus almost all of February 2020 as well. Without aggregating by month, as the above plot does, and also not performing a moving average, I saw a much wider variance in the same month from year to year than I expected:

In any given year, I did see December as the highest. However, in 2018 (a particularly sad year?) the summer featured lower valence scores than February. Additionally, the inclusion of 2020 data, which is much higher than the previous years, acted to inflate February’s average across time. Therefore, I chose to exclude 2020 data. Compare these two charts, one with 2020 included and the other without:

On the left: including Jan/Feb 2020; On the right: excluding Jan/Feb 2020

This doesn’t change the charts a great deal, however, one key point The Economist noted in their chart was that southern hemisphere New Zealand also experiences a dip in valence in February despite the reversal of their summers and winters compared to the northern hemisphere. When I include February 2020, I see the opposite effect as to what The Economist noticed, but when I exclude February 2020, then I do see the dip; however much less pronounced than what The Economist saw. Following The Economist’s convention of including all available data though, I would include February and, because 2020 was so much happier than previous years, this proves the opposite point to be true. Including this data does, unfortunately, seem somewhat arbitrary to me — what do we call an appropriate cutoff point? Including an equal number of months in each year seems reasonable to me though, so slightly less arbitrary to exclude the 2020 data. I’ll be curious to rerun this analysis at the end of the year and see what it looks like then.

A key finding of mine though which is most certainly different than what The Economist found is that December is the happiest month, not July as they found.

I also sorted the countries into continents to look at wider trends. As The Economist found, Latin American countries do indeed stream much happier music than the rest of the world. Also to note is that in every continent except Europe, December is the happiest month; and February the saddest everywhere except Africa and Australia:

Additionally, I looked at mood by day. As I had expected, I found Saturday to be the happiest.

I also looked at this chart for just the US and New Zealand. I found Friday to be the saddest day in the United States and Sunday to be the happiest. Any theories why this may be? New Zealand exhibits the behavior I’d probably most expect, with Monday being the saddest and Saturday the happiest.

Finally, I did spend a bit of time looking into the other features available via the Spotify API and made one last chart, danceability for each country:

Here, I found Fridays to feature more danceable music than Mondays, as I was expecting. Furthermore, the ordering of the countries shifts around a lot from the valence chart. For instance, the United States is at the sadder end of the valence spectrum but at the more danceable end in this chart. I also noted that the lowest countries as ranked by danceability are all Asian countries with traditional music styles which do not fit the “standard” western 12-tone, 4-beats-to-a-bar system. I wonder if the algorithm has a tough time predicting danceability when the structure is so different than the majority of training samples.

I also noted that on Sundays in the Netherlands, their danceability score plummets! Norway and Sweden experience this phenomenon, although to a lesser extent. Religious Sundays may be one explanation for this result, although the Netherlands, Norway, and Sweden don’t seem to me to be particularly more religious than many other countries which don’t exhibit this behavior.

Just for fun, I looked up the saddest song I know (tragically missing from the Top 200 charts), the 3rd symphony by Henryk Górecki (titled, very appropriately, Symphony of Sorrowful Songs). The 2nd movement features a valence score of just 0.0280, well below Adele’s Make You Feel My Love at 0.0896. This score would place it second-to-last on the valence chart for all 68,000+ songs in the Top 200, just slightly above Tool’s Legion Inoculant, the saddest song in the charts with a valence of 0.0262 (although, in my opinion, this composition by Tool really stretches the definition of “song” beyond reasonableness).

I also looked up Pharrel Williams’ Happy, expecting to find a peak score, and was disappointed to see that it’s “only” 0.9620. Compare that to the almost comically happy September by Earth, Wind & Fire, with a valence score of 0.982 (the highest in the charts).

The top 10 happiest songs in the United States’ top 200 rankings:

Earth, Wind & Fire - September
Gene Autry - Here Comes Santa Claus (Right Down Santa Claus Lane)
The Beach Boys - Little Saint Nick - 1991 Remix
Logic - Indica Badu
Chuck Berry - Johnny B. Goode
Shawn Mendes - There's Nothing Holdin' Me Back
Foster The People - Pumped Up Kicks
Tom Petty - I Won't Back Down
OutKast - Hey Ya!
Aretha Franklin - Respect

And the top 10 saddest songs:

TOOL - Legion Inoculant
Joji - I'LL SEE YOU IN 40
Trippie Redd - RMP
Drake - Days in The East
Drake - Jaded
Lil Uzi Vert - Two®
TOOL - Litanie contre la Peur
Russ - Cherry Hill
2 Chainz - Whip (feat. Travis Scott)
Rae Sremmurd - Bedtime Stories (feat. The Weeknd) - From SR3MM

So, it seems that Andy Williams is correct: December actually is the Most Wonderful Time of the Year (valence score: 0.7240).

https://medium.com/media/b01581b5bc0525ec2fd0555e3138a899/href

What’s the Most Wonderful Time of the Year? Hint: It’s not what The Economist says was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

A/B testing — Is there a better way? An exploration of multi-armed bandits

Greg Rafferty — Thu, 23 Jan 2020 00:10:57 GMT

A/B testing — Is there a better way? An exploration of multi-armed bandits

Using the algorithms of Epsilon-Greedy, Softmax, UCB, Exp3, and Thompson Sampling

Photo by Benoit Dare on Unsplash

Stuck behind the paywall? Click here to read the full story with a Friend Link!

I’m Greg Rafferty, a data scientist in the Bay Area. The code for this project is available on my GitHub.

In this post, I’ll simulate a traditional A/B test and discuss its shortcomings, then I’ll use Monte Carlo simulations to examine some different multi-armed bandit algorithms, which can alleviate many of the problems with a traditional A/B test. Finally, I’ll discuss the termination criteria for the specific case of Thompson Sampling.

Part 1: Traditional A/B testing

Websites today are meticulously designed to maximize one or even several goals. Should the “Buy Now!” button be red or blue? What headline attracts the most clicks to that news article? Which version of an advertisement has the highest click-through rate? To determine the optimal answer to these questions, software developers employ A/B tests — a statistically sound technique to compare two different variants, version A and version B. Essentially, they’re trying to determine whether the mean value in the blue distribution below is actually different than the mean value of the red distribution, or is that apparent difference actually just due to random chance?

Also, an example of the Central Limit Theorem

In a traditional A/B test, you start by defining what minimum difference between the versions is meaningful. In the above distributions, version A (usually, the current version) has a mean of 0.01. Let’s say this is a 1% click-through rate, or CTR. In order to change our website to version B, we want to see a minimum of 5% improvement, or a CTR of at least 1.05%. Next, we set our confidence level, the statistical confidence that our observed results are due to a true difference as opposed to random chance. Typically, this is called alpha and set to 95%. In order to determine how many observations to collect, we use power analysis to determine the required sample size. If alpha can be thought of as the acceptable rate of making a Type I error (False Positive), power can be thought of as the acceptable rate of making a Type II error (False Negative).

Many statisticians believe a Type I error is 4x as costly as a Type II error. Put another way: Your eCommerce website is currently running fine. You believe you’ve identified a change that will increase sales so you implement the change, only to find out that the change actually hurt the website. This is a Type I error and has lost you sales. Now imagine that you consider making a change but decide it won’t improve things, even though in reality it would have, and so you don’t make the change. This is a Type II error, and cost you nothing but potential opportunity. So if we set our confidence level to 95%, that means that we’re willing to accept a Type I error in only 5% of our experiments. If a Type I error is 4x as costly as a Type II error, that implies that we set our power to 80%; we’re willing to be conservative and ignore a potentially positive change 20% of the time.

Version A is what’s currently running. So we have historical data and can calculate a mean CTR and corresponding standard deviation. We’ll need these values for version B though, which doesn’t yet exist. For the mean, we’ll use that 5% improvement value, so mean_b = 1.05 * mean_a. Standard deviation though will need to be estimated. This can be a severe downside to traditional A/B testing when this estimation is difficult. In our case though, we’ll just assume version B will have the same standard deviation as version A. With sigma standing in for standard deviation and d being the difference between our two means, we’ll need to look up z-scores for both alpha and beta, and calculate our sample size with this equation:

With that, we simply run our A/B test until the required sample size is obtained. We randomly show visitors to our site either version A or version B and record the CTR for each version. You then either use a stats package or t-test calculations and a t-test table to arrive at a p-value; if the p-value is less than your alpha, 0.05 in this case, then you can state with 95% confidence that you have observed a true difference between version A and version B, not one due to chance.

Drawbacks to the traditional A/B test

The greatest drawback to a traditional A/B test is that one version may be vastly inferior to another, and yet you must continue to offer that version to visitors until the test is complete, thus losing sales. As stated earlier, you also must arrive at an estimate of standard deviation for your version B; if your guess is incorrect, you may not collect enough samples and fail to achieve statistical power; that is, even if version B truly is better than version A and even if your experiment demonstrates this fact, you do not have enough samples to declare the difference statistically significant. You’ll be forced into a False Negative.

It would be great if there was a way to run an A/B test, but not waste time on an inferior version B (or C, D, and E…).

Part 2: Multi-Armed Bandits

Those old slot machines with the single lever on the side which always take your money — those are called one-armed bandits. Imagine a whole bank of those machines lined up side-by-side, all paying out at different rates and values. This is the idea of a multi-armed bandit. If you’re a gambler who wants to maximize your winnings, you obviously want to play the machine with the highest payout. But you don’t know which machine this is. You need to explore the different machines over time to learn what their payouts are, but you simultaneously want to exploit the highest paying machine. A similar scenario is Richard Feynman’s restaurant problem. Whenever he goes to a restaurant, he wants to order the tastiest dish on the menu, but he has to order everything available to find what is that best dish. This balance of exploitation, the desire to choose an action which has payed off well in the past, and exploration, the desire to try options which may produce even better results, is what multi-armed bandit algorithms were developed for.

How do they do this? Let’s take a look at several algorithms. I won’t spend too much time discussing the mathematics of these algorithms, but I will link to my Python implementations of each of them on my Github which you can refer to for further details. I’ve used the same notation for each algorithm so the select_arm() and update() functions should fully describe the math.

Epsilon-Greedy

The Epsilon-Greedy algorithm balances exploitation and exploration fairly basically. It takes a parameter, epsilon, between 0 and 1, as the probability of exploring the options (called arms in multi-armed bandit discussions) as opposed to exploiting the current best variant in the test. For example, say epsilon is set at 0.1. Every time a visitor comes to the website being tested, a number between 0 and 1 is randomly drawn. If that number is greater than 0.1, then that visitor will be shown whichever variant (at first, version A) is performing best. If that random number is less than 0.1, then a random arm out of all available options will be chosen and provided to the visitor. The visitor’s reaction will be recorded (a click or no click, a sale or no sale, etc.) and the success rate of that arm will be updated accordingly.

There are a few things to consider when evaluating multi-armed bandit algorithms. First, you could look at the probability of selecting the current best arm. Each algorithm takes a bit of time to stabilize and find the best arm, but once stabilization is reached epsilon-Greedy should select the best arm at a rate of (1-epsilon) + epsilon/(number of arms). This is because (1-epsilon)% of the time, it will automatically select the best arm and then the remaining time it will select all arms at an equal rate. For different values of epsilon, this is what the accuracy looks like:

In all these trials, I’ve simulated 5 arms with failure/success ratios of [0.1, 0.25, 0.5, 0.75, 0.9]. These values span a far wider range than would typically be seen in a test like this, but they allow the arms to display their behavior after simulating far fewer iterations than would otherwise be required. Each graph is the result of averaging 5000 experiments with a horizon of 250 trials.

Low values of epsilon correspond to less exploration and more exploitation, therefore it takes the algorithm longer to discover which is the best arm but once found, it exploits it at a higher rate. This can be see most clearly with the blue line starting off slowly but then passing the other arms and stabilizing at a higher rate.

When there are many arms at play, all roughly similar in expected reward, it can be valuable to look at the average reward of an algorithm. The following chart again shows a handful of values for epsilon compared:

Both of these approaches, however, focus on how many trials it takes to find the best arm. An evaluation approach which looks at cumulative reward will treat algorithms which focus upfront on learning more fairly.

Clearly, choosing the value of epsilon can matter a great deal and is not trivial. Ideally, you would want a high value (high exploration) when the number of trials is low, but would transition to a low value (high exploitation) once learning is complete and the best arm is known. There is a technique called annealing, which I will not go into too much detail here on, but it is pretty simple. Again, see my Github code for details, but it basically does exactly what I described: adjust epsilon as the number of trials increases. Using the annealed epsilon-Greedy algorithm and plotting the rate of selecting each arm looks like this:

With these (admittedly extreme) values for each arm, the algorithm very quickly settles on arm_0 as the best and selects the remaining arms a fraction of the time.

One of the greatest advantages of multi-armed bandit approaches is that you can call off the test early if one arm is clearly the winner. In these experiments, each single trial is a Bernoulli trial — the outcome is either success (an ad click, a sale, an email sign-up) or a failure (the user closes the webpage with no action). These trials in aggregate can be represented with Beta distributions. Look at the following graphic. At first, each arm has an equal probability of any outcome. But as more and more trials accumulate, the probability of success of each arm becomes more and more focused on its actual, long-term success probability. Note that the y-axis is the probability density and is increasing in each frame; I’ve omitted it for clarity so just remember that the area under each curve is always exactly 1. As the curves get more narrow, they correspondingly get taller to maintain this constant area.

Notice how the peaks of each arm start to center around their actual payout probabilities of [0.1, 0.25, 0.5, 0.75, 0.9]. You can use these distributions to run statistical analyses and stop your experiment early if you reach statistical significance. Another way to look at these changes statically:

This shows a single experiment with a horizon of 1,000,000 trials (as opposed to the average results of 5000 experiments with a horizon of 250), and with more realistic values of [0.01, 0.009, 0.0105, 0.011, 0.015] (in this case, I’ve simulated click through rate, CTR). But what I want to point out is that arm_1, the best arm, is used much more frequently due to the way epsilon-Greedy favors it. The 5% confidence interval (shaded area) around it is much tighter than the other arms. Just as in the gif above, where the best arm has a much tighter and taller bell curve, representing a more precise estimate of the value, this chart shows that using a multi-armed bandit approach allows you to exploit the best arm while still learning about the others, and reach statistical significance earlier than in a traditional A/B test.

Softmax

An obvious flaw in epsilon-Greedy is that it explores completely at random. If we have two arms with very similar rewards, we need to explore a lot to learn which is better and so choose a high epsilon. However, if our two arms have vastly different rewards (and we don’t know this when we start the experiment, of course), we would still set a high epsilon and waste a lot of time on the lower paying reward. The Softmax algorithm (and its annealed counterpart) attempt to solve this problem by selecting each arm in the explore phase roughly in proportion to the currently expected reward.

The temperature parameter has a purpose similar to epsilon in the epsilon-Greedy algorithm: it balances the tendency to either explore to exploit. At the extremes, a temperature of 0.0 will always choose the best performing arm. A temperature of infinity will randomly choose any arm.

What I want you to observe when comparing these algorithms is their different behavior with regards to the explore/exploit balance. This is the crux of the multi-armed bandit problem.

UCB1

Whereas the Softmax algorithm takes into account the expected value of each arm, it’s certainly plausible by that sheer random chance a poor performing arm will at first have several successes in a row and thus be favored by the algorithm during the exploit phase. They’ll under-explore arms which may have a high payout even though they don’t have enough data to be confident. Thus, it seems reasonable to take into account how much we know about each arm and encourage an algorithm to slightly favor those arms of which we don’t have high confidence in their behavior, so that we can learn more. The Upper Confidence Bound class of algorithms was developed for this purpose; here, I’ll demonstrate two versions, UCB1 and UCB2. They operate similarly.

UCB1 doesn’t display any randomness at all (you can see in my code on Github that I never import the random package at all!). It is fully deterministic, as opposed to both epsilon-Greedy and Softmax. Also, the UCB1 algorithm does not have any parameters needing tuning, which is why the below charts show only one variant. The key to the UCB1 algorithm is its “curiosity bonus”. When selecting an arm, it takes the expected reward of each arm and then adds a bonus which is calculated in inverse proportion to the confidence of that reward. It is optimistic about uncertainty. So lower confidence arms are given a bit of a boost relative to higher confidence arms. This causes the results of the algorithm to swing wildly from trial to trial, especially at the early phases, because each new trial provides more information to the chosen arm and so the other arms will essentially be favored more in the coming rounds.

UCB2

The UCB2 algorithm is a further development of UCB1. The innovation with UCB2 is to ensure that we trial the same arm for a certain period before trying a new one. This also ensures that in the long term, we periodically take a break from exploiting to re-explore the other arms. UCB2 is a good algorithm to use when rewards are expected to change over time; in the other algorithms, once the best arm is discovered it is heavily favored until the end of the experiment. UCB2 challenges that assumption.

UCB2 has a parameter, alpha which is effective at tuning the length of UCB2’s periods of favoring certain arms.

Exp3

Finally, we have the Exp3 algorithm. The UCB class of algorithms is considered the best performing in a stochastic setting; ie, the results of each trial are fully random. The Exp3 algorithm in contrast was developed to handle scenarios where the trials are adversarial; that is, we want to consider the possibility that the expected outcome of future trials may be changed by the results of previous trials. A good example of when the Exp3 algorithm might be good is with the stock market. Some investors see a stock listed at a low price per share and buy it up even though its current return isn’t that great, but the very act of buying up the stock in high volume causes its price to surge and its performance as well. The expected earnings of that stock is changing as a result of our algorithm predicting one thing or another. As such, in these experiments here, the Exp3 appears to be performing much worse than the other algorithms. I’m running each trial fully randomly, this is a stochastic setting in which Exp3 was never developed to be strong in.

Comparing Algorithms

So now, let’s take a look at all those algorithms together. As you can see, in this short set of trials, UCB2 and epsilon-Greedy look like they’re running together, with UCB2 taking many more opportunities to explore. However, UCB2 is improving slightly more quickly and indeed in longer timeframe, it surpasses epsilon-Greedy. Softmax tends to peak out quite early, indicating that it continues to explore at the expense of exploiting its knowledge of the best arm. UCB1, being an early version of UCB2, trails its more innovative brother as would be expected. Exp3 is an interesting one; it appears to be the lowest performing by far, but this is to be expected as these trials are not what Exp3 is good at. If instead the trial environment were adversarial, we would expect Exp3 to be much more competitive.

There’s one algorithm in those charts which we haven’t discussed yet, Thompson Sampling. This algorithm is fully Bayesian. It generates a vector of expected rewards for each arm from a posterior distribution and then updates the distributions.

The algorithm draws a random number from the changing probability distributions and selects the largest. It simply pulls the lever with the highest expected reward at each trial. Thompson Sampling learns very quickly which is the best arm and heavily favors it going forward, at the expense of exploration. Just look at the uncertainty (the width of the shaded area) in all the other arms! That’s the result of nearly pure exploitation and little exploration.

When do you end your multi-armed bandit experiment?

Because Thompson Sampling draws from arms at the frequency given by their beta distribution, we can use many statistical techniques of uncertainty to know when one arm surges ahead due to superiority as opposed to chance. Google Analytics has developed a sound solution using what they call Potential Value Remaining. They first check for three criteria to be met, and then check that a “champion” arm has emerged, and call the experiment complete. Those three criteria are:

There is active daily traffic on the website
The experiment has run a minimum of two weeks, in order to cancel out any weekly periodicity.
Potential Value Remaining is less than 1%

At that point, once any arm is selected at least 95% of the time, the experiment concludes.

Potential Value Remaining is also called “regret” in the literature. It describes how much a metric such as CTR may still improve above the leader. When the chance that another arm may beat the leader by only 1% more, that meager improvement isn’t worth continuing the test.

Potential value remaining in the test is calculated as the 95th percentile of the distribution of (θₘₐₓ — θ*) / θ*, where θₘₐₓ is the largest value of θ in a row, and θ* is the value of θ for the variation that is most likely to be optimal, with θ being the value drawn from each arm’s beta distribution. As before, for the details of the math please refer to my Github repo, or read the original paper published by a Google engineer.

In the traditional A/B test I described at the beginning of this article, with a confidence interval of 95%, power of 80%, version A CTR of 1%, and hypothesizing a version B CTR of 1.05%, the required sample size was a minimum of 635,829 sample draws from each version. In my experiment, I rounded up to 700,000 draws for each arm.

When using Thompson Sampling with Google’s termination criteria, I simulated 100 experiments and averaged the results. The algorithm determined that version B was 5% better with an average of 5,357 pulls on the inferior version A arm, and 6,353 pulls on the superior version B.

In the traditional A/B test, I would have served my customers an inferior version of my website over 600,000 times, whereas Thompson Sampling required just over 5,000 to learn the same thing. That’s 120x fewer mistakes!

So which bandit is best? It really depends upon your application and needs. They all have their benefits and drawbacks and suitability to specific cases. Epsilon-Greedy and Softmax were early developments in the field and tend not to perform as well as, in particular, the Upper Confidence Bound algorithms. In the realm of web testing, the UCB algorithms do seem to be used most frequently although Thompson Sampling offers the benefits of termination criteria and is the algorithm used by Google Analytics’ Optimize tool. If your context is not stochastic though, you may want to try the Exp3 algorithm which performs better in an adversarial environment.

A/B testing — Is there a better way? An exploration of multi-armed bandits was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Forecasting in Python with Facebook Prophet

Greg Rafferty — Tue, 26 Nov 2019 03:45:00 GMT

How to tune and optimize Prophet using domain knowledge to add greater control to your forecasts.

Update: I’ve written a book about Prophet which has been published by Packt Publishing! The new and updated Second Edition is available for purchase on Amazon.

The book covers every detail of using Prophet starting with installation through model evaluation and tuning. Over a dozen datasets have been made available and used to demonstrate Prophet functionality from the simple to the advanced with fully working code. If you enjoy this Medium post, please consider ordering it here: https://amzn.to/42xTkOb! At more than 250 pages, it covers far more material than can be taught on Medium!

Thank you so much for supporting my book!

Stuck behind the paywall? Click here to read the full story with a Friend Link!

I’m Greg Rafferty, a data scientist in the Bay Area. The code for this project is available on my GitHub.

In this post, I’ll explain how to forecast using Facebook’s Prophet and demonstrate a few advanced techniques for handling trend inconsistencies by using domain knowledge. There are a lot of Prophet tutorials floating around the web, but none of them went into any depth about tuning a Prophet model, or about integrating analyst knowledge to help a model navigate the data. I intend to do both of those with this post.

https://www.instagram.com/p/BaKEnIPFUq-/

In a previous story about forecasting in Tableau, I used a modification of the ARIMA algorithm to forecast the number of passengers on commercial flights in the United States. The ARIMA approach works decently well with stationary data and when forecasting short time frames, but Facebook’s engineers have built a tool for those cases which ARIMA can’t handle. Prophet is built with its backend in STAN, a probabilistic coding language. This allows Prophet to have many of the advantages offered by Bayesian statistics, including seasonality, the inclusion of domain knowledge, and confidence intervals to add a data-driven estimate of risk.

I’m going to look at three sources of data to illustrate how to use, and some of the advantages of, Prophet. If you want to follow along, you’ll first need to install Prophet; Facebook’s documentation provides simple instructions. The notebook I used for this article provides the full code to build the models discussed.

Air Passengers

Let’s start out with something simple. The same Air Passengers data from my previous article. Prophet requires time series data to have a minimum of two columns: ds which is the time stamp and y which is the values. After loading our data, we need to format it as such:

passengers = pd.read_csv('data/AirPassengers.csv')

df = pd.DataFrame()
df['ds'] = pd.to_datetime(passengers['Month'])
df['y'] = passengers['#Passengers']

With just a few lines, Prophet can make a forecast model every bit as sophisticated as the ARIMA model I built previously. Here, I’m calling Prophet to make a 6-year forecast (frequency is monthly, periods are 12 months/year times 6 years):

prophet = Prophet()
prophet.fit(df)
future = prophet.make_future_dataframe(periods=12 * 6, freq='M')
forecast = prophet.predict(future)
fig = prophet.plot(forecast)
a = add_changepoints_to_plot(fig.gca(), prophet, forecast)

Number of passengers (in the thousands) on commercial airlines in the US

Prophet has included the original data as the black dots and the blue line is the forecast model. The light blue area is the confidence interval. Using the add_changepoints_to_plot function added the red lines; the vertical dashed lines are changepoints Prophet identified where the trend changed, and the solid red line is the trend with all seasonality removed. This plot format is what I’ll be using throughout this article.

With that simple case out of the way, let’s move on to more complicated data.

Divvy bike share

Divvy is a bike share service in Chicago. I did a project previously where I analysed their data and correlated it with weather information scraped from Weather Underground. I knew this data exhibited strong seasonality so thought it would be a great demonstration of Prophet’s ability.

The Divvy data is on a per-ride level so to format the data for Prophet, I aggregated to the daily level and created columns for the mode of the “events” column per day (i.e., the weather conditions: 'not_clear', 'rain or snow', ‘clear', ‘cloudy', ‘tstorms', ‘unknown'), the count of rides, and the mean of temperature.

Once formatted, let’s look at the number of rides per day:

So there’s clearly a seasonality to the data, and the trend appears to be increasing with time. With this data set, I want to demonstrate how to add additional regressors, in this case the weather and temperature. Let’s look at the temperature:

It looks a lot like the previous chart, but without the increasing trend. And this similarity makes sense because bicycle riders are going to ride more often when the weather is sunny and warm, so both plots should rise and fall in tandem.

In order to create a forecast with the addition of another regressor, it is necessary that the additional regressor have data for the forecasted period. For this reason, I’m cutting the Divvy data short a year so I can predict that year with the weather information. You can see I’m also adding Prophet’s default holidays for the US:

prophet = Prophet()
prophet.add_country_holidays(country_name='US')
prophet.fit(df[d['date'] < pd.to_datetime('2017-01-01')])
future = prophet.make_future_dataframe(periods=365, freq='d')
forecast = prophet.predict(future)
fig = prophet.plot(forecast)
a = add_changepoints_to_plot(fig.gca(), prophet, forecast)
plt.show()
fig2 = prophet.plot_components(forecast)
plt.show()

The above code block creates the trend plot as described before in the Air Passengers section:

Divvy trend plot

And the components plot:

Divvy component plot

The components plot consists of 3 sections: the trend, the holidays, and the seasonality. The sum of those 3 components account for the entirety of the model in fact. The trend is simply what the data is showing if you subtract out all of the other components. The holidays plot shows the effect of all of the holidays included in the model. Holidays, as implemented in Prophet, can be thought of as unnatural events when the trend will deviate from the baseline but return once the event is over. Additional regressors, as we’ll explore below, are like holidays in that they cause the trend to deviate from the baseline, except that the trend will stay changed after the event. In this case, the holidays all result in reduced ridership, which again makes sense if we realize that a lot of these riders are commuters to work. The weekly seasonality component shows that ridership is pretty constant throughout the week, but with a steep decline on the weekend. This is the evidence that supports the theory that most riders are commuters. The final thing I want to note is that the yearly seasonality plot is quite wavy. These plots are created with Fourier transforms, essentially stacked sine waves. Clearly, the default in this case has too many degrees of freedom. In order to smooth out the curve, I’ll next create a Prophet model with the yearly seasonality turned off and an additional regressor added to account for it, but with fewer degrees of freedom. I’m also going to go ahead and add in those weather regressors in this model as well:

prophet = Prophet(growth='linear',
                  yearly_seasonality=False,
                  weekly_seasonality=True,
                  daily_seasonality=False,
                  holidays=None,
                  seasonality_mode='multiplicative',
                  seasonality_prior_scale=10,
                  holidays_prior_scale=10,
                  changepoint_prior_scale=.05,
                  mcmc_samples=0
                 ).add_seasonality(name='yearly',
                                    period=365.25,
                                    fourier_order=3,
                                    prior_scale=10,
                                    mode='additive')

prophet.add_country_holidays(country_name='US')
prophet.add_regressor('temp')
prophet.add_regressor('cloudy')
prophet.add_regressor('not clear')
prophet.add_regressor('rain or snow')
prophet.fit(df[df['ds'] < pd.to_datetime('2017')])
future = prophet.make_future_dataframe(periods=365, freq='D')
future['temp'] = df['temp']
future['cloudy'] = df['cloudy']
future['not clear'] = df['not clear']
future['rain or snow'] = df['rain or snow']
forecast = prophet.predict(future)
fig = prophet.plot(forecast)
a = add_changepoints_to_plot(fig.gca(), prophet, forecast)
plt.show()
fig2 = prophet.plot_components(forecast)
plt.show()

The trend plot looks very similar so I’ll only share the components plot:

Divvy component plot with smooth annual seasonality and weather regressors

The last year of the trend is upwards in this plot, not downwards as in the previous! This is explained because the last year of data showed lower average temperatures, which reduced ridership more than expected otherwise. We also see that the yearly curve is smoothed out and there’s an additional plot: the extra_regressors_multiplicative plot. This shows the effect of the weather. What we’re seeing is to be expected: ridership is increased in the summer and decreased in winter, and a lot of that variability is accounted for by the weather. I want to see one more thing, just for a demonstration. I ran that above model yet again but this time only included the regressor for rain or snow. Here’s the components plot:

Divvy component plot of just the effect of rain or snow

This shows that when it’s raining or snowing, there will be about 1400 fewer rides per day than otherwise. Pretty cool, right!?

Lastly, I wanted to aggregate this dataset by hour to create one more component plot, the daily seasonality. Here’s what that plot looks like:

Divvy component plot for daily seasonality

As Rives noted, 4am is the worst possible hour to be awake. Clearly, Chicago’s bicycle riders agree. There’s a local peak just after 8am though: the morning commuters; and a global peak around 6pm: the evening communters. I also see that there’s a small peak just after midnight: I like to think that this is people heading home from the bars. That’s it for Divvy data! Let’s move on to Instagram.

Instagram

Facebook developed Prophet to analyze its own data. It only seems fair therefore to test out Prophet on a fitting data set. I scoured Instagram for a few accounts exhibiting interesting trends which I wanted to explore and then I scraped the service for all the data for three accounts: @natgeo, @kosh_dp, and @jamesrodriguez10.

National Geographic

https://www.instagram.com/p/B5G_U_IgVKv/

In 2017, I was working on a project where I noticed an anomaly in National Geographic’s Instagram account. For the month of August in 2016, the number of likes per photo suddenly and inexplicably increased dramatically, but then returned to the baseline as soon as the month was over. I wanted to model this spike as due to a marketing campaign during the month to increase likes, and then see if I could predict the effect of a future marketing campaign.

Here’s what Natgeo’s likes per post chart looks like. The trend is obviously increasing and there’s also increased variance over time. There are a lot of outliers with dramatically high likes, but there’s that spike in August 2016 where all photos posted during that month had likes which were much higher than the surrounding posts:

I don’t want to speculate why this could be, but for the sake of this model let’s just pretend that Natgeo’s marketing department performed some month-long campaign specifically aimed at increasing likes. First, let’s build a model ignoring this fact so we have a baseline to which we can compare:

Natgeo likes per photo over time

Prophet seems to be confused with that spike. It’s attempting to add it to the yearly seasonality component, as can be seen by the August spikes each year in the solid blue line. Prophet wants this to be a recurring event. In order to tell Prophet that something special occurred in 2016 which is not repeating in other years, let’s create a holiday for this month:

promo = pd.DataFrame({'holiday': "Promo event",
                      'ds' : pd.to_datetime(['2016-08-01']),
                      'lower_window': 0,
                      'upper_window': 31})
future_promo = pd.DataFrame({'holiday': "Promo event",
                      'ds' : pd.to_datetime(['2020-08-01']),
                      'lower_window': 0,
                      'upper_window': 31})

promos_hypothetical = pd.concat([promo, future_promo])

The promo dataframe contains just the August 2016 event, and the promos_hypothetical dataframe contains an additional promo which Natgeo is hypothetically considering for August 2020. When adding a holiday, Prophet allows for a lower window and an upper window, essentially days to include with the holiday event if you, for example, want to include Black Friday with Thanksgiving, or Christmas Eve with Christmas. I’ve added 31 days after the “holiday”, to include the whole month in the event. Here’s the code and the new trend plot. Note that I’m just sending holidays=promo when calling the Prophet object:

prophet = Prophet(holidays=promo)
prophet.add_country_holidays(country_name='US')
prophet.fit(df)
future = prophet.make_future_dataframe(periods=365, freq='D')
forecast = prophet.predict(future)
fig = prophet.plot(forecast)
a = add_changepoints_to_plot(fig.gca(), prophet, forecast)
plt.show()
fig2 = prophet.plot_components(forecast)
plt.show()

Natgeo likes per photo over time, with a marketing campaign in August 2016

Fantastic! Now Prophet is not adding that silly August bump annually but is indeed showing a nice spike in just 2016. So now let’s run the model again, but using that promos_hypothetical dataframe, to estimate what would happen if Natgeo were to run an identical campaign in 2020:

Natgeo likes per photo over time with a hypothetical marketing campaign upcoming in 2020

This demonstrates how to forecast behavior when adding in an unnatural event. Planned merchandise sales could be model this year, for instance. Now let’s move on to the next account.

Anastasia Kosh

https://www.instagram.com/p/BfZG2QCgL37/

Anastasia Kosh is a Russian photographer who posts whimsical self-portraits to her Instagram and makes music videos for YouTube. We were neighbors on the same street back when I lived in Moscow a few years ago; she had about 10,000 Instagram followers back then but in 2017 her YouTube account went viral in Russia and she has become something of a celebrity among tweens in Moscow. Her Instagram account has grown exponentially and is quickly approaching 1 million followers. This exponential growth seemed like a good challenge for Prophet.

This is the data we’re going to model:

It’s the classic hockey stick shape of optimistic growth, except that in this case it’s real! Modelling it with linear growth, the same way we did the other data above, results in unrealistic forecasts:

Anastasia Kosh likes per photo over time, with linear growth

That curve will just keep going on to infinity. Obviously, there’s an upper limit to how many likes a photo on Instagram can get. Theoretically, this would be equal to the number of unique accounts on the service. But realistically, not every account will see, nor like, the photo. This is where a little bit of domain knowledge from the analyst will come in handy. I decided to model this with logistic growth, which requires that Prophet be told a ceiling (Prophet calls it a cap) and a floor:

cap = 200000
floor = 0
df['cap'] = cap
df['floor'] = floor

Through my own knowledge of Instagram and a little bit of trial and error, I decided upon the ceiling of 200,000 likes, and a floor of 0 likes. It’s important to note that Prophet does allow these values to be defined as functions of time, so they needn’t be constant. In this case, constant values were exactly what I needed:

prophet = Prophet(growth='logistic',
                  changepoint_range=0.95,
                  yearly_seasonality=False,
                  weekly_seasonality=False,
                  daily_seasonality=False,
                  seasonality_prior_scale=10,
                  changepoint_prior_scale=.01)
prophet.add_country_holidays(country_name='RU')
prophet.fit(df)
future = prophet.make_future_dataframe(periods=1460, freq='D')
future['cap'] = cap
future['floor'] = floor
forecast = prophet.predict(future)
fig = prophet.plot(forecast)
a = add_changepoints_to_plot(fig.gca(), prophet, forecast)
plt.show()
fig2 = prophet.plot_components(forecast)
plt.show()

I defined the growth to be logistic, turned off all seasonality (there didn’t appear to be much of it in my plots), and adjusted a few of the tuning parameters. I also added the default holidays for Russia, as that is where the majority of Anastasia’s followers are located. When calling the .fit method on the df, Prophet sees the cap and floor columns and knows to include them in the model. It’s very important though that when you create your forecast dataframe, you add these columns to it (that’s the future dataframe in the code block above). We’ll walk through this again in the next section. But now our trend plot looks a lot more realistic!

Anastasia Kosh likes per photo over time, with logistic growth

Finally, let’s look at our last example.

James Rodríguez

https://www.instagram.com/p/BySl8I7HOWa/

James Rodríguez is a Colombian soccer player who was a standout performer in both the 2014 and 2018 World Cups. His Instagram account has had steady growth since its inception; but while working on a previous analysis, I noticed that during the two World Cups his account saw sudden and lasting spikes in followers. In contrast to the spikes in National Geographic’s account, which could be modeled as a holiday, Rodríguez’s growth did not return to the baseline after the two tournaments but redefined a new baseline. This is fundamentally different behavior and will require a different modelling approach to capture it.

This is what James Rodríguez’s’s likes per photo looks like throughout the account lifetime:

This is going to be difficult to model cleanly with only the techniques we’ve used so far in this tutorial. He experienced an increase in the trend baseline during the first World Cup in the summer of 2014, and then a spike, and potentially a changed baseline, during the second World Cup in the summer of 2018. Modelling this behavior with the default model doesn’t quite work:

James Rodríguez likes per photo over time

It’s not a terrible model; it just doesn’t neatly model the behavior around those two World Cup tournaments. If, as we did with Anastasia Kosh’s data above, we model those tournaments as holidays, we do see an improvement in the model:

wc_2014 = pd.DataFrame({'holiday': "World Cup 2014",
                      'ds' : pd.to_datetime(['2014-06-12']),
                      'lower_window': 0,
                      'upper_window': 40})
wc_2018 = pd.DataFrame({'holiday': "World Cup 2018",
                      'ds' : pd.to_datetime(['2018-06-14']),
                      'lower_window': 0,
                      'upper_window': 40})

world_cup = pd.concat([wc_2014, wc_2018])

prophet = Prophet(yearly_seasonality=False,
                  weekly_seasonality=False,
                  daily_seasonality=False,
                  holidays=world_cup,
                  changepoint_prior_scale=.1)
prophet.fit(df)
future = prophet.make_future_dataframe(periods=365, freq='D')
forecast = prophet.predict(future)
fig = prophet.plot(forecast)
a = add_changepoints_to_plot(fig.gca(), prophet, forecast)
plt.show()
fig2 = prophet.plot_components(forecast)
plt.show()

James Rodríguez likes per photo over time, with holidays added for the World Cups

I still don’t like how slow the model is to adapt to the changed trendline, especially around the 2014 World Cup. It’s just too smooth of a transition. By adding additional regressors though, we can force Prophet to consider an abrupt change.

In this case, I’m defining two periods for each tournament, during and after. Modelling it this way assumes that before the tournament, there will be a certain trend line, during the tournament there will be a linear change to that trend line, and after the tournament, there will be yet another change. I define these periods as either 0 or 1, on or off, and let Prophet train itself on the data to learn the magnitudes:

df['during_world_cup_2014'] = 0
df.loc[(df['ds'] >= pd.to_datetime('2014-05-02')) & (df['ds'] <= pd.to_datetime('2014-08-25')), 'during_world_cup_2014'] = 1
df['after_world_cup_2014'] = 0
df.loc[(df['ds'] >= pd.to_datetime('2014-08-25')), 'after_world_cup_2014'] = 1

df['during_world_cup_2018'] = 0
df.loc[(df['ds'] >= pd.to_datetime('2018-06-04')) & (df['ds'] <= pd.to_datetime('2018-07-03')), 'during_world_cup_2018'] = 1
df['after_world_cup_2018'] = 0
df.loc[(df['ds'] >= pd.to_datetime('2018-07-03')), 'after_world_cup_2018'] = 1

Note where I’m updating the future dataframe to include these “holiday” events below:

prophet = Prophet(yearly_seasonality=False,
                  weekly_seasonality=False,
                  daily_seasonality=False,
                  holidays=world_cup,
                  changepoint_prior_scale=.1)

prophet.add_regressor('during_world_cup_2014', mode='additive')
prophet.add_regressor('after_world_cup_2014', mode='additive')
prophet.add_regressor('during_world_cup_2018', mode='additive')
prophet.add_regressor('after_world_cup_2018', mode='additive')

prophet.fit(df)
future = prophet.make_future_dataframe(periods=365)

future['during_world_cup_2014'] = 0
future.loc[(future['ds'] >= pd.to_datetime('2014-05-02')) & (future['ds'] <= pd.to_datetime('2014-08-25')), 'during_world_cup_2014'] = 1
future['after_world_cup_2014'] = 0
future.loc[(future['ds'] >= pd.to_datetime('2014-08-25')), 'after_world_cup_2014'] = 1

future['during_world_cup_2018'] = 0
future.loc[(future['ds'] >= pd.to_datetime('2018-06-04')) & (future['ds'] <= pd.to_datetime('2018-07-03')), 'during_world_cup_2018'] = 1
future['after_world_cup_2018'] = 0
future.loc[(future['ds'] >= pd.to_datetime('2018-07-03')), 'after_world_cup_2018'] = 1

forecast = prophet.predict(future)
fig = prophet.plot(forecast)
a = add_changepoints_to_plot(fig.gca(), prophet, forecast)
plt.show()
fig2 = prophet.plot_components(forecast)
plt.show()

James Rodríguez likes per photo over time, with additional regressors

Here, the blue line is what we should be looking at. The red line shows just the trend, with the influence of the additional regressors and holidays subtracted out. Look how the blue trend line takes sharp jumps during the World Cups. That’s exactly the behavior our domain knowledge tells us would happen! After Rodríguez scored his first World Cup goal, suddenly thousands of new followers arrived on his account. Let’s take a look at the component plot, just to see what specific effect of these additional regressors:

James Rodríguez component plot for the World Cup regressors

This tells us that in 2013 and the beginning of 2014, the World Cup had no effect on Rodríguez’s likes per photo. During the 2014 World Cup, there was a dramatic uptick in his average like per photo which continued after the tournament was over (this can be explained because he gained so many active followers during the event). There was a similar, but less dramatic, event during the 2018 World Cup, presumably because by this point there weren’t as many soccer fans left to discover his account and follow him.

Thanks for sticking around for this whole post! I hope you now understand how to use holidays, linear vs. logistic growth rates, and additional regressors to enrich your Prophet forecasts significantly. Facebook has built an incredibly valuable tool with Prophet, making what was once a very difficult exercise of probabilistic forecasting into a simple set of parameters with enormous latitude for tuning. Good luck with your forecasting!

Forecasting in Python with Facebook Prophet was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Basic NLP on the Texts of Harry Potter: Sentiment Analysis

Greg Rafferty — Fri, 21 Dec 2018 17:45:40 GMT

Sentiment Analysis on the Texts of Harry Potter

With bonus tutorial of Matplotlib advanced features!

I’m Greg Rafferty, a data scientist in the Bay Area. You can check out the code for this project on my github. Feel free to contact me with any questions!

In this series of posts, I’m looking at a few handy NLP techniques, through the lens of Harry Potter. Previous posts in this series on basic NLP looked at Topic Modeling with Latent Dirichlet Allocation, Regular Expressions, and text summarization.

What is sentiment analysis?

In a previous post I looked at topic modeling, which is an NLP technique to learn the subject of a given text. Sentiment analysis exists to learn what was said about that topic — was it good or bad? With the growing use of the internet in our daily lives, vast amounts of unstructured text is being published every second of every day, in blog posts, forums, social media, and review sites, to name a few. Sentiment analysis systems can take this unstructured data and automatically add structure to it, capturing the public’s opinion about products, services, brands, politics, etc. This data holds immense value in the fields of marketing analysis, public relations, product reviews, net promoter scoring, product feedback, and customer service, for example.

I’ve been demonstrating a lot of these NLP tasks using the text of Harry Potter. The books are rich in emotionally charged experiences that the reader can viscerally feel. Can a computer capture that feeling? Let’s take a look.

VADER

I used C.J. Hutto’s VADER package to extract the sentiment of each book. VADER, which stands for Valence Aware Dictionary and sEntiment Reasoning, is a lexicon and rule-based tool that is specifically tuned to social media. Given a string of text, it outputs a decimal between 0 and 1 for each of negativity, positivity, and neutrality for the text, as well as a compound score from -1 to 1 which is an aggregate measure.

A complete description of the development, validation, and evaluation of the VADER package can be read in this paper, but the gist is that the package’s authors first constructed a list of lexical features correlated with sentiment and then combined the list with some rules that describe how the grammatical structure of a phrase will intensify or diminish the sentiment. When tested against human raters, VADER outperforms with accuracy scores of 96% to 84%.

VADER works best on short texts (a couple sentences at most), and applying it to an entire chapter at once resulted in extreme and largely worthless scores. Instead, I looped over each sentence individually, got the VADER scores, and then took an average of all sentences in a chapter.

Goblet of Fire, Chapter 16: Harry discovers the missionary position

By plotting at the VADER compound score for each chapter of each book, we can clearly mark events in the books. The three greatest spikes in that chart above are Harry being chosen by the Goblet of Fire around chapter 70 of the series, Cedric Diggory’s death at about chapter 88, and Dumbledore’s death at about chapter 160.

Here’s the code to produce that chart (the full notebook is available on my Github). The data exists in a dictionary with each book’s title as a key; the value for each book is another dictionary with each chapter number as a key. The value for each chapter is a tuple consisting of the chapter title and the chapter text. I defined a function to calculate the moving average of the data, which essentially smooths out the curve a bit and makes it easier to see long multi-chapter arcs throughout the stories. In order to plot each book as a different color, I created a dictionary called book_indices with each book’s title as the key and the values being a 2-element tuple of the book’s starting chapter number and ending chapter number (as if all the books were concatenated with chapters numbered sequentially throughout the entire series). I then plotted the story arc in segments based upon their chapter numbers.

import matplotlib.pyplot as plt

# Use FiveThirtyEight style theme
plt.style.use('fivethirtyeight')

# Moving Average function used for the dotted line
def movingaverage(interval, window_size):
    window = np.ones(int(window_size))/float(window_size)
    return np.convolve(interval, window, 'same')

length = sum([len(hp[book]) for book in hp])
x = np.linspace(0, length - 1, num=length)
y = [hp[book][chapter][2]['compound'] for book in hp for chapter in hp[book]]

plt.figure(figsize=(15, 10))
for book in book_indices:
    plt.plot(x[book_indices[book][0]: book_indices[book][1]],
             y[book_indices[book][0]: book_indices[book][1]],
             label=book)
plt.plot(movingaverage(y, 10), color='k', linewidth=3, linestyle=':', label = 'Moving Average')
plt.axhline(y=0, xmin=0, xmax=length, alpha=.25, color='r', linestyle='--', linewidth=3)
plt.legend(loc='best', fontsize=15)
plt.title('Emotional Sentiment of the Harry Potter series', fontsize=20)
plt.xlabel('Chapter', fontsize=15)
plt.ylabel('Average Sentiment', fontsize=15)
plt.show()

I also made this same chart using the TextBlob Naive Bayes and Pattern analyzers with worse results (see the Jupyter notebook on my Github for these charts). The Naive Bayes model was trained on movie reviews which must not translate well to the Harry Potter universe. The Pattern analyzer worked much better (almost as well as VADER); it is based on the Pattern library, a rule-based model very similar to VADER.

Emotion Lexicon

I also looked at emotions by using a lexicon created by the National Research Council of Canada of over 14,000 words, each scored as either associated or not-associated with any of two sentiments (negative, positive) or eight emotions (anger, anticipation, disgust, fear, joy, sadness, surprise, trust). They kindly provided me access to the lexicon, and I wrote up a Python script which loops over each word in a chapter, looks it up in the lexicon, and outputs whichever emotions the word was associated with. Each chapter was then assigned a score for each emotion corresponding to a ratio of how many words associated with that emotion the chapter contains compared to the total word count in the chapter (this basically normalizes the scores).

Here are plots of the ‘Anger’ and ‘Sadness’ sentiments. I find it interesting that anger always exists with sadness, but sadness can sometimes exist without anger:

Wow, Voldemort. You really pissed off Harry when you killed the adults in his life

Those mood swings hit hard during puberty

Again, take a look at the Jupyter notebook on my Github to see detailed charts for all sentiments. Here’s a condensed version:

Let’s see how I made all those subplots:

length = sum([len(hp[book]) for book in hp])
x = np.linspace(0, length - 1, num=length)

fig, ax = plt.subplots(4, 3, figsize=(15, 15), facecolor='w', edgecolor='k')
fig.subplots_adjust(hspace = .5, wspace=.1)
fig.suptitle('Sentiment of the Harry Potter series', fontsize=20, y=1.02)
fig.subplots_adjust(top=0.88)

ax = ax.ravel()

for i, emotion in enumerate(emotions):
    y = [hp_df.loc[book].loc[hp[book][chapter][0]][emotion] for book in hp for chapter in hp[book]]
    for book in book_indices:
        ax[i].plot(x[book_indices[book][0]: book_indices[book][1]],
                 y[book_indices[book][0]: book_indices[book][1]],
                 label=book, linewidth=2)

ax[i].set_title('{} Sentiment'.format(emotion.title()))
    ax[i].set_xticks([])

fig.legend(list(hp), loc='upper right', fontsize=15, bbox_to_anchor=(.85, .2))
fig.tight_layout()
fig.delaxes(ax[-1])
fig.delaxes(ax[-2])
plt.show()

But it really becomes interesting to see how all the sentiments compare to each other. Overlaying 10 lines with this much variance quickly became a mess, so I again used the moving average:

It’s interesting to see contradicting emotions acting counter to each other, most obviously the pink and brown lines above for ‘Positive’ and ‘Negative’ sentiment. Note that, due to the moving average window size of 20 data points, the first 10 and last 10 chapters have been left off the plot.

I removed the y-axis because those numbers are meaningless to us (mere decimals: the ratio of words of that emotion to total words in the chapter). I also removed the horizontal and vertical chart lines to clean up the plot. I don’t particularly care to mark regular chapter numbers but I do want to mark the books; therefore, I added those vertical dotted lines. The legend has been reversed in this plot, which isn’t really necessary for readability or anything but I did it for consistency with the area and column charts coming up next.

Here’s how I made it:

# use the Tableau color scheme of 10 colors
tab10 = matplotlib.cm.get_cmap('tab10')

length = sum([len(hp[book]) for book in hp])
window = 20

# use index slicing to remove data points outside the window
x = np.linspace(0, length - 1, num=length)[int(window / 2): -int(window / 2)]

fig = plt.figure(figsize=(15, 15))
ax =fig.add_subplot(1, 1, 1)

# Loop over the emotions with enumerate in order to track colors
for c, emotion in enumerate(emotions):
    y = movingaverage([hp_df.loc[book].loc[hp[book][chapter][0]][emotion] for book in hp for chapter in hp[book]], window)[int(window / 2): -int(window / 2)]
    plt.plot(x, y, linewidth=5, label=emotion, color=(tab10(c)))
    
# Plot vertical lines marking the books
for book in book_indices:
    plt.axvline(x=book_indices[book][0], color='black', linewidth=2, linestyle=':')
plt.axvline(x=book_indices[book][1], color='black', linewidth=2, linestyle=':')

plt.legend(loc='best', fontsize=15, bbox_to_anchor=(1.2, 1))
plt.title('Emotional Sentiment of the Harry Potter series', fontsize=20)
plt.ylabel('Relative Sentiment', fontsize=15)

# Use the book titles for X ticks, rotate them, center the left edge
plt.xticks([(book_indices[book][0] + book_indices[book][1]) / 2 for book in book_indices],
           list(hp),
           rotation=-30,
           fontsize=15,
           ha='left')
plt.yticks([])

# Reverse the order of the legend
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[::-1], labels[::-1], loc='best', fontsize=15, bbox_to_anchor=(1.2, 1))

ax.grid(False)

plt.show()

I also made an area plot to show the overall emotive qualities of each chapter. This is again a moving average in order to smooth out the more extreme spikes and to show the story arc better across all books:

The books seem to start with a bit of trailing emotion from the previous story but quickly calm down during the middle chapters only to pick back up again at the end.

length = sum([len(hp[book]) for book in hp])
window = 10
x = np.linspace(0, length - 1, num=length)[int(window / 2): -int(window / 2)]

fig = plt.figure(figsize=(15, 15))
ax = fig.add_subplot(1, 1, 1)

y = [movingaverage(hp_df[emotion].tolist(), window)[int(window / 2): -int(window / 2)] for emotion in emotions]

plt.stackplot(x, y, colors=(tab10(0),
                            tab10(.1),
                            tab10(.2),
                            tab10(.3),
                            tab10(.4),
                            tab10(.5),
                            tab10(.6),
                            tab10(.7),
                            tab10(.8),
                            tab10(.9)), labels=emotions)

# Plot vertical lines marking the books
for book in book_indices:
    plt.axvline(x=book_indices[book][0], color='black', linewidth=3, linestyle=':')
plt.axvline(x=book_indices[book][1], color='black', linewidth=3, linestyle=':')

plt.title('Emotional Sentiment of the Harry Potter series', fontsize=20)
plt.xticks([(book_indices[book][0] + book_indices[book][1]) / 2 for book in book_indices],
           list(hp),
           rotation=-30,
           fontsize=15,
           ha='left')
plt.yticks([])
plt.ylabel('Relative Sentiment', fontsize=15)

# Reverse the legend
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[::-1], labels[::-1], loc='best', fontsize=15, bbox_to_anchor=(1.2, 1))

ax.grid(False)

plt.show()

Note how in this chart, reversing the legend became necessary for readability. By default, the legend items are added in alphabetical order going down, but the data is stacked from the bottom up. So the colors of the legend and the area plot run in opposite direction — to my eye, quite confusing and difficult to follow. So with ‘Anger’ plotted at the bottom, I also wanted it to be on the bottom of the legend and likewise with ‘Trust’ at the top.

And lastly, a stacked bar chart to show the weights of the various sentiments across the books:

Naturally, words associated with any of the positive emotions would also be associated with the ‘Positive’ sentiment, and likewise for ‘Negative’, so it shouldn’t come as a surprise that those two sentiments carry the bulk of the emotive quality of the books. I find it notable that the emotions are relatively consistent from book to book with just slight differences in magnitude but consistent weights, except for the ‘Fear’ emotion in red; it seems to exhibit the most variance across the series. I also would have expected the cumulative magnitude of sentiments to increase throughout the series as the stakes became higher and higher; however although the final book is indeed the highest, the other 6 books don’t show this gradual increase but almost the opposite, with a constant decline starting with book 2.

books = list(hp)
margin_bottom = np.zeros(len(books))

fig = plt.figure(figsize=(15, 15))
ax = fig.add_subplot(1, 1, 1)

for c, emotion in enumerate(emotions):
    y = np.array(hp_df2[emotion])
    plt.bar(books, y, bottom=margin_bottom, label=emotion, color=(tab10(c)))
    margin_bottom += y

# Reverse the legend
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[::-1], labels[::-1], loc='best', fontsize=15, bbox_to_anchor=(1.2, 1))

plt.title('Emotional Sentiment of the Harry Potter series', fontsize=20)
plt.xticks(books, books, rotation=-30, ha='left', fontsize=15)
plt.ylabel('Relative Sentiment Score', fontsize=15)
plt.yticks([])
ax.grid(False)
plt.show()

The tricky bit in this plot is using the margin_bottom variable to stack each of the columns. Other than that, it just uses a couple tricks from the previous plots.

Basic NLP on the Texts of Harry Potter: Sentiment Analysis was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Text Summarization on the Books of Harry Potter

Greg Rafferty — Wed, 12 Dec 2018 22:23:19 GMT

A comparison of several algorithms

“Harry sat with Hermione and Ron in the library as the sun set outside, tearing feverishly through page after page of spells, hidden from one another by the massive piles of books on the desk in front.”

I’m Greg Rafferty, a data scientist in the Bay Area. You can check out the code for this project on my github. Feel free to contact me with any questions!

Hermione interrupted them. “Aren’t you two ever going to read Hogwarts, A History?”

How many times throughout the Harry Potter series does Hermione bug Harry and Ron to read the enormous tome Hogwarts, A History? Hint: it’s a lot. How many nights do the three of them spend in the library, reading through every book they can find to figure out who Nicolas Flamel is, or how to survive underwater, or preparing for their O.W.L.s? The mistake they’re all making is to try to read everything themselves.

Remember when you were in school and stumbled upon the CliffsNotes summary of that book you never read but were supposed to write an essay about? That’s basically what text summarization does: provide the CliffsNotes version for any large document. Now, CliffsNotes have been written by rather well-educated people who are familiar with the book they’re summarizing. But this is the twenty-first century, aren’t computers supposed to be putting humans out of work? I looked into a few text summarization algorithms to see if we’re ready to put poor old Clifton out of a job.

There are two types of text summarization algorithms: extractive and abstractive. All extractive summarization algorithms attempt to score the phrases or sentences in a document and return only the most highly informative blocks of text. Abstractive text summarization actually creates new text which doesn’t exist in that form in the document. Abstractive summarization is what you might do when explaining a book you read to your friend, and it is much more difficult for a computer to do than extractive summarization. Computers just aren’t that great at the act of creation. To date, there aren’t any abstractive summarization techniques which work suitably well on long documents. The best performing ones merely create a sentence based upon a single paragraph, or cut the length of a sentence in half while maintaining as much information as possible. Often, grammar suffers horribly. They’re usually based upon neural network models. This post will focus on the much more simple extractive text summarization techniques.

Most of the algorithms I’ll present are packaged together in the sumy package for Python but I also use one summarizer from the Gensim package and one other technique I wrote myself using LDA topic keywords to enrich the sumy EdmundsonSummarizer. All examples output five-sentence summaries of the first chapter of Harry Potter and the Sorcerer’s Stone. See my Jupyter notebook for complete code. And a word of caution: don’t judge the results too harshly. They’re not great… (text summarization does seem to work better on drier works of non-fiction)

LexRank Summarizer

LexRank is an unsupervised approach that gets its inspiration from the same ideas behind Google’s PageRank algorithm. The authors say it is “based on the concept of eigenvector centrality in a graph representation of sentences”, using “a connectivity matrix based on on intra-sentence cosine similarity.” Ok, so in a nutshell, it finds the relative importance of all words in a document and selects the sentences which contain the most of those high-scoring words.

"The Potters, that's right, that's what I heard —" "— yes, their son, Harry —" Mr. Dursley stopped dead.
Twelve times he clicked the Put-Outer, until the only lights left on the whole street were two tiny pinpricks in the distance, which were the eyes of the cat watching him.
Dumbledore slipped the Put-Outer back inside his cloak and set off down the street toward number four, where he sat down on the wall next to the cat.
"But I c-c-can't stand it — Lily an' James dead — an' poor little Harry off ter live with Muggles —" "Yes, yes, it's all very sad, but get a grip on yourself, Hagrid, or we'll be found," Professor McGonagall whispered, patting Hagrid gingerly on the arm as Dumbledore stepped over the low garden wall and walked to the front door.
Dumbledore turned and walked back down the street.

Luhn Summarizer

One of the first text summarization algorithms was published in 1958 by Hans Peter Luhn, working at IBM research. Luhn’s algorithm is a naive approach based on TF-IDF and looking at the “window size” of non-important words between words of high importance. It also assigns higher weights to sentences occurring near the beginning of a document.

It was now reading the sign that said Privet Drive — no, looking at the sign; cats couldn't read maps or signs.
He didn't see the owls swooping past in broad daylight, though people down in the street did; they pointed and gazed open-mouthed as owl after owl sped overhead.
No one knows why, or how, but they're saying that when he couldn't kill Harry Potter, Voldemort's power somehow broke — and that's why he's gone."
"But I c-c-can't stand it — Lily an' James dead — an' poor little Harry off ter live with Muggles —" "Yes, yes, it's all very sad, but get a grip on yourself, Hagrid, or we'll be found," Professor McGonagall whispered, patting Hagrid gingerly on the arm as Dumbledore stepped over the low garden wall and walked to the front door.
G'night, Professor McGonagall — Professor Dumbledore, sir."

LSA Summarizer

Latent Semantic Analysis is a relatively new algorithm which combines term frequency with singular value decomposition.

He dashed back across the road, hurried up to his office, snapped at his secretary not to disturb him, seized his telephone, and had almost finished dialing his home number when he changed his mind.
It seemed that Professor McGonagall had reached the point she was most anxious to discuss, the real reason she had been waiting on a cold, hard wall all day, for neither as a cat nor as a woman had she fixed Dumbledore with such a piercing stare as she did now.
He looked simply too big to be allowed, and so wild — long tangles of bushy black hair and beard hid most of his face, he had hands the size of trash can lids, and his feet in their leather boots were like baby dolphins.
For a full minute the three of them stood and looked at the little bundle; Hagrid's shoulders shook, Professor McGonagall blinked furiously, and the twinkling light that usually shone from Dumbledore's eyes seemed to have gone out.
A breeze ruffled the neat hedges of Privet Drive, which lay silent and tidy under the inky sky, the very last place you would expect astonishing things to happen.

TextRank Summarizer

TextRank is another text summarizer based on the ideas of PageRank, and was also developed at the same time as LexRank, though by different groups of people. TextRank is a bit more simplistic than LexRank; although both algorithms are very similar, LexRank applies a heuristic post-processing step to remove sentences with highly duplicitous.

Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much.
They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense.
Mr. Dursley was the director of a firm called Grunnings, which made drills.
He was a big, beefy man with hardly any neck, although he did have a very large mustache.
Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors.

Edmundson Summarizer

In 1969, Harold Edmundson developed the summarizer bearing his name. Edmundson’s algorithm was, along with Luhn’s, one of the seminal text summarization techniques. What sets the Edmundson summarizer apart from the others is that it takes into account “bonus words”, words stated by the user as of high importance; “stigma words”, words of low importance or even negative importance; and “stop words”, which are the same as used elsewhere in NLP processing. Edmundson suggested using the words in a document’s title as bonus words. Using the chapter title as bonus words, this is what Edmundson outputs:

The Dursleys shuddered to think what the neighbors would say if the Potters arrived in the street.
When Dudley had been put to bed, he went into the living room in time to catch the last report on the evening news: "And finally, bird-watchers everywhere have reported that the nation's owls have been behaving very unusually today.
Twelve times he clicked the Put-Outer, until the only lights left on the whole street were two tiny pinpricks in the distance, which were the eyes of the cat watching him.
Dumbledore slipped the Put-Outer back inside his cloak and set off down the street toward number four, where he sat down on the wall next to the cat.
He couldn't know that at this very moment, people meeting in secret all over the country were holding up their glasses and saying in hushed voices: "To Harry Potter — the boy who lived!"

Another addition I made was to use LDA to extract topic keywords, and then add those topic keywords back in as additional bonus words. With this modification, the Edmundson results are as follows:

At half past eight, Mr. Dursley picked up his briefcase, pecked Mrs. Dursley on the cheek, and tried to kiss Dudley good-bye but missed, because Dudley was now having a tantrum and throwing his cereal at the walls.
When Dudley had been put to bed, he went into the living room in time to catch the last report on the evening news: "And finally, bird-watchers everywhere have reported that the nation's owls have been behaving very unusually today.
Twelve times he clicked the Put-Outer, until the only lights left on the whole street were two tiny pinpricks in the distance, which were the eyes of the cat watching him.
One small hand closed on the letter beside him and he slept on, not knowing he was special, not knowing he was famous, not knowing he would be woken in a few hours' time by Mrs. Dursley's scream as she opened the front door to put out the milk bottles, nor that he would spend the next few weeks being prodded and pinched by his cousin Dudley.
He couldn't know that at this very moment, people meeting in secret all over the country were holding up their glasses and saying in hushed voices: "To Harry Potter — the boy who lived!"

SumBasic Summarizer

The SumBasic algorithm was developed in 2005 and uses only the word probability approach to determine sentence importance. Sorry, but it’s pretty bad on this document.

Mr. Dursley wondered.
"Harry.
The cat was still there.
"It certainly seems so," said Dumbledore.
"Yes," said Professor McGonagall.

Wow. That’s terrible. Ahem, moving on…

KL Summarizer

The KLSum algorithm is a greedy method which adds sentences to the summary as long as the KL Divergence (a measure of entropy) is decreasing.

It was on the corner of the street that he noticed the first sign of something peculiar — a cat reading a map.
It grew steadily louder as they looked up and down the street for some sign of a headlight; it swelled to a roar as they both looked up at the sky — and a huge motorcycle fell out of the air and landed on the road in front of them.
He looked simply too big to be allowed, and so wild — long tangles of bushy black hair and beard hid most of his face, he had hands the size of trash can lids, and his feet in their leather boots were like baby dolphins.
"But I c-c-can't stand it — Lily an' James dead — an' poor little Harry off ter live with Muggles —" "Yes, yes, it's all very sad, but get a grip on yourself, Hagrid, or we'll be found," Professor McGonagall whispered, patting Hagrid gingerly on the arm as Dumbledore stepped over the low garden wall and walked to the front door.
He clicked it once, and twelve balls of light sped back to their street lamps so that Privet Drive glowed suddenly orange and he could make out a tabby cat slinking around the corner at the other end of the street.

Reduction Summarizer

The Reduction algorithm is another graph-based model which values sentences according to the sum of the weights of their edges to other sentences in the document. This weight is computed in the same way as it is in the TexRank model.

Mrs. Potter was Mrs. Dursley's sister, but they hadn't met for several years; in fact, Mrs. Dursley pretended she didn't have a sister, because her sister and her good-for-nothing husband were as unDursleyish as it was possible to be.
It seemed that Professor McGonagall had reached the point she was most anxious to discuss, the real reason she had been waiting on a cold, hard wall all day, for neither as a cat nor as a woman had she fixed Dumbledore with such a piercing stare as she did now.
Dumbledore took Harry in his arms and turned toward the Dursleys' house.
"But I c-c-can't stand it — Lily an' James dead — an' poor little Harry off ter live with Muggles —" "Yes, yes, it's all very sad, but get a grip on yourself, Hagrid, or we'll be found," Professor McGonagall whispered, patting Hagrid gingerly on the arm as Dumbledore stepped over the low garden wall and walked to the front door.
G'night, Professor McGonagall — Professor Dumbledore, sir."

Gensim Summarizer

The Gensim package for Python includes a summarizer which is a modification of the TextRank algorithm. Gensim’s approach modifies the sentence similarity function.

It was now reading the sign that said Privet Drive — no, looking at the sign; cats couldn't read maps or signs.
Dumbledore slipped the Put-Outer back inside his cloak and set off down the street toward number four, where he sat down on the wall next to the cat.
They're a kind of Muggle sweet I'm rather fond of." "No, thank you," said Professor McGonagall coldly, as though she didn't think this was the moment for lemon drops.
All this 'You-Know-Who' nonsense — for eleven years I have been trying to persuade people to call him by his proper name: Voldemort." Professor McGonagall flinched, but Dumbledore, who was unsticking two lemon drops, seemed not to notice.
Professor McGonagall shot a sharp look at Dumbledore and said, "The owls are nothing next to the rumors that are flying around.
It seemed that Professor McGonagall had reached the point she was most anxious to discuss, the real reason she had been waiting on a cold, hard wall all day, for neither as a cat nor as a woman had she fixed Dumbledore with such a piercing stare as she did now.

So which algorithm do you think provides the best summary? There are a lot of parameters to tweak before making a final judgement. Some might perform better on shorter summaries, some on longer summaries. The style of writing could make a difference (I mentioned before that I’ve had better luck summarizing non-fiction than fiction). But if you’re Harry just entering Hogwarts’ library with the intent to answer some question and are intimidated by “the sheer size of the library; tens of thousands of books; thousands of shelves; hundreds of narrow rows”, then it couldn’t hurt to have a couple summaries available with a few mouse clicks. If only wizards used such silly muggle creations…

Text Summarization on the Books of Harry Potter was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Regex on the Texts of Harry Potter

Greg Rafferty — Sun, 09 Dec 2018 23:55:42 GMT

A deep-dive case study

Regular Expressions: more mind-boggling than arithmancy

I’m Greg Rafferty, a data scientist in the Bay Area. You can check out the code for this project on my github. Feel free to contact me with any questions!

In this series of posts, I’m looking at a few handy NLP techniques, through the lens of Harry Potter. My previous post in this series on basic NLP looked at Topic Modeling with Latent Dirichlet Allocation, and my next post will look at text summarization.

Throughout this NLP project, I’ve been using as my text corpus the collection of seven books about Harry Potter. But before I could send these books through my algorithms, I first had to save the pdfs as txt files and then extract the chapters as separate documents. To do this, I used regular expressions. For a good intro to regex in Python, check out Google’s quick course.

Regular expressions can be thought of as wildcard search on crack. They allow an incredible amount of flexibility to describe patterns in text strings. If you’ve never seen them before, a regular expression pattern can look like a jumbled, senseless mess. But there is order in that chaos, and a great deal of power. Let’s explore this with the pattern I used on these texts:

pattern = r"((?:[A-Z-][ ]){9,}[A-Z])\s+([A-Z \n',.-]+)\b(?![A-Z]+(?=\.))\b(?![a-z']|[A-Z.])(.*?)(?=(?:[A-Z][ ]){9,}|This book)"

When applied with re.findall to a Harry Potter book saved as a text file, the output is a list of 3-element tuples. Each tuple in the list corresponds to a chapter in the book and takes the form (‘C H A P T E R O N E’, ‘THE BOY WHO LIVED’, ‘Mr. and Mrs. Dursley, of number four, Privet Drive, blah, blah, blah...’) and contains the text of the entire chapter. We need to extract that from this:

C H A P T E R O N E

THE BOY WHO LIVED

Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such...

Let’s walk through that pattern above and see how it does this. I’ll describe the logic behind this expression but for more details about the precise syntax, follow along here for a character-by-character description. This railroad diagram displays the pipeline:

Source

Group #1 is this part: ((?:[A-Z-][ ]){9,}[A-Z]). That’s the first box on the left in the diagram above. The outer parenthesis tell python to capture what’s inside as Group 1 (this will end up being 'C H A P T E R O N E'). The inner parenthesis are a grouping of a sub-pattern and the ?: instructs python not to capture it as a second group. The sub-pattern, ([A-Z][ ]){9,}[A-Z], means any UPPERCASE letter followed by a space, repeated 9 or more times, and ending with a final UPPERCASE letter. This is how the chapter numbers are captured.

It is followed by \s+ which is not in parenthesis, so not captured, and instructs python to look for white space (line breaks, tabs, spaces, etc), one or more times. This essentially skips down to the chapter title, captured in Group #2, the next set of parenthesis: ([A-Z]\n',.-]+). This again means any UPPERCASE letter, and/or line break (\n), apostrophe, comma, period, and dash, one or more times. Some of the chapter titles have line breaks or the other various punctuation marks, for instance:

Prisoner of Azkaban, Chapter 18: MOONEY, WORMTAIL, PADFOOT, AND PRONGS

Goblet of Fire, Chapter 21: THE HOUSE-ELF LIBERATION FRONT

Order of the Phoenix, Chapter 8: THE WOES OF MRS. WEASLEY

However, there’s one problem with this: The fourth chapter of book 1 starts with

BOOM. They knocked again…

So by looking for all UPPERCASE letters, that “BOOM.” would be captured as part of the title. We need to use what’s called a negative lookahead, declared with ?!: /b(?![A-Z]+(?=\.))/b. The \b signs mean word boundaries. So the chapter title capturing string above captures all UPPERCASE letters unless the next word is also all UPPERCASE but immediately followed by a period, declared with a positive lookahead ?=. Great! Now we’re able to find that “BOOM.” and exclude it from our capture. But this introduces another problem. One of the chapters I mentioned in the previous paragraph, “THE WOES OF MRS. WEASLEY”, also contains a word followed by a period so now we’d only capture the chapter title as “THE WOES OF”. We need another negative lookahead, (?![a-z']|[A-Z.]), to make sure that all UPPERCASE words followed by a period are not also followed by lowercase words (chapter text), or are not the last word in the UPPERCASE string (because although the chapter title may contain a period, it never ends in one).

Group #3 is the easiest one: (.*?). This means capture any character, any number of times, and be greedy about it. Keep going until forced to stop, with the next part.

The last part of the regular expression tells python at what point to stop capturing text: (?=(?:[A-Z][ ]){9,}|This book \n). This is another non-captured lookahead. It provides instructions to stop capturing text once a sequence of UPPERCASE letter and space is repeated at least 9 times (because thus begins the next chapter, "C H A P T E R T W O”), or until the string This book \n. This final string marks the end of the book; for each of the seven Harry Potter books ends with this text:

This book
was art directed by
David Saylor and designed by Becky
Terhune. The art for both the jacket and interior was
created using pastels on toned printmaking paper. The text was
set in 12-point Adobe Garamond, a typeface based on the sixteenth-
century type designs of Claude Garamond redrawn by Robert
Slimbach in 1989. The book was printed and bound
at RR Donnelley & Sons, Willard, Oh.
The production was supervised by
Angela Biola

Phew! I hope that all made sense! I’d highly encourage you to visit this link to see this all in action with even more detail and a character-by-character description of what everything is doing.

Regex on the Texts of Harry Potter was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

LDA on the Texts of Harry Potter

Greg Rafferty — Fri, 07 Dec 2018 01:12:44 GMT

Topic Modeling with Latent Dirichlet Allocation

“Hmm,” said a small voice in his ear. “Difficult. Very difficult. Plenty of courage, I see. Not a bad mind either. There’s talent, oh my goodness, yes — and a nice thirst to prove yourself, now that’s interesting. . . . So where shall I put you?”

I’m Greg Rafferty, a data scientist in the Bay Area. You can check out the code for this project on my github. Feel free to contact me with any questions!

In this post, I’ll describe topic modeling with Latent Dirichlet Allocation and compare different algorithms for it, through the lens of Harry Potter. Upcoming posts will cover two more NLP tasks: text summarization and sentiment analysis.

Recently, I was put on a new project with a team who were unanimously shocked and disappointed that I’d never read nor seen the movies about a certain fictional wizard named Harry Potter. In order to fit in with the team, and obviously save my career from an early end, it quickly became evident that I was going to have to take a crash-course in the happenings at Hogwarts. Armed with my ebook reader and seven shiny new pdf files, I settled in to see what the fuss was all about. Meanwhile, I had started working on a side project composed of a bunch of unrelated NLP tasks. I needed a good sized set of text documents and I thought all of these shiny new pdfs would be a great source.

And somewhere around the middle of the third book, I suddenly realized that LDA was basically just an algorithmic Sorting Hat.

LDA, or Latent Dirichlet Allocation, is a generative probabilistic model of (in NLP terms) a corpus of documents made up of words and/or phrases. The model consists of two tables; the first table is the probability of selecting a particular word in the corpus when sampling from a particular topic, and the second table is the probability of selecting a particular topic when sampling from a particular document.

Here’s an example. Let’s say I’ve got these three (rather non-sensical) documents:

document_0 = "Harry Harry wand Harry magic wand"
document_1 = "Hermione robe Hermione robe magic Hermione"
document_2 = "Malfoy spell Malfoy magic spell Malfoy"
document_3 = "Harry Harry Hermione Hermione Malfoy Malfoy"

Here’s the term-frequency matrix for these documents:

Just from glancing at this, it seems pretty obvious that document 0 is mostly about Harry, a little bit about magic, and partly about wand. Document 1 is also a little bit about magic, but mostly about Hermione and robe. And document 2 is again partly about magic, but mostly about Malfoy and spell. Document 3 is equally about Harry, Hermione, and Malfoy. It’s usually not so easy to see this because a more practical corpus would consists of thousands or ten-of-thousands of words, so let’s see what the LDA algorithm chooses for topics:

Data Science with Excel!

And that’s roughly what we predicted just by going with term frequencies and our gut. The number of topics is a hyperparameter you’ll need to choose and tune carefully, and I’ll go into that later, but for this example I chose 4 topics to make my point. The upper table shows words versus topics and the lower table shows documents versus topics. Each column in the upper table and each row in the lower table must sum to 1. These tables are telling us that if we were to randomly sample a word from Topic 0, there’s a 70.9% chance we’d grab “Harry”. If we chose a word from Topic 3, it’s near certain that we’d pick “magic”. If we sampled Document 3, there’s an equal chance that we would pick Topic 0, 1, or 2.

It’s up to us as smart humans who can infer meaning from a bag of words to name these topics. In these examples with a very limited vocabulary, the topics quite obviously correspond to single words. If we had run LDA on, say, a couple thousand restaurant descriptions, we might find topics corresponding to cuisine type or atmosphere. It’s important to note that LDA, unlike typical clustering algorithms such as Kmeans, allows a document to exist in multiple topics. So in those restaurant descriptions, we might find one restaurant placed in the “Italian”, “date-night”, and “cafe” topics.

So how is all of this like the Sorting Hat? All new students at Hogwarts go through a ceremony when they arrive on day one to determine which house they’ll be in (I’m probably the only person who didn’t know this up until a few weeks ago). The Sorting Hat, once placed on someone’s head, understands what is in their thoughts, dreams, and experiences. This is a bit like LDA building the term-frequency matrix and understanding what words and N-grams are contained within each document.

The Sorting Hat then compares the student’s attributes with the attributes of the various houses (bravery goes to Gryffindor, loyalty to Hufflepuff, wisdom to Ravenclaw, and sneaky, shifty sleazeballs go to Slytherin (ok, just a quick aside — can ANYONE explain to me why Slytherin has persisted for the thousand-year history of this school? It’s like that one fraternity which finds itself in yet another ridiculously obscene scandal every damn year!)). This is where LDA creates the word-topic table and begins to associate the attributes of the topics.

Harry’s placement was notably split between Gryffindor and Slytherin due to his combination of courage, intelligence, talent, and ambition, but Gryffindor just slightly edged out ahead and Harry Potter became the hero of an entire generation of young millennials instead of its villain. This is where LDA creates the document-topic table and finally determines which is the dominant topic for each document.

OK, so now that we know roughly what LDA does, let’s look at two different implementations in Python. Check out my Github repo for all of the nitty-gritty details.

First of all, one of the best ways to determine how many topics you should model is with an elbow plot. This is the same technique often used to determine how many clusters to choose with the clustering algorithms. In this case, we’ll plot the coherence score against the number of topics:

You’ll generally want to pick the lowest number of topics where the coherence score begins to level off. This is why it’s called an elbow plot — you pick the elbow between steep gains and shallow gains. In this case (and it’s a remarkably spiky case; usually the curves are a little bit smoother than this), I’d go with somewhere around 20 topics.

The first model I used is Gensim’s ldamodel. At 20 topics, Gensim had a coherence score of 0.319. This is not great; indeed the Mallet algorithm which we’ll look at next almost always outperforms Gensim’s. However, one really cool thing with Gensim is the pyLDAvis, an interactive chart you can run in a Jupyter notebook. It plots the clusters with two principal components and shows the proportion of words in each cluster:

Harry Potter and the Allocation of Dirichlet

The next implementation I looked at was Mallet (MAchine Learning for LanguagE Toolkit), a Java-based package put out by UMASS Amherst. The difference between Mallet and Gensim’s standard LDA is that Gensim uses a Variational Bayes sampling method which is faster but less precise that Mallet’s Gibbs Sampling. Fortunately for those who prefer to code in Python, Gensim has a wrapper for Mallet: Latent Dirichlet Allocation via Mallet. In order to use it, you need to download the Mallet Java package from here http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip and also install the Java Development Kit. Once everything is set up, implementing the model is pretty much the same as Gensim’s standard model. Using Mallet, the coherence score for the 20-topic model increased to 0.375 (remember, Gensim’s standard model output 0.319). It’s a modest increase, but usually persists with a variety of data sources so although Mallet is slightly slower, I prefer it for its increase in return.

Finally, I built a Mallet model on the 192 chapters of all 7 books in the Harry Potter series. Here are the top 10 keywords the model output for each latent topic. How would you name these clusters?

https://medium.com/media/e8171bb4c7b021db298ef7b732ec0beb/href

LDA on the Texts of Harry Potter was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.