Is it brunch time?

Background

Ben Jacobson
Jun 24, 2016 · 6 min read

Hypothesis

Twitter is a platform that allows users to “get real-time updates about what matters to you.”[2] If we assume people generally tweet about things while they are doing those things then we can assume tweets containing “brunch” are generally happening while that person is actually having brunch. Therefore, if we collect enough tweets over a long enough period of time and analyze the time at which they were tweeted we could infer the specific time range in which brunch falls.

The Data

We begin by using the Twitter Streaming API. This API allows us to subscribe to search terms, for example “brunch”, and get any tweet matching that term sent to our program in real-time. Not only did we collected “brunch” tweets but we also collected tweets containing “breakfast”, “lunch”, and “dinner” to use as controls (which we will review later). We allowed the program run from 2015–06–01 to 2016–05–31 which yielded 100M+ tweets for analysis. Twitter is global platform, so we have to do some additional work to understand the time of day a specific tweet is occurring. At the time of tweet we analyzed the timezone of the person tweeting and any attached geolocation data (when available). Using this data we then made an informed estimate of the localized hour for each tweet.

Tweets containing the term brunch with the hour localized based on inferred timezone of the tweeter

Solution 1: The Naive Solution

Simply looking at the histogram above one will quickly notice that 11am is the most popular hour to tweet about brunch — which therefore means 11am to 12pm must be brunch time, right?


Solution 2: Probability Density Function

Since we are visualizing a histogram it is logical for us to jump to a distribution function for further analysis. And since our data has a positive skew we will look to a probability density function (PDF). To start, we calculate the lognormal distribution[5] PDF (the red line below) and locate the maximum point of that function (the mode). We chose the lognormal over the normal distribution because it appeared to fit the data better. We then identify the range in which a large portion, let’s say 1/4 (25%), of all tweets occur centered around that maximum:


Solution 3: Spline Interpolation

Splines allow us to create a smooth curve through many different points. Let’s make a curve through each hour and graph it:

Brunch Point — the exact time of day in which brunch maximally occurs

  • It uses a spline which will remain very consistent even if the x-axis is translated.
  • It is based on concepts that we are very familiar with — acceleration and deceleration

Comparing Data

For comparison let’s look at the splines for “breakfast” and “lunch” tweets:

  • Brunch: 10:01am to 1:40pm
  • Lunch: 11:40am to 2:25pm


The Startup

Medium's largest active publication, followed by +479K people. Follow to join our community.

Ben Jacobson

Written by

Purveyor of Software and Product Design

The Startup

Medium's largest active publication, followed by +479K people. Follow to join our community.