How To Analyze Your Medium Stats With R
Answering every question you’ve ever had about your writing with statistics
As a blogger with a background in science research, the first time I had a question about blogging, I naturally turned to coding to find the answer. Specifically, I wanted to know how many of my earning stories were written this month, and how many were written previously. Could it be that old stories were earning more money? If so, how much?
This gave rise to new questions, like:
- How much money is each Medium fan worth?
- Do longer stories earn more money?
- Does putting a story in a publication help or harm that story’s stats?
- How do views influence a story’s earnings versus reads?
Medium’s stats page is both incredibly useful and incredibly opaque: There’s a wealth of information there, but it’s very difficult to drag out the kind of information I was interested in. So I took a few hacky detours until I could answer every single question on my list and more.
Then it occurred to me that others might find this kind of thing useful as well, so I thought I’d share!
This is a tutorial for how to analyze your Medium writing stats using R, my favorite statistical language. Anyone, no matter how basic their knowledge, will be able to follow along.
The short answer for impatient folks: I’ve included the script I used to run the analysis as a Google doc. Here’s the link!
Table of Contents
1. The basics: downloading R and RStudio Desktop
2. Getting your Medium stats and payment data
2.1. Downloading Medium engagement stats
2.2. Downloading Medium pay stats
3. Setting up your environment in RStudio
4. Cleaning the data
5. Joining the data
6. Analyzing the data
6.1. How much money did each story earn me?
6.2. How much of earnings come from low-value stories?
6.3. When were most high-value stories published?
6.4. Do publications make a story earn more?
6.5. Do longer stories earn more?
6.6. Which statistic most accurately predicts earnings?
The Basics: Downloading R and RStudio Desktop
Before you start, you might need to download R and RStudio Desktop. Both are open source and free, and so it’s super easy to download them online. Here’s where you can get R, and here’s where you can get RStudio.
For newbies who might not know, R is a programming language, primarily used by statisticians. RStudio is what’s called an “integrated development environment” — it sits on top of R and makes it super easy to edit your code, debug, and create plots.
I learned to use them in university and find myself turning to them whenever I have a question and some data — which is often.
Once you have those downloaded, we can get into the next part: the Medium data.
Getting Your Medium Stats and Payment Data
As I mentioned, Medium’s stats aren’t fabulous for analysis in their raw form. I suspect this is intentional to stop folks from “gaming” the system, but we can get around that.
We’re interested in two datasets, which annoyingly don’t come together: payment and engagement. Medium pays based on what they term “engagement.” So we’re going to want the payment numbers, or how much each story earned, and the engagement numbers, or the numbers of views, reads, and fans each story has to date.
Downloading Medium engagement stats
Firstly, we need a way to download data like views, claps, reads, and fans. Luckily, someone’s already done the hard work of building a Chrome extension to do this, and they even wrote a Medium article about it! Thanks to murraygm for that. Head on over and download the Medium Stats Grabber.
It’s super easy to use. Go to your Medium stats page and just click the little button top right. It’ll trigger a download of your stats in CSV format.
Open this up in a Numbers, Excel, or Google Spreadsheet, and then export or save as an Excel file. The reason for this is that if we try to read it into R as a CSV, it will only import the first column, and we want all the columns.
Make sure you scroll all the way down to the bottom of your stats page to ensure it’s able to grab all your stats. Medium uses lazy loading, and so if you don’t go all the way down, your CSV will be truncated wherever you stopped scrolling.
Also, I do this on a Sunday. Why? Although Medium releases payment info on Wednesdays, the pay period actually ends on the Sunday before. If you grab your stats on Sunday, your views, fans, reads, and so on will correctly correspond to the amount of money earned for them. If you grab them on a Wednesday, you’ll have extra views, reads, and fans that didn’t go into pay period calculations.
Finally, it’s impossible to select a certain time period. For example, when I download my September numbers, I’m looking at the data since the beginning — how many views, claps, etc. a story has since September 2018, when I started writing. For instance, if I see a story has 1000 views and was published in May 2019, those 1000 views aren’t from this month alone — they’re since that story was published. This makes it a little tricky to have any conclusions, because our engagement stats are for the lifespan of the story, while the payment stats are only for this month.
If you do this every month, you can subtract one month from another to see how much a story got that month, versus how much it has all-time. Or, if you’re low on time, you can just look at stories published only in this month.
Downloading Medium pay stats
I do this at the end of every month’s pay period! It’s not exactly downloading — I couldn’t find a nice extension that would do this for me, and Medium unhelpfully does not include that stat in the main stats page. So we’re going to have to do a little copy-pasting.
First, go to your Medium payment page and click on whatever month you’re interested in analyzing.
Then, you’re going to highlight and copy all the stories and their payment. After that, paste it into a Google spreadsheet. It might look a little strange because the story titles are in four columns merged together, so you’ll have to select all four columns and hit “unmerge.”
Delete columns B through F once you’ve unmerged so you just have two columns. You’ll want to make sure your columns are titled something useful because we’ll be referring to these later in R. I call mine “title” and “payment.”
Once that’s done, all that’s left to do is download. Go to File, Download, and pick the option for Excel.
OK, we have all the data we need! Let’s start importing and cleaning it.
Setting Up Your Environment in RStudio
We’re going to need to download a few packages to be able to do everything we need to in RStudio. Open up RStudio and go to File -> New script. Now we can start coding.
Again, it’s super easy to do all this and doesn’t cost a thing. Here are the packages you’ll need:
install.packages("readxl", "dplyr","tidyr","ggplot2", "GGally")
These packages are basically to be able to import, wrangle, and plot your numbers. Our numbers are going to come in kind of funky, and so we’re going to need to do some manipulation to make sure we can correctly match up story stats with story payment.
Set your working directory. This is where R will look for the things you tell it you want to import. For me, I’m setting it as my downloads folder because both of my datasets have been downloaded.
Now import your data. I stick to the convention of [month][type of dataset], so this month that was:
medstats <- read_excel("septstats.xlsl")
paystats <- read_excel("septmoney.xlsx")
What that code is saying is basically “Please import these numbers from my computer, and call these data sets this.”
Cleaning the Data
Ok, we have our data in, our environment is ready, and we’re ready to crunch some numbers! Just kidding, we have a bit of tidying to do beforehand.
As I said, there are quite a few issues with the numbers so we can’t interrogate the data just yet. Luckily, the tidyverse makes it incredibly simple to clean it right up.
First, let’s open up our
medstats in RStudio. Once it’s been imported, R will show you that dataset in the top right window. Clicking it will show you what the dataset looks like.
OK, that looks wild. But don’t panic, this is all easy to fix.
You can see every time there was a comma in the story title — for example on row three — the stats grabber split that section and put it into a new column. This is because the type of file we downloaded is a CSV — comma-separated values. Every comma indicates a split.
And if you look at the column name, it’s all the stats we’re interested in, but all bunched together and separated with a |.
The bad news? This took me forever to figure out how to tidy. The good news? It won’t take you anywhere near as long.
First, we’re going to paste all the columns together into one column, which is going to make it easier to separate into discrete columns and values by ‘|’ later. Here’s the code:
medstats <- unite_(medstats, "cols", c("mediumID|title|link|publication|mins|views|reads|readRatio|fans|pubDate|liveDate", "...2", "...3"))
What that line of code is saying is basically, “Paste all these columns from dataset
medstatstogether,” and it refers to them specifically by column name. When we imported our CSV file, R didn’t know what to call columns 2 and 3, and so it defaulted to “…2” and “…3”. You may need to alter the code based on what your columns are called in R.
Now, we’re going to separate them all by |. Here’s the code:
data <- medstats %>% separate(col= 'cols',
into = c("mediumID", "title", "link", "publication", "mins", "views", "reads", "readRatio", "fans", "pubDate", "liveDate"), sep = "\\|" )
What that code is saying is: “Please take
medstats and separate it into columns. Here’s what you’re going to call the columns. And here’s how you know when to separate them — just look for a |. And when you’re done with that, call this new dataset
The reason I call it a new name is so that if I make a mistake later on after I’m manipulating
data, I can go back to
medstats without any issue.
This is starting to look more like what we need:
But there are still some problems. If you look at that top one, you’ll notice it has an underscore where the comma used to be. This wouldn’t be an issue, except that later we’re going to want to match up this title with its stats, to the title with the payment earned. And those titles will have no underscores, so R wouldn’t know to put them together.
Not a problem — we can remove them:
data$title <- gsub("_", "", data$title)
This line of code is telling our data frame that any underscore should be replaced with nothing, but only for our “title” column in that data frame.
Now we can start doing a few cosmetic fixes. I want to drop some columns I don’t really care about, drop some characters from a column so we can treat it as strictly numeric, do a bit of polishing so we can easily join this dataset to our payment dataset, and ask it the questions I want to be answered.
First, let’s drop the columns we don’t care about.
data <- select (data,-c(link,mediumID))
Now, let’s get rid of the words “min read” in read time column so it’s only numbers in there.
data$mins <- gsub(“[a-zA-Z ]”, “”, data$mins)
Now we have to ensure all our columns are the correct formats. The reason behind this is later if I ask R to please plot X against Y, if it thinks my X column is a string of characters, it’ll freak. So we calm it down by telling it exactly what kind of column it is. In this case, we’re saying, “For columns 3 through 7 in the dataset
data , it’s numeric, not a character or anything else weird.”
Ok, now we can start adding to our data frame. For example, I’m interested in how many days a story has been active, so I’ll make our last two columns dates:
data[,8] <- as.Date(unlist((data[,8])))data[,9] <- as.Date(unlist((data[,9])))
Now I’m going to put in an end date — the last day of payment — so we can do a little subtraction and determine how many days it’s been live from publishing to the end of the pay period.
data$current <- as.Date(“2019–09–30”)data$dayslive <- as.numeric(data$current-data$liveDate)
Finally, we’re going to drop our columns we don’t want:
data <- select (data,-c(pubDate:current))
The colon tells R to remove not just those two columns, but all those in between, too.
Great! That’s our engagement
data dataset cleaned up. Now onto
Luckily, because the
paystats dataset is only two columns, it’s super easy to clean.
First, we’re going to get rid of our empty rows:
paystats <- na.omit(paystats)
Then, we can take out dollar sign and force the column to read as numeric:
paystats$payment <- as.numeric(gsub(“\\$”, “”, paystats$payment))
What that line of code is doing is just saying, “For this column in
paystats, I want to replace the ‘$’ with nothing, and then I want to force it to be numeric.”
Now we’re going to do the same, but with commas in the title. Remember, our
data dataset has no comma and we need to be able to match them up by title name later.
paystats$title <- gsub(“\\,”, “”, paystats$title)
Joining the Data
OK, our two datasets are beautiful and clean. Now we can go about joining them together.
First, we’re going to remove the spaces in the title names. Because of the uneven commas, |’s, and so on, it’s just simpler to make sure the story titles in both the datasets are the same by removing all spaces:
data$title <- gsub(“ “, “”, data$title)paystats$title <- gsub(“ “, “”, paystats$title)
Those two lines of code each say: “Take the datasets
paystats, look at the column ‘title,’ and replace any empty spaces with nothing.”
This gives us a kind of weird-looking set of names, but it’ll make it much easier for R to match up to our payments with our other engagement stats.
Now, all that’s left to do is tell R to look through both data and pay stats and match up whichever have the same title.
merge <- inner_join(paystats, data, by = c(“title” = “title”))
This line of code tells R to perform an inner join on two datasets, data, and pay stats. It’s going to look in the columns we specify; “title” and “title” respectively. If our story title column was called something else — for example, “story name” — this is where we would tell R which two columns we’re looking to join by.
So R is churning away, going through the rows on our two datasets, figuring out which match and which don’t. If any of those rows share the same name, R is going to neatly paste them together.
I’m also going to do this, but only for stories published in September, which I can do by filtering in only those that have 31 or fewer days in the
Sept_merge <- subset(merge, dayslive < 31)
Bonus: You can also look at the stories that R wasn’t able to match up. This way, you can run quality control checks and make sure there’s no story excluded by mistake or some kind of coding error. When I do this, I find that 40 stories are left out of my
merge, where R couldn’t find a match. But by checking out that dataset, I can see they’re all deleted stories, which naturally couldn’t be matched up by title. This is because, for whatever reason, when you delete a story, Medium wipes it completely:
anti_join <- anti_join(paystats, data,by = c("title" = "title"))
But it’s only 69 cents out, so I’m not too concerned. Let’s move onto our analysis.
Analyzing the Data
We have one dataset,
merge , which has all the information we need. Now we can ask some questions.
I like to start with visualizing my data, which in turn can start generating the questions.
For this, you’ll want to crack out your
ggplot2 . This is my favorite package for data visualization because it’s just so easy to show what you need. Let’s have a look at what we can see in the data.
Plotting the distribution of how much stories earn
ggplot(merge, aes(payment)) +
This line of code says: Create a plot using the
merge dataset. As my X coordinate, I want to use the
payment column. What kind of plot? A histogram. And then at the end, I made it a “classic” theme just so it looks pretty.
However, this isn’t that helpful because it’s so skewed. I have one $500 story, and then the vast majority are in my $0-$20 bin, so I can’t really see the variety in that $0–$20 bracket. Let’s get a bit fancier and break that bucket down a bit more.
Plotting distribution of only lower-value stories
Here, I narrowed down our window by defining my x-axis as 0–50 dollars, and my y-axis as 0–30 frequency.
ggplot(merge, aes(payment)) +
geom_histogram(bins = 50) +
So we can see that the vast majority of my stories earned between $0-$10. In fact, I know that exactly 344 out of my 382 stories earned less than 10 bucks. However, add those all up, and the total comes to $525.91.
lowpay <- merge %>%
filter(payment < 10)sum(lowpay$payment)
That code says, “Take
Merge and create a new dataset called
lowpay . Filter
merge so only stories that had a payment of less than $10 go into the new dataset. Then total up that new payment column.”
This makes up 30.2% of my earning! How wild is it that 90% of my stories made up only 30% of my earnings, and 10% of my writing was responsible for 60% of my income.
I’m now interested in looking at my higher earners. How many of my top 10% of earning stories were published this month?
Analyzing date published with earning amount
First, let’s make a new dataset to only include my top 10% of earning stories — anything that earned over $10 — and call it
highpay <- merge %>%
filter(payment > 10)
Now, let’s look at when they were published. We’re going to plot earnings against days live, with a line on the 31st of August.
ggplot(highpay, aes(dayslive, payment)) +
geom_vline(xintercept = 30) +
This code creates a plot for
highpay dataset, mapping the number of days a story has been live against how much it earned me. I made it a scatterplot (that’s geom_point), and then added in the vertical line at 30 days. I also excluded that ludicrously high $500 earner so it doesn’t skew my data:
So, yes, there’s definitely a cluster of younger stories that have only been earning for a month or so. But there’s a surprising chunk of them that have been much older and still earning a lot. Look at that one on the far right, nearly a year old, that’s just earned me $25 this month alone.
So stories published this month earned me $954.05, which includes that hefty $500 story from Forge. But stories published longer ago earned me $789.05, which isn’t bad at all.
But maybe it’s publications that help a story to earn more. Let’s investigate.
Analyzing how publications influence payment
Let’s see how much stories in each publication earned me this month.
I’m going to sum up the amount of money each publication earned me and put them into a bar chart.
First, we group our data by publication and add up the story totals:
pub_sum <- merge %>%
total = sum(payment))
This code says: “Take
merge , group by publication, and add a new column called ‘total,’ which is the sum of the payment for stories.”
Then we can put in on a graph like this:
ggplot(pub_sum, aes(publication, total)) +
geom_bar(stat = "identity") +
So it looks like the stories not in any publication earned me the most. However, this doesn’t take into account the number of stories in each one. For example, if I have over 200 stories not in a publication, it would make sense that they have earned me the most money overall.
Let’s take the mean instead — that is, the mean of how much a story earns per publication. I’m going to drop the Forge story because it’s skewing my graph.
ggplot(pub_sum, aes(publication, mean)) +
geom_bar(stat = "identity") +
Now we start to see how stories not in a publication earn less than those in certain publications.
This isn’t an entirely fair representation either, though, because this includes stories written months ago that may only be earning small amounts.
What I’ll do is the same analysis, but only looking at stories published this month, again removing Forge so we can clearly see the results:
ggplot(pub_sum2, aes(publication, mean)) +
geom_bar(stat = "identity") +
Onto the next question!
Analyzing how much money each Medium fan is worth
To me, of course, every Medium fan is priceless. However, I want to look into the actual financial value per fan.
My total number of fans in September was around 2,500 (this statistic is not given to me unless I go to my stats page and add all the columns up by hand!). My total money earned in September was $1737.59. Basic math tells me that in September, every time someone clapped for a story, I earned on average $0.70.
But this is including a mishmash of old stories, new stories, and a whole host of other factors I can’t control. So I’m going to only look at the money and fans earned for September stories.
Let’s take out
Sept_merge dataset and look at it more closely.
First, I know already that stories published this month earned me $954.05. Doing a quick
sum(Sept_merge$fans)shows me stories published this month got 1,629 fans. So this tells me that in September, if someone clapped for one of the stories I published this month, I got $0.58. Not too shabby!
Interestingly, this means we know that for the ~900 fans that clapped for stories published before September who earned me $789.05, those fans were worth more: nearly $0.88 each. So the longer a story continues getting money, the more money each fan is worth.
This might be for any number of reasons, but my guess is that people who clap for older stories clap less in general. Medium doles out a portion of your $5 membership fee to the authors you clap for — if you clap for 100 authors, each gets 1/100th of that money. If you clap just once, that author gets 100% of that money. So it might be that older stories get those rarer clappers.
Analyzing the correlation between story length and story value
This question has a lot of speculation around it. Because Medium pays us based on “engagement” and supposedly people engage with longer stories for longer, it’s presumed they earn more.
We can have a look and see if longer stories earned more this month.
First, let’s look at the distribution of story length, again excluding the Forge story as I think it will skew my numbers.
ggplot(Sept_merge, aes(mins)) +
This is a simple enough distribution — one story all the way at eight minutes, a few only at three, but the vast majority between four to six.
Let’s see how much each story length is worth on average. I’ll group the stories by read time, look at the number of total stories in that bracket, and look at the mean earnings:
length <- Sept_merge %>%
number = n(),
mean = mean(payment))
It’s tough to draw conclusions as, for example, my eight-minute story earned $18, but it’s the only one in its bracket, and we don’t know how many fans each story had.
Why don’t we do something a little more sophisticated and look at how much each fan was worth per story length?
That is, if someone claps on a three-minute story, will that be worth less than on a five-minute story, all else being equal?
What I did here was sum up the total number of fans, the total money earned, and divided, so we know how much each fan is worth.
length <- Sept_merge %>%
number = n(),
mean = mean(payment),
fan_worth = (sum(payment)/sum(fans)))
Interestingly, it seems it does increase — six-minute read fans are worth more than five. We drop back down again at seven and eight-minute read time, but I happen to know my eight-minute read story wasn’t curated, so it might just be that.
For now, it looks like this theory might hold water — that longer stories earn more — but it’s hard to say with so little data.
Analyzing which statistic most likely predicts how much a story will earn
Writers on Medium (myself included) will obsess over a lot of things — views, reads, claps, fans, highlights, comments. There’s a lot to obsess over.
However, I decided early on that fans would be my number one concern. Remember, Medium pays on engagement, and it seems like clapping is the best engagement metric there is.
But what if I’m wrong?
With my September stats, I know how much each story earned, and how many views, reads, and fans it received. Which one of those best predicts the earnings?
What we’re going to do is look at how each variable affects the other using a great package called
The code I used:
ggpairs(data=Sept_merge, columns=c(2,5:6,8), title="story stats")
This says, “Take
Sept_merge , specifically columns 2, 5, 6, and 8, and compare each to the other in a 4x4 matrix. Call this matrix ‘story stats.’”
Let’s start by looking at the graphs we’re interested in: The very first column, which shows us how payment changes with an increase in views, reads, and fans. Obviously, the rule is the more something is viewed, the more reads and fans it will have, so there’s bound to be some confusion, but this isn’t always the case.
Immediately, you can start to pick out some patterns: The fans graph is more of a straight line, while views and reads kind of lose definition the higher they go, which means the relationship is weaker. If you look at the views x payment graph, you can see that even some stories with a lot of views earned relatively little.
Plus, the graph itself tells us that fans are more strongly correlated with payment (0.905) than either reads (0.662) or views (0.63).
This confirms what I thought: Fans are the most important statistic when it comes to getting paid.
This extremely long article represents just the iceberg’s tip of the information there is to learn about stories and how they earn money on Medium. Someone with more sophisticated coding ability would no doubt be able to tease out even more.
But even with so much left to do, this analysis gives us a few key takeaways:
- Old stories earn a lot of money. It’s easy to think new stories give the bulk of your earnings, but it’s a mistake to think that’s entirely it.
- Publications might influence payment. This goes counter to what I thought, but it seems that putting your stories in publications might help them earn more. But it’s hard for me to tell with my small sample size. From the outset, stories not in a publication earned the least amount.
- Each fan is worth around $0.58 in September. But looking at older stories, those fans are worth more. Hard to say exactly why, but interesting nonetheless!
- Longer reads might be worth more money. Again, this may become clearer with more data, but it would appear longer read-times correlate to fans being worth “more.”
- Fans are the best predictor for income. This confirms my previous belief that fans, more than views or reads, are a signifier of how much that story will earn.