Publishing Data Science on Medium
A Step-by-Step Process using Github
Based on my experience, the biggest difference between a “Data Analyst” and a “Data Scientist” is rigor. The days of ad hoc, fly-by-night analysis are gone, replaced by responsible, systematic—and reproducible—analysis.
This post is intended to provide a helpful framework for using Github to support publishing analysis on Medium. So that we aren’t speaking in generalities, I will use my analysis of the Electoral College and its supporting Github repo as a specific example. It involved a trivial amount of data, which makes it easier, actually, to focus on process.
Brief aside: Having been a part of this whole data business for the last 10 years, let me say how excited I am to see data science really come into its own lately. But this suddenness means that our methods are ever evolving, so I am eager to get your feedback on all this. Please comment with reckless abandon!
The Process
The process I propose isolates each step in the data analysis pipeline and emphasizes openness and reproducibility:
- Use a public Github repo for each analysis/article
- Isolate raw data and the data extraction process from analysis
- Isolate data transformation and analysis from reporting
- Keep reporting as simple and straightforward as possible (bonus: no pie charts)
- Use prose to structure the narrative around the reporting
- Finally, cite your data and your analysis at the end
Let’s take each step individually.
1. Use a Public Github Repo for Each Analysis/Article
James Ball wrote a fantastic article in the Guardian about the recent entrants into data journalism, and in it he highlights the need for data scientists to be accountable and show their work. I could not agree more.
Rarely do you find yourself publishing a public article or blog post using private data. So it follows that if you’re using public data, you may as well publish the raw data in conjunction with your process. What we lose in proprietary process we more than make up in legitimacy and collaboration. As Simon Rogers says, “Liberate your data!”
So, let’s begin by pairing each article with its own public Github repo. The repo can and should contain everything other people would require to:
- Reproduce your results
- Correct any errors they discover (via fork, branch, and pull request)
- Elaborate and expand, building different cuts of the data or enhanced models to improve accuracy, and so on
In addition to creating a repo, I suggest including a simple README such as this describing the steps to reproduce results. As with my example, it doesn’t have to be lengthy—just enough to give new users their bearings. My script uses node, so I included mention of that along with steps to install dependencies.
2. Isolate Raw Data and the Data Extraction Process
There’s nothing more frustrating to public debate than a fundamental disagreement on the basic facts. Please do what you can to combat this by starting with pristine, untouched data at the outset.
If you’ve gathered data manually from a primary source, be sure to cite it specifically. Better yet, if you’ve captured data programmatically through an API or web scraping routine, save that script along with the data so that it can be reproduced without debate. And—though I risk beating a dead horse here—avoid transforming the data during extract where possible.
In my example we see a standalone extract.coffee that pulls data from 2 sources and drops it into a data folder. I tried my best to follow a literate programming style, explaining the process using comments so that someone either new to coffeescript or new to data science would be able to follow along.
3. Isolate Data Transformation and Analysis from Reporting
The infamous “black box” in which data analysts do their work typically involves all manner of ad hoc “jiggery-pokery” (or “skullduggery” for the pirates in the crowd) via Excel or a database platform. A thousand temp tables soon deleted, Excel formulas strewn all about, queries run once that modify data, data copied and pasted as values or copied from the output of a query into a new sheet. [Cringe]
Worst of all, when shaking the the analysis tree doesn’t yield the insight fruit they’re looking for, suddenly the analysts start getting “creative.” If you haven’t heard of the legendary Cave of Unreported Exceptions I suggest you check this out.
Instead, I suggest we stick to a reproducible script that stands separate from both data extraction and reporting. I used this. Raw data goes from the data folder through the meat grinder and into the analysis folder in a repeatable, clear, commented way. You might disagree with my approach to the analysis, but that’s good. Fork it, fix it, and send it my way via pull request!
By the way, I am not particular about tools. The script can be in R, Python, Coffeescript, Excel VBA…it wouldn’t matter to me, provided I can see how the data went from raw input to useful output.
4. Keep Reporting as Simple and Straightforward as Possible
Once you have your analysis in hand, and you believe that it supports a thesis you intend to communicate publicly, the goal of reporting is to illustrate your analysis and thesis as clearly as possible.
I have heard “no pie charts” from many people over the years, and I have said it myself. The biggest reason is the difficulty in comparing adjacent items. This is true whether it’s an ugly 3D pie chart from Excel 95 or a crazy-sexy thin/flat version (see left).
This said, if you show any article or presentation to 10 consultants or data scientists, invariably you’re going to get 10 different viewpoints about different or better ways to visualize the information. That’s fine, and most of the time the feedback is all useful, since we as authors already know what insights we’re trying to communicate. New readers do not.
Since we cannot embed javascript charts in Medium posts, I think Excel is as good a tool as any for static charting. I used a standalone Excel file with tabs that simply import my analysis outputs. I used a separate tab for each chart so that I could make them visually consistent, and then I saved each chart as an image for import into Medium. Like so. If you open it, you’ll notice the Excel file has no formulas whatsoever, since my analysis is captured elsewhere. Excel in this case is used just for charting.
Stylistically, I suggest:
- Always using a chart title with a simple, clear message
- Ensure that font sizes emphasize the appropriate parts of the chart (e.g. 24pt chart title, 20pt axis titles and legend, 14pt y-axis labels for critical information, 8pt x-axis labels for less critical details, etc.)
- Eliminate useless parts of the chart (e.g. no legend if the chart title fully describes the series)
- Try to use the simplest chart types that communicate your message most clearly (e.g. 2 lines to demonstrate correlation, columns for single series charting, areas for comparable quantities in time series, etc.)
5. Use Prose to Structure the Narrative around the Reporting
In my opinion, Nate Silver is so highly regarded not simply for his statistics and data modeling skills, but more for his ability to craft the story around the data. To my amazement and admiration, he manages to explain particularly complex models to the masses time and again. Above all, this is the hard part.
While there are plenty of other platforms for publishing research and analysis, Medium stands out to me because the narrative comes first. This is why data journalism is starting to take hold with outlets like The Upshot and Nate’s 538. It’s the tale that counts.
In my article, I tried to show 2 problematic point-in-time views into the current state of the Electoral College, and then I showed a historical view with the aim of establishing that the issues I have surfaced are not new. But then I did my best to turn the attention to what I believe to be the best solution, a bill currently in process.
Every thesis requires a different approach, and this part isn’t easy, but this is where we data scientists can really impact the broader debate. The goal is to contextualize the information and prove why it matters. I even identified a specific group of under-represented voters with the power to make change, and I included a “next step” for them to take if I managed to sway them.
6. Finally, Cite Your Data and Your Analysis at the End
This is simple, but it should not be forgotten. If you’re going to the trouble of doing all this, please make sure the rest of us can find it. Cite your raw data sources. Cite your github repo.
I also like to credit the font author/foundry if I am not using a system font. But that’s because I am a nerd.
If you found the article useful, I would humbly appreciate your recommendation, your sharing with others, and even your comments, if you think I’ve gotten it wrong or would like me to clarify.