This was first posted on my personal blog.

A thing that I do when I cook is to re-write the recipes I’m using (whether they’re from a cookbook or my own invention) onto a piece of paper in a very specific way. I think the approach I use is handy, so I’m describing it here in case you’d like to use it. (Or in case you need more evidence about how weird I am.)

There are 4 ideas that I think are important:

  1. Copying the recipe, by hand, forces me to think through the steps and keeps me from not…


For many years, long before Medium came around, I wrote personal and professional blogs on Blogspot and Wordpress. My web domain, harlan.harris.name has been pointed at a Wordpress site since 2008. But it’s time for a change. As of this afternoon, I’m now using Blogdown, a tool for generating Markdown-based static web sites, written in R.

http://www.harlan.harris.name/

My first (new) post there is up, about the migration itself.

I have no intention of leaving Medium, but I may write certain posts (those with inline code or data visualization) there first, then post links/highlights here, where more people will read it, and where there are comments. In the other direction, I’ll probably continue writing some stuff here and sync’ing it there as a backup.

Hopefully this will increase my somewhat embarrassing blog post rate a bit…


There’s recently been some interesting opinionated writing in the R statistical programming community about how and when to teach the abstracted, easy-to-use approaches to solving problems, versus the underlying nitty-gritty. David Robinson, Data Scientist at Stack Overflow, wrote a blog post recently called Don’t teach students the hard way first. The primary example was on the data-manipulation tools in the tidyverse, versus the underlying methods in base R, but the discussion was mostly about principles in pedagogy. Some highlight quotes from the original article (which I recommend reading!):

  • that phrase keeps popping up: “I teach them X just to show…


I recently attended two small conferences — the ISBIS (International Society for Business and Industrial Statistics) 2017 conference, held at IBM Research in Westchester County, and the Domino Data Lab Popup, held in West SoHo. I was invited to speak at ISBIS (slides here, if you’re curious), but for this post, I want to summarize some insights from other people’s talks.

In chronological (to me) order… First a few talks from ISBIS that I particularly liked (note that I only saw a fraction of all the talks):

  • Merlise Cylde from Duke talked about Bayesian Model Average, which is interesting and…


Occasionally when chatting with other data scientists, especially with others who are interested in integrating predictive models into production software system, the word “scaling” comes up.

Not this. Although some West Coast data scientists are into this kind of scaling too.

I think this is a great question, but it’s a little underspecified. There seem to be at least three qualitatively different notions of “scaling” in data science, and it’s worth the effort to clarify each of them, and address how people tackle them.

Specifically, I think the real questions that underlie “scaling” are: “what happens when you have a lot more training data?”, “what happens when you have to make a lot more predictions?”…


A particularly good way to get a little more out of professional conferences is to blog about your experiences, I think. It makes you focus your thoughts on things like “what’s the big take-away here,” and “what should I be asking people in the hallways?” Rather than just summarizing what you saw, or making snarky Twitter comments (also worth doing!), a great conference blog post is synthesis — combining insights from multiple presentations and conversations into a coherent new whole that helps clarify ideas.

I recently returned from the INFORMS Analytics 2017 conference. INFORMS is the professional society of Operations…


A particularly good talk at Strata NY last year was by Brett Goldstein, former CIO of Chicago, who talked about accountability and transparency in predictive models that affect people’s lives. This struck a strong chord with me, so I wanted to take some time to write down some thoughts. (And a rather longer time to publish those thoughts…) I’m sure others’ have thought about this more and have better takes on this — please comment and provide links!

A slide from Goldstein’s Strata presentation.

There has been a lot of discussion recently about accountability in predictive models, and the failures of certain systems to avoid troublesome…


When building a complex system, it’s often helpful to think about the design of that system using patterns and abstractions. Architects and software engineers do so frequently, and the experience of implementing predictive modeling pipelines has recently led to a variety of patterns and best practices. For instance, when dealing with large amounts of streaming data, some organizations use the Lambda Architecture to handle both real-time and computationally-intensive use-cases.

I recently attended the Strata conference here in NYC, where one of the better presentations was by Jon Morra, at the time at eHarmony — A Generalized Framework for Personalization. …


You’re a data scientist, and you’ve got a predictive model — great work! Now what? In many cases, you need to hook it up to some sort of large, complex software product so that users can get access to the predictions. Think of LinkedIn’s People You May Know, which mines your professional graph for unconnected connections, or Hopper’s flight price predictions. Those started out as prototypes on someone’s laptop, and are now running at scale, with many millions of users.

Metaphor (source)

Even if you’re building an internal tool to make a business run better, if you didn’t build the whole app…


Yesterday was the 2016 National Day of Civic Hacking, a Code for America event that encourages people with technology and related skills to explore projects related to civil society and government. My friend Josh Tauberer wrote a thoughtful post earlier about the event called Why We Hack —on what the value of this sort of event might be — please read it.

For my part, this year I worked on one of the projects he discusses, understanding the impact of DC’s rent stabilization laws and what potential policy changes might yield. As Josh noted, we discovered that it’s a hard…

Harlan Harris

Data Scientist; co-founder of Data Community DC and the Data Science DC Meetup; Brooklyn, NY.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store