A thing that I do when I cook is to re-write the recipes I’m using (whether they’re from a cookbook or my own invention) onto a piece of paper in a very specific way. I think the approach I use is handy, so I’m describing it here in case you’d like to use it. (Or in case you need more evidence about how weird I am.)
There are 4 ideas that I think are important:
For many years, long before Medium came around, I wrote personal and professional blogs on Blogspot and Wordpress. My web domain, harlan.harris.name has been pointed at a Wordpress site since 2008. But it’s time for a change. As of this afternoon, I’m now using Blogdown, a tool for generating Markdown-based static web sites, written in R.
My first (new) post there is up, about the migration itself.
I have no intention of leaving Medium, but I may write certain posts (those with inline code or data visualization) there first, then post links/highlights here, where more people will read it, and where there are comments. In the other direction, I’ll probably continue writing some stuff here and sync’ing it there as a backup.
Hopefully this will increase my somewhat embarrassing blog post rate a bit…
There’s recently been some interesting opinionated writing in the R statistical programming community about how and when to teach the abstracted, easy-to-use approaches to solving problems, versus the underlying nitty-gritty. David Robinson, Data Scientist at Stack Overflow, wrote a blog post recently called Don’t teach students the hard way first. The primary example was on the data-manipulation tools in the tidyverse, versus the underlying methods in base R, but the discussion was mostly about principles in pedagogy. Some highlight quotes from the original article (which I recommend reading!):
I recently attended two small conferences — the ISBIS (International Society for Business and Industrial Statistics) 2017 conference, held at IBM Research in Westchester County, and the Domino Data Lab Popup, held in West SoHo. I was invited to speak at ISBIS (slides here, if you’re curious), but for this post, I want to summarize some insights from other people’s talks.
In chronological (to me) order… First a few talks from ISBIS that I particularly liked (note that I only saw a fraction of all the talks):
Occasionally when chatting with other data scientists, especially with others who are interested in integrating predictive models into production software system, the word “scaling” comes up.
I think this is a great question, but it’s a little underspecified. There seem to be at least three qualitatively different notions of “scaling” in data science, and it’s worth the effort to clarify each of them, and address how people tackle them.
Specifically, I think the real questions that underlie “scaling” are: “what happens when you have a lot more training data?”, “what happens when you have to make a lot more predictions?”…
A particularly good way to get a little more out of professional conferences is to blog about your experiences, I think. It makes you focus your thoughts on things like “what’s the big take-away here,” and “what should I be asking people in the hallways?” Rather than just summarizing what you saw, or making snarky Twitter comments (also worth doing!), a great conference blog post is synthesis — combining insights from multiple presentations and conversations into a coherent new whole that helps clarify ideas.
I recently returned from the INFORMS Analytics 2017 conference. INFORMS is the professional society of Operations…
A particularly good talk at Strata NY last year was by Brett Goldstein, former CIO of Chicago, who talked about accountability and transparency in predictive models that affect people’s lives. This struck a strong chord with me, so I wanted to take some time to write down some thoughts. (And a rather longer time to publish those thoughts…) I’m sure others’ have thought about this more and have better takes on this — please comment and provide links!
When building a complex system, it’s often helpful to think about the design of that system using patterns and abstractions. Architects and software engineers do so frequently, and the experience of implementing predictive modeling pipelines has recently led to a variety of patterns and best practices. For instance, when dealing with large amounts of streaming data, some organizations use the Lambda Architecture to handle both real-time and computationally-intensive use-cases.
You’re a data scientist, and you’ve got a predictive model — great work! Now what? In many cases, you need to hook it up to some sort of large, complex software product so that users can get access to the predictions. Think of LinkedIn’s People You May Know, which mines your professional graph for unconnected connections, or Hopper’s flight price predictions. Those started out as prototypes on someone’s laptop, and are now running at scale, with many millions of users.
Even if you’re building an internal tool to make a business run better, if you didn’t build the whole app…
Yesterday was the 2016 National Day of Civic Hacking, a Code for America event that encourages people with technology and related skills to explore projects related to civil society and government. My friend Josh Tauberer wrote a thoughtful post earlier about the event called Why We Hack —on what the value of this sort of event might be — please read it.
For my part, this year I worked on one of the projects he discusses, understanding the impact of DC’s rent stabilization laws and what potential policy changes might yield. As Josh noted, we discovered that it’s a hard…
Data Scientist; co-founder of Data Community DC and the Data Science DC Meetup; Brooklyn, NY.