Paper Recaps: An Introduction

TL;DR: I’ll be reading academic papers about data cleaning and putting much more approachable summaries / explanations here.

A question to my software engineer friends: Why isn’t it a common thing to read academic papers? A lot of very cool ideas that we use today are first shown off in journals — MapReduce and Hadoop came from an OSDI paper by Google in 2004 (ten years ago now!), Postgres was shown off at SIGMOD (1984) and its predecessor Ingres on Transactions on Database Systems (1976), just a few famous examples. Don’t the dates seem surprisingly early? (If they don’t, let me remind you that in 2004 the iPhone didn’t exist. Yeah. We had Hadoop before the iPhone.) It takes a while for ideas to gain mainstream adoption. And these are the better cases — the example papers above were all about already existing systems, not theoretical ideas.

VLDB (Conference on Very Large Databases) 2015 happened, and the papers have been out for a while now. I always skim through the papers that look more relevant to me (who are we kidding — nothing I do is that cool). This time I’m especially excited because there’s a few ones on data cleaning specifically. What I want to start doing is instead of just skimming papers, I want to read them in more detail and write down my thoughts on them and have discussions, from the perspective of a practitioner.

Reading papers is usually a daunting task for practitioners like myself, because of the heavy jargon and — in my opinion — dense writing style. After all, the audience is other people who are read up on the literature and are used to information-dense writing where reading and rereading sentences and passages until you understand them is considered the right way to do things.

As a result of all this, neat ideas might not get the interest they deserve until someone goes out and writes some production quality software that does it, which is sad. A tight feedback loop between industry and academia is a great thing, and is mutually beneficial. Databases, operating systems and other systems-related software have benefited greatly from this, and now we’re seeing similar developments in machine learning.

On the flip side, I think it could be useful for academics to get further confirmation and ideas from the real world. First, you never really know who could be using your ideas in what crazy way you didn’t think of initially. Academia even has this issue within itself — you’ll find distinct fields approaching very similar problems in different domains from different angles, only to discover each other later on and benefit from the insight.

Figure 1: The Academia — Industrial Complex

Second, a huge part of legitimizing an idea in academia is finding a relevant experiment that you think models the real world and then showing that your idea works well in that context. But often I’m left with questions like “what about situation X though?”. The answer is that we won’t know until someone in the academic world cares about it enough to do the work and write a paper, and if and only if there was funding and it was a publishable idea in the first place. Yet there can still be value in knowledge that may not be a significant enough contribution to the field and that may not make it into a journal. Negative results, implementation trivia, and small clarifications do still matter.

I recently saw a thread on reddit discussing teaching styles: “This thing exists” style explanation is destined to make people’s eyes glaze over, versus “we need a thing, how do we do it” style explanation is engaging because it explains stories in such a way that people get personally invested in the subject. Like highschool education, I think academic papers can tend towards that style too. The abstract says “we attempt to add feature K to the XYZ system ” — yes but why?

Simon Peyton Jones, who’s the author of the slide above, understands my plight and has some great advice to fellow researchers regarding this (“you want to infect the mind of your reader with your idea, like a virus”), which gets agreement whenever I mention it but I don’t see it in use as often as I thought I would.

I don’t think there exists an inherent reward for academic writing to be engaging in that way. If you’re in a field, it is your duty and responsibility to have read the papers in your field. If you’re a journal reviewer, you’re not going to (consciously) dock points for an especially boring paper that’s otherwise reasonable.

So I’ve been convinced that there is room for a middle ground of material that is more newbie friendly and more educational, and yet is still domain specific and applications focused. I’ve been inspired by some great examples:

  • Papers we love is an enthusiastic mixture of people who are curious enough to get together after work, read some papers and help each other understand the details. They also have a youtube channel.
  • Julia Evans is one of my heroes in terms of taking complicated ideas and explaining enough to whet the appetite.
  • The People’s Science, an effort of a friend of mine that aims to bring the science community and the public closer together.
  • Adrian Colyer writes “The Morning Paper” which is the inspiration for all this.

All I have right now is some basic experience reading papers, and some more experience doing data cleaning work in the field. I’m not really an expert in any of the subjects I’ll be covering, so I hope you’ll excuse any mistakes I’ll make when trying to decipher the papers. I’m certain I’ll make many, and I appreciate comments directing me the right way. I’m also not the best writer or educator, but I’m hoping that I have enough perspective to explain things in a way that’ll be useful to people, and see how our experiences mesh with what academia has been working on.

If this sounds interesting, don’t forget to follow me here or on Twitter!