An Interview with Jeff Leek, Author of The Elements of Data Analytic Style
published Sep 28, 2015 by Len Epp
Jeff Leek is Associate Professor of Biostatistics and Oncology at the Johns Hopkins Bloomberg School of Public Health. He’s the author of the popular Leanpub book, The Elements of Data Analytic Style, with over 37,000 readers. In this interview, Leanpub co-founder Len Epp talks with Jeff about his career, the origins of his interest in data science, and the importance and nature of data science generally.
This interview was recorded on June 6, 2015.
Len Epp: Hi I’m Len Epp from Leanpub, and in this Lean Publishing Podcast, I’ll be interviewing Jeff Leek. Jeff is Associate Professor of Biostatistics and Oncology at the Johns Hopkins Bloomberg School of Public Health in Baltimore, Maryland. He is also co-director of the Johns Hopkins Specialization in Data Science, the largest data science program in the world that has enrolled more than 1.76 million people. He writes for the blog “Simply Statistics” and can be found on Twitter @simplystats. Jeff is the author of the Leanpub book, “The Elements of Data Analytic Style.” His book is focused on the details of data analysis that sometimes fall through the cracks in traditional statistics classes and textbooks. In this interview, we’re going to talk about Jeff’s professional interests, his book, his experiences using Leanpub, and ways we can improve Leanpub for him and other authors. So thank you Jeff for being on the Lean Publishing Podcast.
Jeff Leek: No problem, thank you very much for having me.
E: I usually like to start these interviews by asking people for more or less their origin story. Do you think you can tell me how you first became interested in biostatistics and what lead you to where you are?
L: Yeah sure, so at the time that I first got interested in biostatistics, I was an undergraduate at Utah State University out west, and I was studying mathematical ecology. I was studying mountain pine beetle outbreaks — how they’re coordinated and how they attack trees. It’s actually kind of a major problem. You read about it in National Geographic a lot, about how these mountain pine beetles are devastating huge swathes of the forest out in the western United States. And so when I first started working as an undergraduate student, I got a research assistance-ship, where basically I got paid to camp. It was the best job I’ve ever had probably, including the one I have now, and I like my job a lot. I would go out and I would collect this mountain pine beetle outbreak data from — basically we’d count how many beetles hit which trees and which time. And then I started analyzing it a little bit, and got into that. And then, at the time there was a statistician who was on the faculty in the math department where I was working, and he suggested biostatistics. And when I applied to graduate school, I applied to both math departments and biostat departments, and the biostat department seemed like they had happier students, and so that I went into biostat was sort of serendipitous. Then did my graduate work studying genomics, studying human genomes and the data around human genomes. And then did a post doc, and then ended up here as a faculty member. So that was how I got started. So beetles is what sort of led to data.
E: I’ve actually got a question to ask you about that in a minute, but just before we do that, can you explain a little bit about just what biostatistics is for those who might not be familiar with it?
L: Oh sure, absolutely. Biostatistics is a field that applies the ideas of statistics, which is basically statistics you might have heard of, or you might think of as this boring subject. I often hear that when I tell people at parties or whatever. But it’s actually quite a fascinating subject. It’s basically, how do you take a small amount of information, or a large amount of information in the form of data, and turn it into some kind of knowledge you can use, whether that’s through a clinical trial and trying to decide if a drug works, or whether it’s analyzing the human genome and trying to figure out which genetic variants are associated with which diseases. Or now more — in a more modern sense, how do you decide which links will people click on in a website? All of that is data, and so statistics is involved in analyzing that data and trying to figure out answers to questions. Biostatistics mostly applies to — How do you do that in the context of clinical trials? How do you do that in a context of images of, say, your brain or your heart? Or, how do you do that in the context of data we’ve collected about your genome? So it’s, How do you take that information that we collect and turn it into decisions about your health? So that’s what I’ve been working on for a long time.
E: Can you give an example of how biostatistics would be used in oncology?
L: Yeah, so a really common example is — so you might want to detect, for example, there are certain genetic variants that if you have them, certain chemotherapies work better for you. We know that if you have certain variations in your genome, then certain kinds of chemotherapies that target those variations will work better. And so, how did we figure that out? Basically through a statistical analysis. We took a whole bunch of patients, figured out how they responded to chemotherapy, measured stuff about their genomes, tried to associate those two things together, and filter out which are the parts that give us information about, how does that chemotherapy work? That’s one example, there’s a lot of other examples. Every time you hear about — if you ever read in the news that some new drug has been approved by the FDA, that was the result of a biostatistician analyzing the data set. They did a trial, they randomized some people to get the drug, some other people to get a different drug. And then a statistician analyzed that data and tried to detect which one worked better. And so that question, that decision is made by a statistician. That happens a lot, not just in oncology but in every aspect of human health.
E: Thanks, that’s a great answer. Just back to the mountain pine beetle for a moment. When I was an undergrad, my summer job was, for the spring season, tree planting in British Columbia.
L: Oh nice.
E: So I spent a lot of time in camps, and also just loved that very much — being paid to be outside working in the mountains. I just wanted to ask you, I’m sure you still follow it, but what’s the current state of affairs there? And just for anyone listening, it’s been truly devastating to the forests up in the north, northwest you can say of the United States, and the southwest of Western Canada. There’s these beetles that are just sort of going through from west east, devastating forests. One hears stories about areas the size of Germany being devastated. Can you tell me what the current state of affairs is with that?
L: Yeah I mean, I do follow it. Not as a researcher now, but mostly as an interested amateur. But I do know that– So, I grew up in Idaho, I still go back to the northwest a lot to see my family and so forth. And you’ll go into the Sawtooth National Recreation Area, it’s a great place to go. But you’ll go into parts of the Sawtooth National Recreation Area — now all the trees are either grey or red because they’ve been– Huge swathes of the forest have been knocked out, and so I think it’s not going as well as you would hope. It’s going pretty badly I think, in the sense that as temperatures are warming, the beetles get more and more habitat that they can survive in. And so, as that’s happening, you’re seeing them sweep across. It’s moving more and more north into Canada actually. There was already quite a bit of damage in the United States, and now more and more you are seeing it in Canada, actually, because basically the climate is becoming conducive to the pine beetle surviving there. So I think that it’s a pretty serious ecological problem that, as far as I know, there’s no solid answer to how to resolve the mountain pine beetle problem. Because, at least at the time when I was doing the research, basically the only way to prevent the beetles from taking out a tree was to spend quite a bit of money. I think it was maybe a few dollars a tree or something like that — to spray the tree down in order to prevent the beetles from coming there. And that’s not — that gets pretty expensive when you’re talking about areas the size of Germany, right?
L: So I don’t know, I mean — again now I’m just an interested amateur, you’d have to ask the researchers in the field. But all the maps I see, and all the times I go visit those areas, it’s pretty grim right now.
E: I’m sorry to hear that. The last thing I’d read had been better news. But I do know that a cold winter is what you need, a really cold winter.
L: Oh and that happened, so I haven’t heard anything this year.
E: Yeah, I just know anecdotally that we had a cold winter up in Western Canada anyway.
L: Oh well, hopefully that’s true. Rhe last I heard about it was like last year, I read a report I think in National Geographic, and it was sounding pretty grim.
E: Oh no.
L: But that could’ve been the way National Geographic was portraying it too, I don’t know.
E: You mentioned you’re a scientist and you’re sort of watching this, just not as a researcher, but just as someone interested. And you had a blog post recently on Simply Statistics about that issue actually. You cited Jon Stewart talking to someone, I think it was a physicist who was asked to comment about climate change or fossil fuels or something like that. Can you explain a little bit what you were getting at in that, because it was a really interesting post.
L: Yeah so, I think Lisa Randall was the physicist that Stewart was talking to. And he asked her basically, “Why haven’t we solved the fossil fuel crisis?” She’s a physicist, quite a theoretical physicist, and she answered the question about as well as you could possibly answer it, by saying that while she knows a lot — she’s clearly very well qualified, that’s not her area of expertise, and she couldn’t answer that question. And so, I think it’s a current problem in society that it’s very hard to tell the credentials of somebody. You hear so and so PhD, but what does that mean? You could get your PhD in a lot of things, right? And just because you have your PhD in literature, doesn’t mean you’re qualified to tell people about their health or– I wouldn’t be qualified to tell anybody anything about history necessarily. And so, I call that residual expertise — where it’s sort of, you get your main expertise, and then you look kind of expert to everybody else just by virtue of the fact that you’re a PhD or an MD or whatever. And I think that sort of residual expertise is being used in lots of different political ways now. I think that’s kind of an interesting — whether it’s, you’ve lined up experts against some idea you don’t like, whether it’s evolution or whether it’s the link between autism and vaccines, or whatever it is. The best way to get experts — quote unquote experts — to talk about your idea in whichever way you want, is to pick people that aren’t necessarily expert in that area. And then you can kind of — they don’t know as much. Their opinions might not be as well formed. So that’s why I hesitate to try to — I try to qualify when I say things that aren’t in my scientific area of expertise, just so that they don’t get interpreted as like– I don’t actually have expertise in the area of mountain pine beetles now. I can tell you only about what I read in National Geographic.
E: Yeah I know, I understand. I really take your point there. And one of the more, I mean from my perspective, one of the more pernicious examples I see of that is people who’ve been successful in business and then claim expertise in the economy.
E: And you know, running a business and understanding the economy are actually completely different things.
E: But nonetheless, because money’s involved in both, people will associate managing a group of people who are doing work, with understanding interest rates and currency valuations, you know?
L: And certainly that’s true. The more that you can draw any kind of a common thread between what you used to do before and what you’re trying to talk about, the more people will believe. Like being a data scientist right now, in the sense that I analyze data, a very particular type of data — but could very easily try to adapt that to a bunch of different– Similarly people that deal with money could. So lots of things deal with money. You may know more or less about some of those things. So yeah, I do think that’s really interesting. I hadn’t thought about the example of business people and the economy, but I think that’s probably true.
E: I used to be an investment banker, and it’s very frustrating to me to see people who are managers play the role of being economists. It’s just not the same thing at all. But yeah, moving on — I wanted to ask you, when it comes to data analysis, in your book you say, “Data analysis is at least as much art as it is science.” I was wondering if you could explain what you meant by that?
L: Yeah, that’s a great question. It’s very interesting to me — when you’d learn about data analysis in school, typically you learn it in the context of say, a statistics class or maybe the end of an econometrics class or something like that, where they start to teach you how to actually work with real data. So usually you learn — it comes kind of from the history of the field, as a field that used statistics, and analyzing data kind of grew out of very mathematical fields. So there’s this idea that you can always write down an equation for how the data are going to behave. And that’s almost never true. Data is, the data that you get out of almost any system is complicated, and there’s all sorts of reasons why it’s messed up. An example would be, from the pine beetle case, there was a day I slept in — don’t tell my old bosses — and missed the counts for that day. And so, you put in — the way you would mark it is you’d just put N/A or whatever. But then somebody going back and analyzing that data has to account for the fact that a graduate student slept in that day. Which is not something you can nicely model with an equation or anything like that. You just have to deal with the fuzziness of the data. And so, whenever those sorts of things happen, you have to make lots of basically arbitrary decisions. Do you skip that day? Do you try to impute the missing values using some information, predict what they were? Do you — what decision you make about that, is basically a human behavioral choice. And it depends a lot on where you were trained. Certain places, like if you went to school at a certain place, they’ll tell you to do one thing. And if you went to school at a different place, you probably got taught to do something else. And also just your own perception. So that’s the art of data analysis. Basically, anything beyond — there’s these beautiful equations that you can use to describe how a linear model fits or– Any of the standard statistical ideas of how to calculate a P value, the central limit theorem and all that. But in real data, most of it is these series of somewhat arbitrary decisions that mostly you only learn how to make them well, after having had experience doing it.
E: It’s interesting, that seems to be a theme. I mean in your book, it’s addressed to trying to find standards for dealing with issues like that. And you have a really interesting section called, “Common Mistakes.” And one of those, in one sub-section, you write about the conflation of inferential and causal analysis, spurious correlations and causation creep. And I was wondering if you could go into a little bit about what causation creep is and why it’s a problem?
L: Yeah, so there’s a few ideas packed in there. They’re all related to each other though. So causation creep is — usually when you’re analyzing a data set, it’s very hard to — even if you find that two variables have related, the data is related… The most common example of this is, if you plot how many ice cream cones people buy and how many murders occur in a city, those two things will be correlated with each other in almost every city in the world. And so, that’s not because ice-cream-eating causes murderous intent or anything like that, right? It’s just because in the hot months, people will eat more ice cream, and also more murders occur in hot months, because people are out and they’re interacting more, or whatever. So that’s an example where there’s a correlation between those two variables, but you can’t say that ice-cream-eating causes murder. Similarly, in almost any analysis you do of data, say in the medical field, if you don’t take very careful steps, you can detect correlations between all sorts of variables. Like, a headline I saw once was, “Facebook causes cancer.” You read a lot of Facebook, so you’ll get cancer. That’s probably not true. They probably just analyzed this gigantic data set that wasn’t carefully curated, and found a correlation and reported it. The way that that that happened likely is the original authors of the study probably were very careful not to say that it was a causal relationship that Facebook causes cancer. They probably said something like, “We observed a correlation between Facebook and cancer in this population.” But then somebody, either them or maybe the editor in an editorial wrote, “Well, it looks like Facebook might cause cancer.” And then somebody says, “Facebook causes cancer.” You can see the progression of the language from, “Oh, we observed an interesting correlation” to “This causes that.” So that’s kind of causal creep, at least as I define it. It’s basically, the creeping of causal language into a description of an analyis that really can’t tell you which one caused which.
E: I was just going to say, I imagine reading the sort of popular science news sections on websites must really be frustrating for you. Those stories seem to be, I mean, half of them seem to be based on that kind of — a journalist just taking the opportunity to say something that they based on some research that they kind of read a summary of.
L: Yeah, I think that’s certainly frustrating. And the one thing that’s been really frustrating is — as a parent, I have two young children, and there’s always news about, “If you do this for your kids or do that for your kids they’ll be fine,” or, “They’ll turn into malformed mutants,” or whatever it is. And so, if you don’t know how to look into the details of the study, then you can be snowed by information that clearly isn’t true. One of my favorite pet peeves is this often discussed connection between breast feeding and IQ. If you breastfeed your kids for longer, they’ll have higher IQs. But that’s one of these notoriously hard things to study well, because it’s very hard to randomize women to breastfeed or not breastfeed. And so, you often will see claims about breast feeding that are based on observational data, which is data that makes it very hard to make real causal claims. That introduces a whole set of ideas that make it really hard to understand what’s really happening. But there’s understandably a lot of emotion tied to what that answer might be. So that one comes up a lot.
E: You also have a section on data dredging, where you quote the British economist, Ronald Coase, saying — and I just love this quote — “If you torture the data enough, nature will always confess.” Can you explain what he meant by that?
L: Yeah, so it kind of goes back to that same idea, the art idea of data science or data analysis. Since most of data analysis boils down to a series of decisions that have to be made by a human, if you’re nefarious for example — if you really want the data to say something, you can make all those decisions in such a way that you’ll get the answer you want. Here’s an example: suppose I take all the data for two stocks, and I want them to be correlated with each other, like I want the prices to be correlated with each other. But it turns out, when I take the data and I look at them, they’re not correlated at all. Well, if I take all the observation, all the times where stock #1 is high and stock #2 is low, and I just throw those out, and throw out all the times when stock #2 is high and stock #1 is low. So now, I only have the data points where they’re both high at the same time, or both low at the same time. Then they’ll be very correlated. So there’s ways in which — that’s an extreme example of course, but there’s also subtler ways in which you can — by making a lot of intermediate choices, the final answer you’re trying to get, you can arrive at it. So there is sort of a concern in the scientific community, the data analysis community that we need to be careful about knowing what all the intermediate steps were. Knowing how many times you’d fit a predictive model. Did you try every possible combination of variables until you found one that worked? Or did you, very carefully, hold out a data set to check your predictions on and make sure it worked? So there’s lots of ways in which you can manipulate the data if you’re not — either if you mean to do it, or even by accident. I think the more common reason is not nefarious, it’s just by accident. People try a bunch of things, and then they just stop when it gives them the answer that they want. They weren’t trying to be bad, they just got to the answer they wanted, and so they quit.
E: Yeah it’s really fascinating, especially as you’re bringing up stocks. I mea, the 2000 tech bubble happened to coincide with people having the internet and personal computers and then charts. And if you want to see how creative people can be interpreting data, just give them a stock chart and let them go.
E: It’s just incredible to see what people will do, how they’ll find different charts and put them against each other. And then they’ll bring the knowledge that they have, of what’s going on in the world, and also their interests and their desires to it. And it’s just amazing how, if you just describe something to somebody, they’d be like, “That’s an interesting story, but what are the facts?” But if you put it in a chart, it’s like — oh, and numbers. I mean, nothing confers more validity upon the wildest claims, than just putting numbers to them, right? It almost has a magical effect on people. And actually you have a line in your book as well, I think on your landing page, where you talk about how the dramatic change in the price and accessibility of data demands a new focus on data analytic literacy. I was wondering, is that related to this? That people are exposed to data more than they used to be and have more — again because of computers, they can interact with data in a way they couldn’t even like 30 years ago. Is that what you were getting at, or was it something else.
L: Yeah there’s different — there’s two levels at which I think about that. Problem #1 is exactly what you’re talking about, which is that basically computing, free computing even, has made it really easy for anybody to make charts of two variables and plot them against each other. And once you do that, inevitably you’ll run into some things that will look like there’s a relationship even when there’s not. So, there’s certainly the fact that data analysis training isn’t something that we give most people, right? Only a certain subset of the population gets trained on how to be aware of the fact that two stocks might look correlated even when they’re not, right? We don’t teach people that in grade school. We teach them reading and writing and arithmetic, but then we don’t teach them, “Oh if you make too many charts, eventually you’ll find one that’s a false relationship” — which is something we don’t teach people but maybe should. Even more than that though, I think it’s even more subtly ingrained into everything you do in life, even if you don’t think about it — now, these days, more than it used to be. I wake up in the morning and I look at the weather and what it’s going to be throughout the day and whether I should bring a raincoat or not. And then I have to assign some credibility to the app that tells me what the weather is going to be. I’m making decisions based on that. If you watch any sports games, they’re always talking about, “This is the first person since so and so in 1973 to have this many goals and this many assists at this time half way through the first period.” Basically you’re saturated with people talking about statistics and numbers and trying to, exactly like you said, give themselves an aura of credibility by talking about numbers, especially precise numbers. And we’re not equipped, we haven’t been trained in general as a population to be skeptical or to identify what are the potential flaws with those numbers. I think it allows people to get away with stuff, whether it’s in speeches — politicians giving speeches and making wild claims with numbers that if you just step back and think about the actual claim they’re making, you think “no way that’s true.” But they said, “Oh it’s a 3.2% increase.” You think, “Wow, 3.2%.” You know, he clearly calculated that number, when maybe they even haven’t. So I think that’s what I mean by that. I think everyone has to [learn] — whether it’s from conceptual to making their first charts to whatever. Almost everyone is doing some form of data analysis every day now. But we don’t have it as part of a standard, daily life curriculum of how do you deal with that, is what I think I meant.
E: I know you are engaged in a pretty wide effort to help educate people, I mean more the sort of — people with more specialized or advanced knowledge. But you’re doing this with your colleagues through the specialization in data science on Coursera.
E: Which, as I said in the introduction has 1.76 million people at last count participating in it. That’s obviously extremely popular, and I was wondering if you could explain a little bit about the specialization in data science and maybe why — I mean, what it is. And that it’s free for example, and why it’s been such a success.
L: I’m just as surprised as anyone else that that many people wanted to learn data science. But I remember a very specific conversation with Roger where, when we originally launched our first Coursera classes, we were talking to each other and we said, “Wouldn’t it be cool if like 2,000 people took our stats classes?” Obviously we’re a couple of orders of magnitude bigger than that now. I think what has happened is that there are a lot of– so, the data science classes cover basically how to use the R programming language to do the whole data analyis, data science process: from getting the data, to cleaning it up, to analyzing it and making reports. I think that particular skill is in such high demand right now. It wasn’t in demand, and became in demand over a very short period of time, and there weren’t that many resources available. If you didn’t know how to analyze data, you could’ve gone back to school to learn it. This is one of the first freely available, always available resources for learning how to do that. The classes are free; if you want a certificate that you can put on LinkedIn, you pay a small [fee] — I think it’s like $50 a class or something like that through Coursera, to get the certificate. I think the fact that it was really available, the fact that it was timed well, got people excited about it. It makes me happy that there are lots of people that are interested in learning how to do data analysis and data science right. I think the trend is positive in the sense that you’re seeing people making the decision to learn how to do that.
E: I remember, I was looking at one of your talks online, just some slides. And I think you showed an email exchange between yourself and Roger Peng, your colleague, where you said, “I’ve got 7000 students.” And he replied, in more colorful language, “You’re screwed.”
E: I was just wondering, does the fact that there is 1.76 million people affect your workload? Or is the way Coursera’s set up just so efficient that that doesn’t really impact anything? So it could be, you know, n people?
L: No it definitely affects workload, but maybe not as dramatically as it would if they were in person. First of all, Coursera and us have recruited people from the classes communities who are amazing, outstanding folks who answer lots of questions on discussion boards. That helps us answer questions. We answer questions on the discussion boards, but at this point the classes have been running for a long time, so the same questions come up over and over again. So, you can kind of anticipate the usual set of questions that will happen. We have 9 classes in the sequence, they all run every single month. We’re getting a lot of data back on — which are the parts that are hard? Which are the parts that are easy? And we can take advantage of that I think. And then it isn’t quite the same experience. Taking a class online, you definitely don’t get as big a– Imagine that many people going through, each one doesn’t get as big a fraction of my attention as say, when I teach a class here in person for 10 people, it’s much easier for me to give everybody personalized attention. It’s much harder at scale, and so you see that, just by virtue of there being a lot of people taking it, it makes it harder. So people tend to work a little bit more independently on the Coursera platform than I think in person. So it has added a workload in a sense that Roger, Brian and I didn’t think it was going to turn into this huge enterprise, but now that it has, it’s got a life of its own, and its own obligations and responsibilities that are added to our lives. But it’s been such a, kind of a rocket ship and it has been fun to take those on, new challenges.
E: That’s fantastic, congratulations on that. I think it’s just great. I wanted to ask you just a couple of questions about Leanpub. I’m curious how you found out about Leanpub and why you chose to use it for your book?
L: So I think it was Brian Caffo actually who led it. Brian’s a colleague of mine that teaches in the same specialization. He was writing a book, and was looking around for the right place to do it. He’s kind of a tools geek, so he’s always checking out like what’s the latest, coolest, easiest or hardest or whatever way to do something. And he got really into Leanpub. He wrote his book and released it first on Leanpub. And we wrote all of our lectures actually in Markdown, for the specialization. So we were all really comfortable with Markdown.
L: And he said, “I found this awesome tool, Leanpub. You can write in Markdown, you can even take some of the material from your lecture notes and convert it more easily into the start of a book chapter.” And then he told us about how, I think the system, of how simple it is to write it and turn it into all the different [book] types, and then launch the book without having to go through all the usual… You know, all of us had worked on academic books, and the publishing process — that can be a bit long and tedious and everything. The speed with which we could do things was very exciting for us. And so, he launched his book and it was pretty successful, and we were very excited about it. And then I launched mine and Roger launched his. And we’ve all been just blown away by, we all really like the Leanpub system, we’ve all been — I don’t think any of us will ever go back to publishing any other way. So we’re pretty excited about it, yeah.
E: Well thanks, thanks a lot for that. I wanted to ask you specifically about academic publishing — and I mean publishing about scientific matters and technical matters. It’s obviously a cliche to say the pace of change is accelerating these days, and the speed with which people can communicate their ideas is changing. Do you think that the conventional academic publishing model, which can take — I mean, it can take a year to get an article in a journal, just in a journal — do you think that there’s a fundamental mis-match between those two?
L: Yeah, I do think that right now, the way I’m seeing that manifest in the journal publishing way, is that people are using pre-prints pretty extensively. So basically, it’s very common now — my group does this a lot. You’ll write a paper with a student, when you submit it to the journal, you also post it on a public-facing web server. One of the most famous ones is arXiv and bioRxiv is the biology one. You just post your paper there, and everybody starts reading it while the paper is being peer reviewed. When it’s on those sites, it’s not peer viewed yet, and so everybody knows to take it with a grain of salt. Things might not be totally worked out yet. But you don’t want to put something up there that’s embarrassing either. You have your scientific colleagues, you know they’re all going to read it, so you don’t just post anything. So that’s improved the speed that way. In terms of publishing books, that’s still a pretty slow process and a pretty hard process. It’s not clear that that’s caught up with the internet age either. I think certainly the publishing policies of academics have — they are very slow adopting internet-style communication. It’s starting to accelerate I think over the last– I would say over the last 3 years I’ve seen a lot more posting things online, tweeting about it — that sort of thing. But that seems pretty new. That hasn’t been going on for a long time.
E: It’s interesting, that process that you’re talking about. In a way, the most important part of the process is putting the text out there, and then having the community engage with it, whether it’s in a formal peer review through a journal, or the informal peer review that happens as soon as you post it to arXiv. I hadn’t heard of bioRxiv before but. With your Leanpub book, was that important to you, to have that kind of interaction with people, or was that a different type of project? I mean, obviously not peer-review-style engagement, but just people telling you even something that’s minor — “it’s a typo,” or “I wish you’d written about this or that.”
L: Yeah definitely, I’ve been getting a huge amount of [feedback about] especially typos. Because I — we’re a bit of a — speed has been a hallmark of the things that we’ve done around here, which doesn’t mean we always catch every typo. So I’ve gotten a long list of typos which I’m slowly working my way through. One thing I like about the platform is that once I get through those, I’m going to release a new version of the book, and everybody gets it again. I like that component of the process — the ability to release it quickly, make edits, and not feel like I am hurting the people that paid for it. Because you don’t want to release something too quickly and then it turns out that there are typos and things like that in it. So mostly it’s just typos, so far nobody’s found any errors thank goodness. But it is something where I do think that the iterative nature of it is nice. It makes it a lot easier to feel like you don’t have to put out that perfect product the first time. Which is very hard to do, especially for something the length of a book, I find. You’re almost guaranteed to have a few typos in there. So it’s easier when there’s a thousand people reading it and checking for typos, than if you only have one editor or one person — your friend, looking it over.
E: Yeah that’s great, I mean that’s obviously — that’s our premise, right? A better way to write books than sitting in stealth mode, working alone in a cabin for a couple of years and then releasing something that’s supposed to be finished, is to just get it out there earlier and start interacting with people and getting their feedback and comments. And that actually will improve your book dramatically.
L: Yeah I mean, that’s certainly been true for me. I think we have the advantage — Roger, Bryan and I of already having a built-in, large group of people that might be interested. And given that we teach these good classes and the things that we talk about in our books are related to those classes — we have a built-in big audience. That’s made typo identification a very quick process in the sense that I was very quickly informed of all the typos in the book. I’m the bottleneck. The identification of typos wasn’t the bottleneck, it’s me having time to correct them all and release the new version. I think certainly when you have a built-in audience especially, it’s just so much more efficient to get the typos corrected post-publication than pre-publication.
E: Is there anything about Leanpub that we could improve for you? You can be as blunt as you want. I’m the type of person who shouts at the computer when I’m using things and they don’t work. Was there any kind of “shout at the computer” thing about Leanpub that you encountered?
L: I didn’t have too much, I mean mine is a little bit easier. I think maybe some of the code and equation stuff was a little bit more challenging for my friends Roger and Brian who had to do more code and equations than I did. The one thing I wish Leanpub had was a really easy hard copy publishing approach. Something like CreateSpace-style.
E: Right, yeah.
L: I’ve gotten quite a few requests now for hardcopies of the book, and so far I’ve just been deferring that, because I haven’t had a good approach. I’ve looked into some other hardcopy publishing approaches and they’re not as slick as– The thing I liked about Leanpub was how it made it very easy. Given that we’re running these classes on the side and we’re professors, and we have all the other things in our lives, I don’t have a huge amount of time to devote to getting my publishing software to work the way–
E: Yeah, fair enough.
L: That has been maybe the thing that’s best. But the thing I wish we could copy for everything we’re doing — we’ve made suggestions to Coursera and other places about ways they could make their system more Leanpub-like, in the sense that it would be easier for people to upload things and stuff like that. And I think that’s certainly true, that that is the part that is the killer feature as far as we’re concerned.
E: Okay, thanks a lot for that. I mean we do — we’ve tried to accommodate that need as much as we can, without actually doing it ourselves, by having a print version output option. That’s sort of optimized for — like, you’d just write your Leanpub book the normal way, but you can also export a version that’s optimized for uploading to Lulu and things like that. I think that Leanpub getting into the production of paper books thing is probably a long ways away.
L:: Even if it was just an agreement where, if there was just a push button — you just sent it to one of those organizations instead of us.
E: Oh that’s really interesting.
L: You know what I’m saying–
L: Where you don’t even have to build the infrastructure yourself. But there’s a certain amount of stuff and upload, then get a new account on this new system.
L: That’s a bit of a pain.
E: That’s really interesting, thanks, we’ll talk about that. That would actually be — that’s a really great– I mean, obviously behind the magic…
L: Oh, there’s going to be a huge amount of work.
E: Behind the magic “world peace” button, as programmers say. I’ve heard programmers talk about, “The client goes, ‘And now I want a button that causes world peace.’” It’s like, “Well I can make a button with the words ‘world peace’ on it,” but actually, anyway — that’s a really good suggestion, and I’ll take that to the team and we’ll think about it.
L: But honestly, I don’t think, I mean, given that that’s my only suggestion, you can tell, for the most part, I’m very pleased with the platform. Things for me went very smoothly. I really didn’t have any problems. I’m already starting on my second Leanpub book, and so I’m sort of — I find no problems with the platform that’s currently created.
E: Actually, that was going to be my last question was — I see that you have an unpublished book called, “We are all statisticians now.” Is that the book you were referring to?
L: I have that one and then I– Sorry, I’m a person that starts things relentlessly, maybe doesn’t finish all them. The one I’m working on mostly right now — I have two. I don’t know if — it should be up here on the Leanpub site. Maybe it’s not already up there. But the one I’m working on right now is that issue of, “How do you deal with health information in your day to day life?” So, my wife and I are both statisticians.
E: Oh okay.
L: So whenever we talk about health headlines, there’s a language we use about them — how we determine whether we believe. Like, if it says we should be giving our kid more sweet potatoes or whatever — then there’s a series of questions. If I read that headline and I tell my wife, there’s a series of questions she’ll ask me before she’ll believe that it’s true. And so, we’re working on — I’m working on a book, I’m hoping that I can eventually convince her to collaborate with me on. Where we talk about, basically, what are the questions that two statisticians ask each other when reading about health news. How do you evaluate critically whether the study behind the headline is really something you should be paying attention to, or whether you should just ditch it. So that’s the one that we’re working on right now.
E: Okay, that sounds like a really great idea. I just wanted to say, thanks very much for being on the Lean Publishing Podcast, and for being a Leanpub author.
L: Alright, thank you very much, and I really appreciate you taking the time to talk to me.
This interview has been edited for conciseness and clarity.
– Posted by Len Epp
Originally published at leanpub.com.