An Interview with Roger Peng, Author of R Programming for Data Science

Published Jun 11, 2015 by Len Epp

Roger Peng is an Associate Professor of Biostatistics at the Johns Hopkins Bloomberg School of Public Health and co-founder of the Johns Hopkins Data Science Specialization on Coursera and the Simply Statistics blog, where he writes about statistics for the general public.

This interview was recorded on May 27, 2015.

The full audio for the interview is here. You can subscribe to this podcast in iTunes or add the following podcast URL directly: http://leanpub.com/podcast.xml.

Len Epp: Hi, I’m Len Epp from Leanpub. And in this Lean Publishing Podcast interview, I’ll be interviewing Dr. Roger Peng. Roger is an Associate Professor of Biostatistics at the Johns Hopkins Bloomberg School of Public Health. He is also a co-founder of the Johns Hopkins Data Science Specialization on Coursera, which has enrolled over 1.5 million students, and he’s a co-founder of the popular Simply Statistics blog, where he writes about statistics for the general public. Roger’s research interests include the study of air pollution and health risk assessment, and statistical methods for environmental data. He is also a leader in the area of methods and standards for reproduceable research, and is the reproduceable research editor for the Journal of Biostatistics.

In addition to being the author of more than a dozen software packages, implementing statistical methods for environmental studies, he has also given workshops, tutorials and short courses in statistical computing and data analysis. Roger recently published his first Leanpub book, “R Programming for Data Science”, which uses material developed as part of the Johns Hopkins Data Science Specialization. The book is available for free, with a suggested price of $15, and already has over 17,000 readers. The book can also be bought along with lecture videos and datasets.

In this interview, we’re going to talk about Roger’s professional interests, his book, his experiences using Leanpub, and ways we can improve Leanpub for him and other authors. So, thank you Roger for sitting through that introduction and for being on the Lean Publishing Podcast.

Roger: No problem, thanks for having me. And I just want to warn you that my building is right next to the hospital here, so you may hear the occasional siren.

E: That’s okay, it’ll just give some background color.

P: Yeah.

E: So I’d like to start with a couple of biographical questions. Can you tell us how you first became interested in statistics generally, and in biostatistics specifically?

P: That’s a great question. It’s kind of a weird path that I took. I studied math in college, and I think that’s how a lot of people get involved in statistics initially. Part of my math requirements required that I take a course in statistics, so I took a — I think it was probability theory. And I really enjoyed it, and so I kind of kept going down that road and taking more and more statistics classes — and ended up being kind of like a minor in that area. And so I just naturally thought about applying to graduate school. My older brother had gone to graduate school, so I figured that was the right thing to do. And it was kind of funny — so I graduated from college in 1999, and basically it was the dot com craziness. Everyone was going to the software companies, and here I was going to apply to graduate school. So I kind of bucked the trend there.

Anyway, so I applied to graduate school in statistics, because I thought that’s what I wanted to do. Seemed like a fun field. And so, I went to UCLA and got my PhD there. And I originally didn’t learn any biostatistics per se. I wasn’t really working in biomedical sciences. And so I was looking for a job, and my adviser, his grad school roommate was a Professor at Johns Hopkins. I had no intention of really applying there, because I didn’t really think I was doing biostatistics. And then I think, my old roommate’s there, he says he really loves the place, so you should check it out. So I applied and I interviewed, and I really liked the people there. And I thought, “Okay, well even if I’m not specialized in this topic area, it seems like a great environment, a great institution. So I got the job here and I came.

So it was kind of weird, because it wasn’t necessarily directly what my training was. But I think for me, a lot of decisions I make, in terms of what to do or where to go, are based on, what people are involved in it? Are there good people involved? And if I like being with them, then that’s the bottom line for me.

E: That’s interesting how you bring up the startup world as. I mean that’s how a lot of decisions are made in startup-land as well, right? It’s like, we’ve got lots of options, but we’ve got lots of ideas. But what startup should we work with? The people that you’re going to be involved with are often a driving factor there.

P: Yeah, because I think things always change, and the people need to be able to deal with it. And you’ve got to make sure that you’re with the right people when things go wrong.

E: I have this specific question about the work that you’re doing now. It’s on your website that you’re working on environmental biostatistics, and how air pollution and climate change affect human health. Can you give us a little information about how you would use statistics to study those effects?

P: There’s a couple of areas. My biggest area is probably looking at outdoor air pollution and population health. This work directly informs national level type regulation on air pollution standards. So what we do is we look at the study in the US where the US Environmental Protection Agency monitors air pollution all across the country, in all the major cities. The idea is that we want to see how the levels of air pollution that are changing in the air, are related to different population health metrics. So we might, say, look at the number of people who have been hospitalized for a heart attack on a given day, or the number of people who were hospitalized for respiratory infection — something we think is linked to air pollution exposure. So we have these very long time series of daily levels of pollutants, and from day to day, things go up and down.

So, you would imagine that if pollution is linked to health, that as pollution’s going up and down, the various health metrics also should be going up and down. But the problem is that teasing out that signal is really hard. Because it’s not the kind of signal that — air pollution’s not the kind of thing that knocks you over as soon as you walk outside, right? Well at least not in the United States, right? And so, there are all kinds of other competing factors that are a risk for your health. Teasing out the signal that air pollution contributes to either morbidity or mortality risk is really where statistical models are needed.

Back in the old days, in the 40’s and the 50’s, when pollution was just out of control, you didn’t need fancy statistical models to see that it was affecting people’s health. You just had to go out on the street and see people having problems. But now that pollution levels are lower, it’s not so obvious to see those kinds of problems. But nevertheless, we still do see pretty strong associations between changes and pollution levels and various health outcomes.

E: I imagine it must be even more complex when you factor in climate change?

P: Yeah, climate change is an aspect that affects how we think about things in the longer term, right? There are different time scales in which you could think about air pollution problems. One of them is the day to day level. But another one’s how things change over time, and are things improving as air pollution levels go down? Climate change can affect that in a variety of ways. One is affecting the weather, which has an interaction with air pollution levels. And the other is that, as we implement policies to deal with climate change, that has a direct effect on air pollution levels too. So, for example, we want to deal with climate change by shutting down some power plants. Then that will also affect the direct levels of pollution. So there’s lots of interactions between the different things there. And so, statistical modeling is useful for integrating all the different kinds of data that you come across. So there’s climate data or air pollution data or health data. And it’s also useful for teasing out these small signals that we have to detect.

E: And are you always working with a national dataset, or do you focus on a specific region, or say, urban versus rural, or something like that?

P: My work focuses on national-level studies. We get data on the pollution side from the entire US EPA monitoring network. Also, we get health data from really large administrative claims databases like Medicare, and Medicaid, which are these large national insurance programs. So we can look at insurance records and see every time someone was hospitalized — we can mark that up, and then see if it’s related to changes in air pollution levels from the monitoring network.

E: That’s really interesting. This reminds me of stories in the media in the last year or so about Paris shutting down half the cars on the streets, when you can only drive a car if your license plate ends in like an even number, or an odd number, to cut down on car pollution. As you were saying, pollution levels have gone down generally in the States in the last couple of decades — do you see any problem like that emerging in an American city in the next 10 or 15 years?

P: Any problem like what, sorry?

E: Like there is in Paris.

P: If you look across the nation, things have gotten much better over the last few decades. But there are still cities that have very high levels of pollution and still have problems. For example, if you look at the, I think it was in 1996 — the Atlanta Olympics, Summer Olympics. They implemented a scheme like that in terms of traffic, just traffic control. I think just because they envisioned lots of people coming and things like that. But I think there are cities that are still beyond the regulations here in the United States that need to improve their levels. And so, although conditions are generally much better. it’s not a solved problem yet.

E: Just switching gears slightly, and just talking about data science more generally, I looked at the John Hopkins Data Science Lab website, and I’m going to read a quote and ask you to explain a little bit about what it’s talking about. It says, “The revolution in measurement, and the resulting deluge of data has made data science the most important field of study in the world today.” So can you explain a little bit about why data science is so important generally? Just for people who might not be familiar with it.

P: Sure yeah. So I think, if you look around — if you just look around yourself, everything that you look at is essentially generating data. And if it’s not generating data itself, we have some device that can collect data from it. So everywhere you go in the world, and in your life today, there is information that’s being generated, kind of spewing out into the world. And a lot of it we can’t collect — there’s just too much. But we can collect more and more of it as time goes on, because of improvements in technology and in computing power.

And I think if you looked back many, many years, say 100 years, the biggest issue was collecting the data, because it was very expensive, and you had to be very careful not to waste a lot of resources collecting data. And then once you got the data — assuming you did it right — the analysis is pretty straight forward, because there’s maybe 10 data points. But now we have the kind of reverse situation, where the collection of the data is very routine, in fact almost too routine sometimes. I mean, the data is just happening. It’s being collected whether we like it or not. You look at some server web logs, the data’s just being collected, it just is. And so now the analysis actually has become much more complicated, and much more difficult to do, because of the volume and the complexity and the heterogeneity of all the data that’s just being generated automatically.

So the difficulties and the skills that are required have really flipped. Whereas before, you had to be really careful about optimizing your study design and making sure that you’re not wasting things. I mean, you still have to worry about that. But now the skills for data analysis are really necessary. And there are lot of fields that didn’t emphasize the data analysis part. And they’re realizing now, “Oh, actually we’ve really got to train people in this area,” whereas in the past we did not have to. So that’s why I say that I think data science is, in many ways, taking over every area of either science or business or whatever. Because everywhere there’s data. And so the skills to analyze those data are becoming increasingly valuable.

E: I imagine that the presence of this data is unlocking new areas of study. For example, in the past, people weren’t all wearing Fitbit’s and clocking their steps every day and stuff like that. So now that things are being tracked that weren’t being tracked before, it must open up new areas for study.

P: Yeah, there are new areas of study being created. Things like wearable computing — that didn’t exist a couple of years ago. There’s just new kinds of data that we’ve never seen before, and so we don’t even know how to characterize it. If you have an accelerometer that you’re wearing, how do we even get any information from that data? How do we know what you’re doing with the accelerometer? That’s where statisticians like me, and many other people that I work with, earn our money, because we have to figure out ways to look at the data, to characterize it, to understand what’s happening.

E: Okay, I’m not sure if my next question is related to that. But you do say on your website that you have a special interest in reproducible research. I was wondering if you could explain a little bit about what reproducible research means, and why it’s so important to you and to your field?

P: Reproducible research is like the scientific analogue of open source software. The basic idea is that, for your, for work to be reproducible — and there’s a lot of confusion sometimes about the terminology — the idea is that you want your data and your software code to be available for others to look at. So, the idea is that I can take your data, and I can take the software that you used to analyze it. And I can reproduce the numbers that you’ve published, or the graph or the plots that you made. Or whatever it is that the result was. It’s not that I’m going to redo your whole study, I’m just goint to take your data and produce what you produced.

And the funny thing is that, this was not really that important many years ago. Because the data, it was so small — there was no software, right? There was nothing to provide. If you really want to know whether an experiment was valid or not, you would just do it again, right? You’d do it yourself, right? But now the problem is a lot of studies are so big, and they involve such large quantities of data that the collection of it — like I said before, the collection of the data’s not really the challenging part. It’s really the analysis, and how they analyze data to come to a conclusion about nature or whatever it is they’re studying. That’s really where all the difficulty is.

So we really need to know, the problem is that the publishing infrastructure is not really designed to let people know what those details are. So the bottom line way to know what those details are, is to see the code to see the data. But again, because a lot of this happened very quickly, there isn’t the kind of infrastructure there for allowing people access to data, giving people access to software code. And so a lot of that has to be built. So I was interested early on in getting this idea across to people, and convincing them that it had to happen in order for science to then move forward.

E: It’s really interesting. I mean, I know that in the popular science media there have been articles over the last year or so about how the interests of scientists aren’t necessarily aligned with actually reproducing research. Because you get the headline or you get the promotion or you get the patent or something like that for actually doing the original research. And so, often articles will be published, and I guess there’s some very low proportion of experiments or studies that are actually reproduced at all.

P: Yeah, yeah.

E: I mean, is that true in your specialty as well? That there’s less of an incentive to just work on reproducing someone’s results than there are to do your own original work?

P: Yeah, I mean I think that’s a general phenomenon, and it’s a general aspect of our culture. I think the emphasis is on discovery, and I think analyzing a published data set and coming to the same results is not what might be considered as discovery, right? On the other hand, analyzing someone else’s data set and finding out something that was wrong, well that is discovery, right? So there are some people interested in doing that. But in general, just reproducing another finding is difficult. It’s sometimes difficult to, for example, get funding for, or to get published in a high profile journal.

However, when it comes to something that’s really high impact, something that’s really interesting to that subfield, to that field, it will get reproduced. If it’s something that’s really surprising or something that could have an impact on the entire field, people are going to want to know whether it’s true or not. And the only way you determine whether it’s true or not, is to have multiple people do the same experiment independently.

E: Right.

P: If no one cares about it, then it’s kind of hard to justify reproducing it. And there’s a lot of scientific publications out there. And it’s, it’s not like every single one of them is going to be reproduced. It’s just not physically possible.

E: Right, okay.

P: But there are many examples in the recent popular press, either of things that were faked or things that weren’t reproduced. And you realize that that’s kind of how the system is supposed to work. There’s this example with stem cells I think. It was a very surprising result, right? So what happened? Well, immediately ten labs went to reproduce it, and they couldn’t do it. None of them could do it. So they knew it was wrong. And so, there is an interest in reproducing things, but probably weighted more on the things that are surprising or really exciting.

E: Fair enough. Speaking of original research, I read in the preface to your book that your first experience with R — and I’ll ask you a question about that in just a minute — involved an analysis of word frequencies in classic texts like Shakespeare and Milton, to see if you could identify authorship based on word frequency. And I was wondering if you could explain a little bit about how you got into that, and what your results were, if you can remember that far back?

P: To be honest, I can’t remember how I got into it. I needed a senior project when I was in college, and I think my advisor pointed me to this paper, it was published in the 60’s, about two statisticians who had analyzed the Federalist Papers, because there was some controversy over who wrote which of the Federalist Papers. And so they did a little statistical analysis, and I adopted the same approach that they took. The question was, “Can you identify certain written works based on the rate at which they use what are called function words?” Words you don’t really think about, like “the”, “and” , “he”, “she”. You probably don’t spend a lot of thought on how frequently you’re going to use that word. And so, the idea is that it reflects your natural personal style.

And so the analysis involves taking these texts that we downloaded out of Project Gutenberg, and you’d use a Perl script to divide up the text into words, and to count how often they used a certain sub-set of these function words. And then, from that — I don’t know how technical you want to get, but we used a basic linear discriminant analysis to see if we could separate one author from another. And it was pretty straightforward actually. It was surprising how well things authors separated. Granted, we picked a group of people who were pretty different from each other. But you could see that authors that wrote in the same time period, they were closer to each other than authors that were writing in very different time periods. And so, there was a kind of logic to what we found — it did seem that a lot of the books or plays from these famous writers were identifiable from these patterns of how frequently they used these kind of meaningless words.

E: And did you have — I’m just curious if you had any kind of response from people on the humanities end of things?

P: Well I would be surprised if you met anyone who’s actually read that paper. I mean it was published in a statistics journal, so I’m not sure how often they’re pulling that off the shelf.

E: Well it’s interesting, because there are some pretty big controversies about authorship, specifically around Shakespeare, right? Like did Shakespeare write the plays? Or was it a group of people, or was it in fact someone else? And so I was just curious when I read about your experience with that — if anyone had kind of gone, “Aha, here’s another tool for me to make my argument that it was actually not Shakespeare who wrote the plays,” or something like that.

P: No, I have not gotten any emails or issues along those lines. But I think you’re right though. Authorship is always a very interesting topic to people. And even not just in literature, but in many different areas. It is interesting to think about how you would characterize numerically something like — for example, a piece of music or whatever, and then be able to separate it between two different people or so. But I’ve not been enlisted in that.

E: Yeah, it’s a fascinating space. I mean, because there are often biographical things that people will try to pull out of an author’s writings. Like, were they hiding something? And people will, I mean, I know this from my experience in the humanities, that sometimes people will try to tease those things out. But it’s been quite a while since I’ve heard of anyone trying to do a statistical analysis of word use like that.

P: Yeah.

E: Anyway, moving on, to talk about your book more specifically. You explain in the book that the R programming language has become the de facto programming language for data science. Can you explain a little bit about what R is, and why that’s happened?

P: Yeah sure, I’ll try. R is a language that was started in the early 90’s. It was created by two statisticians from New Zealand. Originally they wanted to create a statistical language that was free, and that could run on very lightweight computers — I think they were using old Macintoshes — and they wanted to use it to teach statistics. That was their goal. They didn’t have any grand aspirations at that time, I think. But, one of the issues — so, back then, open source software was still in some sense controversial. There were really no statistical packages of any quality that were free or open source. And so, you had to pay a lot of money to use these statistical packages to analyze any data. Unless you were at a big company or at a university, you didn’t really have access to this kind of stuff.

And so they put R up on the web in the later 90’s. And it was really one of the first open source statistical packages out there that you could really use to do serious data analysis. I found it just because I didn’t really want to pay the money for all these expensive packages. And so, I found it, and I started using it pretty early on actually. It’s a language that’s in some sense a clone of an earlier language called S+, which was one of the ones that cost a lot of money. A lot of people, including myself, had been trained on S+, and so it was easy to go over to R. It had a similar syntax — things like that. So that’s how I started out using it.

And I think very quickly, as many successful open source projects I think experience, a big community developed around it — all over the world, in Europe, the US and Latin America, Central America. A lot of people gathered around it, I think initially because it was free, and eventually I think because the community itself becomes a reason to use the package. People started building add-on packages that you could load up, and it became this thing where all of a sudden R was better at some things than a lot of the commercial packages. And I think from then there was no turning back.

I got involved somewhat early on in the kind of history of language, and saw it develop and become popular. And now it’s actually hard for me to comprehend sometimes, how popular it’s become. I always thought it would be this niche academic thing. But now it’s in business everywhere. There’s companies developed around selling and consulting in R, and there are a lot of data science companies using it for analysis. Its capabilities have just really expanded in so many different directions. One of the things that makes it great is that it has this ability to be very customizable. Anyone can implement a procedure that they want to use to analyze their data. It’s a very flexible and powerful programming language, and it has a great community behind it, an enormous community now, to support and to learn new things.

E: Speaking about size of community, I noticed from the description of your book, that the Coursera course it’s partially based on has 1.5 million people who have participated in it. I’m wondering just if you could explain it a little bit about that, or if I’ve got it slightly wrong?

P: The 1.5 million is not for the one course. We have a sequence of courses that they call specialization on Coursera, and the sequence has nine courses. That’s our data science specialization. And it kind of follows the pipeline from, how do you get data, to how do you clean it, to how do you kind of analyze it? Then how do you make data products? So R programming is just 1 of the 9 courses. It runs every month, and it’s a month long course that runs every month. It typically gets on the order of 40,000 to 45,000 students enrolled per month.

E: Wow.

P: The other 8 courses are not all quite as popular as that. Across the 9 courses, they all run every month, we get about 170,000 people enrolled per month. And so we have a lot of students. There’s a lot of interest obviously in this area, and the R programming class is one of the more popular ones in the sequence.

E: Do you know if your students are coming from all around the world? Are they concentrated in certain areas?

P: They are coming from all over for sure. Less than half come from the US. Something like 30 to 40% come from the US, and then the rest come from all over. We get a lot of people from China, Brazil, India, the UK. So it’s kind of all over the map, yeah.

E: That’s amazing. Are they mostly people who are studying in a university somewhere, who get directed towards one of the nine courses, or towards the entire specialization?

P: The people who are studying in university are a big group. But they’re not the majority. A lot of the people that we get are working full time, and are kind of looking to upgrade their skills. To learn something new, and to maybe look to change careers or change positions in whatever they’re doing. I think our sequence is structured nicely for those kinds of people, because it’s a month long. It runs very frequently, so you can take it whenever you’re ready. It’s a lot of working professional type of people.

E: I imagine that the courses are taught mostly by video?

P: Yes, we have lecture videos. And then, depending on the nature of the course, we may have quizzes. My course has programming assignments that you have to complete, and they are graded by a unit testing framework. Some of the courses have projects where you have to do a data analysis, or create some software. So yeah, we have all kinds of things like that.

E: And is having accompanying books, is that a conventional thing for a Coursera course? Or is that something that you and your colleagues are innovators on?

P: That’s a good question. It’s hard to say what’s conventional, for something, for a phenomenon that’s like 3 years old.

E: Very good point.

P: But I don’t know, I don’t think it’s that common, unless you are someone who already had a textbook. But we felt like it was a natural thing to do. And I think we would have done it sooner had we learned about Leanpub sooner. We were looking, but we couldn’t quite find the right mechanism. We didn’t want to use a regular publisher, and I think the nature of the courses that we teach — it’s very low cost, it’s hopefully accessible. And you can take it for free, so it’s hopefully accessible to as many people as possible. We wanted to layer on something like a textbook, using a similar kind of model. And the traditional publisher really was not the way to do that.

E: No, no, fair enough! Do you use the ability to update your book quite a bit? Is that something you do once every couple of weeks?

P: No, well — initially I did update it a little bit. But the courses are fairly mature at this point, so they don’t change that much. And I wanted the book to match the course material somewhat closely, not exactly. My plan is that the books will evolve. So maybe not on a very frequent basis, but on a regular basis things will be added to the course, things will be added to the book, and so some things will be updated.

E: And what were some of the reasons you didn’t want to go with a traditional publisher?

P: Well, so I’ve done it before. I have another book through a traditional academic publisher. The bottom line is that they don’t really hit the right audiences, in my opinion. And also they — you have to charge a lot of money to make it worth your while. From an author’s point of view, you’re going to have to charge a lot of money to make it worth your while. It’s also a very slow process, because the publisher really doesn’t do anything for you. You have to do all the formatting, everything.

E: Oh really, you have to do the formatting as well?

P: Oh yeah, I mean for an academic book, unless you’re writing something that’s guaranteed to be a bestseller, you have to do everything. They do a little bit of marketing for you, and then they go to the conferences and stand at the exhibit booth for you. But there’s not much else that happens. And they do a little editing. So it’s a lot of work to go through, and then to have to sell it at such a high price. The number of people who are going to see this book is very limited from the get-go. I had that experience already with one of my other books. And so I was looking for something different, something that we could price low but still make it worth our while. And I think Leanpub just kind of hit all those points. And in addition, I think the authoring process I found really attractive.

E: Oh great.

P: Writing in Markdown, but still being able to do all the mathematics and the code and everything. It was just, it hit the right kind of balance I think.

E: So you were familiar with Markdown before you came to Leanpub?

P Yeah, in fact we teach it in one of our courses.

E: Oh, great. Just talking shop a little bit — can you tell us how you found out about Leanpub? Was it just kind of searching around for a publishing platform?

P: Yeah, so I actually heard about it from one of my colleagues. My colleague, Brian Caffo, who teaches the specialization with me. He’s one of these guys that he’s like — his brain is kind of connected to the Internet. So he’s always aware of what the latest things are. And I think he found it, and he published a book, it’s called, “Statistical Inference.” And he just raved about it, so to the point where I said, “Okay, if I don’t do this myself, I’m just going to have to keep listening to him talk about it.” So I just signed up and started the book. And once I just got going, I realized, this is just like — it feels like, I don’t know, it kind of feels like Leanpub has just hit every pain point that I had about publishing, like simultaneously — I don’t know, like maybe you guys are living in my bedroom or something — figured out every problem that I had with the publishing process, you just solved it. And so, it was just a weird coincidence I think.

E: Well that’s very nice to say, and I’m very glad to hear that. I mean, Leanpub’s been around for a couple of years already, and customer development has been really important to us. So a lot of what you’re seeing in Leanpub is other people like you who’ve been kicking the tires for quite some time, and giving us feedback. And it is one of the pleasures of working with people who are doing something serious and sustained, like writing a book — is that, they like to give you feedback, and they like to write. And they like to analyze things. So I’m really glad to hear that, because if you find something in Leanpub that you’re like, “Oh my God, I can’t believe that was there, but that’s exactly what I needed,” that’s probably because someone just like you was there at some point when it didn’t exist, and was like, “You know what would be really great, would be if we had this.”

On that note actually, I would like to know if there’s anything you think that we could do to improve? Or if there was anything you saw that was missing? If you could have your one wish feature built for you, what would that be?

P: There probably is something, but I can’t — it’s one of those things where like, when someone asks you, you don’t remember. Right now it’s really quite good for me. And I think actually, it’s quite good for academic publishing. If you’re writing, if you’re a different kind of writer, I don’t know how good it is for you. But for people like me, who are doing academic publishing, I think it’s just the right tool and it’s just the right model for that style. Unfortunately I don’t have my wish list in front of me.

E: That’s okay, that’s okay. If you ever think of anything, please get in touch.

P: Yeah, but I think — I really am serious though when I say that you really hit all the major points. And so I think you’re at least 90% of the way there, so there’s another 10%, we’ll figure it out.

E: Well thanks very much for that actually. I do have just one more question about academic publishing specifically. It’s something we’ve been thinking about for quite some time now. And I was wondering, one of the big questions about academic publishing is that people are often looking — I mean, if they’re tenure track, but they don’t have tenure yet, that’s kind of the most important promotion point. And there’s often very specific, in fact even calculated methods for saying what the value is of getting a publication in a certain journal. I don’t know how much it’s like this in the States, but definitely in the UK. They have this thing called the “Research Assessment Exercise,” which actually kind of quantifies your contribution to the field. And often this is based on rankings of journals or university presses for example. And so getting a monograph published with The Oxford University Press or something like that is worth more than one from somewhere else. I’m curious about what you might think about that when it comes to academic publishing in the future. Do you think this is something that’s going to change, where for example, if you published an academic book on Leanpub, it’s hard to know how it would fit in with that ranking, where people are looking for quantified professional development?

P: Yeah, I think that’s a short term issue. So people today may have an issue, may have a problem, because it’s we’re in transition. Ebooks are still kind of new. But I think in a couple of years, it won’t even come up. And the idea that you’re self-publishing in a way, or whatever, is not such a big deal, because I think with books in particular, the publishing process is not like when you’re writing a journal article, which is peer reviewed. With books, there are peer reviewers, but it’s a much — you have much more control, and it’s much more your thing. And so, it’s much more of a personal statement when you write a book, I think, then if you write a journal article, it’s a research article.

I think with books, what it comes down to is not so much like, “Oh is this publisher good or not?” It’s more about — well it’s a really big commitment of time to write a good book. And if you’re a junior professor, you’re looking to get promoted — you’re going to think, “Oh, what’s the trade-off here? I could spend this amount of time to write this book, or I could spend the same amount of time to write two research articles.” Because there’s a huge commitment of time, and there’s a trade off: “I’ve got to do one or the other, I can’t do both.”

And I think one thing that’s nice about something like Leanpub, and a lot of these other tools out there — is that, it really decreases the amount of time that’s not spent just directly producing content. Because time is the one resource that is the most important resource. If you can minimize the amount of time doing things that are really not that important, like emailing back and forth with the publisher or whatever, and just really focus on writing content and writing your book — Again, that’s one of the beauties of Markdown, right? You’re just focused on writing the content. I think that is a major plus. And that’s what I tell people now too. The tools are developed such that you don’t have to waste time figuring out, how do you format things correctly, or how to get things — how to produce things. You just focus on writing. And I think that’s the kind of thing that I would worry about most in terms of the time trade-off for writing a book versus not writing a book. I think it’s not so much an issue, like, “Oh, should I go with this publisher or that publisher?”

E: Thanks very much for that. I really appreciate you giving us your time today. Unless you have any questions for me, I’d just like to say thanks for being on the Lean Publishing Podcast, and for being a Leanpub author.

P: Well thanks for having me, I’m really enjoying it.

E: Thanks.

This interview has been edited for conciseness and clarity.

– Posted by Len Epp


Originally published at leanpub.com.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.