An Interview with Brian Caffo, Author of Statistical inference for data science

Published Nov 24, 2015 by Len Epp

Brian Caffo is a professor in the Department of Biostatistics at Johns Hopkins University’s Bloomberg School of Public Health. In this interview he talks with Leanpub cofounder Len Epp about the how he first became interested in biostatistics, why it’s such an important and growing field, and about his research interests and initiatives.

This interview was recorded on July 21, 2015.

The full audio for the interview is here. You can subscribe to this podcast in iTunes or add the following podcast URL directly: http://leanpub.com/podcast.xml.

Len Epp: Hi, I’m Len Epp from Leanpub, and in this Lean Publishing Podcast, I’ll be interviewing Brian Caffo. Brian is a professor at the Department of Biostatistics at the Bloomberg School of Public Health at Johns Hopkins University, and director of the graduate program at JHU Biostatistics.

Brian works in the fields of computational statistics and neuroinformatics, and is a co-founder of the SMART Working Group at Johns Hopkis Biostatistics, which specializes in medical, and especially neurological, imaging and biosignals, such as polysomnography and wearable computing. In 2011, he was among the recipients of the Presidential Early Career Award for Scientists and Engineers, and the first statistician to receive such an award. Brian has also received the Bloomberg School of Public Health, Golden Apple, and Amtra Teaching Awards.

Brian is the author of two Leanpub books, Statistical inference for data science and Regression Models for Data Science in R [note: since we conducted this interview, Brian has published two more books]. Each book offers a brief but rigorous treatment of statistical inference and regression models respectively, and is intended for practicing data scientists. Both books are companions to classes offered as part of the Data Science Specialization on Coursera, a ten-course program offered by Brian and his colleagues at Johns Hopkins, Jeff Leek and Roger Peng.

In this interview, we’re going to talk about Brian’s professional interests — his books, his experiences using Leanpub, and ways we can improve Leanpub for him and other authors.

So thank you, Brian, for being on the Lean Publishing Podcast.

Brian Caffo: Thank you. It’s great to talk with you and meet you.

E: Thanks. So, I usually like to start these interviews by asking people for their origin story, to learn how they got to where they are in their careers, and how they developed their interests. So I’m wondering, specifically, how you first became interested in bio statistics and why you decided to pursue a career in academia?

C: Well, the long version of this story is, I was actually a swimmer in college. I wasn’t a terribly good swimmer, but I was on a big team. I was at the University of Florida, which has a great swimming program. And I was an art major at the time, and I was spending so much time training that I didn’t have a ton of time to actually put into being an art major — which is a surprisingly difficult major, especially in terms of time. And I had an aptitude for mathematics. So I kept taking math classes, maybe a little bit lower level than I needed to, but then just kept incrementally doing it to fill out my hours, so that I could have some classes to be able to manage swimming and trying to do the art major.

Finally, after doing this for long enough, I talked to a guidance counselor, and they said, “You know, we typically don’t get too many art majors who are taking differential equations, and linear algebra, and these sorts of subjects.” And she said, “You seem to be actually doing better in those than you are doing in your art classes. You can always do it in your spare time.” So, from there I switched over to become a math major. And from mathematics, I spent some time working with the Children’s Oncology Group, which was then centered in Gainesville, at the University of Florida, where I was at.

From there I just really fell in love with working with data, and the kind of computing and mathematics that goes along with statistics. Just staying in academics for me, was in a lot of ways a no-brainer. I really loved the things that I was doing, and I loved the kind of research that I was doing. I was very fortunate to get a position here at Hopkins, where I have such amazing access to great data, great researchers, great medical research. So that’s the long version of my origin story.

E: Okay, thanks — that’s very good. It’s an interesting path. I was wondering if you could explain some of the reasons you co-founded the SMART Working Group, and what the purpose of the group is?

C: Yeah, so originally this was co-founded with a collaborator here, Ciprian Crainiceanu. We had a lot of similar interests, in terms of how we approach modeling and statistics. We were getting a lot of people coming to us, talking to us about some new kind of measurement that they were collecting. In my case, it was mostly brain imaging measurements. You might think that there’s only a couple of ways that you can measure the brain, and that idea couldn’t possibly be more wrong. Just even with one type of scanner, a magnetic residence imaging scanner, there are so many different ways you can tweak an MRI scanner to give you different kinds of signals in the brain, that you can barely count them.

At any rate, both in terms of brain imaging, but also things like sleep studies and polysomnograms well as other kinds of wearable computing things, we were constantly getting people coming to us and talking to us about, “How do I analyze this kind of data?” Especially because we’re here at a school of public health, we’re here at a medical institution, people wanted to relate these measurements to disease. They wanted to create preventions and prognoses. Being at a school of public health, people wanted to relate it to large populations. And they didn’t know how to do it. It involves a lot of computing mathematics, statistics. And so we noticed a common thread of some biological signals, or biosignals, and we founded the group out of it. Initially it was a “group” with air quotes around it, and then after enough faculty joined in, and enough students joined in, we started having alumni from the group, and postdocs and things like that. It’s now become a rather large entity. When we have a full group meeting — which we don’t have that often anymore because it’s gotten so unwieldy — maybe 35 or 40 people will show up.

E: Wow, that’s great. Can you explain a little bit about what a biosignal is, and maybe give an example of one?

C: A biosignal, basically, is any biological or medical signal that is used to create a diagnosis, or to create a measurement that is then used for research purposes. That’s a very broad definition. An important class of biosignals that we don’t really delve into too much in our work is the field of computational genomics and high throughput bioinformatics. So we don’t do too much of that. There’s a lot of different interesting kinds of measurements that go on there, but we broadly classify it. We did it that way because around here, and around Hopkins, there were large developed bioinformatics groups, because there was so much excitement over the sequencing of the human genome. But a lot of these other technological revolutions were getting left behind, in terms of analysis skills. So we sort of lumped them all together. And so yeah, I agree that things like biosignals are kind of a vague term. But now that we’ve gotten big enough, we maybe need to make them more precise, and define ourselves a little better.

E: Oh, no, fair enough. I was just wondering. So, for example it’s like something people might be familiar with, like rapid eye movement? Does that count as a biosignal? Or like measuring eye movements?

C: Yeah, so there you’re talking about in sleep. Sleep is typically measured — if you get a rigorous sleep study, that’s called a polysomnogram. The collection of biosignals that they would collect in that case would be an electroencephalogram — they’d put electrodes on your head; that collects brain activity. They have things like a myogram, that they would put for example on your chest — that would detect breathing, and that’s detecting motion a little bit. They might have some motion sensor that they’re putting on your leg for restless leg syndrome. They might have something that measures oxygen that they would put on your finger, and they might put an EKG on. In most of the sleep studies we were looking at, they were very interested in cardiac outcomes associated with different kinds of sleep disorders, and so they might have an EKG on.

So, specifically when you talk about REM, what you’re talking about is the collection of measurements. For a diagnostic sleep study, they have things that are going to measure things about your breathing. REM is specific classification of a sleep state. And that arises from subsets of these biosignals, that they use to then classify different kinds of deep sleep — different stages of sleep in REM. That’s an important aspect of the measurement. Usually they have to take them and pass them through a human to get the staging like that.

As an example, we have several papers on analyzing the percent of the time that you spend in REM over the course of an evening; they call things like that “sleep architecture”. So we often think about how, if you’re tired for several days straight, you might think you can catch up on sleep or something like that. That relates to things like sleep efficiency, and sleep architecture, and the sorts of things that we get out of those signals. So, going from those signals, to these measurements, and relating them to diseases in the population is really what we try to do.

E: And you’ve done work with signals that come directly from the brain, where you’ve implanted electrodes directly on the brain, as well. Is that correct?

C: Yeah, the one kind of measurement like that that I’ve dealt with is called electrocorticography. That’s something they do for people with pretty severe epilepsy, where different medications and other kinds of treatments have failed, and they’re left with nothing, other than brain surgery. They’d saw off the top of their head, and as a measure to help inform the surgery, they’d place this electrode sheet directly on the cortex. Then, usually there’s some time between the early parts of the surgery, and then the actual brain surgery part, so people often let some amount of experimentation go on — in terms of maybe playing sounds, or having them do things, and then recording the brain activity while that’s going on.

So yeah, in some rare cases you can actually collect human measurements from otherwise healthy humans who have some severe disorder, like epilepsy, where the measurements are directly implanted on the brain. I don’t personally do this, but there’s also a lot of people around here who study mice, and monkeys, and other things where they actually implant the electrodes. So these aren’t implanted, they just rest on top of the cortex. There are other people who do things — there’s a fascinating field called machine-brain interface, where people are implanting electrodes in monkey brains, and they’re getting the monkeys to feed themselves with a robotic arm — just the robotic arm being controlled with the electrodes directly implanted in the brain. There’s tons of neat areas where there’s actually a direct implantation of the electrodes, but that only occurs in these more invasive things that people do on animals. I almost exclusively work on human data.

E: I’ve read also that the SMART Working Group uses brain imaging for prediction, and I was wondering if you could explain a little bit about what brain imaging is, and how it can be used to predict behaviors?

C: So, the kind of brain imaging that I work on is called functional magnetic resonance imaging. In that, you don’t get a static image, you get a dynamic image that represents, hopefully, brain activity–localized brain activity–over time. There’s a lot of different ways that people might use both this kind of measurement, and other kinds of brain measurements for prediction.

As an example, a colleague that I’m working with right now wants to use brain activity as measured by fMRI, plus some structural measurements in people who are in comas, to try and predict when they’ll come out of it, or if they’ll come out of it and the prognosis. So that’s an example of using brain imaging as a biomarker to predict some outcome.

A colleague of mine in the SMART Working Group — someone we managed to successfully recruit to the university — an extremely well-known fMRI researcher, Martin Lindquist, works on actually trying to predict what’s going on in your head, with the information from the scanner, at that moment. In particular, he works on pain. So he tries to predict, they have people in the scanner, and they actually deliver pain to them by a hot plate or something that’s resting on their wrist. It actually stings a little bit, and he tries to predict how hot it was on the plate, just based exactly on the brain signal and things like that.

That has implications for trying to understand how we can get a better measurement of pain, right? When people just say, “Oh something hurts,” a physician doesn’t know what that means, but if they can calibrate it…. So, he’s working toward the idea of actual prediction of pain. That’s another way that you could use these kinds of measurements for prediction — and there’s quite a few.

I tend to more focus on the public health-y type prediction-type things, where we try to predict whether or not a person has a disease; whether or not they’ll come out of the coma is another example — these sorts of things, where the image is just a collection, a part of the measurement. A really big one that everyone’s working on right now is trying to predict who will get Alzheimer’s disease, the reason being that, if you can detect Alzheimer’s disease early, then you have a much better chance of being able to develop an effective treatment.

E: Generally on the subject of gathering data, I was wondering if you had any comments about the impact that wearable computing is going to have on public health generally, and perhaps on your field specifically?

C: It’s going to be huge; it’s going to be huge. The SMART Group does a lot of wearable computing work. Personally, I don’t do too much. But my colleague Ciprian who co-founded the group with me, along with some others in the group, have really dove into wearable computing in a big way.

There’s so many different types. When you think of wearable computing, people tend to think of things like Fitbit and stuff like that. But for research purposes, there’s a million different kinds of sensors and measurements that people can take, that are now small enough and portable enough, and it is just amazing.

I think it’s going to revolutionize public health, in terms of our ability to get accurate measurements for lots of different things. The key bottleneck is having enough people who know how to analyze this stuff, where our ability to collect data is just so vastly outstripping our ability to analyze it. So we see that actually the bigger problem is not the development of the sensors and stuff like that, beccause lots of smart people are working on that, and they’re developing great stuff; but then so much data gets produced, and the real bottleneck now is people to analyze the data. So I would say to anyone who’s an aspiring young machine learner, or computer statistician, or biostatistician — or anything like that, it’s a great field to get into.

E: I was going to ask on that subject — generally speaking, statistics seems to be sort of enjoying a cultural moment. With the popularity of sports and election statistics, most commonly associated in North America with people like Nate Silver, I was wondering if you think that this specifically is going to inspire more people to get into things like data science? And if data literacy generally will improve, as we go forward — say if we’re all wearing Fitbits, or things like that?

C: Well, I think this cultural revolution for sure is helping. Moneyball, the book, is a great example — yeah like Nate Silver, that cultural revolution is great, and is really going to help out our field in this closely-related field.

But I think the bigger impetus for drawing people into the field is the demand for jobs in the field. I think the fact that there are so — it’s one of the relatively few sectors where there’s an enormous amount of job growth, and that there’s way more demand than supply. There’s no apparent harvesting that’s happening. There’s no leveling off that seems to be eventually the case. And new data oriented companies seem to be popping up every day, and giant companies — all your Googles, and Facebooks, and Twitters, and everyone, they’re ostensibly data companies at some level.

And so, I think the major draw for people into this field will be the fact that it is going to be one of the principal jobs of the future. Which, when I got into it, when I was a lowly art major trying to figure out what to do — there weren’t all the different kinds of options. It’s interesting now, we see our students — the amount of options that our students have now is truly remarkable. Some of them go into finance, some of them go into technology and move off to Silicon Valley — and some of them stay in academics. Some of them do biostatistics, some of them go to mathematics. The number of options they have now is remarkable.

E: I read on your website that you have a particular interest in large scale open access education. And I know, of course, that you’ve been successful with the data specialization course on Coursera. I was wondering what inspired your interest in open access education, and what plans you might have going forward?

C: Well, Leanpub fits really, really well into our vision, and I’ll get to that in a second.

So, initially it was really just kind of fortuitous. I had wanted to flip my classroom, which is the process where students watch videos of the lecture at home. During the class period, they actually get more of my time and the TA’s time, actually doing problems. There’s a lot of work so far that’s showing that that’s a more effective way to teach people, and that the old sage on the stage lecture model is not the optimal way to do things.

So, when I contacted some people to do some recording in our school, they mentioned that we had just struck a deal with this open access open education company called Coursera and asked whether I’d like to be one of the people on the launch. And so I agreed. I was really enthusiastic about it, snd I happily went and talked to Roger and Jeff, who are my two colleagues here, who were very interested in it as well. I think my class was okay; their classes were just blockbusters.

From then on, our interest was really piqued, and for a variety of reasons, one being, this idea of delivering low cost or free education is very appealing to people in academics. So, I think that the books, Leanpub in particular, has really helped us in terms of really fitting into that model. Our Coursera model for all of our courses is: everything’s free. The lecture notes are all posted on GitHub, and you can see the full development process.

The videos are all free, both on Coursera, and you can get them off of YouTube from most of us as well. And it just makes sense too, that the textbook — if there is a textbook that existed for these classes — that that should also have a free option, or a variable — or something that, some new way for doing the pricing so that it conformed to this new model. And it conforms great, Leanpub for a textbook, especially — especially because we can give the students edition updates, and things like that, and a lot of things that people would complain about in university textbooks, just get all solved all at once.

So that fit actually pretty well with our open education mission. I don’t know specifically how I got interested into it, other than the series of events. In that it kind of always kind of fell well within my kind of personal ethic. And I think the same with Roger and Jeff. I would also mention that school was a very early pioneer for open education. Well before Coursera, and before Khan Academy, and things like that — there was MIT Open Courseware, and our school was a participant in MIT Open Courseware. I was in on some of those meetings where they were deciding to do it, and I was super enthusiastic about it at that time, as well. This was quite a while ago. So I think the School of Public Health here has really been on the vanguard of open education, and being part of that culture kind of seeps in as well.

E: It’s really interesting, I know there are some voices out there that respond relatively conservatively to the idea of open education. And in particular, they’ll invoke the possibility that it might be a competitor or a threat to conventional university education, and I was wondering what your response might be to that criticism?

C: For sure, well it’s possible it might be a threat to certain financial models, for certain departments, for certain topics. But by and large, the question is whether or not these are coming up with new markets, or they’re poaching existing markets. I think the vast majority of the ways in which the students take these classes are new markets. They might be university students — but, a student might sign up for a machine learning class that they may not have taken at their university otherwise. I think the majority of the student engagement at Coursera, edX, Udemy, these other sites, is probably new users.

But for sure there is a certain amount of poaching that also has to occur. That student that was going to take that elective class elects to just take it on Coursera, or something like that. I’m sure to some extent that is happening a little bit. But it’s not all that dissimilar from the correspondence courses, and other historic attempts over time to broaden access to education in different ways for people who have different circumstances.

I think a lot of that focus on how online education is going to disrupt universities is a laser-beamed focus on 18 to 22 year olds in undergraduate education. But the people who take these classes exist well into their work life, well beyond their university life. And so I think it’s a lot more complex than that. I think there is a certain amount of disruption that’s occurring because of it. But I think that it’s — in a lot of ways, good disruption.

I think that one way which the kind of disruption that you’re talking about might occur, is any place that really has a revenue model where they use large introductory classes as a revenue generating component — with adjuncts to generate the revenue that they use for other things. If they don’t have a very diverse kind of financial model, that might get disrupted. But even that, I don’t see that much. People like in-person classes, so I just see how much — I do think most of it is new users and new content. But I am a university professor at some level and dependent on brick-and-mortar learning. Not on some level, on exactly every level.

E: On the subject of generating revenue and textbooks in academia, and also journal articles, I was wondering how you see academic publishing evolving say over the next few years, and if there’s going to be a shift perhaps along the lines of what’s happening towards open access education with video tutorials and things like that?

C: I feel a little bit more confident that something like Leanpub is going to disrupt traditional book publishing, than I am about what will happen with traditional universities with respect to online teaching. Because our experience with Leanpub has just shown that if you have your own channels to get your book out there, then it is really an ideal circumstance.

If you’re willing to publish something purely as an ebook, or mostly as an ebook, then you can do all sorts of interesting things with embedded video. In my stat inference book, not all, but most of the homework problems have links to YouTube videos that actually give the solutions, with me working them out; I have a little tablet here that I hand write out the solutions on as I record myself doing them. That kind of disruption seems pretty inevitable to me at this point.

For journal publishing, I’m less and less certain about it. I don’t know a lot about the journal publishing business. I guess I don’t really know much about the book publishing business either. From my experience as an author, I can’t even imagine contacting a traditional publisher at this point. But for journal publishing, that’s — I mean I still submit my best articles to what I think are the best journals that’ll take them. And I don’t know about disruption… There’s a lot of discussion in academics right now about disruption of academic publishing, and there’s a lot of new models that are coming out. I must say many of them are really impressive — Frontiers is an example of one that I’ve worked with, that I think is quite impressive. And PLOS is another example that’s quite impressive. But it’s less clear in my mind how that will shake out.

E: I know that in both of your books on Leanpub, you mention that people are invited to send you errata with pull requests on GitHub and things like that, and I was wondering if engaging directly with people who’ve already bought your book is important to you? And if there’s anything we could do at LeanPub to help you engage better with your readers online?

C: I like the fact that if I decide to add an extra section or chapter or new set of problems or something like that, that Leanpub allows me to contact everyone who’s bought the book, and tell them. So, sometimes I’ll republish the book, and I’ll send out an email to everyone that just says, “I’m republishing the book, this is my minor stuff, don’t bother reloading, re-downloading and putting on your devices for this.” But then sometimes I’ll put a whole new section or chapter in, and I’ll be like, “This is probably worth re-downloading if you’re still actively engaged in the book.”

So the ability to contact people, I think, is quite useful. And people who have bought the inference book might want to get the regression one. So the ability to email them out and say to do that is great. So yeah, I think that’s a nice aspect of having direct access to the customer.

And the GitHub integration… So, the students submit pull requests on GitHub, then I’ll correct that error. So it’s not like they’re emailing me errata, they’re submitting it as a pull request. I’ll accept the pull requests. I have one of the switches on GitHub that — whenever I check something in, it automatically re-compiles it on Leanpub. So they’ll submit a pull request, I’ll accept the pull request when I recommit the repository, repush the repository, then it automatically gets recompiled on Leanpub.

Then I have to republish the book, which I’ll only do every now and then when it’s big enough, a big enough set of changes. But that’s actually really great, because it’s not like I’m getting a ton of emails. I’m getting and working through on GitHub, which is fantastic.

E: And specifically with respect to you emailing readers, we actually do a kind of, a sort of double blind, where we don’t reveal the reader’s email address to the author, or the author’s email address to the reader.

C: That’s right, yeah.

E: We’re kind of a middleman. Do you see that as a good thing, or does it bother you that you can’t see the emails — the actual email addresses unless they opt into that?

C: I think it’s sort of irrelevant to me. I mean if I had a list of emails, I wouldn’t do anything with it. If I actually had to manage the emails, that would be kind of an annoyance. So it’s kind of useful. But then also as a — from the consumer side, it seems like a nice protection of their identity and their information.

E: Okay.

C: So that seems pretty reasonable.

E: Okay.

C: And I don’t have any need for their email otherwise. So yeah, it seems like the right way to do it.

E: Okay, great. I was wondering if there’s anything specific that you think we could improve in your workflow, or for the way you like to write, and to contact readers, for example, if there’s anything specifically that’s come up?

C: Well so, for the data science specialization, we actually wrote all the lecture notes in Markdown. So the ability to convert the lecture notes into kind of a… well, I think of the stuff I’m doing as more kind of like lecture text, right? So it’s sort of like lecture text, it’s highly connected to the class, right? It’s sort of a standalone book, but it’s mostly connected to the class. But the ability to kind of create that kind of entity in Markdown — because with Leanpub, the authoring is in Markdown — it was very seamless. And that helped a lot.

One instance where I think things could be improved is if you could author directly in LaTeX. I noticed that Leanpub apparently does Markdown and then probably does Pandoc to create the LaTeX file, which is then — I’m pretty confident I’m seeing in the log file, standard LaTeX compiling. So if there’s some way you could actually author in LaTeX, for the maths, science, computer science, stat crowd. LaTeX is the lingua franca of the community. People already know it. That’s one thing.

The other thing would be access to the log files after it gets compiled. That would be very useful. Because when it comes in an email — that’s just a timing thing.

I think another useful thing would be an offline compiler, or something that was close enough to where you could compile it offline. So a combination of like a Pandoc — a collection of Pandoc commands or something like that. That said, “Oh yeah, take this, do this to your Markdown file” — and this will approximate pretty well whether or not you’ll get an error. I think that — so that you can kind of write and rapidly debug.

I’ve noticed the only time I ever get an error while compiling is with equations. That’s it. So for certain subjects, that’s no problem. But for like — for inference it was a little — that part was a little hard. Fortunately I already knew the math was already typeset without errors for LaTeX to begin with. So that was helpful. But for writing a de novo book, something that would allow you kind of a quicker ability to debug errors in the math — math typesetting, that would be helpful.

E: Okay well thanks very much for that. We’re actually going to be working on an author app, and hopefully a lot of that will be incorporated into it when we do that. Because we want to make it easier for people to work on their own, on their own machines.

C: Sure, and the other thing — I’ve never tried it, but another solution in my particular case would just be to convert it to EPUB myself, and then upload it as an EPUB file.

E: Oh okay.

C: I noticed you guys just accept an EPUB file by itself.

E: Yes that’s correct.

C: Yeah, so I could author it in LaTeX, Pandoc it to an EPUB file, and then just submit the EPUB file. I’ve never tried that solution.

E: I just have one last question. Are you working on any other books right now that you plan on publishing?

C: I don’t know how often you look in people’s accounts, I’ve got about seven queued up. I teach three classes as part of the specialization. I’ve done two of them — inference and regression. I’ve got to finish up regression. I think on the slider of Leanpub, it says maybe 70%, I’d like to get it up to 100%. And then, after I’ve finished with regression, I teach a third class in specialization called, “Developing Data Products.” I’d like to finish that one. Because it’s a computer-oriented book, that will be really ideal for the Leanpub authoring and setting. And then after that, I have two Coursera classes that are coming out that I hope to create.

My new strategy is to write the Leanpub book, then record the videos, and then release the class. The reason being is that the Leanpub book will almost then serve as a script for the videos and then for the class. So we’re thinking of that in terms of the workflow.

E: That’s just really interesting. I’m so glad to hear that you’re using Leanpub that way. I mean it’s one of the things we always hoped it would be used for.

I’d just like to say, before we go, thank you very much Brian for being on the Lean Publishing Podcast and for being a Leanpub author.

C: Thank you. We’ve had a blast. It’s been really fun working with the platform, and I really — I’m a big proselytizer for Leanpub at this point. Our experience has been great, so thank you guys for creating such a great product.

E: Thanks very much.

This interview has been edited for conciseness and clarity.

– Posted by Len Epp


Originally published at leanpub.com.