The following is interview is from a series featuring faculty involved in natural language processing education. Johns Hopkins University Professor of Computer Science Jason Eisner chatted with us about courses he’s taught, which include Natural Language Processing (601.465/665) and Machine Learning: Linguistic and Sequence Modeling (601.765). Jason was previously interviewed by JHU’s Center for Educational Resources in February 2012 about teaching, and he’s also posted his “Statement of Teaching Accomplishments and Goals” on his webpage. He won JHU’s Robert B. Pond, Sr. Excellence in Teaching Award in 2005 and Alumni Association Excellence in Teaching Award in 2013. Jason’s research aims to formalize and model linguistic structure, and his work uses statistics and computer science to address areas including phonology, syntax, morphology, and semantics.
Q: How many times have you taught your NLP class at this point?
Almost every fall, I guess 17 times. Even when I was on sabbatical, I moved it to the following spring.
Q: On your teaching statement you say that you add and drop certain topics every year. How do you decide what changes over the past 17 years?
Actually, a fair number of things have stayed on. Some things have just evolved. For example, take modeling grammatically, which is lecture two on the syllabus. Originally I did that as a blackboard lecture, and then eventually decided it would go faster if I had slides. Also, originally I didn’t teach log linear modeling, but now I have a whole online exercise for it, including an interactive visualization.
Q: At what tipping point do you decide to add new information to your class?
When it no longer feels valuable to have an NLP class without it. At the moment, I still don’t have much by way of deep learning, which is sort of increasingly not viable, but I do have a second course now.
Q: Why does your new course focus on sequence labeling?
So, I’m working on course notes for that class at the moment. I kind of improvised the class. For every lecture, I figured out before the lecture what I was going to talk about, but I didn’t have slides for the most part. I used an electronic whiteboard. I recorded those lectures and I screen capture of the whiteboard, and that’s enabling me to turn it into notes. Next year, there’ll probably be slides as well as stuff to read.
So, why sequence labeling? It turns out that if you think about a state machine, it’s a good unifying formalism for quite a lot of algorithmic stuff in NLP, such as dynamic programming, reinforcement learning, and RNNs. The question is, what is your state space, is it finite or infinite, is it like a trie where it just keeps branching, or is there re-entrancy where things come back together, in which case you get some advantage from dynamic programming. What I wanted was to show that lots of methods could be tied together with a common notation as being ways of mixing and matching various tricks. Again, the field can seem very big to somebody on the outside, but to somebody’s been doing this stuff for a while, it feels like there’s a few ideas that keep coming back.
I was thinking some years ago, like 10 years ago, of writing a quirky textbook called “Thirteen Ways of Looking at an HMM,” which is in allusion to the Wallace Stevens poem, “Thirteen Ways of Looking at a Blackbird.” The idea is that you could take an HMM as a jumping off point for graphical models, as a jumping off point for grammars by way of regular expressions, or as jumping off point for canonical dynamic programming. So, every chapter would be like, “Okay let’s start with this new thing you could do with HMMs. Oh and by the way, you know, this opens up this other piece of intellectual territory.”
Q: It seems like a lot of your class focuses on computer science and algorithmic ideas. What role does linguistics play in the classroom?
So the way that I describe things in the first lecture is that there’s this triangle whose corners are linguistics, statistics, and algorithms. I’m trying to give them a sense of how those things were connected, so sometimes we’ll take a more linguistic perspective. In fact, my first assignment is grammar writing. Did you find the competitive grammar writing paper?
David: Actually I ran it in my classes this semester.
Jason: I’ll be running it Monday afternoon at the JSALT summer school.
David: Students were scratching their heads at first and then their insides kick in as they realize what kind of structure there is. The non-native speakers seem to do a little better with it at first.
Jason: That’s very interesting. Why do you think that is?
David: I think the native speakers had a more implicit notion of the grammar, but the non-native speakers are able to externalize the structures they learned in courses
Well yeah, they probably have had some kind of formal grammar that the native speakers might not have had when they were learning English, such as direct objects. I did find that the non-native speakers and maybe some of the native speakers were sort of uncertain about whether they were getting the correct terminology, and I have to tell them, “You can make up your own terminology.”
David: It was it was good to have like a mixture of both in each group. There was a lot of insight from that.
Jason: My second class is more of a machine learning class, and the kinds of things that I want them to understand are things such as the relation between generative and conditional models and different methods for inference. Language is the source of many of the examples, but that class is less linguistics and more machine learning, more designing deep learning systems and so forth.
Q: We were looking at the assignments for your NLP course, which were amazing. What’s the ratio that you aim to have for programming versus math?
The homeworks are not very explicit about that. You may have noticed that I refactored them to be a homework followed by a reading handout. Some students in the class have suggested I should go even farther, and make it into a one-page homework where they just get something working, and then the rest is hints. Students do find the homeworks to be excessively long and detailed. If we move more to a leaderboard-based activity, then as long as the program runs and can be auto graded, then the students wouldn’t have to do it my way if they didn’t want to. On the other hand, I kind of want them to learn particular things and not just get something running using their own special sauce. The nice thing about the leaderboard is that students have freedom and they can go look stuff up, and I do something like this on homework four where they write a probabilistic parser. I’m like, okay, you have to get it running and it also has to run moderately fast, and here are a bunch of things that you could put in it and you can pick. Maybe we should do more of that. I could have open-ended assignments for things like language modeling and say, okay, here are ten things you could try, and many students would probably at least read these suggestions. However, there is a question about whether it would be fair for me to then to examine them on like, trick number eight which nobody tried.
David: I tried doing this for a simple word2vec model in my class and put a bunch of optional things that students could try out. Students ended up converging to what’s the easiest thing or what has the cleanest instructions.
Jason: Well, it is possible of course to grade people differently depending on how well they do so they get some benefit from trying something new. You can even credit them for diversity, whether it works or not. Students could get extra points if fewer people are doing that particular suggestion. I was working with Gradescope last year on adding leaderboard functionality, and I wanted an information board rather than a leaderboard so it’s less competitive.
David: We ended up using Kaggle in the class for the fall for doing this but I didn’t want to remove the competitive aspect for students who enjoy that.
Jason: The way we were doing it was you could submit your program as often as you like and you have major and minor version numbers. The leaderboard show the last minor version of each major version number, if I remember right. If you’re trying different things then all of those versions show up and everybody can see what the speeds and accuracies are and you can put notes saying what you did. Some are just bug fixes, but if you’re trying genuinely different things such as converting it to Cython which speeds it up 10 times, everybody can see what worked. It’s not necessarily that you want to be at the top of the leaderboard; it’s that everybody can see how well your thing is doing given the choices that you made. Of course, some people are competitive and there’s some pride.
Q: How much bias from your own research comes into what you choose to teach?
In general the class is probabilistic modeling. Going back to the triangle I mentioned earlier, I want students to understand what the linguistic phenomena are so that they could write down probability models, which capture those linguistic insights, and they can design algorithms for doing inference and learning under those probability models. If it turns out that there’s no efficient way to do that, then they have two choices. You could either make inference approximate if you can’t get exact inference, or simplify your model so you can still do exact inference. I talked about this some in the NLP class, although there’s not a lot of approximate inference on these assignments. There is some in the follow up course.
I try to give students the idea that I’m trying to turn linguistics into a probability distribution, which also motivates my research. What I want is a grand probability distribution that ties all of linguistics together and allows us to replicate the kind of reasoning that linguists do, or that humans do when they are subconsciously doing linguistics. That means understanding the phenomenon of language, understanding how to do mathematical modeling, and being able to do algorithms.
Q: If you had one more week is there a particular topic you would want to cover? Does the second, new course cover topics that spill over from the first course?
The new course is covering a bunch of stuff that historically I’ve had to teach students one-by-one on my whiteboard when they need to know it. For the first course, there probably should be some more neural stuff in it, though I don’t know what I’d remove. The other thing I’d add is approximation algorithms — the furthest that we get toward that is pruning — and things like iterative deepening, which are similar, and it’s where you have something like A*, but your heuristic is not admissible. In the second course I do things like mean field methods, and I’ll probably phase in belief propagation as well. Matt Gormley and I taught our tutorial on that at ACL a couple of times so we’ve got all the materials.
Q: You start your first class talking about n-grams and language models, which is a common way for NLP classes to begin. Why?
The first class is just about ambiguity, like, “Wow, there’s more going on under the surface in language than I would have realized because I just speak it without thinking twice.” My class is actually a little unconventional in this regard. My class starts out with ambiguity, and then says go home and write a grammar. The way I think about the class is that we’re starting with a tree structure. I think n-grams are kind of boring. I want people to be thinking about language in a way which accords more with how linguists think about it. When they think about the structure of sentences, they should be thinking about more than n-grams, and n-grams are a cheap trick.
Q: How do the students respond to having to write a grammar on the first day?
Well they get a couple of weeks to do it. They enjoy it and they learn a lot. As a programming assignment, it’s easy since they just have to sample from the grammar. For homework one on designing CFGs, at some point I post on Piazza something about alias sampling, which is a beautiful algorithm.
I have a ton of Piazza posts in this course. I reuse Piazza posts from year to year so they’re a back channel for teaching. Another back channel are office hours, that are right after class. Students who want to stay can do so, and often it just kind of continues as an impromptu lecture. People have questions and I’ll go off on into more detail in ways that connect much more closely to modern research or my own research, and so I’m filling in a lot of stuff like, “Okay, here’s the real scoop is what happens,” in office hours, as well as answering questions which are not curiosity questions.
Q: One comment from your students is that your assignments are often real-world problems rather than toy problems. Could you give an example of a particularly challenging assignment to design that you think students learned a lot from?
I don’t know whether I agree that they are real-world problems, but I think what they mean is that there’s a lot of moving parts to the assignment. A real-world problem would be to take something that hasn’t been done before, such as, look at all tweets about my restaurant chain and figure out which branches need management attention or something. That’s real world in the sense that there’s a downstream customer. There’s a lot of papers that you see at ACL, NAACL, and EMNLP that are real-world problems. I have thought sometimes it would be good to devote some sessions near the end of the course when I start getting into applied things, where I’d say, “Okay, here’s an applied problem, and let’s brainstorm good ways to solve it.”
Q: Do you think your classes is orienting students more towards fundamentals or software engineering?
It’s definitely more fundamental. Something we verge onto in the first course and we hit pretty much at the beginning of the second course is the CRF-CFG (Conditional Random Field Context Free Grammar). This is a probabilistic model, you can do dynamic programming on it, and the dynamic programming is interesting because it’s not just like Viterbi. Training involves marginalization, it’s an example of a log linear model with complicated features, and you can also do arbitrary feature design. You can apply an LSTM to a CRF-CFG, and then there’s neural nets helping you extract features, which is pretty much state-of-the-art at this point. CRF-CFGs bring together almost all of the key ideas in NLP: there’s grammars and annotation, there’s dynamic programming, there’s feature design, there’s neural nets. You can’t do unsupervised learning on it because it’s a CRF, but if you had incomplete or unlabeled trees, then you could do unsupervised learning.
Q: NLP is a rapidly changing field. How you manage to prepare students for the present but also to become leaders in the field for the future?
I’m trying to do that by focusing on fundamentals rather than tasks. I’m not sure that this is the right approach because it’s possible that by working on tasks, you learn something about design, but what I’m trying to do is give people the design elements. I’m figuring that if they can read other people’s papers, then they can understand how to put those elements together, but if they don’t have the fundamentals they won’t be able to read people’s papers. I do have some focus on things like evaluation, which is something that you need to understand to read people’s papers, such as what does this measure, what is a train, dev, test split, what are hyperparameters, and what does it mean to tune the smoothing method. Understanding what smoothing is for, what dynamic programming is for, and what evaluation is for is maybe more important than a particular way of doing it. I haven’t spent too much time looking at other people’s courses, but I do have some sense that a lot of courses are like, “Okay, today we’re going to build one of these, then next week we’re going to build one of those,” and in general the field seems very task-oriented to me these days. I think it’s not necessarily healthy because it’s not emphasizing the common thinking underlying all of those tasks. Things are coming back together now to some extent because of neural nets being a universal solvent. I think it’s easier now for people to move from one task to another.
Q: What sort of advice you would give someone structuring their own NLP course for the first time?
I start with the notion of grammars with the grammar writing assignment, with the idea being that there is structure to language and its structure that is to some extent accessible to either introspection or exploratory data analysis. So, it’s something that you might be able to get on a rational basis, and there’s this whole field of linguistics that’s trying to define that structure. That structure might be useful across many tasks, such as compositional semantics, machine translation, and so on. I like to get this intuition out there at the beginning. The only reason that we then go back to n-grams (and I keep apologizing for doing n-grams) is to say this is the simplest thing that we could do, these are the probability axioms, and P is a function. Technically speaking it’s a conditional measure, but P is a function, and you can write down any function as long as it satisfies these axioms. I emphasize that this function is supposed to be capturing your understanding of the domain, or the objects in the outcome space. What are the possible outcomes including all the latent variables? What conditional independence assumptions are you making? You get to design that pretty much freely. In fact, there’s a general way of doing it, namely a Gibbs distribution or Boltzmann distribution where you just take any scoring, any function for scoring outcomes, exponentiate and renormalize and look, there’s your probability distribution. So, you’re in the driver’s seat.
I want people to understand that their job is to do mathematical modeling of something that really has regularity and patterns. I want them to understand that first, and I want them to understand probability as a tool for automating reasoning. I channeled E.T. Jaynes at them, and he’s a great proponent of Bayesian statistics. His magnum opus is called Probability Theory: the Logic of Science, and it was published posthumously in 2003. When it came out, I was looking through it and I was like, “Oh yeah, this is the book I need to recommend to everybody because this is how I think too.” I have a question on an assignment that gives an example of some logical reasoning and asks the student to then turn it into probabilistic reasoning, which is just a generalization of logical reasoning. Basically, change this argument, which I think is completely correct, is any defensible reasoning mechanism is essentially probabilistic reasoning and that includes things like abduction as well as deduction. Even when I’m teaching them n-grams I show them multiple ways of building the model. One could have latent parts of speech and dependence on position in the sentence, and the other could have words as units, and another could be a trigram character model. It’s showing them that there’s different ways of thinking about this stuff, and these models have different independence assumptions and different pros and cons. What I’m trying to teach them is you should think twice about how you’re modeling. There’s a lot of stuff going on, and you’re going to decide what you capture.
When I do unsupervised learning, I initially teach forward-backward. I show students a forward-backward spreadsheet in “the ice-cream lesson,” that depending on how you start, the model hits different local optima. Each optimum of a particular model can reveal something useful and shows the tradeoff between the number of states and the need for more data to fit the parameters. This kind of qualitative thinking is something that I really want to put across. Thinking about how decisions you make when you’re modeling are going to affect your bias or variance in your runtime. There’s a lot of questions on my past exams which are essentially qualitative thinking, such as designing a solution for some new problem or figuring out how to fix somebody’s a broken solution.
David: I’m trying to do this on my exams too. They’re fun to write but they can be challenging for students.
Jason: It is challenging. The means on my exams, which may also reflect how we grade, are generally around 60%, with people ranging from about 30 to 85. Some of that is just as we develop the rubric, we try to get a reasonable spread, and so the median is around half the points. People are clearly finding these qualitative questions hard, but they also learn something from the exam, and I use past exam questions as practice problems. Every week we have a discussion session, and what we’re doing there is breaking into little groups, solving past exam problems, and then discussing them.
Q: Thanks for talking with us today. Any last thoughts on teaching NLP?
We have another course in the department called Deep Learning. We have several deep learning courses and NLP isn’t even one of them, although for a while NLP was the only machine learning course in the in the department. This kind of shaped the construction of the class; I was teaching probability axioms and bias-variance tradeoff and things like that because nobody else was. We got a machine learning course starting in 2009 or something, and now we have five or six courses with machine learning in the title as well as a lot of course in the applied math and stats department that people take. Now there are a lot of undergrads as well as grad students who are loading up on machine learning. I also wanted to teach linguistics from the beginning. I think one reason frankly that I’ve won teaching awards is that I’m awakening computer scientists to a whole new area that they find to be intellectually fascinating and never realized could be treated with the tools of computer science and mathematics. It’s just opening new doors to them, whereas if my class were just a straight deep learning engineering class, they would feel like they could cast some new spells, but not like they’ve been surprised and delighted in the same way.