NLP Pedagogy Interview: Dan Jurafsky (Stanford)
(This interview is part of a series of interviews on the pedagogy of NLP)
The first interview in our series is with Dan Jurafsky, Professor of Linguistics and Computer Science at Stanford University, whose research resides at the intersection of NLP, linguistics, and social sciences. Dan was a fitting starting point for this blog series because Lucy’s journey in NLP began when she was a sophomore in his class, CS 124/LING 180: From Languages to Information. This course acts as the entry to a suite of upper-level classes: CS 224N (NLP with deep learning), 224W (social networks), 224U (natural language understanding), 276 (information retrieval), 224S (speech and dialog processing) and 246 (data mining). Currently, it is a flipped classroom, where some of the lectures are watched outside of class and in-class time is dedicated to activities. According to archives of the Stanford Bulletin, Dan has been teaching the intro NLP course at Stanford since the 2004–2005 school year. He is also known for co-authoring the popular textbook Speech and Language Processing with James H. Martin, which is undergoing the development of its 3rd edition.
Q: We’re going to start off with talking about your sophomore-level class, CS 124: From Languages to Information. Why did you choose the flipped classroom set up for your class and why?
Dan: I was convinced by first Daphne Koller and then Andrew Ng, and then I started reading the education literature, especially results from physicists that it’s just a better way to teach. It’s partly that they convinced me that I should do it, and partly laziness and easiness. Chris Manning and I had done the NLP MOOC, and once we had those lectures, it was obvious that we could just use some of them for CS 124, but we also used CS 124 to pilot some of the MOOC lectures.
Q: How has OpenCourseWare changed your approach to teaching? Do you think there’s like a long term impact of having your material online for future students?
Dan: Currently I give 8 of the 20 lectures live, and the rest are online on EdX, and then I use the rest of the in-class time slots for group exercises and labs, designed to add conceptual understanding.
I see two big benefits. One is that the flipped class forces me to think about each topic in terms of 8 minute chunks, with a clear learning goal. This really helps structure what I want students to take away. Another is that it’s made me think really hard about active learning for the in-classroom part: what are the conceptual things they need to know, and how can I get them doing conceptual thinking in class in teams, which will then get them to learn those things. The flipped class in that way is good.
The minus is that, since preparing, recording, and editing a single recorded lecture takes 20–30 hours, it’s easier to just let old material stay in the course, and just tell students “oh, ignore that last bit of that video for now,” which is probably bad. The 8 lectures I give live are material that is either completely new, or has changed a lot over the years, and those 8 get updated most every year.
This year I’m working on redesigning two of the live lectures and replacing one of the recorded lectures, with the goal of getting deep learning and embeddings earlier in our curriculum.
Certainly having lectures on the web has had a great impact, I get mail all the time from students who found the lectures online and learned from them.
Q You use your textbook as the basis for your class but there are more topics covered in the textbook than you can cover in ten weeks, so how do you pick out what topics to cover?
A: Our class isn’t just an NLP class because of our weird setup at Stanford. It’s the intro to the grad classes that cover NLP, but also acts as our intro to social networks class, to IR, to recommendation systems. Each of those topics has a different textbook, and right now I use my textbook for the NLP parts and draw on other textbooks for the other parts.
It was Chris Manning’s original idea to create my course to draw people in both to Stanford information/data science track and to Stanford’s AI/NLP track.. I think that was a really successful, interesting idea, but it’s a little bit specific to Stanford. If I was just building an undergrad NLP class, I wouldn’t do collaborative filtering, I might not do all of IR, and I certainly wouldn’t do social networks. I could see the current setup being applicable in other places (like an Information Science school), or you might just have separate undergrad classes for each topic.
Q: Do you include your own research in what you choose to teach?
Dan: Lately, not much at all. Except, I try to do one lecture on NLP for Social Good and then I usually ask my postdocs and students to present their work there. The big exception is that obviously I have particular tastes in presentation which comes from textbook writing, so I certainly use my textbook chapters. My perspective from the textbook definitely come through in class, but my research papers rarely do because it’s an undergrad class.
Q: The deep learning NLP course at Stanford, 224D, and graduate NLP course, 224N, merged in the last two years. Has this impacted your planning for 124?
Dan: In general, the field has changed so an NLP course has to have deep learning! Stanford currently has no general undergraduate AI course, instead we have 3 separate courses: vision, language, and robotics, and right now the students don’t get to deep learning until grad school which is just crazy.
But also I really want the course to be accessible to my target audience: sophomores and juniors. So I’m working on rebuilding the course. This summer I’m working on writing the deep learning chapters of the textbook, so then in the fall and winter I can write cs124 lectures on deep learning that rely on the chapters. The current plan is to try to do this next winter and add 3 deep learning lectures plus an embedding lecture, and then replace the spell-checking homework with logistic regression and the QA homeworks with a deep learning version, probably just feed-forward networks and save recurrent nets for the grad course,
The hard part has been we don’t have GPUs for all the students and I don’t want to be begging companies for GPUs every year. Also, what I don’t want students to just spend the whole quarter doing hyperparameter tuning; that’s more appropriate for a machine learning grad course. I want them to understand the intuition for classifiers in deep learning, so it’s figuring out a homework that’s doable and fun, hopefully in which deep learning is actually better than logistic regression. It turns out if you can’t use GPUs, logistic regression is better than most deep learning things. The homework has to be something where they don’t go, “Hey, how come deep learning works worse than regression? Why don’t you give us more GPUs so we could work better?”
Anyhow, we’ll see how that goes over the next six months!
Q: Are there any major differences between what you want to teach and what the students want to learn?
Dan: Not so far, I think the major current problem is that deep learning needs to be in the course. Maybe in a perfect world I could have gotten the new chapters and lectures done in time for last year’s course!
Q: Other than your current plans to introduce deep learning into the existing content, if you could extend the class to cover an extra NLP topic in CS 124, what would it be?
Dan: With infinite time would love to put in at least some of the core NLP stuff: part of speech tagging, named entity recognition, parsing, and MT. MT is definitely the most fun of those, hat’s four new topics, though I don’t know what order I’d put them in. Everybody likes MT because it’s fun and you get to look at languages, so if I could only do one of them I probably do MT. If I could do two, I would do part of speech tagging and identity tagging to help build an understanding of some fundamentals of words and word groups. Then, if I had room, I’d then add parsing.
Q: Looking across lots of syllabi for NLP classes, n-grams and regular expressions are a very common way to start an NLP class, including your class. Do you have any sense of why?
Dan: Well, I suspect it was natural for people to teach regular expressions first because it was the first thing in the textbook! And we put it first originally because it was a natural lead into finite state automata, and in those days finite state automata were a big part of NLP; people don’t teach them as often these days, but I suspect they may return! Both Chris Manning and I like the Ken Church UNIX tools, and the day we spend in class on UNIX tools like grep and regular expressions was maybe the single most practical thing that students could take away from them about language. That tutorial day was incredibly useful to their later careers, so regular expressions has just locked itself in. Also, the fact is, dialogue systems are still mostly ELIZA plus slots and fillers, more regular expressions, so it’s a valuable industry tool.
We started with N-grams too because they a great simple way to teach students probability theory, because they are very intuitive about counting and dividing and so is naive Bayes. So we use them to get people completely rock-solid in probabilities, where they deeply and intuitively understood them, and then you can go straight from there to neural language modeling.
Q: Do you think that language modeling is still an important thing to teach even though you have these neural methods at this point?
Dan: Great question! I’ve been thinking about this a lot because all the research — including our own — is focusing on neural language models (LMs), which are way more powerful. However, for many tasks n-gram LMs are nonetheless better than neural LMs. Neural LMs are way better with the same amount of data, but it’s very slow to train huge neural LMs, while you can learn huge old-fashioned n-grams. So huge old-fashioned n-grams end up being what people still use in big systems. They’re something that’s not taught in a (non-NLP) machine learning course so it’s kind of unique to language.
Bottom line, I think now, yes, I would still do language modeling but I wouldn’t do advanced smoothing stuff, just do stupid back off and skip all the Kneser–Ney and Good-Turing. Will people stop using N-gram LMs in a few years, perhaps once neural neural language become fast enough? Maybe. In that case, that chapter may go away and I’ll have to figure out how to reorder things. Maybe use naive Bayes for probability and then go straight to neural language modeling? The problem is that even for naive Bayes for text classification, bigrams are still a really useful feature. Having seen language models, students are used to thinking about bigrams and trigrams, so language modeling teaches them the idea of two-and three-word chunks. I think the answer is, I’ll make its role in the curriculum shorter and shorter but still do it.
Q: When you’re designing your homeworks, how do you decide how much math and programming to involve?
Dan: The quizzes are math and the homeworks are programming. In CS 124, there’s a weekly quiz which is multiple-choice. The quizzes are for conceptual understanding and making sure you work through the math by hand. The programming homeworks are so that you know how to build tools like naive Bayes and language models; we want you to lock this knowledge into your fingers on how do you build these.
Q: How has your perspective on designing this class changed since you started?
Dan: The class is now way bigger. When I started it was 20 people and now it’s 350. At the beginning, it was a little more NLP-ish and included the NLP stuff we didn’t do in Chris Manning’s grad class. Back then, Chris taught parsing, MT, and information extraction, so I did everything else. I covered lexical semantics, co-reference, discourse, and dialogue acts, and I had them build a chatbot. So, all the pieces that weren’t in the grad course were in the undergrad class. I made the coursework slightly easier for undergrads and there was a clearly different structure in our curriculum.
However, then I created a graduate-level Natural Language Understanding course, so that took care of some of missing content, and then created a graduate-level Dialogue course. That meant CS 124’s role changed and now it serves as an introduction to multiple topics outside of NLP, so we still have to talk about a little of everything, but don’t want any of the homeworks to overlap exactly with the grad courses. It’s also the case that every time the grad courses change, our homeworks get impacted. For example, 3 of the courses have homeworks on embeddings now and there’s a little too much overlap. But, embeddings are very central to everything, so that’s okay as long as you don’t have the exact same homeworks.
Q: How much do you think linguistics should should play a part in an NLP class?
Dan: I definitely try to get some linguistics in there, partly because so often at the end of my class students are like, “I had no idea you could study language systematically” or “I didn’t know about gender or sentiment.” I end up getting a lot of Linguistics majors and a lot of Symbolic Systems majors who were originally going to be CS majors and had taken a bunch of systems-y courses. CS 124 ends up being the first course with some human stuff in it that they ever saw. For the AI students, it’s still the most human-centered of all the courses in Stanford’s AI curriculum. Through the course, my role definitely has been to increase the number of humanities and social science courses that CS students take after my course. For that reason, I cover linguistics when I can, especially sociolinguistics lately. In years when I have MT in the class, I also do a lot of typology and language variation and language differences in morphology because that mattered for MT. I try to get them to think systematically about language.
Q: What do you think is the right relationship, then, between an NLP course and a machine learning course?
Dan: This is a good question because there’s a lot of overlap between the two; in a lot of these topics, we use machine learning as a tool. Partly, that’s just a personal decision to make, but also that can change; in the past I left gradient descent to the machine-learning courses, but I’m now adding it to the textbook and so I’ll probably add it to my class.
In general on my classes, I don’t do any proofs and they don’t spend a lot of time building machine learning algorithms from scratch (SVMs, LSTMs).. In my course, those are supposed to be tools so you just have to understand them, but you’re not going to build all the pieces by hand.
Part of it depends on whether deep learning takes over all possible NLP algorithms so that there’s just one kind of machine learning for everything. That seems unlikely to happen though. That’s what we thought would happen in 1990s and it didn’t happen then. It turns out that vision, language, and robotics just have their own parochial constraints and their own kind of kind of biases.
David: Which machine learning was it in the 90s that everyone thought would take over? SVM?
Dan: Oh, in 1988, it was gonna be unsupervised clustering, or unsupervised learning. Everyone thought you would just induce linguistic structure completely unsupervised using EM. Everyone thought EM would take over the field, and there were these early papers proposing EM for learning part of speech tagging, and then it turned out that just having a tiny bit of training data helped. Now, we know you can do better than using EM completely unsupervised with enormous amounts of data if you had just a thousand labeled observations or something crazy like that, which you could label in an hour or two. Everything switched very quickly to supervised machine learning and then all the research was focused on architectures and features, but the actual machine learning algorithms were just like standard regression or SVM, so there was nothing research-y to teach about how to build an SVM or a CRF. It was just some applications of them and how to build the features.
Q: Many common NLP techniques are now pre-implemented in packages, so how much do students end up understanding the details of the techniques they learn about in the class if everything is already built for them?
Dan: It’s a mix. In this class, up til now I’ve required that students build everything so they can’t use libraries; so for example they implement naive Bayes from scratch and play around with it, that gives them a really intuitive understanding of Bayesian thinking, priors, likelihoods, and so on.. But I think as I’m adding deep learning, I’ll have to move to more library use since you don’t have time in 10 weeks to have homeworks on the machine learning fundamentals of everything.
Q: You’re working the third edition of your textbook right now. How do you decide what topics to include in each edition?
Dan: Well, partly we always do a thing where we search for the syllabi of everyone who’s teaching the book and just see which chapters they require. You can see very quickly which things to drop. Like, no one taught the Chomsky hierarchy from our textbook — literally, I think one person in the world required our Chomsky hierarchy chapter as part of an NLP course. So we dropped it. You can tell right away what people are using, and everybody does parsing and everyone did n-grams in the olden days. Now, obviously, everybody’s teaching deep learning, so they’re either using Yoav Goldberg’s book — which is really great — or people will use some combination of our book and Yoav’s book. That’s what told us we need to write the neural net chapters. However, for what to take out, you asked earlier about whether to teach n-grams. Is it time to get rid of n-grams? I’m not sure. In my case, no, but maybe I’ll shorten it yet again. Or for another example, if all NLP people are shifting to dependency parsing, do we still have to have constituency parsing? I’m still gonna put it in because people might choose one versus the other. Some labs will pick one approach, but probably what we really should be doing is a survey and see which chapters of the third edition are getting used. If it turns out that no one’s teaching word sense disambiguation or something, then maybe that goes away.
Q: Just projecting out to the future, what would the fourth edition include that isn’t yet in the in the third version?
Dan: I don’t know; that’s a great question. The third edition won’t be done ’til next year and I still can’t decide on the current topics. For example, Jim is writing the sequence modeling chapter right now using LSTMs, but of course the way people build sequence models will change, maybe Attention Is All You Need or maybe it will turn out we should have used dilated convolutions, or something else. So I’m not positive that the simplest, most general algorithm won’t be something else by next year Sequence-to-sequence models is something that’s changed a lot over time, from HMMs, to MEMMs, to CRFs, to RNNs…. Or maybe it’ll turn out that really simple feed-forward networks that just walk across the input or something will work better, because somebody might come up with some simplification that makes it do so.
Q: NLP is changing pretty fast. How do you make sure you prepare students for the near future as well as ten or twenty years from now?
Dan: You can’t do 20 years, but you can try. You try to teach the students big ideas, like training sets and test sets, supervised machine learning, looking at your data, and thinking about language. You hope that those things are general and will be there in ten years, but you have no idea.
Q: What advice would you give to someone designing a new graduate-level course in NLP?
Dan: Obviously a modern grad course will be deep-learning based. But you also need to decide which areas of NLP you really want to cover, that’s a tough decision. Do you cover dialog, or put that in another course? Historical dialog had different math (POMDPs in the olden days, deep reinforcement learning now). How much semantics do you do? Are you going to cover both lexical semantics (very natural now using embeddings) and formal semantics (very common now with semantic parsing). Also you’ll want to make sure also to cover important areas even if the best algorithms are pre-neural, and for now it’s still important to make sure the students know the non-neural baselines like n-grams and TF-IDF.
The above interview has been edited for clarity.