Emily M. Bender

As the next installment in our series on NLP pedagogy, we interviewed University of Washington Linguistics Professor Emily M. Bender. She has taught many courses and seminars in the past two decades, including Introduction to Syntax for Computational Linguistics (Ling 566), Introduction to Computational Linguistics (Ling/CSE 472), and Knowledge Engineering for NLP (Ling 567) multiple times. Emily is also Faculty Director of the Computational Linguistics Master’s program and an Adjunct Professor in Computer Science and Engineering. She is a passionate advocate for the use of linguistics in NLP, having authored the book Linguistic Fundamentals for Natural Language Processing: 100 Essentials from Morphology and Syntax and acting as LSA’s delegate to the ACL. Her research interests include computational semantics, multilingual grammar engineering, variation across and within languages, and ethics and NLP. She is also known for the “Bender Rule” that requires researchers to list which languages their paper or system is tested on. Earlier we had chatted with another faculty member at UW, Prof. Yejin Choi, who teaches NLP targeted towards CSE students. We hope that Emily’s Q&A can provide a complementary perspective to teaching different courses at the same institution. For those interested in resources she used for teaching ethics in NLP, there is a “Ethics in NLP” section under the Teaching section of the ACL Wiki.

Q: What differentiates UW’s intro NLP class from your intro CL class?

Emily: That’s actually something we’ve been working on together with Yejin and the other CS faculty. The class that nominally I own, though it’s actually often taught by a graduate student in linguistics, is the oldest one of these classes at the University of Washington. Introduction to Computational Linguistics was and still is cross-listed in Linguistics and Computer Science, and so when the NLP group grew in Computer Science, we had to figure out how to differentiate the classes so that ideally somebody could take both of them and get something out of it.

The question is, well, how do we do that? One big difference is in the prereqs. The NLP classes, as computer science classes, have a lot of computer science as prereqs, and we don’t want to

do that in Linguistics because we want the class to be a way in. Someone coming through linguistics into computational linguistics ought to be able to take that class and decide, “Yes, that’s what I want to do. Now I’m gonna go get the programming skills that I need to go further with this.” At the same time it’s a cross-listed class, so it has to be something that brings interest to computer scientists and not be effectively an introduction to programming disguised as computational linguistics, because that’s not going to help anybody.

What we’ve been grappling with is bringing the linguistics to computational linguistics. I’m talking about it in terms of, what is the linguistic knowledge required here? We try to have every single assignment have questions that require you to know something about the structure of language, or require you to learn these things as well as the algorithms or formal structures.

One thing that we changed just this past year that I’m really excited about is the final project. When I first inherited the class, there was a final exam, and I thought, well, maybe that’s not so interesting. So I made, in my exuberance as a junior faculty member, a three-way choice: you could do an exam, a paper, or a project. Then I got tired of creating the exam, so it became paper or project, and then it just became project. The projects were very open-ended and it was meant to be this chance to explore and build something you’re interested in. Just about everybody did a project on language identification, which in some contexts is still an interesting problem, but in many other contexts, really isn’t. I got bored of reading reports on language ID, so we refocused the project on error analysis. I think this change was extremely valuable. Now, the project in that class is to go find somebody else’s system with its training data and its test data, train it up if need be, run it, and then take not only the metrics that it does on the test data but actually go through the test data by hand and do an error analysis. I think that error analysis is one of the places that linguistic knowledge really can inform NLP and not just computational linguistics. So, building the project around that brings a nice linguistic flavor to the class.

Q: It seems like the majority of your students are in Linguistics. What’s the makeup of the background of students in your introductory class?

Emily: In that class, which is an undergraduate class, I’d say most of the students or maybe at least three-quarters have linguistics among their majors. There’ll be a lot of double majors too: we’ll get people double majoring in linguistics and CS. Some computer science students also come and take it, and there’ll also be people who are into just linguistics as well. Though the class has a bit of a reputation for being a challenge, we really try to keep it accessible.

Q: How has your class size changed over time?

Emily: That class has stayed fairly stable in the mid 20s, which is a bit surprising because everything that’s adjacent to AI is getting totally hammered. What has been explosive over time is the applications to our master’s program. In the last cycle for students starting in fall 2018, we had 187 applicants for what ended up being I think around 45 offers of admission, and probably 35 to 40 of those people are joining us. One of the things that’s been really interesting is that when we first started, our admissions process involved asking, can this applicant manage what’s going to happen in this program? Just yes or no, and we took everyone who was a yes and that was most the people who applied. This is because in 2005 when our program started, hardly anyone knew about computational linguistics so if you knew enough about it to go apply for a master’s program, you probably had the background you needed. Even still, now with 187 applications, and it’s not that only 45 people were excited about the field and 142 people were not qualified for the program. It was more like 150 people who would have done just fine and then 37 more people who needed some more background. So, it’s really amazing the level of preparation that we’re seeing and the level of interest and excitement in the field.

(Editors note: This interview was conducted in 2018. Emily sent us an update for the 2019 admissions cycle: “For 2019, we had 231 applicants, and offered admission to about 65, of whom probably at least 50 will be joining us in the fall. And just like last year, it was an incredibly strong batch of applicants!”)

Q: Are your classes more oriented towards preparing students to apply CL in industry or research?

Emily: The undergraduate class is meant to be exposure to the field, answering questions such as: what is computational linguistics, what are the standard techniques used, what are the kinds of problems it addresses? There’s a big emphasis on evaluation and how do you know how well something is working. This distinguishes it from many other subfields of linguistics and from the non-machine learning bits of computer science. I’m not a computer scientist, but my impression is that the non-machine learning parts of computer science don’t have the same notion of training and testing. In a lot of computer programming, either it works or it doesn’t, rather than it works better than that other one. Oftentimes in computer science it’s about how fast does it go, not is it getting the right answer or what percentage of the time is it getting the right answer. Here we have a big emphasis on evaluation which is relevant for both research and industry.

The class is sort of prior to the research-industry distinction because it’s not directly preparing anyone for either of those things; it’s exposure to the field. Another thing that we’ve added to it recently, which I think is really important no matter how you’re going to use the knowledge, is a couple of sessions on ethics in natural language processing.

Q: How do you decide what changes or stays the same in your syllabi throughout the years?

Emily. Honestly some of that is just professors’ choice. It’s like, “You know what, I really don’t like doing the unit on text-to-speech and it’s always the hardest one for me to do because it’s far away from what I research. Something’s got to go. Sorry text-to-speech!” (For the record, text-to-speech is still in the introductory course; it was the first example that came to mind.) Definitely things like this happen, though sometimes it depends also on the textbook. We use Jurafsky & Martin as the main text and it’s a really valuable resource. But at least in the second edition, the morphology chapters are a mess, and so until I found something else that I liked for doing computational morphology, we skip it because the material is not great. So a combination of factors like that come into play.

This is sort of the privilege of teaching a survey course. It’s not Computational Linguistics 101, which is like, here’s the basic foundational things that you’re going to need to build on for everything else. It’s more like, here’s a sampling of what’s all out there, what are the commonalities that we can see, how does linguistic knowledge play into this, and what do you need to know about computers and computer programming to do this. You can illustrate those points with a variety of different specific tasks or levels of structure.

Q: Is there a difference between what your students want to learn and what you want to teach?

Emily: I certainly get people who are disappointed that they’re not learning how to do machine learning in the class, and I basically blame that on AI hype.

Q: What’s the ratio of programming to math to linguistics in your introductory CL class?

Emily: It’s always a tricky balancing act. For the most part, the programming is extremely scaffolded. We have a couple of assignments, including one called Eliza-like which is basically an assignment about regular expression and string replace. It has this Python script that does the ELIZA loop and the students set the regular expressions. It sort of gets your hands dirty, but in this very scaffolded fashion. The final project is where there’s more room to go do whatever coding you want to do. This was more true with the previous “go build something” kind of project and less true now with the “run someone else’s software and do an error analysis” project. In the introductory class, you have to be able to deal with computers, but it’s not really a programming heavy class. It’s way different in the master’s program.

Q: How is it different in the master’s program?

Emily: The master’s program is nine courses which students, because we’re on the quarter system here, can pack into one year plus a capstone project. In the nine courses there are two linguistics required courses and one linguistics elective, and none of that has programming. Then there’s a four course sequence in computational linguistics, and all of those have heavy hands-on programming. Then, the students have room for two more electives in computational linguistics or a related field and those also usually involve hands-on programming projects.

For the master’s program, we have, as a prereq for that course sequence and many of the electives, introductory coursework in computer science and probability and statistics. Usually this involves a two course programming sequence, the data structures and algorithms course, and probability and statistics for computer science. All that has to be under their belts before they can do the core courses. We really want to be open to people coming primarily from linguistics, because one of the big strengths of our program is that we get students working together and helping each other and you do need both kinds of knowledge. We didn’t want to say we’re only taking computer science majors, so we said this specific part of what you would get in a computer science major are the prereqs for our program.

For awhile it was sort of touch-and-go to get people with that background, and later we posted the prereqs on our web page. We said, here’s some advice on how you can get the computer science prereqs if you are a current student in linguistics, how to get the linguistic prereqs if you’re in computer science, and what are the ways you can go get these if you’ve already graduated. Once we had that webpage up for two years, which is about the time it would take to fulfill those requirements, we started getting people who had that preparation being the majority of the applicant pool. If you tell people what they need to do, they can know to go do it.

Q: What makes teaching intro courses and seminars different for you?

Emily: In intro courses, there’s this responsibility to say, “Here’s the first steps into this field and I am going to guide you towards it.” You feel more responsibility for staying at least a step ahead of the students: “I know this and I can show you the way.” However, I like to tell people such as TAs or people who are starting to be TAs that there’s three magic words that can help you get out of any situation, which are “I don’t know.” Then you can follow up with, “What do you think?” or “I’ll go find out,” or “How would we find out?”. There’s a lot of modeling of scholarship that can happen there, and so I’m comfortable saying “I don’t know” in an intro class if I don’t know.

For the seminars, I tend to pick something that I want to know more about and collect the things that I would need to read to know more about it. Then I get these smart motivated people coming along and doing that with me, and it’s just wonderful. So, for a seminar it’s more about articulating the research questions, asking why is it that we’re reading these things, figuring out what do we hope to get out of them, and structuring the discussion. That can be a lot of fun.

My entrée into worrying about ethics and NLP was through a seminar just like that. Lesley Carmichael, who’s at Microsoft and had finished her PhD in this department, is on the advisory board for our master’s program. She said years ago, “You know, you really need to get ethics into the curriculum.” So as faculty director of the master’s program, I looked around. I tried to get someone to come give a guest lecture for us or come to our lab meeting and talk, and even though there are people at my university who specialized in that, I wasn’t able to connect with them. So I thought, “Okay, fine, I guess I’m gonna teach this.” So I spent about six months as a back-burner project collecting various things that have been written, and then I structured the seminar by putting them into thematic groups, and we just went week by week through them. It was far too much for any one person to read, though, so we did what I like to call “divide and share,” where pretty much every week everyone read two papers out of the full collection which may have been 20 for each week. I had a set of common questions that everyone was going into the reading with, and then we would have discussions where we talked about how each of the readings spoke to those questions. These were really valuable, wonderful discussions and at the end of the quarter on the student evaluations, I discovered there was a hidden benefits I hadn’t anticipate. One of the students said that they felt comfortable asking questions about the readings which they didn’t usually do, because they usually felt if they asked a question, it would show that they hadn’t read it or hadn’t understood it, but in this context no one was supposed to have read everything and so everyone could ask whatever questions they wanted. It went really really well.

David: Wow, we have a week on ethics in my course, but not a whole course on it.

Emily: I think it’s important to do both because if you only have a whole course on it, then people have to take that course instead of taking something else. If you can weave ethics into every course, you reach more people.

Q: Going back to course content, much of NLP has moved towards more deep learning approaches. What benefits do traditional models and methods still have?

Emily: I see linguistics as slower moving, and that’s a good thing compared to the sort of rapid churn that we see in machine learning in general and in the machine learning part of NLP. The advantage of working in my side of the field is that I don’t have to stay on top of arXiv. Yes, there are relevant papers that come up for me from time to time in arXiv, but I don’t have to worry about making sure I read everything that’s posted there so that my stuff is current. These questions of how does language work are not changing so fast. What is changing is how can we take knowledge of how language works and work it into machine learning NLP to improve it. Prior to deep learning, one of the main avenues was feature engineering. Your feature engineering was basically predicated on understanding what makes an interesting feature, what’s an interesting way to represent these strings of words so that the machine learning algorithm can grab a hold of what it needs to about them, and to go beyond n-grams. The whole deal about deep learning is, well, we don’t have to do feature engineering anymore.

So where does the linguistics fit in? I think it fits in a few places. One is in error analysis, and another is in a task design. If we’re trying to learn about how well a machine can learn to do X, our representation of X in the form of a task has to be informed by people who know something about what X is. This true for things not just language, but anything where you’re doing machine learning. There’s far too much work in machine learning where people don’t seem to be coming back and talking to the experts who know how these things work, and so they go ahead and set up these tasks that are toy tasks or utterly gameable.

At Anton van den Hengel’s ACL 2018 invited talk about deep learning and vision, there was this wonderful example with series of pictures. There was a picture with two horses in it, and if you ask the visual QA system, “How many horses are there?”, it will say two. Then if you take the same picture and ask, “How many unicorns are there?”, it says two. Then, you photoshop in a third horse and you ask, “How many horses are there?” Two. It turns out that, according to his talk, the most frequent answer to any question with “How many?” is two and so the system wasn’t doing anything combining vision and language. It was just learning this pattern in that particular linguistic data. I have nothing to say really about vision, but on the language side of things, you absolutely want people involved in task design and error analysis understanding the training and test data who can spot things like that and say, well this isn’t really showing what you think it’s showing because the computer is just effectively cheating in this way.

There’s also the point of multilingualism. One of the soap boxes that I always carry around is English is not the same as language in general. I’m constantly annoyed by the fact that we will write all these papers in NLP and not mention the language that’s being worked on, and it’s always English. That just sort of sets up this context in which we assume that English is just completely typical of all languages. So, there’s always room for people who understand something about typology and language variation and within language variation, such as sociolinguistics, to bring that to the table and say, okay, your system that supposedly is speaker-independent speech recognition, is it really dialect independent? How similar to those speakers have to be for it to still work? What kinds of dimensions are you looking at when you’re picking your test speakers?

So coming back to your question, I think the answer to that is there was always value in making sure that the people building the systems that are trying to learn a thing actually understand the thing. If we teach someone the nitty-gritty details of finite state machines and how that can be used to model morphology, or CKY, parsing algorithms, and context-free grammars and what that has to do with syntax, then that puts them in a better position to understand what’s going on when they’re looking at a deep learning paper that’s doing these auxiliary tasks to try to get the machine to pay attention to syntactic structure. They’ll have a better sense of what that is if they’ve gotten into it and worked with it with their own hands.

Q: How does your own research impact what you choose to teach? It seems that your knowledge engineering for NLP class is closely related to your own work.

Emily: “Knowledge engineering for natural language processing” is actually a terrible name for that class. It’s a grammar engineering class, and it got named the way because at that point I was looking at things as a contrast between machine learning and knowledge engineering of language itself. But now I understand knowledge engineering in NLP often refers to world knowledge, and that’s a much harder problem if you ask me. Grammar engineering is building linguistically precise grammars by hand in a machine readable form, and that’s where I focus my research.So, it’s a great thrill to teach that class every year.

I have this project called the Grammar Matrix which makes it faster to start new grammars. It’s a cross-linguistic shared core grammar paired with a bunch of typologically informed libraries. You start by answering a questionnaire about the language, and it outputs the small starter grammar that can do both parsing and generation, and the semantic representations are in Minimal Recursion Semantics. It’s on a broad enough foundation that can be built out to broader coverage. I have students in that class each year work in pairs on languages that we haven’t treated yet in this framework, and we’re up over a hundred languages now. Students build test suites and use the customization system to get a starter grammar, and then they keep on building and get to more interesting phenomena over the course of the quarter. That’s loads and loads of fun, and it also then feeds into the Grammar Matrix project itself because I get to learn about where it’s not yet ready to handle all kinds of interesting things from all kinds of languages.

That course is an elective in the program. When I designed the master’s program, I said, you know, this is really fun and really interesting, but it’s not the main thing that employers are looking for right now and so I’m not going to drag everybody through it. It’s also a very time-consuming class because it ends up being very open-ended. Anybody who is excited about a language will know the grammar’s not quite working because there’s also this variant and that variant, and this makes it really hard to stop. So, I put a lot of energy into trying to help the students stop and get the writing done so they can get on to the next assignment. It’s the kind of class that works best as an elective. You want students in there who are excited about that kind of computational linguistics.

Q: Earlier you mentioned how CL moves slower than machine learning, which has accelerated in the past few years. But even so, CL still changes and sometimes the methods that you use today might not be the popular ones in ten or twenty years. How do you prepare students for the now and the future at the same time?

Emily: That’s actually something in the master’s program that we’ve put as a design principle. We want people to be relying on or benefiting from the training they get in our program not just when they get their first job right after graduating, but ten and maybe even twenty years further. One of the ways that we address that is by putting an emphasis not on specific tasks but on natural language processing components. What are the various sub problems that you often have to solve again and again? You might say that all of the end-to-end approaches with today’s seq2seq or whatever obviates the need for understanding components, but I’m not seeing that yet really. In the big picture in industry, if you look at the commercial products that we’re working with, they have many components inside of them. So, we need to focus on those subtasks, and understand the different structures and different problems that are raised: how does ambiguity work, where does that cause problems, how does it actually benefit systems, and how do we deal with it?

We have these cross-cutting themes that we introduce in our program’s orientation at the start of the year. One of them is ambiguity. Another one is multilingualism. What happens if you try the next language? How do you understand which languages are good ones to really stress test the system for being language independent? The third theme is evaluation. How do you construct a good evaluation system? How do you tell if an evaluation metric is appropriate? How do you handle training data and test data? Then, a fourth one that we’ve added is ethics in NLP. The idea is that by emphasizing these cross-cutting themes that go across the classes, it can help the students organize what they’re learning in our program into these fundamental principles that I think are going to remain relevant even if the target linguistic structures or machine learning techniques keep changing over time. You’re still gonna have to be able to evaluate, you’re still gonna have to ask or be able to ask if something is specific to English or going to work in another language, ambiguity is always going to be with us, and ethical considerations are always important.

Q: What advice would you give someone else who is just starting to craft their own graduate level course in computational linguistics?

Emily: So the first thing I would say is it’s important to answer some questions. Who’s the audience and what’s the context that defines the learning objectives? Is this a class that is then going to feed into more specialized NLP classes? Is it going to be people’s only exposure to NLP? Is it a specific NLP topic amongst a very large catalogue? So maybe we can take David’s case. Who’s the audience and what are the contextual factors?

David: In my case, UMich has one NLP course and there’s the occasional NLP graduate seminar but it’s a part of the upper advanced curricula. My course is for those who know roughly something about machine learning or some other advanced programming. So, the audience has students who want to get involved in language but don’t know anything else about it.

Emily: Yes, I love to talk to these kinds of people. I would say that an NLP class there should have some specific sample tasks or subtasks that are not presented as the comprehensive catalogue but as case studies. Those case studies should illustrate things like, how do we do evaluation in NLP, and make apparent to the students the complexity of linguistic structure that they’re working with. In my 100 things book that I did on morphology and syntax, thing number two is “Morphosyntax is the difference between a sentence and a bag of words.” I think a lot of people in coming at language from computer science don’t understand the richness of the structure of language, because our experience with language as speakers is well, there’s the words that we see and we understand the meaning of these words. But there’s way more to it than that and that’s only apparent if you start studying linguistics. So, I would want an NLP class that is the only or first exposure to language have computer science students engage with these issues. I think that for that reason it would be nice to have some of the tasks that actually target things like syntactic structure or involve compositional semantics or morphology, and especially doing those tasks in a way that brings in multiple different languages so we can see that languages can be very different to English.

Now, that’s a tall order to ask someone who’s in the computer science department to go do this, but there are some wonderful resources out there now. There’s the universal dependencies tree banks, so you could do something around dependency parsing and trying out a couple of the different dependency parsing algorithms on a few different languages. Then, you could also go downstream from that and say, well let’s take a look at a sentiment analysis system where we can throw in some dependency features and see if that improves things. You could try it maybe in a couple different languages if you’ve got tagged sentiment data for those other languages and hopefully we would, especially with something like Yelp where you can just use the stars as gold standard labels. If you were going to do it from more of a deep learning perspective, you could say, well, could we improve the sentiment analysis by using maybe dependency parsing as an ancillary task? You could bring these ideas regardless of whether you’re doing feature engineering.

Q: Say you had an extra week in your intro class or some other class. What topic would you cover?

Emily: If an extra week pops up and I get to throw a topic at it, I might go for ethics. It would get students to engage with questions such as who are the people in the world who can be impacted by these systems, what are the ways in which harm can be done, and what are the strategies we can apply to try to mitigate that harm? In fact, I already include a week on this in my undergrad class, but it’s a deep and important topic and could definitely use more time!

Q: Is anything else you think we should know about teaching computational linguistics?

Emily: It’s an enormous field, and the enormity of it is sort of belied by the way it is included in both linguistics and computer science. So, within linguistics it’s just another subfield. You have phonetics, phonology, morphology, syntax, semantics, pragmatics, historical linguistics, sociolinguistics, psycholinguistics, and computational linguistics, even though computational linguistics basically crosscuts all of that. Then, over in computer science, you have this long list of things, including vision and NLP, as if NLP is a small little thing.

This enormity makes it exciting, and I think that one of the jobs that we have as instructors of computational linguistics is to help students understand that when they’ve opened the door into our class, they’ve basically stepped through the closet into Narnia. What they thought was a small space is actually a big space, but still an exciting one.

The above Q&A has been edited for clarity.

David Jurgens is an Assistant Professor in the School of Information at the University of Michigan