How to get into Natural Language Processing?

Dibya Chakravorty
Broken Window
Published in
6 min readNov 17, 2016

Introduction

If you are a developer looking to get started with Natural Language Processing, then you must be wondering about the books you should read and whether there are good online courses for NLP. In this article, we share an eclectic list of resources that will give you a bird’s eye view of your options. The list includes books, MOOCs, YouTube videos, academic papers, user groups, everything that you will ever need.

Note that these suggestions come from the awesome users of Hacker News, which makes the list quite authoritative. There’s also an accompanying GitHub repository for this post.

Introductory books on NLP

1. Jurafsky and Martin

The classic, standard textbook on Language and Speech Processing. A pre-publication draft of the 3rd edition is available online for free.

2. The NLTK book

An application oriented book, where the examples are in Python. This book accompanies the Python package NLTK and is a great resource for beginners who like learning by doing.

This book is available online for free. A paperback edition is also available.

3. Taming Text

Another example oriented book, where the examples are in Java. Covers getting started, feature extraction and preprocessing, search, clustering, classification, string heuristics, Named Entity Recognition and finishes off with a simple Question Answering system.

4. Manning and Schütze

A classic book that dives deep into the implementation of the statistical methods of NLP. Good choice if you want to eventually implement a tagger or parser yourself.

5. Handbook of NLP

A complete and authoritative treatment of NLP that starts from the historical roots and ends with the modern methods of NLP.

6. Statistical Machine Translation

This introductory text to statistical machine translation provides all of the theories and methods needed to build a statistical machine translator, such as Google Language Tools and Babelfish.

7. Introduction to Information Retrieval

Another book by Manning and Schütze. If you find services like Google Search and Google News fascinating, this book might give you an insight on how they work. It deals with web-era information retrieval, including web search and the related areas of text classification and text clustering from basic concepts.

8. Prolog and Natural-Language Analysis

This book covers the implementation of basic NLP algorithms in Prolog. For those with an interest in logical programming, this book is the right way to get into NLP.

NLP MOOCs

1. Coursera MOOC offered by University of Michigan

This course, instructed by Dragomir R. Radev, provides an introduction to the field of NLP. It includes relevant background material in Linguistics, Mathematics, Probabilities, and Computer Science. Some of the topics covered in the class are Text Similarity, Part of Speech Tagging, Parsing, Semantics, Question Answering, Sentiment Analysis, and Text Summarization. The preferred programming language for the course is Python.

2. Discontinued Coursera MOOC offered by Columbia University

Columbia University used to offer a MOOC on NLP in Coursera. This MOOC has unfortunately been discontinued. Luckily, the materials can still be accessed here.

YouTube video series

1. Video lecture series by the legendary Dan Jurafsky and Chris Manning.

Dan Jurafsky and Chris Manning have written multiple classic textbooks in NLP, so this video lecture series is definitely going to be great.

2. Video series on the application of Deep Learning in NLP (CS224D at Stanford)

Deep Learning is the state of the art in Machine Learning. This video series, which is a part of the Stanford CS224 course, discusses how Deep Learning is applied in the field of NLP.

3. Video series on NLP with Python and NLTK

A example oriented video series where NLTK is used to perform the most common NLP tasks.

Online University Courses

1. Machine Translation course at University of Pennsylvania

This course offered at the University of Pennsylvania deals with Machine Translation. The course materials are available online.

Software Packages to play with

Playing around with software packages can teach you a lot about the technology and can be a fun way to get introduced to the possibilities of NLP. Here are some suggested packages that you an try fiddling with. They all come with online demos, in case you want to try out the functionality before installing them on your system.

1. NLTK

The Natural Language Tool Kit (NLTK) is a most popular Python library for NLP. It contains classes that implement most of the functionality that you will ever need in most NLP projects. It also has wrapper classes that can hook onto external libraries like Stanford CoreNLP. It comes with excellent documentation and a book (remember “The NLTK book” from the books section?). A demo (unofficial) of some basic functionality can be accessed online.

2. Stanford CoreNLP

Stanford CoreNLP is a fast and feature rich NLP library written in JAVA. You can access a demo here.

3. Spacy

Spacy is another Python library for NLP, which claims itself to be industrial grade and production ready. They offer “one best algorithm” for NLP tasks like tagging, parsing etc. instead of offering many implementations. This helps developers keep an uniform API while still taking advantage of the state of the art methods and implementations. They have some very cool visualizations here. They also have a blog.

4. Apache Tika

Apache Tika is a toolkit that can extract text and metadata from many different file types (PDF, PPT etc.) using an unified interface and parse them. This makes it useful for tasks like search engine indexing, content analysis, translation, and much more.

Academic Papers

If you are feeling extra academic and want to read some academic papers on the state of the art in NLP (or just want to know how they sound), then the following resource is for you.

1. A compilation of papers on the application of Deep Learning in NLP

Learning by doing

Contributing to friendly Open Source projects is a great way to learn by doing. You can also make friends with other NLP enthusiasts this way (project maintainers and contributors) and this will help you learn the subject much faster.

1. Betty

Betty is a quite interesting open source project with both real-life use and practical NLP considerations, and is looking for new maintainers. This the link to the GitHub repository.

Making something yourself is also a great way of learning a skill. If you are interested in making something fun, then consider the following.

2. Interactive Fiction/Parser based fiction

Interactive Fiction (IF) is a kind of video game where the player’s interactions primarily involve text. A recent FLOSS podcast episode with folks from the IF Tech Foundation on this subject was pretty interesting and illuminating.

APIs

If you want to build cool applications by leveraging the power of NLP (but not implement anything yourself), then you can hook onto some services that offer an intuitive API for NLP functionality.

1. IBM Watson developer cloud

IBM Watson is a famous question answering system that beat humans to win first prize in Jeopardy! Since 2014, IBM started offering some modular AI functionality (including NLP) in a public API, which they call the IBM Watson Developer Cloud. They have a free tier/ free trial for most of their APIs, so you can start experimenting with this functionality in your apps without paying anything up front.

User Groups

Books, videos etc. are great resources, but if you are looking for some face to face human contact on topics related to NLP, then the ACM special interest group in AI might be worth a shot.

Other “how to get into NLP” guides

While the resources in this guide are great starting points, there are some other guides on this topic that mention resources that are beyond the scope of this article. You should definitely check them out too.

1. Quora question “How do I learn Natural Language Processing?”

2. A GitHub repository called Awesome-NLP has lots of resources on NLP

Conclusion

This wraps up all the community suggested resources on getting started with NLP. I hope it helps you start your exploration of this amazing technology.

Remember, there’s a GitHub repository with these resources. If you think that an important beginner resource is missing in this article, you can create a pull request in the repository. This way we can create an even more exhaustive resource list on this topic.

Acknowledgements

Thanks to all HN members who contributed to this Ask HN thread. I would also like to thank Scott Bell from YCombinator for his valuable suggestions.

--

--