How to get into Natural Language Processing?
Introduction
If you are a developer looking to get started with Natural Language Processing, then you must be wondering about the books you should read and whether there are good online courses for NLP. In this article, we share an eclectic list of resources that will give you a bird’s eye view of your options. The list includes books, MOOCs, YouTube videos, academic papers, user groups, everything that you will ever need.
Note that these suggestions come from the awesome users of Hacker News, which makes the list quite authoritative. There’s also an accompanying GitHub repository for this post.
Introductory books on NLP
1. Jurafsky and Martin
The classic, standard textbook on Language and Speech Processing. A pre-publication draft of the 3rd edition is available online for free.
2. The NLTK book
An application oriented book, where the examples are in Python. This book accompanies the Python package NLTK and is a great resource for beginners who like learning by doing.
This book is available online for free. A paperback edition is also available.
3. Taming Text
Another example oriented book, where the examples are in Java. Covers getting started, feature extraction and preprocessing, search, clustering, classification, string heuristics, Named Entity Recognition and finishes off with a simple Question Answering system.
4. Manning and Schütze
A classic book that dives deep into the implementation of the statistical methods of NLP. Good choice if you want to eventually implement a tagger or parser yourself.
5. Handbook of NLP
A complete and authoritative treatment of NLP that starts from the historical roots and ends with the modern methods of NLP.
6. Statistical Machine Translation
This introductory text to statistical machine translation provides all of the theories and methods needed to build a statistical machine translator, such as Google Language Tools and Babelfish.
7. Introduction to Information Retrieval
Another book by Manning and Schütze. If you find services like Google Search and Google News fascinating, this book might give you an insight on how they work. It deals with web-era information retrieval, including web search and the related areas of text classification and text clustering from basic concepts.
8. Prolog and Natural-Language Analysis
This book covers the implementation of basic NLP algorithms in Prolog. For those with an interest in logical programming, this book is the right way to get into NLP.
NLP MOOCs
1. Coursera MOOC offered by University of Michigan
This course, instructed by Dragomir R. Radev, provides an introduction to the field of NLP. It includes relevant background material in Linguistics, Mathematics, Probabilities, and Computer Science. Some of the topics covered in the class are Text Similarity, Part of Speech Tagging, Parsing, Semantics, Question Answering, Sentiment Analysis, and Text Summarization. The preferred programming language for the course is Python.
2. Discontinued Coursera MOOC offered by Columbia University
Columbia University used to offer a MOOC on NLP in Coursera. This MOOC has unfortunately been discontinued. Luckily, the materials can still be accessed here.
YouTube video series
1. Video lecture series by the legendary Dan Jurafsky and Chris Manning.
Dan Jurafsky and Chris Manning have written multiple classic textbooks in NLP, so this video lecture series is definitely going to be great.
2. Video series on the application of Deep Learning in NLP (CS224D at Stanford)
Deep Learning is the state of the art in Machine Learning. This video series, which is a part of the Stanford CS224 course, discusses how Deep Learning is applied in the field of NLP.
3. Video series on NLP with Python and NLTK
A example oriented video series where NLTK is used to perform the most common NLP tasks.
Online University Courses
1. Machine Translation course at University of Pennsylvania
This course offered at the University of Pennsylvania deals with Machine Translation. The course materials are available online.
Software Packages to play with
Playing around with software packages can teach you a lot about the technology and can be a fun way to get introduced to the possibilities of NLP. Here are some suggested packages that you an try fiddling with. They all come with online demos, in case you want to try out the functionality before installing them on your system.
1. NLTK
The Natural Language Tool Kit (NLTK) is a most popular Python library for NLP. It contains classes that implement most of the functionality that you will ever need in most NLP projects. It also has wrapper classes that can hook onto external libraries like Stanford CoreNLP. It comes with excellent documentation and a book (remember “The NLTK book” from the books section?). A demo (unofficial) of some basic functionality can be accessed online.
2. Stanford CoreNLP
Stanford CoreNLP is a fast and feature rich NLP library written in JAVA. You can access a demo here.
3. Spacy
Spacy is another Python library for NLP, which claims itself to be industrial grade and production ready. They offer “one best algorithm” for NLP tasks like tagging, parsing etc. instead of offering many implementations. This helps developers keep an uniform API while still taking advantage of the state of the art methods and implementations. They have some very cool visualizations here. They also have a blog.
4. Apache Tika
Apache Tika is a toolkit that can extract text and metadata from many different file types (PDF, PPT etc.) using an unified interface and parse them. This makes it useful for tasks like search engine indexing, content analysis, translation, and much more.
Academic Papers
If you are feeling extra academic and want to read some academic papers on the state of the art in NLP (or just want to know how they sound), then the following resource is for you.
1. A compilation of papers on the application of Deep Learning in NLP
Learning by doing
Contributing to friendly Open Source projects is a great way to learn by doing. You can also make friends with other NLP enthusiasts this way (project maintainers and contributors) and this will help you learn the subject much faster.
1. Betty
Betty is a quite interesting open source project with both real-life use and practical NLP considerations, and is looking for new maintainers. This the link to the GitHub repository.
Making something yourself is also a great way of learning a skill. If you are interested in making something fun, then consider the following.
2. Interactive Fiction/Parser based fiction
Interactive Fiction (IF) is a kind of video game where the player’s interactions primarily involve text. A recent FLOSS podcast episode with folks from the IF Tech Foundation on this subject was pretty interesting and illuminating.
APIs
If you want to build cool applications by leveraging the power of NLP (but not implement anything yourself), then you can hook onto some services that offer an intuitive API for NLP functionality.
1. IBM Watson developer cloud
IBM Watson is a famous question answering system that beat humans to win first prize in Jeopardy! Since 2014, IBM started offering some modular AI functionality (including NLP) in a public API, which they call the IBM Watson Developer Cloud. They have a free tier/ free trial for most of their APIs, so you can start experimenting with this functionality in your apps without paying anything up front.
User Groups
Books, videos etc. are great resources, but if you are looking for some face to face human contact on topics related to NLP, then the ACM special interest group in AI might be worth a shot.
Other “how to get into NLP” guides
While the resources in this guide are great starting points, there are some other guides on this topic that mention resources that are beyond the scope of this article. You should definitely check them out too.
1. Quora question “How do I learn Natural Language Processing?”
2. A GitHub repository called Awesome-NLP has lots of resources on NLP
Conclusion
This wraps up all the community suggested resources on getting started with NLP. I hope it helps you start your exploration of this amazing technology.
Remember, there’s a GitHub repository with these resources. If you think that an important beginner resource is missing in this article, you can create a pull request in the repository. This way we can create an even more exhaustive resource list on this topic.
Acknowledgements
Thanks to all HN members who contributed to this Ask HN thread. I would also like to thank Scott Bell from YCombinator for his valuable suggestions.