Two minutes NLP — 20 Learning Resources for Information Retrieval
Articles, tutorials, and popular libraries
Hello fellow NLP enthusiasts! As soon there will be an NLPlanet Discord server for networking between NLP practitioners, I’m working on the first organization of its channels. I’m planning to add learning resources for many NLP areas, therefore this article is a step towards preparing such content. If you’re interested in the Discord server, follow NLPlanet on Medium, LinkedIn or Twitter to stay updated on its release. Enjoy! 😄
Here follows the first draft, curated by me, of the Information Retrieval learning resources of NLPlanet. Being a draft, this list will be improved using the feedback of the community.
This article is part 5 of a series of articles about learning resources:
What is Information Retrieval
Information Retrieval (IR) is the process that responds to a user query by examining a collection of documents and returning an ordered document list, where each document should be relevant to the user query. It’s the activity of obtaining information resources relevant to an information need.
A popular type of Information Retrieval is Semantic Search. Semantic Search is a data searching technique in which a search query aims to not only find keywords but to determine the intent and contextual meaning of the words a person is using for search.
Information Retrieval applications and use cases
- Search engines, searching for text documents, images, videos, and so on.
- Question answering over a set of documents (e.g. with a chatbot or a smart speaker).
- Recommender systems.
- Summarization of a set of documents.
Articles and tutorials
- Create A Simple Search Engine Using Python: Information retrieval using cosine similarity and term-document matrix with TF-IDF weighting.
- TF-IDF from scratch in python on a real-world dataset: Document retrieval using TF-IDF cosine similarity.
- Introduction to Information Retrieval [Kaggle notebook]: This tutorial covers the basics of Information Retrieval concepts and focuses on Boolean and TF-IDF Ranked Retrieval models. In the end, it presents ways to evaluate an IR system using a benchmark dataset and an algorithm shipped with modern search engines based on Lucene.
- Semantic search with embeddings: index anything: Smart search, encoding pipeline, search pipeline, and open-source solutions.
- How To Create Natural Language Semantic Search For Arbitrary Objects With Deep Learning: An end-to-end example of how to build a system that can search objects semantically, using embeddings.
- Quick Semantic Search using Siamese-BERT Networks: Using the S-BERT library to create fixed-length sentence embeddings suitable for semantic search on a large corpus. Article with code.
- How to Build a Semantic Search Engine With Transformers and Faiss: How to build a vector-based search engine with sentence Transformers and Faiss, with code.
- Semantic Search with S-BERT is all you need: Building a Semantic Search Engine from scratch, using S-BERT.
- Billion-scale semantic similarity search with FAISS+SBERT: Building a prototype of smart search with Faiss and S-BERT.
- Learning to Rank for Information Retrieval: A Deep Dive into RankNet: An insight into the state-of-the-art ranking systems that can be used for Information Retrieval.
- Relevance, Ranking and Search: A list of traditional Information Retrieval models.
- Supercharging Elasticsearch with Haystack’s Question Answering: Question answering with Haystack.
- Getting started with Elasticsearch in Python: Setting up Elasticsearch and accessing it with Python.
Popular libraries
- Elasticsearch: Elasticsearch is a distributed, free and open search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured. Elasticsearch is built on Apache Lucene.
- Jina: Jina is a neural search framework that empowers anyone to build SOTA and scalable neural search applications.
- Milvus: Milvus is an open-source vector database built to power embedding similarity search and AI applications.
- Haystack: Haystack is an end-to-end framework that enables you to build powerful and production-ready pipelines for different search use cases. Whether you want to perform Question Answering or semantic document search, you can use the state-of-the-art NLP models in Haystack to provide unique search experiences and allow your users to query in natural language.
- Faiss: Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM.
- Weaviate: Weaviate is a vector search engine and vector database. Weaviate uses machine learning to vectorize and store data and find answers to natural language queries.
- Vector Hub: Vector Hub is a library for publication, discovery, and consumption of state-of-the-art models to turn data into vectors, such as Text2Vec, Image2Vec, Video2Vec, Face2Vec, Bert2Vec, Inception2Vec, Code2Vec, LegalBert2Vec, etc.
Conclusion
If you know any other good resources for learning about Information Retrieval in particular, please let me know so that I can share them with the community.
Other NLP areas that will need a learning resources area of their own are chatbots, language models, question answering, and speech.