Information Retrieval with document Re-ranking with BERT and BM25
Retrieving relevant information from a huge corpus of documents is a challenging problem, and places different requirements on the model architecture.There is growing interest in developing scalable query retrieval models trained end-to-end, bypassing the typical document retrieval step. In this blog we present a scalable solution for information retrieval.
To Achieve the above we present below a 2 step approach for retrieving relevant information from a huge corpus of document.
Step 1 :-A retrieval phase
Step 2:- A re-ranking phase
Retrieval phase
In the retrieval phase, we search the Document corpus to get top 100 or 200 results using information retrieval method like TF-IDF or BM25. BM25 often works much better as compared to TF-IDF since it takes different document lengths into the consideration.
First Step in any information retrieval method is storing documents in proper format.Text search is very different from traditional computing tasks, so it calls for its own kind of data structure, the inverted index.The index is inverted because usually we think of words being a part of documents, but if we invert this idea, documents are associated with words
For example if we have below Sentences to be stored as inverted index format.
S1 COUNTRY KITCHEN HAMBURGER BUNS 8CT
S2 LA BUNTING 8 WHOLE WHEAT TORT 10CT
S3 GREAT VALUE HAMBURGER BUNS
S4 OLD TYME HAMBURGER ROLLS 12 S
S5 COUNTRY HEARTH 100 WHOLE WHEAT BREAD
S6 HOLSUM HOT DOG ROLLS 8 S
S7 COUNTRY KITCHEN HAMBURGER BUNS 8CT
S8 QUALITY 10CT FLOUR BURRITO TORTILLA
COUNTRY `1:1 `5:1 `7:1
HAMBURGER `1:1 `3:1 `4:1 `7:1
.
.
.
BUNS `1:1 `5:1 `7:1
.
.
.
Additionally the index can contain several other information like the position at which the words occur in the document and also the field where the words occur etc.
Second step is scoring the search results using a ranking function like BM25.Below is the scoring/ranking formula used by BM25 algorithm.
Where:
N — Size of the Collection of documents
ni — Number of documents in the collection containing query term ti
R — Relevant set size (i.e., number of documents judged relevant in collection)
ri — Number of judged relevant documents containing ti
fi — The frequency of term ti in the document
qfi — The frequency of term ti in the query
k1 — Determines how the tf(term frequency)component of the term weight changes as fi increases.
k2 — Determines how the tf (term frequency)component of the term weight changes as qfi increases
K = k1((1-b) + b. dl/avgdl),where b is a parameter, dl is the length of the document, and avgdl is the average length of a document in the collection
We will be using elasticsearch for information retrieval since this software deploys BM25 algorithm and is scalable for large number of records. Moreover elastic search can run on a distributed system.
1.Download elasticsearch software from website https://www.elastic.co/downloads/elasticsearch
Run elasticsearch engine by unzipping the downloaded file and executing “./elasticsearch” in bin directory of elasticsearch
IF Successfully started then below message will be visible on console
Also the same can be verified by typing http://localhost:9200 in browser address bar
2.install python wrapper for elasticsearch
pip install elasticsearch
For searching the index we can control the number of top searches we need, For our use case we have retrieved top 100 search results.Below mentioned parameter “size” can be specified to get desired number of search results.
def SEARCH(text,index,field):
res=es.search(index=index,body={“query”:{“match”:{field:{“query”:text,”operator”:”or”,”fuzziness”:0}}}},size=100,request_timeout=20)
return([(x.get(‘_source’).get(‘text’)) for x in res[‘hits’][‘hits’]])
Re-Ranking Phase
In the second step, you apply a more complex model: This re-ranker model gets as input (query, hit1), (query, hit2), …, (query, hit100). For each pair, it outputs a value between 0 &1 about how relevant the pair is.
How BERT works
BERT makes use of Transformer, an attention mechanism encoder that reads the text input and a decoder that produces a prediction for the task.Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence.
BERT_large model consists of 24-layer, 1024-hidden, 16-heads, 340M parameters. Trained on lower-cased English text.
Method : The job of the re-ranker is to estimate a score si of how relevant a document di is to a query q. We use BERTs as our re-ranker. We use BERTs since we focus on sentence-level retrieval. BERT produces out-of-the-box rather bad sentence embeddings. Sentence-BERT fine-tunes BERT with a siamese or triplet network structure to produce semantically meaningful sentence embeddings that can be used in unsupervised scenarios: Semantic textual similarity via cosine-similarity and semantic search.
With above method we receive a cosine distance for each of the query,documnet pair which can then be used to select the top result. Document with least cosine distance is considered to be the top match.
We can further use MRR(Mean Reciprocal rank) to evaluate end-to-end model but we would need ground truth value for them. This can be done with some sample data for manually verifying the rank of the correct document for each query.