What I built is a simple Information Retrieval system using pretrained BERT model and elasticsearch. Recently elasticsearch announced text similarity search with vectors in this post. We convert text into a fixed length vector which would be saved into an elasticsearch index. Then we use cosine similarity metric to figure out the most similar content out of the index. This is the overall workflow of the system.
I just had to connect together a couple of things to make it work. I will explain how did I manage to do it.
- Clone the repo https://github.com/kelvin-jose/elasticbert
I created a minimal corpus manually by copying some titles and abstracts from researchgate.net. You can see a file named example.csv inside the elastic/ folder. I would use a search term against the corpus to retrieve the most similar abstracts from the index. Basically it does the job of a search engine.
1. Download pre - trained BERT model.
wget https://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_A-12.zipunzip cased_L-12_H-768_A-12.zipcp cased_L-12_H-768_A-12 bert/model
2. Setup BERT docker.
docker build -t bert-server .
3. Setup elasticsearch docker.
docker pull docker.elastic.co/elasticsearch/elasticsearch:7.5.2
4. Start docker containers.
docker run -d --net="host" bert-server
docker run -d -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.5.2
- Make sure the dockers are up and running by
5. Install dependencies.
pip install argparse
pip install elasticsearch
pip install bert-serving-client
6. Create elasticsearch index
python3 elastic/create_index.py --index researchgate --config elastic/index_config.json
- create_index.py script creates an index in elasticsearch
- --index and --config arguments specify the name of the elasticsearch index and schema of the target index, respectively.
- You can verify the index by checking http://127.0.0.1:9200/researchgate
7. Create documents
python3 elastic/create_document.py --index researchgate --csv elastic/example.csv --output example.json1
- This script creates an example.json1 file in the elasticsearch prescribed format which in-turn to be indexed later.
8. Create indexes
python3 elastic/index_documents.py --data example.json1
- This scripts generates the actual indexes and saves it into elasticsearch
- verify it by checking http://127.0.0.1:9200/researchgate/_search
9. Test the engine.
- It will return the top match to the target query. It’s hard coded as “machine learning” in line number 13 and you are free to change it.
I hope this can be scaled up to serve millions of records in blazing speed. All we wanted to do is, build a large corpus and add more configurations to elasticsearch schema such as clusters and shards.
Have fun guys.