Export an Entire Pandas DataFrame as a Document to Elasticsearch
I have forever been in search of a method to export an entire Pandas DataFrame to Elasticsearch Client. I had tried my hand at it using the helper function in this post. However, a problem with this method is that, it is difficult to get back the entire DataFrame as an intact DataFrame. I mean, we can only extract one row at a time. This doesn’t bode well for tasks where you need to scan the entire database. So, I started looking for methods to get back the entire DataFrame, instead of tediously pulling each row individually.
I recently stumbled upon a library called espandas, which was a complete game-changer for me. It totally blew my mind! A huge shout-out to David Shuab for creating this library.
First of all, for any of your Elasticsearch based solutions to work, you need to have it running in the background.
Make sure you have Elasticsearch downloaded on your system, and run it from your terminal by navigating to the folder where you have it installed. To run it type the following command:
That will get it up and running, and as a confirmation, you can see the following on your terminal:
Now you are all set to go.
Once you have your elasticsearch running in the backend, import elasticsearch client and helpers into your notebook like so:
import numpy as np
import pandas as pd
from elasticsearch import Elasticsearch
from elasticsearch import helpers
Begin an elasticsearch session as follows:
es = Elasticsearch(http_compress=True)
The argument ‘http_compression’ enables users to compress HTTP traffic. I have enabled mine by setting it to TRUE. Now your session is active.
For this demo I have used the FAQs from the Uber website. I have done some web-scraping, and converted the text into questions and answers, and saved them in a DataFrame with two columns: ‘Question’ and ‘Answer’.
In order for Espandas to be able to do its job, you need to add a unique identifier column with the name indexId. Let us cretae this column:
data[‘indexId’]= 'ind’ + data.index.astype(str)
Now you are good to go. Now create an Espandas instance and push your data into Elasticsearch with just 2 lines of code:
INDEX = ‘uber_faq’
TYPE = ‘_doc’
esp = Espandas()
esp.es_write(data, INDEX, TYPE)
Keep in mind that in order to save a document in Elasticsearch, you need to indentify it with a unique INDEX and also specify the DOCUMENT TYPE. I have indexed my document with the name UBER_FAQ and given it a _DOC datatype (actually, I wasn’t really sure what document type I should mention, so I went with the default).
Congratulations!! Your DataFrame has been successfully ingested into Elasticsearch. If you want to verify/retrieve your saved data, use the es_read method. Here I am extracting the first 10 records from the dataframe:
k = [‘ind’+str(i) for i in 10]
res = esp.es_read(k, INDEX, TYPE)
This should give you:
And that’s it! Now you can save Pandas DataFrames entirely as a single document and retrieve them whole as well!