Getting started with Elasticsearch in python.

Abhimanyu Singh
4 min readSep 5, 2018

--

while working on Elasticsearch and python for the past few weeks. I had a hard time in figuring out how to get the raw data from the elastic cluster and dump it into pandas. In this process, I am going to describe certain ways to dump elastic cluster data into python in the form of the data frame. Once we will get the raw data we can do any kind of analysis which is limited in kibana.

Time to download the data will vary and will depend on the interval for which we are downloading the raw data. As we are getting a huge number of hits so it’s very important that we should filter our query and only download the data which is important for our analysis. We can put date and time filter to download the data. As in my case, my index name is date wise so I don’t have to worry about the date as one index is having one day of data. So, we can just put the time filter and parse the data.

I have different ways of downloading the data depending on the need so I will explain all the ways and one can use whatever is needed as per the situation.

1.) The first and most important step is to import all the libraries.

2.) There are several ways to define the elastic cluster. I have used the below

3.) Once I defined the elastic cluster I will get to know what all indices I have

4.)Elastic search by default gives us only 10 observation and if you need more than that then you have to declare that in the query. so, lets go through the various ways to get the data:-
4.a.) If you just want to have a look at all the variables then you can use the below code and have a look as we are not defining the size so we will be getting only 10 observation

b.) If you want to have more than 10 observation. With the below code you can get data for upto 10,000 observation. Sometimes you will get request timeout error because of the network. At that time you can change the size to 9999 or less.

c.) In both the above methods we are getting all the data without doing any filtering. What if we need to put filters in the query:-

5. a.) Now getting just 10000 observation won’t help us as we need data for a certain time period to do our analysis. In that case we have a SCAN and SEARCH method through which we can get any amount of data. We have to be very cautious here while running as this will just dump all the data irrespective of quantity whether its in millions or billions.

b.) After getting done with the above we will start with getting the data into pandas

6.)After you are done with creating the final data frame which we are having as result_df, you can dump all the data into a CSV file or any other source. Here I have done that into CSV :-

7.) Precaution :-

a.) you will not be getting data in a sorted order.

b.) Be very cautious while downloading the data if you are having a huge data on your elastic cluster.

c.)Never dump all the data with the scan method as it will dump the whole data whether its in millions or in billions.

I just have started my career as a Data scientist and learning everyday. So please bare with typos or any small mistake.

--

--