Elasticsearch Tutorial Part 3: Ingesting Documents

Abhishek Bairagi
5 min readDec 29, 2023

--

In the last blog, we got Elasticsearch up and running, even made our own ‘library’ index. Now, let’s get hands-on with the real action — ingesting documents into elasticsearch index. This tutorial is all about putting some books into our ‘library’ so we can search and explore. Let’s dive in and see how it’s done!

A quick recap to our mapping which we created in the last tutorial:

 "mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "standard"
},
"author": {
"type": "text",
"analyzer": "standard"
},
"description": {
"type": "text",
"analyzer": "standard"
},
"published_date": {
"type": "date"
},
"purchase_url": {
"type": "keyword"
}
}
}

To ingest our documents, we start by having json data with values for the fields we defined in the mapping(title,author etc.). Here is one sample data that we need:

{
"title": "The Catcher in the Rye",
"author": "J.D. Salinger",
"description": "A classic novel about teenage angst.",
"published_date": "1951-07-16",
"url": "https://example.com/book/catcher-in-the-rye"
}

Once you have the data, there are two ways of ingesting documents:

  1. Single Document Ingestion: Send individual documents to Elasticsearch one at a time.
  2. Bulk Ingestion: Send multiple documents in a single request for improved efficiency.

Single Document Ingestion

To ingest a single document, send a POST request to the Elasticsearch endpoint with this URL , typically using the /index_name/_doc path. Hence our URL becomes: http://localhost:9200/library/_doc and our body will be the sample data I mentioned earlier. So let’s see a curl example of it:

curl -X POST "http://localhost:9200/library/_doc" -H 'Content-Type: application/json' -d '
{
"title": "The Catcher in the Rye",
"author": "J.D. Salinger",
"description": "A classic novel about teenage angst.",
"published_date": "1951-07-16",
"url": "https://example.com/book/catcher-in-the-rye"
}'

You can also do it using python:

from elasticsearch import Elasticsearch

# Connect to Elasticsearch
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])

# Define the document
document = {
"title": "The Catcher in the Rye",
"author": "J.D. Salinger",
"description": "A classic novel about teenage angst.",
"published_date": "1951-07-16",
"url": "https://example.com/book/catcher-in-the-rye"
}

# Index the document without specifying the ID
index_name = 'library'
result = es.index(index=index_name, body=document

What if for one of the document I do not have values of one of the fields, let’s say author.

That’s okay, elasticsearch wont stop you from ingesting that. Just make sure while ingesting data you do not have any extra fields in your data which were not mentioned in mapping and no spelling and casing mistakes.

Bulk Ingestion

To get started with bulk ingestion we need to first prepare our data. We’ll assemble our book data in a format Elasticsearch understands. Each book is represented as a JSON object, and all these documents are organized within a JSON array. Let’s take a look at four captivating books we’re adding to our library:

[
{"index": {"_index": "library"}},
{"title": "The Catcher in the Rye", "author": "J.D. Salinger", "description": "A classic novel about teenage angst.", "published_date": "1951-07-16", "url": "https://example.com/book/catcher-in-the-rye"},

{"index": {"_index": "library"}},
{"title": "To Kill a Mockingbird", "author": "Harper Lee", "description": "A powerful exploration of racial injustice.", "published_date": "1960-07-11", "url": "https://example.com/book/to-kill-a-mockingbird"},

{"index": {"_index": "library"}},
{"title": "1984", "author": "George Orwell", "description": "A dystopian vision of the future.", "published_date": "1949-06-08", "url": "https://example.com/book/1984"},

{"index": {"_index": "library"}},
{"title": "The Great Gatsby", "author": "F. Scott Fitzgerald", "description": "An American classic capturing the Roaring Twenties.", "published_date": "1925-04-10", "url": "https://example.com/book/the-great-gatsby"}
]

Now that our data is ready, let’s proceed with bulk ingestion. Using tools like curl or Elasticsearch clients in Python, we can efficiently send these documents to Elasticsearch in a single request.

You need to send a post request to localhost:9200/index_name/_bulk with the above data as body.

Here’s how you do it using curl:

curl -X POST "http://localhost:9200/library/_bulk" -H 'Content-Type: application/json' -d '
[
{"index": {"_index": "library"}},
{"title": "The Catcher in the Rye", "author": "J.D. Salinger", "description": "A classic novel about teenage angst.", "published_date": "1951-07-16", "url": "https://example.com/book/catcher-in-the-rye"},

{"index": {"_index": "library"}},
{"title": "To Kill a Mockingbird", "author": "Harper Lee", "description": "A powerful exploration of racial injustice.", "published_date": "1960-07-11", "url": "https://example.com/book/to-kill-a-mockingbird"},

{"index": {"_index": "library"}},
{"title": "1984", "author": "George Orwell", "description": "A dystopian vision of the future.", "published_date": "1949-06-08", "url": "https://example.com/book/1984"},

{"index": {"_index": "library"}},
{"title": "The Great Gatsby", "author": "F. Scott Fitzgerald", "description": "An American classic capturing the Roaring Twenties.", "published_date": "1925-04-10", "url": "https://example.com/book/the-great-gatsby"}
]'

And here’s how you do it using python:

from elasticsearch import Elasticsearch

# Connect to Elasticsearch
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])

# Define the bulk data directly
bulk_data = '''
[
{"index": {"_index": "library"}},
{"title": "The Catcher in the Rye", "author": "J.D. Salinger", "description": "A classic novel about teenage angst.", "published_date": "1951-07-16", "url": "https://example.com/book/catcher-in-the-rye"},

{"index": {"_index": "library"}},
{"title": "To Kill a Mockingbird", "author": "Harper Lee", "description": "A powerful exploration of racial injustice.", "published_date": "1960-07-11", "url": "https://example.com/book/to-kill-a-mockingbird"},

{"index": {"_index": "library"}},
{"title": "1984", "author": "George Orwell", "description": "A dystopian vision of the future.", "published_date": "1949-06-08", "url": "https://example.com/book/1984"},

{"index": {"_index": "library"}},
{"title": "The Great Gatsby", "author": "F. Scott Fitzgerald", "description": "An American classic capturing the Roaring Twenties.", "published_date": "1925-04-10", "url": "https://example.com/book/the-great-gatsby"}
]
'''

# Index the bulk data
result = es.bulk(body=bulk_data)

print(f"{result['items'][0]['index']['_id']} documents ingested successfully into the 'library' index.")

So this is how you ingest data in structured format in elasticsearch index.

But what do you do if your data is in some other structure or is unstructured?

Usually when you have data in some other structure, for example an excel file you define mapping based on your need and create an elasticsearch index and then populate the values from excel file using some code in the json dictionary.

In case of an unstructured data like scrapped data from website, you can have the basic mapping like with fields like title , content etc. and then populate the data.

Awesome! now we have 5 books in our library. Good to see we have come this far. This blog ends here, we will see how to search documents using elasticsearch. See you in the fourth part.

Update!! Part 4 is here: Elasticsearch Tutorials Part 4

--

--