Search text from PDF files stored in an S3 bucket

Mixpeek
3 min readJul 27, 2022

Does your application allow users to upload PDFs? Maybe they upload resumes, waivers, agreements or signed documents. What if they need to search the contents of these PDFs?

As a developer, you have 3 options:

  1. Search by Filename: Lookup by key/value like filename [Native]
  2. Search by Metadata: Store the metadata in a separate database to perform queries [Database add-on]
  3. Full-Text-Search: Extract the contents into a search engine [OCR, Database, Search add-on]

Full Text Search provides the most intuitive user experience, but it’s also the most challenging to build, maintain, and enhance.

PDF -> ML -> Database -> Search Engine -> API

In this tutorial, we’ll walk you through best practices for PDF file upload, content extraction via OCR (Optical Character Recognition), and searching so you can add full-text PDF search into your application, with ease.

Bonus: At the end will be a Github repository so you can import the code directly into your application.

Store the file

First we need a function to download the file locally in order to run our OCR extraction logic:

import boto3s3_client = boto3.client(
's3',
aws_access_key_id='aws_access_key_id',
aws_secret_access_key='aws_secret_access_key',
region_name='region_name'
)
with open(s3_file_name, 'wb') as file:
s3_client.download_fileobj(
bucket_name,
s3_file_name,
file
)

Extract the contents

We’ll use the open source, Apache Tika library, which contains a class: AutoDetectParser that does OCR (optical character recognition):

from tika import parserparsed_pdf_content = parser.from_file(s3_file_name)['content']

Insert contents into a search engine

We’re using a self-managed OpenSearch node here, but you can use Lucene, SOLR, ElasticSearch or Atlas Search.

Note: if you don’t have OpenSearch locally you must install it first, then run it:

brew update
brew install opensearch
opensearch

OpenSearch will now be accessible here: http://localhost:9200. Let’s build the index and insert the file contents:

from opensearchpy import OpenSearchos = OpenSearch("http://localhost:9200")
index_name="pdf-search"
doc = {
"filename": s3_file_name,
"parsed_pdf_content": parsed_pdf_content
}
response = os.index(
index=index_name,
body=doc,
id=1,
refresh=True
)

Creating a PDF search API

We’ll use Flask to create a microservice that searches terms:

from flask import Flask, jsonify, request
from opensearchpy import OpenSearch
from config import *
app = Flask(__name__)
os = OpenSearch("http://localhost:9200/")
@app.route('/search', methods=['GET'])
def search_file():
# value from the api
query = request.args.get('q', default = None, type = str)
# query payload in json forOpenSearch
payload = {
'query': {
'match': {
'parsed_pdf_content': query
}
}
}
# run search query
response = os.search(
body=payload,
index=index_name
)
return jsonify(response)if __name__ == '__main__':
app.run(host="localhost", port=5011, debug=True)

Now we can call the API via:

GET: http://localhost:5011/search?q=SEARCH_TERM{
"_shards": {
"failed": 0,
"skipped": 0,
"successful": 1,
"total": 1
},
"hits": {
"hits": [
{
"_id": "1",
"_index": "pdf-search",
"_score": 0.29289162,
"_source": {
"filename": "prescription.pdf",
"parsed_pdf_content": "SEARCH_TERM"
}
}
],
"max_score": 0.29289162,
"total": {
"relation": "eq",
"value": 1
}
},
"timed_out": false,
"took": 40
}

Whoo we did it! We’ve successfully created an API that offers full text PDF search.

You can download the repo here: https://github.com/mixpeek/pdf-search-s3

So what’s next?

  • Queuing: Ensuring concurrent file uploads are not dropped
  • Security: Adding end to end encryption to the data pipeline
  • Enhancements: Including more features like fuzzy, highlighting and autocomplete
  • Rate Limiting: Building thresholds so users don’t abuse the system

Everything collapsed into just 2 API calls

If this feels like too much for you to build, maintain, and enhance, Mixpeek has you covered.

Upload

import requestsurl = "https://api.mixpeek.com/upload"
files=[
('file',('FILE_NAME.pdf',open('FILE_NAME.pdf','rb'),'pdf'))
]
response = requests.request("POST", url, files=files)

Search

import requestsurl = "https://api.mixpeek.com/search?q=SEARCH_QUERY"response = requests.request("GET", url)print(response.text)

If this was helpful to you, please upvote this answer on StackOverflow.

--

--