Search text from PDF files stored in an S3 bucket

Mixpeek
3 min readJul 27, 2022

--

Does your application allow users to upload PDFs? Maybe they upload resumes, waivers, agreements or signed documents. What if they need to search the contents of these PDFs?

As a developer, you have 3 options:

  1. Search by Filename: Lookup by key/value like filename [Native]
  2. Search by Metadata: Store the metadata in a separate database to perform queries [Database add-on]
  3. Full-Text-Search: Extract the contents into a search engine [OCR, Database, Search add-on]

Full Text Search provides the most intuitive user experience, but it’s also the most challenging to build, maintain, and enhance.

PDF -> ML -> Database -> Search Engine -> API

In this tutorial, we’ll walk you through best practices for PDF file upload, content extraction via OCR (Optical Character Recognition), and searching so you can add full-text PDF search into your application, with ease.

Bonus: At the end will be a Github repository so you can import the code directly into your application.

Store the file

First we need a function to download the file locally in order to run our OCR extraction logic:

import boto3s3_client = boto3.client(
's3',
aws_access_key_id='aws_access_key_id',
aws_secret_access_key='aws_secret_access_key',
region_name='region_name'
)
with open(s3_file_name, 'wb') as file:
s3_client.download_fileobj(
bucket_name,
s3_file_name,
file
)

Extract the contents

We’ll use the open source, Apache Tika library, which contains a class: AutoDetectParser that does OCR (optical character recognition):

from tika import parserparsed_pdf_content = parser.from_file(s3_file_name)['content']

Insert contents into a search engine

We’re using a self-managed OpenSearch node here, but you can use Lucene, SOLR, ElasticSearch or Atlas Search.

Note: if you don’t have OpenSearch locally you must install it first, then run it:

brew update
brew install opensearch
opensearch

OpenSearch will now be accessible here: http://localhost:9200. Let’s build the index and insert the file contents:

from opensearchpy import OpenSearchos = OpenSearch("http://localhost:9200")
index_name="pdf-search"
doc = {
"filename": s3_file_name,
"parsed_pdf_content": parsed_pdf_content
}
response = os.index(
index=index_name,
body=doc,
id=1,
refresh=True
)

Creating a PDF search API

We’ll use Flask to create a microservice that searches terms:

from flask import Flask, jsonify, request
from opensearchpy import OpenSearch
from config import *
app = Flask(__name__)
os = OpenSearch("http://localhost:9200/")
@app.route('/search', methods=['GET'])
def search_file():
# value from the api
query = request.args.get('q', default = None, type = str)
# query payload in json forOpenSearch
payload = {
'query': {
'match': {
'parsed_pdf_content': query
}
}
}
# run search query
response = os.search(
body=payload,
index=index_name
)
return jsonify(response)if __name__ == '__main__':
app.run(host="localhost", port=5011, debug=True)

Now we can call the API via:

GET: http://localhost:5011/search?q=SEARCH_TERM{
"_shards": {
"failed": 0,
"skipped": 0,
"successful": 1,
"total": 1
},
"hits": {
"hits": [
{
"_id": "1",
"_index": "pdf-search",
"_score": 0.29289162,
"_source": {
"filename": "prescription.pdf",
"parsed_pdf_content": "SEARCH_TERM"
}
}
],
"max_score": 0.29289162,
"total": {
"relation": "eq",
"value": 1
}
},
"timed_out": false,
"took": 40
}

Whoo we did it! We’ve successfully created an API that offers full text PDF search.

You can download the repo here: https://github.com/mixpeek/pdf-search-s3

So what’s next?

  • Queuing: Ensuring concurrent file uploads are not dropped
  • Security: Adding end to end encryption to the data pipeline
  • Enhancements: Including more features like fuzzy, highlighting and autocomplete
  • Rate Limiting: Building thresholds so users don’t abuse the system

Everything collapsed into just 2 API calls

If this feels like too much for you to build, maintain, and enhance, Mixpeek has you covered.

Upload

import requestsurl = "https://api.mixpeek.com/upload"
files=[
('file',('FILE_NAME.pdf',open('FILE_NAME.pdf','rb'),'pdf'))
]
response = requests.request("POST", url, files=files)

Search

import requestsurl = "https://api.mixpeek.com/search?q=SEARCH_QUERY"response = requests.request("GET", url)print(response.text)

If this was helpful to you, please upvote this answer on StackOverflow.

--

--