Build a PDF Text Extractor with FastAPI & MongoDB

Saverio Mazza
3 min readDec 11, 2023

--

To create a FastAPI application that extracts text from a PDF and saves it to MongoDB, you’ll need a few additional Python libraries. Specifically, you’ll use PyPDF2 to handle PDF text extraction and pymongo to interact with MongoDB. Let's create the complete code:

First, you’ll need to install the necessary packages:

pip install fastapi[all] PyPDF2 pymongo loguru

Then, here is the complete FastAPI application code:

from fastapi import FastAPI, Request, UploadFile, File
import json
import PyPDF2
import pymongo
from loguru import logger
from io import BytesIO
import requests
import os

app = FastAPI()

# MongoDB setup
mongo_uri = os.getenv("MONGO_URI", "your_default_mongodb_uri")
client = pymongo.MongoClient(mongo_uri)
db = client.get_database("your_database")
collection = db.get_collection("your_collection")

@app.post("/process_pdf_url/")
async def process_pdf_url(request: Request):
message = await request.json()
logger.info(f"Received message: {message}")

# Assuming the message contains the URL of the PDF file
pdf_url = message.get("pdf_url")

# Download and process the PDF
return await process_pdf(pdf_url)

@app.post("/process_pdf_file/")
async def process_pdf_file(file: UploadFile = File(...)):
# Save file locally for processing
contents = await file.read()
with open(file.filename, 'wb') as f:
f.write(contents)

# Process saved file
return await process_pdf(file.filename, is_local_file=True)

async def process_pdf(pdf_source, is_local_file=False):
# Process the PDF from URL or local file
file = BytesIO(requests.get(pdf_source).content) if not is_local_file else open(pdf_source, 'rb')

# Extract text from PDF
pdf_reader = PyPDF2.PdfFileReader(file)
text = ""
for page in range(pdf_reader.numPages):
text += pdf_reader.getPage(page).extractText()

if is_local_file:
file.close()

# Save to MongoDB
pdf_document = {"source": pdf_source, "text": text}
collection.insert_one(pdf_document)

return {"status": "Processing completed"}
# Use an official Python runtime as a parent image
FROM python:3.8-slim

# Set the working directory in the container
WORKDIR /app

# Copy the current directory contents into the container at /app
COPY . /app

# Install required packages
RUN pip install --no-cache-dir -r requirements.txt

# Install Uvicorn for running the application
RUN pip install uvicorn

# Make port 8080 available to the world outside this container
EXPOSE 8080

# Define environment variable
ENV PORT=8080

# Run uvicorn when the container launches
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]

A few important notes about this code:

  1. MongoDB Connection: Replace "your_mongodb_uri", "your_database", and "your_collection" with your actual MongoDB URI, database name, and collection name.
  2. PDF Retrieval: This code assumes the request’s JSON contains a "pdf_url" key with the URL of the PDF file. It downloads the PDF from this URL.
  3. Text Extraction: PyPDF2 extracts text from the PDF file. Note that PyPDF2 has some limitations and might not work perfectly with all PDF files, especially those with complex layouts or scanned documents.
  4. Error Handling: This example lacks detailed error handling. In a production environment, you should add try-except blocks to manage potential errors in downloading the file, extracting text, or interacting with MongoDB.
  5. Logging: The loguru library is used for logging, as per your preference.

Please adjust and expand this basic example according to your specific requirements and the structure of your application.

--

--