Build a PDF Text Extractor with FastAPI & MongoDB
3 min readDec 11, 2023
To create a FastAPI application that extracts text from a PDF and saves it to MongoDB, you’ll need a few additional Python libraries. Specifically, you’ll use PyPDF2
to handle PDF text extraction and pymongo
to interact with MongoDB. Let's create the complete code:
First, you’ll need to install the necessary packages:
pip install fastapi[all] PyPDF2 pymongo loguru
Then, here is the complete FastAPI application code:
from fastapi import FastAPI, Request, UploadFile, File
import json
import PyPDF2
import pymongo
from loguru import logger
from io import BytesIO
import requests
import os
app = FastAPI()
# MongoDB setup
mongo_uri = os.getenv("MONGO_URI", "your_default_mongodb_uri")
client = pymongo.MongoClient(mongo_uri)
db = client.get_database("your_database")
collection = db.get_collection("your_collection")
@app.post("/process_pdf_url/")
async def process_pdf_url(request: Request):
message = await request.json()
logger.info(f"Received message: {message}")
# Assuming the message contains the URL of the PDF file
pdf_url = message.get("pdf_url")
# Download and process the PDF
return await process_pdf(pdf_url)
@app.post("/process_pdf_file/")
async def process_pdf_file(file: UploadFile = File(...)):
# Save file locally for processing
contents = await file.read()
with open(file.filename, 'wb') as f:
f.write(contents)
# Process saved file
return await process_pdf(file.filename, is_local_file=True)
async def process_pdf(pdf_source, is_local_file=False):
# Process the PDF from URL or local file
file = BytesIO(requests.get(pdf_source).content) if not is_local_file else open(pdf_source, 'rb')
# Extract text from PDF
pdf_reader = PyPDF2.PdfFileReader(file)
text = ""
for page in range(pdf_reader.numPages):
text += pdf_reader.getPage(page).extractText()
if is_local_file:
file.close()
# Save to MongoDB
pdf_document = {"source": pdf_source, "text": text}
collection.insert_one(pdf_document)
return {"status": "Processing completed"}
# Use an official Python runtime as a parent image
FROM python:3.8-slim
# Set the working directory in the container
WORKDIR /app
# Copy the current directory contents into the container at /app
COPY . /app
# Install required packages
RUN pip install --no-cache-dir -r requirements.txt
# Install Uvicorn for running the application
RUN pip install uvicorn
# Make port 8080 available to the world outside this container
EXPOSE 8080
# Define environment variable
ENV PORT=8080
# Run uvicorn when the container launches
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]
A few important notes about this code:
- MongoDB Connection: Replace
"your_mongodb_uri"
,"your_database"
, and"your_collection"
with your actual MongoDB URI, database name, and collection name. - PDF Retrieval: This code assumes the request’s JSON contains a
"pdf_url"
key with the URL of the PDF file. It downloads the PDF from this URL. - Text Extraction:
PyPDF2
extracts text from the PDF file. Note thatPyPDF2
has some limitations and might not work perfectly with all PDF files, especially those with complex layouts or scanned documents. - Error Handling: This example lacks detailed error handling. In a production environment, you should add try-except blocks to manage potential errors in downloading the file, extracting text, or interacting with MongoDB.
- Logging: The
loguru
library is used for logging, as per your preference.
Please adjust and expand this basic example according to your specific requirements and the structure of your application.