How to Use Document Intelligence Studio and Azure to Extract a JSON file from a PDF

Masego
3 min readMay 20, 2024

--

In today’s digital age, managing and extracting valuable information from documents is crucial for various tasks, from data analysis to decision-making. In this article, I’ll walk you through how to use Microsoft’s Document Intelligence Studio and Azure to extract a JSON file from a PDF.

We’ll use the Patriotic Alliance Manifesto as an example.

The Patriotic Alliance is a political party based in South Africa. Their political manifesto for the 2024 South African elections presented a challenge: the text in the PDF could not be read by standard text processing packages in R. This situation provided an ideal opportunity to leverage Azure’s Document Intelligence resources and their Form Recognizer capabilities for effective text extraction.

This guide aims to be straightforward and easy to follow, making it perfect for beginners.

Introduction

Extracting text from PDFs can be challenging, especially when dealing with complex documents like manifestos. Basic R functions can fall short, but fortunately, Document Intelligence Studio combined with Azure provides a powerful solution.

What You Will Need

  1. Microsoft Azure Account: If you don’t have one, you can sign up for a free account.
  2. Document Intelligence Studio: Available as part of Azure AI services.
  3. A PDF document: For this guide, we’ll use the Patriotic Alliance Manifesto.

Step-by-Step Guide

Step 1: Setting Up Your Azure Account

First, ensure you have an Azure account. If you don’t, go to Azure Sign Up and create one. Once logged in, navigate to the Azure portal.

Step 2: Create a Document Intelligence Studio Resource

1. Navigate to Azure AI Services: In the Azure portal, search for AI Services and select Document Intelligence.
2. Create a New Resource: Click on Create a resource and fill in the necessary details (like resource group, region, and pricing tier — I used the paid tier).
3. Review and Create: Review your settings and click “Create”.

Step 3: Upload Your PDF

  1. Access Document Intelligence Studio: Once your resource is created, go to the Document Intelligence Studio.
  2. Upload the PDF: Click on the Read under the Document Analysis section -> Browse for files (and select the pdf file)

Step 4: Extract and Download JSON

Click on Run Analysis -> Once this has completed select Results and download the JSON file.

Step 5: Using the Extracted JSON in R

Now that you have the JSON file, you can easily use it in your R projects. Here’s a simple example of how to load and view the JSON data in R:

# Load necessary library
library(jsonlite)
# Read the JSON file
json_data <- fromJSON("path/to/your/downloaded_file.json")
# Print the JSON data
print(json_data)

This script loads the JSON data into R, making it ready for analysis or further processing.

Troubleshooting Tips

- File Upload Issues: Ensure your PDF is not corrupt and is under the maximum file size limit.
- Extraction Errors: Double-check the configuration settings in the Document Intelligence Studio.

Conclusion

Using Document Intelligence Studio and Azure makes extracting text from PDFs and converting it into a structured format like JSON straightforward and efficient. This powerful combination helps overcome the limitations of basic text extraction methods.

I hope this guide helps you in your document processing tasks and demonstrates how simple it is to extract text from pdfs from a low-code/no-code solution provided by Azure with the Document Intelligence Studio.

Happy extracting!

--

--