Receipt Vision: Build a receipt reader using AWS Textract and Comprehend

6 min readJun 28, 2024

This tutorial is part 2 of Machine Learning on AWS Series. You can see the previous tutorial Here

Pre-Requisites

Basic python knowledge
An AWS account (If you don’t have one go to aws.amazon.com and sign up for a free account)
Basic AWS knowledge (Optional but recommended)

Project Overview

In this project we will use OCR (optical character recognition) abilities of AWS Textract to read user uploaded images of receipts and parse them using AWS Comprehend.

Other than S3 and Lambda which we have used previously, we will be using 2 new AWS services for this project:

AWS Textract is a machine learning service that automatically extracts text, handwriting, and data from scanned documents. It goes beyond simple optical character recognition (OCR) by also identifying the structure of documents, such as forms and tables, making it easy to convert paper-based data into digital formats for analysis and processing.
AWS Comprehend is a natural language processing (NLP) service that uses machine learning to find insights and relationships in text. It can perform tasks such as entity recognition, sentiment analysis, key phrase extraction, language detection, and text categorization.

Set Up S3 Buckets

We will create two S3 Buckets. One for uploading receipt images to and the second which will store the parsed receipt data from Textract.

Sign in to the AWS Management Console.
In the AWS Management Console, go to the “Services” menu and select “S3” under “Storage”.

Create Receipt Uploads Bucket

Click on the “Create bucket” button.
Enter a unique bucket name like receipt-uploads (bucket names must be unique across all existing bucket names in Amazon S3).
Choose the same region where you will deploy your Lambda function.
Ensure all options are checked to block public access.
Click on the “Create bucket” button at the bottom.
Click on the “Permissions” tab and enter the following policy so that Textract can access images in this bucket.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "textract.amazonaws.com"
            },
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::receipt-uploads/*"
        }
    ]
}

Create the Result Bucket

Follow the same steps to another bucket with a name like receipt-results. , except this bucket does not need a bucket policy attached.

Set Up IAM Role

We need to set up an IAM Role so that our Lambda function can access S3, CloudWatch, Comprehend and Textract. This time we will be attaching a custom policy which is more restrictive.

Create Custom Policy

Navigate to the IAM Console.
Click on “Policies” and the “Create Policy” button.
Switch to the JSON tab and paste the following policy in there (replace the bucket names to match your bucket names)

{
 "Version": "2012-10-17",
 "Statement": [
  {
   "Effect": "Allow",
   "Action": [
    "comprehend:DetectEntities",
    "comprehend:DetectSentiment",
    "comprehend:DetectSyntax"
   ],
   "Resource": "*"
  },
  {
   "Effect": "Allow",
   "Action": [
    "textract:AnalyzeDocument",
    "textract:DetectDocumentText"
   ],
   "Resource": "*"
  },
  {
   "Effect": "Allow",
   "Action": [
    "s3:GetObject",
    "s3:PutObject"
   ],
   "Resource": [
    "arn:aws:s3:::hijaz-receipt-uploads/*",
    "arn:aws:s3:::hijaz-receipt-results/*"
   ]
  },
  {
   "Effect": "Allow",
   "Action": [
    "logs:CreateLogGroup",
    "logs:CreateLogStream",
    "logs:PutLogEvents"
   ],
   "Resource": [
    "arn:aws:logs:*:*:log-group:/aws/lambda/*"
   ]
  }
 ]
}

Give the policy and name like “LambdaTextractS3CloudWatchPolicy” and click “Create Policy”

Create Role

In IAM Console, click on “Roles” and click “Create Role” button.
Choose “AWS Service” and select “Lambda” as the service.
Click “Next” to go to “Permissions”
Search for and attach the custom policy we just created “LambdaTextractS3CloudWatchPolicy”
Enter the role name, “lambda-receipt-processor-role”.
Review the permissions and click “Create role”

Set Up the Lambda Function

Go to the AWS Lambda console.
Click “Create function”.
Function name: ReceiptProcessor
Runtime: Python 3.x
Permissions: Choose the IAM role created earlier (lambda-receipt-processor-role).
Click “Create function”.

Enter the following code for your function (replace the results bucket name with your result bucket)

import json
import boto3
import logging
import urllib.parse
import uuid
from dateutil import parser
import re

# Set up logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Initialize AWS clients
s3_client = boto3.client('s3')
textract_client = boto3.client('textract')
comprehend_client = boto3.client('comprehend')

def lambda_handler(event, context):
    try:
        # Log the event received from S3
        logger.info(f"Received event: {json.dumps(event)}")

        # Get the bucket name and document key from the event
        bucket = event['Records'][0]['s3']['bucket']['name']
        key = event['Records'][0]['s3']['object']['key']
        logger.info(f"Bucket: {bucket}, Key: {key}")

        # URL-encode the object key to handle spaces and special characters
        encoded_key = urllib.parse.quote(key)
        logger.info(f"Encoded Key: {encoded_key}")

        # Skip processing if the file is in the result bucket
        if bucket == 'your-result-bucket':
            logger.info("Skipping processing for output bucket.")
            return

        # Call Textract to analyze the document
        response = textract_client.analyze_document(
            Document={'S3Object': {'Bucket': bucket, 'Name': encoded_key}},
            FeatureTypes=['TABLES', 'FORMS']
        )

        # Extract the text in order
        text_blocks = extract_text_blocks(response)
        combined_text = "\n".join(text_blocks)

        # Call Comprehend to analyze the text
        comprehend_response = comprehend_client.detect_entities(
            Text=combined_text,
            LanguageCode='en'
        )

        # Extract the structured data
        extracted_data = extract_data(text_blocks, comprehend_response)
        logger.info(f"Extracted Data: {extracted_data}")

        # Save the extracted data to another S3 bucket
        save_to_s3(extracted_data, key)

        return {
            'statusCode': 200,
            'body': json.dumps('Receipt processed successfully!')
        }

    except Exception as e:
        logger.error(f"An error occurred: {str(e)}")
        return {
            'statusCode': 500,
            'body': json.dumps('An internal error occurred.')
        }

def extract_text_blocks(response):
    text_blocks = []
    for block in response['Blocks']:
        if block['BlockType'] == 'LINE':
            text_blocks.append(block['Text'])
    return text_blocks

def extract_data(text_blocks, comprehend_response):
    data = {'ReceiptId': str(uuid.uuid4())}
    entities = comprehend_response['Entities']
    vendor_name = None
    total_amount = None
    date = None

    for entity in entities:
        if entity['Type'] == 'ORGANIZATION' and vendor_name is None:
            vendor_name = entity['Text']
        elif entity['Type'] == 'DATE' and date is None:
            date = entity['Text']

    # Extract total amount more accurately
    for line in text_blocks:
        if 'total' in line.lower():
            parts = line.split()
            for part in parts:
                if part.replace('.', '', 1).isdigit():
                    total_amount = part
                    break

    data['Vendor'] = vendor_name if vendor_name else 'N/A'
    data['Total'] = total_amount if total_amount else 'N/A'
    data['Date'] = extract_date(text_blocks)

    return data

def extract_date(text_blocks):
    date_patterns = [
        r'\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b',  # Matches dates like MM/DD/YY or MM/DD/YYYY
    ]
    date_regex = re.compile('|'.join(date_patterns))
    
    for line in text_blocks:
        match = date_regex.search(line)
        if match:
            date_str = match.group(0)
            try:
                date = parser.parse(date_str, fuzzy=False)
                logger.info(f"Parsed date: {date.strftime('%Y-%m-%d')} from line: {line}")
                if date.year > 1900 and date.year < 2100:
                    return date.strftime('%Y-%m-%d')
            except ValueError:
                logger.info(f"Failed to parse date from line: {line}")
                continue
    return 'N/A'

def save_to_s3(data, original_key):
    result_bucket = 'hijaz-receipt-results'  # Replace with your result bucket name
    result_key = 'results/' + original_key.split('/')[-1].replace('.jpg', '.json').replace('.png', '.json')
    
    s3_client.put_object(
        Bucket=result_bucket,
        Key=result_key,
        Body=json.dumps(data),
        ContentType='application/json'
    )

Set the timeout and memory settings according to your needs (e.g., timeout of 5 minutes and 512 MB of memory).
Click “Save”.

Add Trigger

In the “Configuration” tab, click on “Triggers”.
Click “Add trigger”.
From the “Trigger configuration” dropdown, select “S3”.

Set Up the Trigger

Bucket: Select your receipt-uploads bucket.
Event type: Choose All object create events.

Add CloudWatch Log Group

Go to the CloudWatch Console.
In the navigation pane, click on “Logs”.
Create the log group/aws/lambda/ReceiptProcessor (if it does not already exist)

Test the Service

Get a jpg or png image of a receipt like the one below and upload it to the uploads S3 bucket you created

Verify in CloudWatch logs that there are no errors.
In the results S3 bucket you should see the parsed JSON data from the receipt you uploaded

{
  "ReceiptId": "99dc4740-a2e4-423a-906b-8dc302092f3e", 
  "Vendor": "Walmart", 
  "Total": "23.19", 
  "Date": "2017-11-13"
}

The code can be made more complete by reading individual items, tax and adapting it for different receipt types.

In the next part of this series we predict Bitcoin prices using SageMaker DeepAR+

Title Background: “200520” by takawohttp://openprocessing.org/sketch/857874License CreativeCommons Attribution NonCommercial ShareAlikehttps://creativecommons.org/licenses/by-nc-sa/3.0