Receipt Vision: Build a receipt reader using AWS Textract and Comprehend

Hassan Ijaz
6 min readJun 28, 2024

--

This tutorial is part 2 of Machine Learning on AWS Series. You can see the previous tutorial Here

Pre-Requisites

  • Basic python knowledge
  • An AWS account (If you don’t have one go to aws.amazon.com and sign up for a free account)
  • Basic AWS knowledge (Optional but recommended)

Project Overview

In this project we will use OCR (optical character recognition) abilities of AWS Textract to read user uploaded images of receipts and parse them using AWS Comprehend.

Other than S3 and Lambda which we have used previously, we will be using 2 new AWS services for this project:

  1. AWS Textract is a machine learning service that automatically extracts text, handwriting, and data from scanned documents. It goes beyond simple optical character recognition (OCR) by also identifying the structure of documents, such as forms and tables, making it easy to convert paper-based data into digital formats for analysis and processing.
  2. AWS Comprehend is a natural language processing (NLP) service that uses machine learning to find insights and relationships in text. It can perform tasks such as entity recognition, sentiment analysis, key phrase extraction, language detection, and text categorization.

Set Up S3 Buckets

We will create two S3 Buckets. One for uploading receipt images to and the second which will store the parsed receipt data from Textract.

  • Sign in to the AWS Management Console.
  • In the AWS Management Console, go to the “Services” menu and select “S3” under “Storage”.

Create Receipt Uploads Bucket

  • Click on the “Create bucket” button.
  • Enter a unique bucket name like receipt-uploads (bucket names must be unique across all existing bucket names in Amazon S3).
  • Choose the same region where you will deploy your Lambda function.
  • Ensure all options are checked to block public access.
  • Click on the “Create bucket” button at the bottom.
  • Click on the “Permissions” tab and enter the following policy so that Textract can access images in this bucket.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "textract.amazonaws.com"
},
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::receipt-uploads/*"
}
]
}

Create the Result Bucket

  • Follow the same steps to another bucket with a name like receipt-results. , except this bucket does not need a bucket policy attached.

Set Up IAM Role

We need to set up an IAM Role so that our Lambda function can access S3, CloudWatch, Comprehend and Textract. This time we will be attaching a custom policy which is more restrictive.

Create Custom Policy

  • Navigate to the IAM Console.
  • Click on “Policies” and the “Create Policy” button.
  • Switch to the JSON tab and paste the following policy in there (replace the bucket names to match your bucket names)
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"comprehend:DetectEntities",
"comprehend:DetectSentiment",
"comprehend:DetectSyntax"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"textract:AnalyzeDocument",
"textract:DetectDocumentText"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::hijaz-receipt-uploads/*",
"arn:aws:s3:::hijaz-receipt-results/*"
]
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": [
"arn:aws:logs:*:*:log-group:/aws/lambda/*"
]
}
]
}
  • Give the policy and name like “LambdaTextractS3CloudWatchPolicy” and click “Create Policy”

Create Role

  • In IAM Console, click on “Roles” and click “Create Role” button.
  • Choose “AWS Service” and select “Lambda” as the service.
  • Click “Next” to go to “Permissions”
  • Search for and attach the custom policy we just created “LambdaTextractS3CloudWatchPolicy”
  • Enter the role name, “lambda-receipt-processor-role”.
  • Review the permissions and click “Create role”

Set Up the Lambda Function

  • Go to the AWS Lambda console.
  • Click “Create function”.
  • Function name: ReceiptProcessor
  • Runtime: Python 3.x
  • Permissions: Choose the IAM role created earlier (lambda-receipt-processor-role).
  • Click “Create function”.
  • Enter the following code for your function (replace the results bucket name with your result bucket)
import json
import boto3
import logging
import urllib.parse
import uuid
from dateutil import parser
import re

# Set up logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Initialize AWS clients
s3_client = boto3.client('s3')
textract_client = boto3.client('textract')
comprehend_client = boto3.client('comprehend')

def lambda_handler(event, context):
try:
# Log the event received from S3
logger.info(f"Received event: {json.dumps(event)}")

# Get the bucket name and document key from the event
bucket = event['Records'][0]['s3']['bucket']['name']
key = event['Records'][0]['s3']['object']['key']
logger.info(f"Bucket: {bucket}, Key: {key}")

# URL-encode the object key to handle spaces and special characters
encoded_key = urllib.parse.quote(key)
logger.info(f"Encoded Key: {encoded_key}")

# Skip processing if the file is in the result bucket
if bucket == 'your-result-bucket':
logger.info("Skipping processing for output bucket.")
return

# Call Textract to analyze the document
response = textract_client.analyze_document(
Document={'S3Object': {'Bucket': bucket, 'Name': encoded_key}},
FeatureTypes=['TABLES', 'FORMS']
)

# Extract the text in order
text_blocks = extract_text_blocks(response)
combined_text = "\n".join(text_blocks)

# Call Comprehend to analyze the text
comprehend_response = comprehend_client.detect_entities(
Text=combined_text,
LanguageCode='en'
)

# Extract the structured data
extracted_data = extract_data(text_blocks, comprehend_response)
logger.info(f"Extracted Data: {extracted_data}")

# Save the extracted data to another S3 bucket
save_to_s3(extracted_data, key)

return {
'statusCode': 200,
'body': json.dumps('Receipt processed successfully!')
}

except Exception as e:
logger.error(f"An error occurred: {str(e)}")
return {
'statusCode': 500,
'body': json.dumps('An internal error occurred.')
}

def extract_text_blocks(response):
text_blocks = []
for block in response['Blocks']:
if block['BlockType'] == 'LINE':
text_blocks.append(block['Text'])
return text_blocks

def extract_data(text_blocks, comprehend_response):
data = {'ReceiptId': str(uuid.uuid4())}
entities = comprehend_response['Entities']
vendor_name = None
total_amount = None
date = None

for entity in entities:
if entity['Type'] == 'ORGANIZATION' and vendor_name is None:
vendor_name = entity['Text']
elif entity['Type'] == 'DATE' and date is None:
date = entity['Text']

# Extract total amount more accurately
for line in text_blocks:
if 'total' in line.lower():
parts = line.split()
for part in parts:
if part.replace('.', '', 1).isdigit():
total_amount = part
break

data['Vendor'] = vendor_name if vendor_name else 'N/A'
data['Total'] = total_amount if total_amount else 'N/A'
data['Date'] = extract_date(text_blocks)

return data

def extract_date(text_blocks):
date_patterns = [
r'\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b', # Matches dates like MM/DD/YY or MM/DD/YYYY
]
date_regex = re.compile('|'.join(date_patterns))

for line in text_blocks:
match = date_regex.search(line)
if match:
date_str = match.group(0)
try:
date = parser.parse(date_str, fuzzy=False)
logger.info(f"Parsed date: {date.strftime('%Y-%m-%d')} from line: {line}")
if date.year > 1900 and date.year < 2100:
return date.strftime('%Y-%m-%d')
except ValueError:
logger.info(f"Failed to parse date from line: {line}")
continue
return 'N/A'

def save_to_s3(data, original_key):
result_bucket = 'hijaz-receipt-results' # Replace with your result bucket name
result_key = 'results/' + original_key.split('/')[-1].replace('.jpg', '.json').replace('.png', '.json')

s3_client.put_object(
Bucket=result_bucket,
Key=result_key,
Body=json.dumps(data),
ContentType='application/json'
)
  • Set the timeout and memory settings according to your needs (e.g., timeout of 5 minutes and 512 MB of memory).
  • Click “Save”.

Add Trigger

  • In the “Configuration” tab, click on “Triggers”.
  • Click “Add trigger”.
  • From the “Trigger configuration” dropdown, select “S3”.

Set Up the Trigger

  • Bucket: Select your receipt-uploads bucket.
  • Event type: Choose All object create events.

Add CloudWatch Log Group

  • Go to the CloudWatch Console.
  • In the navigation pane, click on “Logs”.
  • Create the log group/aws/lambda/ReceiptProcessor (if it does not already exist)

Test the Service

  • Get a jpg or png image of a receipt like the one below and upload it to the uploads S3 bucket you created
  • Verify in CloudWatch logs that there are no errors.
  • In the results S3 bucket you should see the parsed JSON data from the receipt you uploaded
{
"ReceiptId": "99dc4740-a2e4-423a-906b-8dc302092f3e",
"Vendor": "Walmart",
"Total": "23.19",
"Date": "2017-11-13"
}
  • The code can be made more complete by reading individual items, tax and adapting it for different receipt types.

In the next part of this series we predict Bitcoin prices using SageMaker DeepAR+

Title Background: “200520” by takawohttp://openprocessing.org/sketch/857874License CreativeCommons Attribution NonCommercial ShareAlikehttps://creativecommons.org/licenses/by-nc-sa/3.0

--

--