“Data is the new code”. A phrase heard a lot in 2018 and I expect it to get a lot more attention in the future. Many companies use some form of content management system to help wrangle their information. Storing files and images to be retrieved later. Retrieval is typically based around a small amount of metadata, a user id, account number or email address. This means that the interesting “stuff’, the data remains trapped. Getting access to it is the first step to making use of something like machine learning or AI.

Cloud providers are now offering a range of tools that can be used to help free up the trapped information.

Amazon

AWS is continuing to expand its offerings in AL and machine learning. My example focuses on using AWS tools to gather the data.

Rekognition

This service provides image and video analysis. It can identify objects, people, text, scenes, and activities. This can be used to help categorize image data. Is this a picture of a vehicle, animal (puppy or kitten), person? As fun as it is to look at puppies and kittens you may not want to spend the day looking through thousands of images to find one of a firetruck. If the images are associated with other content such as PDF’s or emails then searching for an image associated with the content maybe be more efficient.

Elasticsearch

Elasticsearch offers text searching and indexing. With AWS it can be run on an EC2 instance or as a managed service. Depending on the use case ES could be indexing specific fields or large text files.

Support services

To support Rekognition and Elasticsearch one I used AWS core services. S3, Lambda and DynamoDB.

The diagram shows how I have laid out the pieces.

Since this effort is a POC I have kept it very simple. The process is as follows:

  1. User drops an image on S3

2. A trigger launches a lambda process

3. The process passes the image to Rekognition three times, labels, text and faces.

4. The results for each are stored in DynamoDB and indexed by Elasticsearch.

This first process is one where images are added dynamically. But there are a lot of images already stored, for this I would use a background process triggered by Cloudwatch.

Storing feature data

Using DynamoDB provides a very secure way of storing the information. Should the need arise the data could be re-indexed from the database.

DynamoDB table
Features add to the database.

Labels for a kitten are:
 [ { “S” : “Animal” }, { “S” : “Mammal” }, { “S” : “Kitten” }, { “S” : “Cat” }, { “S” : “Pet” }]

Labels for a car are:

[ { “S” : “Transportation” }, { “S” : “Automobile” }, { “S” : “Car” }, { “S” : “Vehicle” }, { “S” : “Convertible” }, { “S” : “Tire” }, { “S” : “Machine” }, { “S” : “Spoke” }]

Text found in the car image is the license plate

[ { “S” : “NDV 9IF” }, { “S” : “NDV” }, { “S” : “9IF” }]

The item layout consists of a primary key, datetime and feature attributes, labels, detected text and detected faces. The primary key I used was a simple UUID. In a real implementation the key would be something like, user id, email…. This id would be associated with whatever other content was stored in the CMS.

Searching on the id is not likely to provide any improvement from the user’s point of view. Instead what they’d likely want is to search for “Car” or “Kitten” and get a list of relevant id’s. In order to achieve this the next step would be to create several Global Secondary Indexes, one for each feature type(face, label, or text).

Elasticsearch

An alternative to adding other indexes to the database is to index the feature data with Elasticsearch. This keeps the database simple and offers a way to perform a wider range of queries. I used Elasticsearch as an AWS managed service since it easy to set up. It may be that deploying ES in an EC2 environment may be a better long-term solution. It’s not clear how AWS manages ES plugins. Something to better understand.

For the rest of this application I used SAM but for ES I deployed it via the AWS console. Make sure to select an instance of t2.small.elasticsearch. The default is large and not in free tier. I got changed $5 in less than 24hrs and I wasn’t even using it!

Access to ES from Lambda is via IAM. Putting ES in a VPC meant I could not use Kibana locally. Instead, I chose to make it public and restrict it via my IP address. The one thing that I didn’t like is having to use AWS4Auth with user credentials. For many, this may pose a significant security issue. Many production environments restrict the direct use of credentials.

Rekognition features for the baby picture and seen by Kibana.

The application

I used a free tier from AWS to do this work. To create the lambda, s3 and DynamoDB instances I used a very basic SAM template. For a “real” application I have to be a lot of security issues.

Since I did this around Christmas the project is called ‘rudolph’ and the S3 bucket is ‘my-reindeer-dropzone’. I manually created the S3 bucket(my- rudolph-package) for the Python package deployment. There are two lambda functions in place but only one(ondemand) is actually functional.

AWSTemplateFormatVersion: ‘2010–09–09’
Transform: AWS::Serverless-2016–10–31
Description:
sam-app
SAM Template for sam-app
Globals:
Function:
Timeout: 3
Resources:
SrcBucket:
Type: AWS::S3::Bucket
Properties:
BucketName: my-reindeer-dropzone

DynamoRekognitionTable:
Type: AWS::DynamoDB::Table # if you want to define a more complex table, use AWS::DynamoDB::Table
Properties:
TableName: rekognitionTable
AttributeDefinitions:
— AttributeName: primaryValue
AttributeType: S
— AttributeName: datetime
AttributeType: S
KeySchema:
— AttributeName: primaryValue
KeyType: HASH
— AttributeName: datetime
KeyType: RANGE
ProvisionedThroughput:
ReadCapacityUnits: 5
WriteCapacityUnits: 5

BackgroundFunction:
Type: AWS::Serverless::Function
Properties:
CodeUri: s3://my-rudolph-package/backgroundpackage.zip
Handler: background.processData
Runtime: python3.6
Policies: AWSLambdaBasicExecutionRole


ProcessOnDemandFunction:
Type: AWS::Serverless::Function
Properties:
CodeUri: s3://my-rudolph-package/ondemandpackage.zip
Handler: ondemandfunction.OnDemand
Runtime: python3.6
Policies:
— AWSLambdaBasicExecutionRole
— AmazonDynamoDBFullAccess
— AmazonS3FullAccess
— AmazonESFullAccess
— AmazonRekognitionFullAccess
— S3CrudPolicy:
BucketName: my-reindeer-dropzone
Events:
S3CreateObject:
Type: S3
Properties:
Bucket: !Ref SrcBucket
Events: s3:ObjectCreated:*

Outputs:
BackgroundFunction:
Description: “Background Function ARN”
Value: !GetAtt BackgroundFunction.Arn
ProcessOnDemandFunction:
Description: “ProcessOnDemand Function ARN”
Value: !GetAtt ProcessOnDemandFunction.Arn

The lambda code is in Python 3.6. There are plenty of examples on how to include Python packages. I simple do pip install <package> -t <rootfolder> and then zip everything to be deployed.

Stuff you might need.

import sys
import json
import boto3
import botocore
import datetime
import json
import uuid
from elasticsearch import Elasticsearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth

Getting setup:

ENDPOINT = ‘ES end point from the console’ #don’t include the https://
awsauth = AWS4Auth(access, ‘secret’, ‘us-east-1’, ‘es’)
# es is a another Python module that does the ES work
esClient= es.connectES(awsauth,ENDPOINT)
session = boto3.Session()
rekognition = session.client(‘rekognition’)
dynamodb = boto3.resource(‘dynamodb’)
table = dynamodb.Table(‘rekognitionTable’)
primaryKey = str(uuid.uuid4())
dt = str(datetime.datetime.now())
bucket_name = event[‘Records’][0][‘s3’][‘bucket’][‘name’]
filename = event[‘Records’][0][‘s3’][‘object’][‘key’]

Get the labels. I set the confidence level to 90. This can be confusing at first since it greatly reduces what gets returned. If you want to see all the features leave this attribute out.

rekognition_response = rekognition.detect_labels(Image={“S3Object”:{“Bucket”: bucket_name,“Name”: filename}}, MaxLabels=100,MinConfidence=90,)

Processing the results. I created two containers, one for DynamoDB and one for ES. The ES indexes are label and primaryKey. For the database, it’s just a string of labels:

{ “S” : “Animal” }, { “S” : “Kitten” }, { “S” : “Cat” }, { “S” : “Pet” }, { “S” : “Mammal” }

indexLabels = {}
labels = []
for label in rekognition_response[‘Labels’]:
labels.append(label[‘Name’])
indexLabels[label[‘Name’]]=primaryKey

Store the data and add the index. (A code review might suggest checking for results after the recognition call instead at this point)

if len(labels) > 0:
response = table.put_item(
Item={‘primaryValue’: primaryKey,‘datetime’: dt,‘labels’: labels
    }
)
es.indexDataElement(esClient,’labels’, ‘image’, indexLabels)

es.py

def connectES(awsauth,esEndPoint):
print (‘Connecting to the ES Endpoint {0}’.format(esEndPoint))
try:
esClient = Elasticsearch(
hosts=[{‘host’: esEndPoint, ‘port’: 443}],
use_ssl=True,
http_auth=awsauth,
verify_certs=True,
connection_class=RequestsHttpConnection)
return esClient
   except Exception as E:
print(“Unable to connect to {0}”.format(esEndPoint))
print(E)
exit(3)
def indexDataElement(esClient,index, type, indexData):
try:
esClient.index(index, doc_type=type, body=indexData)
except Exception as E:
print(“Unable to Create Index {0}”.format(index))
print(E)
exit(4)

This is repeated for text and faces. Processing face results is more complicated (https://docs.aws.amazon.com/rekognition/latest/dg/faces-detect-images.html). The code could use some rework!

This effort was a chance for me to learn how these services played together. The next step is to see how to use Elasticsearch to index the content itself, full-text search.

I envision using the data indexed in Elasticsearch to train a Neural Network. The result could be used to analyze or make predictions based on new data. If an auto repair facility had images of damaged cars and repair cost. Could they use this to help estimate new repairs? There are already cases where this sort of work is being done in health care.

Code: https://github.com/rickerg0/rudolph