Utilizing Machine Learning in the Palm of Your Hand With iOS’s ML Frameworks

Published in

Capital One Tech

7 min readDec 4, 2017

By Kevin Ferrell, Lead Software Engineer, Capital One

Earlier this year, Apple introduced new Machine Learning (ML) frameworks in iOS 11 — Core ML and Vision — that make it easy and intuitive for developers to incorporate ML models into their apps, improving the speed, power, and overall experience for users. As a developer on the internal mobile app team at Capital One, I wanted to try out a use case for implementing these new frameworks that would tie into my focus on improving our employees’ workplace experience.

The Use Case

Like many large companies with open workspaces, Capital One employees reserve a lot of conference rooms; impromptu meetings often lead to last-minute scrambles to secure a space. Today, this requires employees to either use a desktop application to find an available room, or to “squat” in what appears to be an empty room and hope that another team doesn’t already have it reserved.

To fix this problem, our team is building a new mobile app to make the process of reserving a conference room more simple and streamlined. One of the planned features, Reserve a Room by Photo, will enable employees to snap a picture of a conference room sign to determine if it’s immediately available, and then reserve the room directly from their phone (no more getting kicked out of conference rooms!).

To enable this new feature, we needed to create two functions that are difficult to implement using traditional programming methods:

Recognizing areas of text in the conference room sign picture taken by the user; and
Converting these text areas to text strings so they can be used to reserve the conference room.

This is where Apple’s new ML frameworks come in.

ML Frameworks in action

The Core ML and Vision software frameworks make it easy to incorporate ML iOS apps.

Core ML enables developers to use pre-trained ML models directly within an app through a native object interface, while Vision rides on top of Core ML to perform ML-driven operations that incorporate the device’s camera, like face detection, object tracking, and text identification.

Vision supports the following operations out of the box:

Face and facial feature (mouth, nose, eyes, etc.) detection
Barcode detection and processing
Image alignment
Horizon detection
Object tracking
Text detection
Custom Core ML models

I’ve bolded “text detection” here because we’ll be using it in our code example. Another important point here is that Vision will detect the text in an image, but won’t provide the functions to convert that text into text strings that can then be processed. To convert the image to a string we’ll need to use another ML model to recognize the text characters.

Here’s a diagram more broadly outlining the workflow for our Reserve a Room by Photo feature:

The user takes a photo of the sign outside of a conference room.
The photo is run through the VNImageDetectTextRectanglesRequest request to identify areas of text in the photo.
The areas of text are extracted into individual images for conversion.
Those individual text images are converted to String objects using the SwiftOCR library.
The converted strings are processed to identify the room number being reserved.
The room is reserved in our resource system and is ready to be used!

Step-by-Step Code

1. User Takes the Photo
The first step is straightforward and uses a simple UIImagePickerController to allow the user take a photo of the room sign. That photo is then passed to the RoomPhotoViewController, which utilizes the Vision framework.

2. Detect Areas of Text in the Image
Next, we’ll use Vision to identify areas of text in the photo. We start by creating a VNImageRequestHandler and passing in the photo; the request handler is used to process Vision requests and to operate on video streams instead of static images, if and when a use case requires it.

We then create a VNDetectTextRectanglesRequest, which accepts a completion handler to process the results. The completion handler includes an input that contains the VNTextObservation result set, in turn identifying all areas of text within the image.

3. Outline and Extract Text Images
Now that we have an array of VNTextObservation results, we can process them to prepare for conversion to text (the VNTextObservation results are essentially an array of rectangles that identify the parts of the photo that Vision thinks contain text).

Our first processing step will outline these image areas on the photo with a green box, making it easier to ensure identification of the right text areas.

Next, we’ll extract images that contain the text for each of the VNTextObservation items.

Note that here we’re extracting each contiguous area of text in the photo (i.e. entire lines of text). The VNTextObservation array also contains individual character boxes, but that isn’t needed for this use case.

4. Convert Images to Strings
Now that we have an array of images, we’ll loop over each image and convert it to a String. Here we’re using SwiftOCR — an open source library that uses a neural network ML model to employ character recognition on images — to perform the String conversion (note that there are other libraries that can be used, including the Tesseract OCR model developed by Google).

5. Identify the Room Number and Reserve the Room
At this point, we have a list of text Strings that have been extracted from the user’s photo. Now we’ll use this information to identify the room number and reserve the room. The code below uses a simplistic method for identifying the room number based on the percentage of numerals to alphabetic characters in each string. Essentially, the function picks the room number based on which text area contains the most numbers.

While this is an overly simplistic method for identifying room numbers, we’re using it here in an attempt to clearly demonstrate the ML aspects of this process.

Tying It All Together

The demo below shows the entire process from end-to-end. As you can see, Vision correctly identifies the areas of text, which are outlined in green boxes — this is impressive given the odd perspective from which the picture was taken. Next, the SwiftOCR framework correctly converts the images to strings and we select the room number from the group of text areas.

Key Takeaways

Developing our new app feature prototype left me with several takeaways on Core ML, Vision, and the incorporation of machine learning into apps more broadly:

Don’t expect to drop a ML model into your app and see it come alive. Think of the Core ML result as a single data point in your app’s decision analysis.
Training ML models is difficult. I could use existing ML models for this prototype, but training new models is complex. Core ML makes the process of incorporating ML models simple but doesn’t help with training new models.
Machine learning can frequently get the answer wrong. Ensure the user has a way to override the ML input and collect results so that the feedback can be reincorporated into the model.
Ensure your test set covers appropriate test cases. You can use automated testing to evaluate changes to your ML model and mitigate regression as you continue refining your model training set.

It’s important to realize that while Apple has introduced powerful new ML tools, alternatives do exist and can help to fill in the gaps.

All the code for the generic data source and the sample apps is available under MIT license on GitHub and can be freely reused and adapted.

We look forward to getting these new app features up and running at Capital One, and welcome your thoughts, feedback, and contributions!

These opinions are those of the author. Unless noted otherwise in this post, Capital One is not affiliated with nor is it endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are the ownership of their respective owners. This article is © 2017 Capital One.