Building a self-driving Headless Browser using Fast-R-CNN Object Detection

Jiayi Wang
June.ai
Published in
8 min readOct 15, 2018

This is one of our more exciting projects that I’ve been fortunate to work on.

At June.ai, our a mission is to utilize our passion for machine learning to build the future of communication to benefit society at large.

The idea came to us a while back when we were looking at things that the OpenAI team and DeepMind were doing.

After reading about the Atari games play themselves we saw an interesting opportunity to explore in how we process email communication.

OpenAI Gym

Reinforcement learning is a very general algorithm and it started to achieve some very good results in many difficult environments. In a nutshell, these agents learn by completing a sequence of actions that reward or punish cycling through different states.

Usually these agents have finite controls such as up-down-left-right + auxiliary controls to play these Atari games. We were thinking what if we provided the entire keyboard and mouse and hooked it up to a browser.

We wanted to see if we could build a self-driving web browser.

There are a number of challenges here, so we just broke it down:

The first problem we wanted to tackle was, can we use computer vision to identify web elements?

Here are the results from a Google Cloud Vision:

Taking an image of a fully rendered webpage and running it through the results you get are:

An email that we ran through Google Cloud Vision
We were able to detect Logo, and extract at least partially
Another Promo email we ran through Google Cloud Vision

Initial results from Google Cloud Vision were impressive, but really it seemed like it was designed more for detecting general objects. We liked that it was able to detect logo and extract the text.

AWS

We ran the same emails through AWS Rekognition and got another set of general object detection.

Another set of images of emails ran through Clarifai

And finally one more promising image recognition API Clarifai, and general object detection.

Looking at the results, it was impressive that they were able to accurately detect objects in the images and turn them into words that accurately described them. However, you can easily tell that these generalize models weren’t really meant for the web. It was probably more for general photos. When humans see objects on a browser or email, we look for buttons, take more weight in larger words, and have specific reactions to different web elements.

So we decided the first step would be to build our own model more specific for what we wanted to do.

We started small, by picking a few items like Headlines, Subheaders, and Buttons. And we trained a Faster R-CNN model. We just wanted to see if we connect the two technologies together, a selenium headless browser, stream the video into a CNN model that could detect the objects that we trained.

We already know that Convolutional Neural Network drives huge breakthroughs in the field of computer vision. Unlike traditional methods, deep CNNs work by consecutively training relatively small pieces of information and integrating them deeper in network, thus are able to process vast amount of variations in images.

Driven by the success of region proposal methods and region-based CNN, object detection algorithm is widely applied on surveillance, vehicle detection, manufacturing products detection, etc. The method used to conduct a object detection model is slightly different from the traditional CNN followed by a fully connected layer. Since the number and class of objects is not fixed, the length of the output layer varies. Thus we need to select different regions with each image and use a CNN model to predict if the object exists within that region.

(Source: https://arxiv.org/pdf/1506.01497.pdf)

Faster R-CNN (proposed by Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun), is composed of two modules: a deep fully convolutional network that proposes regions, and a Fast-CNN detector that uses the proposed region. Instead of using the selective search as the region proposal method, another network is built to predict the region. Then the RoI pooling layer accepts the convolutional features and the predicted bounding boxes.

Here is a brief introduction of how we train our own object detection model.

We first collect 1500+ real world emails and transfer them from HTML format into RGB images which are encoded as JPEG. We generated the images by rendering html of emails and sending them to a docker-ized selenium headless chrome browser. In order to label the objects in email, I need a table of bounding boxes with coordinates that define the class region of each object.

We choose LabelImg, a graphical image annotation tool that we can manually label all the email images with objects like headline, button, image, discount, etc. The coordinates of bounding box of each object would then be saved in a XML file.

After hand-labeling the images, we convert the XML files into TFRecords file which serves as the input data for training. With Tensorflow Object Detection Model, we are able to train our own object detection model with customized labels.

Training a object detection model with CNN could be time consuming and computational expensive. Thus we choose to train it on Google Cloud ML Engine. We setup the config with standard GPU with 5 workers.

trainingInput:
runtimeVersion: "1.0"
scaleTier: CUSTOM
masterType: standard_gpu
workerCount: 5
workerType: standard_gpu
parameterServerCount: 3
parameterServerType: standard
Google Cloud ML Engine Jobs Console

Here you can see we already achieved low loss after 20,000 steps.

After our model has been trained, we export it to a Tensorflow graph proto with multiple checkpoints.

Next, we just need to applied the graph output model on a video stream. This is done with OpenCV, a open source computer vision library that processes images and videos.

And we got this:

Very cool! We were able to detect more specific web elements that is more specific to how we look at web pages and emails.

This technology and process is very similar to that of self-driving cars where it can with detect very specific objects. For example Tesla Autopilot ignores the clouds in the sky where as a very general model might have picked that up.

What about a real world example?

We’re using a ton of natural language processing algorithms. But those algorithms are all about text or code. Humans as visual creatures, find importance in the way buttons are shown, the font and sizes of titles, and the placement of images.

One particular example: A lot of promotional emails have a lot of images that say 70% off in big huge letters, but never say them in the text (or it’s very difficult to filter this out because of the noise). So we trained it to detect anything that would have % but in promotional images.

After rerunning the same process above, we were able to successfully detect “discount” tags on images.

Admittedly, there are a number of issues with this approach.

First you’d have to render the entire email as a web page, save that giant image somewhere. Then run the image through the model and label the image, parse out the text, then save that text to the database. The biggest issue was latency, all of our other microservices run under 500ms, so for this particular microservice it would be a lot slower due to images being transferred and processed.

We did have success in the ability to click on the unsubscribe buttons for you. Currently the easiest way to build in the “unsubscribe” feature is to look at email headers and reply to an email that is listed as the unsubscribe asking them to take you off the list. Even though there are anti-spam laws, because email is highly decentralized, people send emails without these headers and just a unsubscribe link. For the small percentage of emails that do not have the unsubscribe headers, we do click on the unsubscribe links for you. We’ve dockerized a headless chrome browser to be ready to go to the link, and even click “unsubscribe” buttons for you. The great thing about this is that even though there is high latency, it really isn’t an issue since the user is not really waiting for anything after they unsubscribe from an email.

The eventual goal of this self-driving browser project would be to start performing human level tasks that people don’t want to do, but machines are good at.

We think that this is very doable — focusing in on the most repetitive, high volume tasks that users perform and expand to more generalized tasks. Although we’re still working through applying this technology to more difficult problems, we’re glad to have shared some of these concepts with you.

A big thank you to John Jung, Allie Sutton, and Dan Radenkovic, Kaitlin Walker for helping me put this post together!

Connect with us, give us a shout, let’s build something new

Twitter: @junedotai

Medium: June.ai

Facebook: June.ai

--

--