Solving CAPTCHAs with TensorFlow and Ruby

At Blacklight event at DevFest Siberia the organizers made a really interesting task that involved breaking Captchas. In this blog, I will present the interesting ideas to be learnt from the contest and using those ideas in Ruby. This post is originally inspired by the works of Natalie Pistunovich on Gopheracademy blog.

Everyone hates CAPTCHAs — those annoying images that contain text you have to type in before you can access a website. CAPTCHAs were designed to prevent computers from automatically filling out forms by verifying that you are a real person. But with the rise of deep learning and computer vision, they can now often be defeated.

The challenge was this— You have to break into a room without the surveillance cameras capturing your break-in attempt. Disabling the camera required entering a four-digit security PIN into a CAPTCHA-protected form.

Provided were a TensorFlow SavedModel in the binary ProtoBuf format, trained to recognize that particular captcha (tensorflow-savedmodel-captcha.zip, 27.7 MB), and a link to the camera interface. I want to help you understand how to solve the underlying problem in this blog post.

An input of a TensorFlow model requires doing some TensorFlow!

A Few Words about TensorFlow

TensorFlow is an extraordinary open source software library for numerical computation using data flow graphs. TensorFlow runs computations involving tensors.

A tensor is a generalization of vectors and matrices to potentially higher dimensions. Internally, TensorFlow represents tensors as n-dimensional arrays of base datatypes.

A tensor is defined by the data type of the values it holds and its shape, which is the number of dimensions and number of values per dimension.

The “Flow” part in TensorFlow comes to describe that essentially the graph (model) is a set of nodes (operations), and the data (tensors) “flows” through those nodes, undergoing mathematical manipulation. You can look at, and evaluate, any node of the graph. If you want to have some fun with predeveloped Tensorflow models or do some interesting visualizations go play around with the embedded projector.

A Few Words about TensorFlow Ruby

TensorFlow comes with an easy to use Python interface and a C++ interface to build and execute your computational graphs. However, Tensorflow was available only in Python, and due to the strong interest from the Ruby community, I took an interest in porting it. I started working on Ruby API with support from Somatic.io and SciRuby foundation and came across some cool things along the way. You can read my previous blog post if you would like to explore Tensorflow Ruby. Aside from this, you can take a look at tutorials on image recognition and protocol buffers in tensorflow.rb

Tensorflow Ruby is a very interesting project for ML but it’s most suitable for running pre-trained models as of now the capabilities are limited but with time we are working on adding many new features and capabilities.

Lets dive right in…

The interface I was facing seemed pretty close to a standard captcha-protected form. Here’s how you can go about solving the problem..

  1. Inspect the model
  2. Load the model and the image
  3. Specify the operations and run the model

1. Inspecting the model binary

SavedModel is the universal serialization format for TensorFlow models. 
 — TensorFlow documentation

On downloading the SavedModel provided by the Blacklight team. The first step is to understand what the file is and what it does. So we used Tensorflow SavedModel CLI to understand more about the model.

Here’s the command and its output:

$ saved_model_cli show --dir ./tensorflow_savedmodel_captcha --all
MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:
signature_def['serving_default']:
The given SavedModel SignatureDef contains the following input(s):
inputs['input'] tensor_info:
dtype: DT_STRING
shape: unknown_rank
name: CAPTCHA/input_image_as_bytes:0
The given SavedModel SignatureDef contains the following output(s):
outputs['output'] tensor_info:
dtype: DT_STRING
shape: unknown_rank
name: CAPTCHA/prediction:0
Method name is: tensorflow/serving/predict

We learn two things here

i). This model takes input: CAPTCHA/input_image_as_bytes. So this model takes input image as bytes.

ii). And outputs the prediction in a string CAPTCHA/prediction

We could use this pre-trained model as a blackbox to solve our problem.

2. Load the model and the image

Here is the reference image that we want to break

This Captcha image was built using python module named captcha. We have to load this image in binary format and transform it into a tensor.

As for the model, we can load the model with LoadSavedModel. The full signature is:

def LoadSavedModel(exportDir, tags, options)

The function takes 3 arguments: export directory, tags and session options. Explaining tags and options can easily take the entire post and will shift the focus, so for now you should know that serve is the tag used to serve TensorFlow models, and session options are not required in our case.

Then we load the image file in binary format and specify a string tensor corresponding to the image.

3. Specify the operations and run the model

To run a prediction we need to supply inputs, called feeds (operations to feed our data to, mapped to tensors containing the data), and outputs, called fetches (operations to fetch the data from).

We only have one feed (input):

  • the feed operation is CAPTCHA/input_image_as_bytes
  • the feed tensor is a string containing the CAPTCHA image as bytes.

Likewise, there is only one fetch (output): CAPTCHA/prediction.

After we run the model with our feeds and fetches, we receive the output — the captcha prediction.

To Wrap This Up

The full code after everything is composed together is available on Github. Please go ahead and try it out. The pre-trained model provided for the challenge is based on emedvedev/attention-ocr.

Hackathons and Programming challenges are some fun activities that we must all try to participate in even if we aren’t sure about our own programming capabilities because primarily the goal of attending any cool event is to learn from the people who are more accomplished and become better developers ourselves.

Acknowledgements

Thanks to Kaustubh Hiware, Natalie Pistunovich and Edward from the Blacklight Team(Twitter) for helping me with the post.

Have a great day!!!