Brave little head-on with OpenCV at a startup
I work at this startup where we make some “edu-tainment” games for kids, fusing shiny technologies like Augmented Reality and Computer Vision, with traditional physical toys. Example — an Android app that detects and tracks an image stuck on let’s say, a flash card (flat), or a cylindrical surface, or a sphere, or any surface, really, as long as the image being tracked is decently visible (that’s a different topic). Tracking the image means the app knows what the image it is tracking is — if it’s the cartoon of a tiger, if it’s a brand’s logo, if it’s a complicated machinery being looked at. Once the app knows what is being shown to it via the camera of the device, it can play a video, render a 3D object, open up a tutorial, do whatever the platform supports — the possibilities are endless; it’s no surprise big names are bullish about AR and so are we, at PlayShifu 😀. In addition to maintaining and updating our current games, we’re planning an AR globe game called Orboot, do check it out 😉.
Of course, we make (and release) ‘some’ games for kids — and brainstorm, prototype and test A LOT of them — part of the philosophy that startups (not just?) should experiment a lot to construct amazing products.
One such quick prototype we brainstormed and wanted to try hacking up was a simple alphabets game for kids — let’s say, a simple game where the child learns spelling by orderly placing small chips, each printed with a letter from the alphabet. Or much simpler — a “match the following” sorta game where the kid is shown a letter (say, ‘A’) and he/she searches for the ‘A’ card/chip from the play kit and shows it to the app which proceeds to show a “right/wrong answer” thing. A key technical capability of the app is to identify individual letters of the alphabet captured in the camera feed.
We found, of course, myriad ways we could hack this up — depending on the exact requirements, things like Haar classifiers, SIFT, etc that work for some cases and don’t, for some others, and OTOH, more robust approaches like neural nets, that work for a larger variety of cases. Among all the reading and the multiple tabs of research papers that were eventually let go (read, “closed, unread” 😜), the lazy ass in me wanted to try a shortcut.
Initial reading
Robust methods like machine learning (and neural nets) work well (in fact, thrive and shine) on large data-sets. Brute force methods OTOH, while unusable on large data, aren’t too bad on small data (of, let’s say, just 26 elements? Yep, that’s the hint.) So, I thought what if we skip all the jazz and simply feed our computer with 26 images — each capital letter of the alphabet, and when the app is run, we subtract the chip images seen from camera feed with the training images we fed the computer with, initially. Where the difference is zero — is where the answer lies. It’s like telling a friend to think of a number between 1 to 26, then asking them, “the number you thought minus 1 equals?” — right from 1 to 26.
With all the robust methods and detailed algorithms that are typically used for such purposes like text recognition, the idea (of simply subtracting the images to detect the alphabet) initially seemed outlandish but because I couldn’t exactly find out why it can actually be so, the inquisitive ass in me found no other way but to actually try it out.
Before this, I had zero OpenCV experience. I just had a vague idea about the suite of image processing capabilities the API provided and that the images’ individual pixels were structured as a matrix (OpenCV::Mat
being the official name of the class representing the matrix data structure). However, because my usual tendency is to (sometimes unnecessarily) dive into too much detail prematurely, this time I decided to somehow quickly try out the thought approach, without thinking about optimisation or better usage of the available API functions. This meant finding out— at a very high level — how to use the bare minimum number of functions to perform the relevant chip image subtraction (and initially storing them to be subtracted against). Also, in other words, I wanted to start learning some basics of OpenCV by doing; I decided to collide with it head-on.
The diff approach
My parallel web search on relevant literature led me to Prof Arnab Nandi’s blog post which not only allowed me to collide with OpenCV head-on but also gave me the confidence to proceed with the hack. Huge thanks to you, Prof Nandi! With the help of that blog post and not without stumbling on the way, one by one, I learned commonly used image processing concepts like image registration, pre-processing, artifact removal, thresholding, contours, perspective transform etc. Much like the approach Prof Nandi’s blog described, I followed these steps:
- From each frame of the cam feed, we’re interested in extracting out only the alphabet chips visible. This was done by finding contours that approximated like a 4-sided polygon (we were working with little square cards). Of course, finding contours was preceded by pre-processing steps — Gaussian blur and thresholding. Further, contours of interest were filtered by passing through a range on their area.
- Each contour that looked like a 4-sided polygon was then passed through a perspective transform. The “looked like” check was an
approxPolyDp
followed by acheckVector
. - Each “perspectively-transformed” contour is then resized to 20 X 20 and then
Core.absDiff
ed against the 20X20s from the “training image” set (26 of them - each letter of the alphabet).- The measure of how “different” the cam feed image is from each training image is measured by simply counting the number of white pixels in the corresponding diff. (Remember that diffing two thresholded images yields a thresholded image). API that I used for this — `
Core.sumElems
`. - The one where the number of white pixels is least is likely the chip shown to the cam.
Results
Results? 😜 Surprise! Android Java running on ART on a Lenovo tablet gave “playable game” speeds. Add some concept, scoring and UI, and we’re ready to go! Not to say that it didn’t take several iterations, documentation-hunting and runtime errors before we had a working prototype. Up to 8 different letters were getting detected before there was a slight hiccup in the frame rate.
Further challenges
- Lighting effects/variations: Lighting variations were causing inconsistent thresholding disturbing the detection. At one place, I changed from a standard
threshold
toadaptiveThreshold
to mitigate disturbances due to lighting variations. - All four angles: We wanted to be able to detect an alphabet chip shown to the cam at any angle. Perspective transform solved that by ensuring that any orientation of the visible letter will be rotated to one out of a maximum of 4 possible orientations. For a highly symmetric letter shape like that of the letter ‘I’, only 2 orientations are enough — one “horizontal” and one “vertical”.
- Collisions: Certain orientations of certain letters collided with each other. For example, a 90-deg rotated ‘Z’ looked like an ’N’; a 180-deg rotated ‘M’ looked like a ‘W’, etc. We tried to resolve these by choosing fonts that ensured that these collisions were eradicated either by differing the letter’s shape or accenting them differently.
Conclusion
Overall, it was a pretty fun experiment. 😀 We’re yet to sell something that uses this experiment but needless to say, we’ve preserved the code, the app builds and the training images. It’ll be interesting to see what else will be required to make this whole experiment robust and “ready to ship”.