Original image of car: https://www.pexels.com/photo/white-mercedes-benz-cars-120049/

Using image processing to extract the perfect shot out of a screenshot

Kevin Vo
carsales-dev
Published in
17 min readDec 1, 2019

--

When sellers advertise their items online, they try to upload nice looking photos in order to attract potential customers. There are situations where sellers are unable to find suitable images for their items, and thus opt to upload screenshots instead. Screenshots are not ideal, as they distract the user and make the content harder to see. We wanted to help sellers showcase their items regardless of the media they uploaded, and we wanted to improve the purchasing journey for customers. To do this, we created a tool which could detect screenshots and, if found, could extract the content from the screenshot in order to produce a better looking image for buyers and sellers.

The process

1. Screenshot detection (determining if an image was a screenshot or not)

2. Screenshot extraction (retrieving the content out of a screenshot)

In order to extract the perfect shot out of a screenshot, we needed to solve a couple of problems. The first of which, was how we were going to determine whether or not an image was a screenshot or not in the first place. We needed to be sure our method was accurate at detecting screenshots only, otherwise we would be transforming other images for no reason. The second problem was to figure out how we would extract the “shot” or content out of the screenshot itself. As screenshots came in many different shapes and sizes, we needed a generic method which could work on any uploaded screenshot.

Thus, we started to investigate our first problem which was to figure out a way of classifying screenshots.

Detecting whether or not an image is a screenshot

The different parts of a screenshot: the screenshot space, the screenshot content and (sometimes) the screenshot user interface/elements.

Screenshots usually contained three things: the content, blank screenshot space and UI elements around the page. Screenshots shared a number of key characteristics, such as:

  • Content area being segregated into its own section (usually found in the center)
  • Content area being rectangular in shape
  • Screenshot space being solid in colour

The first thing we tried was to detect areas of screenshot space. If there were two distinct rectangles of screenshot space which touched the bounds/edges of the image, this could have been an indicator for screenshots. This was fine when there were no UI elements, but as soon as there were some obstructions in the screenshot (like scrollbars or status bars), this no longer worked.

Detecting areas of screenshot space. The top rectangle touches three bounds of the image, whereas the bottom rectangle only touches one edge. Using this method, we needed both rectangles to touch three edges of the image in order for it to be a screenshot.

Pixel sampling

We tried another method: pixel sampling. In this approach, we sampled a number of points from the input image and kept track of the occurrences for each colour.

An example of sampling a 4 pixel image. From sampling, we can see that the image is composed of 2 black pixels, 1 yellow pixel and 1 red pixel.

We tried sampling every pixel in the image to not lose any data, but this process turned out to be extremely slow (an image with dimensions 2000x2000 (width*height) resulted in 4 million pixels sampled).

Instead of sampling each point, we tried sampling a grid of points instead. We drew a grid of x lines vertically and horizontally spaced evenly across the image, and every time the lines intercepted we sampled that point.

Sampling a 6x6 grid of points at the intersections.

By using this approach, we were able to generate an approximation of the image very quickly. We chose x = 50, meaning every image was sampled 50*50 = 2500 times regardless of the image's size. Here is an example of what the sampling looked like:

From left to right: original image, image sampled in 40 pixel increments (until edge of image is hit), image sampled in a 50x50 grid (equalling to 2500 points sampled)

Even though the image was now reduced to 2500 pixels, we could still see a rough outline of the image. Printing out the number of occurrences for each colour revealed something interesting (note that we only printed colours that appeared 5 or more times):

// (rgb_colour) -> times_appeared_in_sample(255, 255, 255) -> 1254
(247, 247, 247) -> 430
(0, 0, 0) -> 96
(246, 246, 246) -> 52
(247, 246, 246) -> 46
(247, 246, 247) -> 22
(246, 246, 247) -> 20

We can see that white (255, 255, 255) occurred 1254 times in the sample, and light gray (247, 247, 247) occurred 430 times. Together, this was 1684 points or 67% of the entire image that was made up from two colours! The next most frequent colour was black (0, 0, 0) which only occurred 96 times (3.84%), so we can see how much screenshot space affects sample counts in the image.

But why is this the case? The RGB colour space actually consisted of 16.5 million unique colours. Most images used a wide variety of colours, and so the chance a single colour appeared multiple times was quite low. It was even more unusual if a single colour made up 20% of an image, unless the image had screenshot space in it! We decided that if a colour appeared at least 250 times (10% of the total sample count), it would be considered a “suspicious colour” as it was probably a screenshot space colour.

But this logic was flawed. We noticed that in over-exposed or very dark images, there was also an abnormally high sample count for single colours.

// (rgb_colour) -> times_appeared_in_sample(40, 40, 40) -> 685
(41, 41, 41) -> 408
(39, 39, 39) -> 159
(42, 42, 42) -> 109
(43, 43, 43) -> 29
(44, 44, 44) -> 23

In this example, the two shades of dark gray(40, 40, 40) and (41, 41, 41) from the person’s back were considered suspicious but this was merely a false positive. This image was not a screenshot, it was just an image with sub-optimal lighting!

Horizontal/vertical line sampling

This image was a screenshot according to our algorithm, but we can clearly see it wasn’t. One of the most distinguishing things in this image was the lack of screenshot space in the image itself. There were no large rectangular areas of solid colour.

We needed a way to detect screenshot space, as this was a good indicator for screenshots. One way to do this was to sample the image line by line and make sure the entire line was the same pixel colour. If the line was the same colour, we considered it to be suspicious (and potentially screenshot space). Here is an example of what the line sampling looked like, where red lines were not suspicious (due to the line containing multiple colours) and where green lines were suspicious (as the entire line was the same colour).

In the original screenshot, there were a lot of green lines in the screenshot space whereas in the false positive, there wasn’t a single green line due to the lack of screenshot space. We made the assumption that images containing same coloured lines were suspicious because in normal images, the same colour did not usually appear across the entire image. To keep consistency with our pixel sampling rule, we decided that images needed at least 5 lines (10% of the total line sample count) of the same colour in order to be suspicious.

We combined the two rules together in order to determine if an image was a screenshot or not. We kept the original pixel sample rule as it could potentially detect more suspicious colours (as it is just an abstraction of the image), as opposed to the line rule which is checking for screenshot space.

Extracting the content out of a screenshot

It was time to move onto the next step, actually extracting the content out of the screenshot.

We explored many different techniques to do this. Our first approach was to crop the image from the four edges (going inwards). If there was a solid slab of colour that started from one edge of the image, we could find the dimensions of this slab/rectangle relative to the content area and crop that out. The problem with this approach was that if there was any obstructions between the edge of the image and the content itself, it would break the process (for example a status bar).

Cropping an image from its edges. In the top example, this works as the content is surrounded by two slabs of solid colour. In the bottom example, cropping from the bottom edge works but cropping from the top edge fails as there is a status bar in the way.

An alternative approach was to draw as many rectangles of solid colour as we could, slice the rectangles out and then re-paste the remaining content onto a new canvas. This would work for our previous example, however the result would look as though we had just squished the screenshot itself.

We could potentially detect the two rectangle areas of screenshot space in the left screenshot. If we cropped out those two rectangles, we would end up with a “squished” screenshot seen on the right.

These methods had flaws because we were trying to analyse a part of the screenshot that could change, the screenshot space. What was something we knew would appear in every screenshot, something static? The content itself. We knew from our previously defined rules that there would always be content in the center of the screenshot, so we started focussing on finding the content area bounds to crop out the image.

The first algorithm we used to find the content area bounds was breadth-first search (BFS). BFS is a pathfinding algorithm commonly used to map out a path from a starting point to any other given point in an area. In BFS, we keep exploring in multiple directions until everything has been visited. If we treated the suspicious colours like a fence/wall around our content area, BFS would keep exploring and eventually discover all the points inside of these bounds.

Breadth-first search exploration in the image, where red dots indicate content area points, and blue dots indicate suspicious points (i.e. not part of the content area).

We could see that the red points acted as an indicator for the content area, and the blue points acted as a wall of suspicious colours. By drawing a rectangle around all of the discovered points (red), we could approximate where the content area was.

A green line surrounding the discovered points, representing where we think the content area is.

This was what we were hoping for, but it still needed some fine tuning. One problem was the edges of the bounding box, and how sections of our content area were being cropped out (seen for the left, right and bottom edges). We wanted to try and pinpoint the real edge of the screenshot, something more precise than our BFS algorithm.

We knew where the content area was, and we knew where the wall of suspicious colours started. This meant that the real content area boundary was somewhere in-between these two points.

We chose to use divide and conquer to find this edge, as it was more efficient than iterating through all the points between the two edges. Divide and conquer works by continuously halving the distance between the two points (by moving one of the two points to the halfway point every iteration). For example, if we checked the line halfway between the current bounds and the wall, and the line was not suspicious, our new content bounds would be at the midway point. If the midway point was suspicious, we would move the wall down to the current line. By continuously doing this, we would eventually find the edge.

After applying divide and conquer, the edge of our content area bounding box is refined to include all of the content.

This was the final bounding box used to crop the screenshot. The bottom, left and right edges were perfectly aligned with the content (thanks to divide and conquer). The top edge also improved, but was not perfect. At the top of the bounding box a few points of screenshot space were detected by BFS.

Here is a crop of the image zoomed in closer.

The bounding box includes a few points above the image which to the human eye, looks like white. In reality, these three white points are very slightly different.

To the human eye, these points looked white but in reality they were only slightly different to what we detected as a suspicious colour (in this case, 254, 254, 254 vs 255, 255, 255. To illustrate this point, this is what the screenshot looked like with all the suspicious points filled in red:

The original screenshot overlayed with red where a suspicious colour is detected, white areas are considered content.

Any section that was red was screenshot space. There was a lot of white space around the image though, and we determined this was due to artifacting and compression from using JPEGs.

A lot of artifacting appears around the edges of the content area.

We can see here that our algorithm was detecting most points, but around the image there were some artefacts which are considered content / not suspicious.

We knew these were almost the same colour (especially to the naked eye), but our algorithm didn’t differentiate between similar colours, only exact colours. Thus, we started to look into comparing similar colours.

Colour similarity/differentiation

Two screens could display RGB colours in different ways, so working in the RGB colour space was not feasible for comparing two colours. CIE L*a*b* (CIELAB) is a colour space that aims to solve this by trying to accurately portray the colour our eyes see (independent of device). By converting our colours to this colour space, we could also take advantage of existing colour similarity algorithms such as CIEDE2000.

scikit-image already included functionality to convert colours from RGB to CIELAB, and also had an inbuilt CIEDE2000 algorithm, so we decided to integrate this into our current process.

CIEDE2000 compares two CIELAB colours and returns a value back to the user. This value represents how different the two colours are (the distance between them), where a value of<= 1 represents a difference that is indistinguishable to the human eye.

We substituted all our exact colour comparisons with this new comparison algorithm, and this is the result:

Our algorithm was now detecting most of the screenshot space in the image and what was left behind was the screenshot content. Although the similar colour algorithm didn’t detect all the artefacts in the image, we were happy with the result as it was close enough to what we wanted.

Speedups

Now that we had our core algorithm implemented, it was time to optimise it. This is the time it took to run each section of our algorithm (for a single image):

PixelSample: 11msBreadthFirstSearch (BFS): 710ms
DivideAndConquer (D&C): 232ms
TOTAL: 953ms

Caching suspicious colours (-370ms, 39% reduction)

Profiling our algorithm revealed that the bulk of the time was spent converting and comparing colours (CIELAB and CIEDE2000). We were running these comparisons without caching the results, so by storing the outcomes in a dictionary and reusing them, the times dropped to:

Pixel Sample: 11msBFS: 575ms (-135ms, 19% reduction)
D&C: 8ms (-224ms, 97% reduction)
TOTAL: 583ms (-370ms, 39% reduction)

BFS took 19% less time to run, but the amazing improvement was in divide and conquer, going from 232ms to 8ms! In divide and conquer, most of the points we were checking were suspicious colours so with caching, most of the comparisons were now free!

Less NumPy (-65ms, 11% reduction)

Pixel Sample: 10ms (-1ms, 9% reduction)BFS: 500ms (-75ms, 13% reduction)
D&C: 8ms
TOTAL: 518ms (-65ms, 11% reduction)

There were a number of places in our code which utilised NumPy for very simple operations. As NumPy calls a C runtime, the overhead in calling this is more than just doing it in Python instead, so we opted for inbuilt python functions instead (e.g. using array slicing instead of np.resize).

Colour similarity fallback (CIE76) (-253.5ms, 49% reduction)

Pixel Sample: 10msBFS: 248ms (-252ms, 50% reduction)
D&C: 6.5ms (-1.5ms, 19% reduction)
TOTAL: 264.5ms (-253.5ms, 49% reduction)

CIEDE2000 was accurate at comparing colours, but was slow because the algorithm itself was mathematically complex. What was something we could do to make it faster? If we took two colours, green and red, we knew that these two colours would never be similar. Regardless of the algorithm we used, we expect that the similarity value would always be very different as they’re two different colours. Thus, we opted to use another algorithm first (CIE76) to catch these differing colours. If CIE76 thought two colours were similar, we would then fallback to the slower CIEDE2000 algorithm for a more accurate calculation.

Inlining function references (-6.5ms, 2% reduction)

Pixel Sample: 3.5ms (-6.5ms, 65% reduction)BFS: 248ms
D&C: 6.5ms
TOTAL: 258ms (-6.5ms, 2% reduction)

When a function is called in Python with a dot notation (e.g. obj.testfunc()), the reference to that function must first be found before it can be executed. Although this overhead is small, it can add up and be detrimental for performance, especially in loops (where the function reference must be retrieved many times).

One way to bypass this is to cache the function reference before the loop, so it is only fetched once and reused:

// non-cached function reference
for i in range(100):
obj.testfunc(i)
--- vs ---// cached function reference
testfunc = obj.testfunc
for i in range(100):
testfunc(i)

Using Pillow’s getpixel function suffered from the same overhead (as the function had to perform many background checks, e.g. to ensure image was loaded). Because we needed to grab many pixels, we instead retrieved the entire pixel array from the image once, cached that and used it for pixel retrieval later on.

This speedup was quite small with respect to the entire process (2% reduction), but it was still very important as screenshot detection ran on every image (fixed cost), whereas screenshot extraction only ran on detected screenshots (infrequent).

Custom CIELAB and CIEDE2000 implementation (-111.5ms, 43% reduction)

Pixel Sample: 3.5msBFS: 140ms (-108ms, 44% reduction)
D&C: 3ms (-3.5ms, 54% reduction)
TOTAL: 146.5ms (-111.5ms, 43% reduction)

We re-profiled our algorithm, and the most time was still spent on the scikit-image colour functions by far. This may have been because our method of converting individual pixels was not the intended use for the algorithm. scikit-image’s default behaviour was to take a 3D or 4D array and convert all the RGB pixels to CIELAB in one go. At first, we tried passing the entire image array but this was very slow as the calculations for same colours were not cached. Passing single pixels was much faster, but was still relatively slow because of the overhead in going to a C runtime (which in turn, removed most of the performance gains of using C in the first place).

We decided to rewrite our own implementation of CIELAB and CIEDE2000 (using http://brucelindbloom.com as a reference) in pure Python to remove this overhead and adapt it to our single pixel use case. This pure Python implementation by itself gave a massive 43% reduction in time for the same result!

Moving from BFS to Wall Follower (-134ms, 91% reduction)

Pixel Sample: 3.5msWallFollow/ContentExtraction: 6ms (-134ms, 96% reduction)
D&C: 3ms
TOTAL: 12.5ms (-134ms, 91% reduction)

When we looked at our approach, we noticed that most of the points we explored were actually unused. The goal of BFS was to explore all the points within the bounds, but what we actually wanted was just the edges of the content area.

After removing all the inner points discovered by BFS, the resultant bounding box is exactly the same.

Even after removing all the inner points from BFS, we still had the same bounding box because the outer points were all we cared about!

But how could we achieve this? There were two main steps we had to take in order to remove most of the inner points. We had to first find one of the edges of the bounding box / content area, and after that we could then try to stick along the edge and find the rest of the content area.

The easiest way was to start at the center of the image and walk in one direction (e.g. north). When we hit a suspicious colour, we assumed that we were at the edge of the content area (more about why this was not 100% foolproof later).

We had one point on the edge, now we needed a way to find all the other points on the edges of the content area. What we needed to do was to stick our left hand out, touch the wall and keep walking. As long as we didn’t let go of the wall (and as long as the area was closed), we would eventually loop back to our starting point. Wherever we walked, would be the content area bounds.

This process of sticking your left hand out and hugging the wall is also known as the wall follower algorithm, commonly used to find an exit to a maze. If the maze was closed or had no exit, the algorithm would eventually lead the user back to the start.

An example of the different situations encountered in the Wall Follower algorithm. The basic concept is to always look left of the direction you last moved in. For example, if you moved North beforehand, you would now look left. If you can walk to your left, walk there. Otherwise, turn 90 degrees clockwise and check in that direction. Repeat until you’re able to walk, and then repeat this process.

We tried this process and “hugged the wall with our left hand”, and this was the result:

An example of the new wall follow algorithm, detecting the edge from the center of the image and following the left wall to find the content area bounding box.

This is exactly what we wanted from the algorithm. All the inner points in the image were skipped (except for the few points needed to travel to the edge), and the final bounding box was still the same as before.

There were a few edge cases we had to fix with this process. If we stumbled upon a suspicious colour before the actual content edge, the wall follower would walk around that patch of suspicious colour which is not what we wanted. We added a condition where the resultant bounding box had to loop around the original starting point. If it didn’t loop around the original starting point, it wasn’t the content area bounding box so we skipped past it and kept moving north.

Another edge case was if our starting point was a suspicious colour. This would break because the wall follow could potentially be stuck and be unable to move anywhere. If we encountered this, we would continue to move the starting point south-west (bottom left) until the starting point was not suspicious. We chose south-west because we wanted to avoid moving along common directions which may follow a pattern (i.e. moving south may still break if the image has a weird pattern).

An example of the wall follow algorithm, where sections of the screenshot have been removed to simulate “bad” screenshots.

We cut some sections out of the image in order to test our algorithm on “weird” photos. Notice how the algorithm only walked along the edge, and how it skipped most of the points that BFS visited. This new algorithm allowed us to visit a lot less points and also sample a lot less colours (most of the comparisons were being done against suspicious colours, which were already cached). Both of these factors combined led to a significant improvement in performance, bringing the extraction time from 140ms to 6ms which was much faster than what we had initially anticipated.

The result

We now had an algorithm which could take an image, determine if it was a screenshot or not and return the content area bounding box all in under 15ms. This algorithm boiled down to three main ideas: pixel/line sampling to determine screenshot/not screenshot, using colour similarity algorithms (CIELAB, CIEDE2000) to catch suspicious colours and wall follower to find the content bounds.

Together, these three ideas gave us a way to identify and handle screenshots uploaded to our website. This feature will not only allow sellers to better showcase their vehicles irrespective of the media they upload, but will also help buyers browse and sift through to find the cars they really want, improving the experience for everyone using carsales! There are still many different projects planned to improve the car purchasing/selling journey for both sellers and buyers but this is one of the many improvements we have in store today.

Thanks to my fellow peers at carsales for helping me through this process! Special shoutout to the pxcrush team (David Poirier, Abhinay Kathuria and Anthony Paes) for helping me develop and refine this algorithm to get it to this state!

At carsales we are always looking for fun and talented people to work with. If you liked what you read and are interesting in joining, please check out what positions we have available in our careers section.

--

--