Earlier this year I ran about 800 digitized zines from the “Solidarity! Revolutionary Center and Radical Library” collection though some google computer vision APIs. I then built a little tool to evaluate the results. The APIs are very straightforward to use and I wanted to see what metadata could be enriched with this low barrier process. I began to evaluate the results but was really let down by the image labeling API results. This task takes an image and returns what it thinks is being depicted. A classic machine learning application. I was fully expecting to spend a day reviewing, smiling saying “haha, google thinks this is a dog, and this is a dog, and this is a dog…” but no, not even that. Almost all of the images (these are full page scans) returned vague results like “black and white, cartoon, font, drawing, line, text, shape, pattern, brand.” Not exactly metadata utopia. This totally makes sense, zines are complex multimedia documents. You can’t send something crazy like a full page of a zine and expect magic.
So I did a little test, what if I could reduce the complexity of the page, could the google label API figure out some of the images? So I reduced from full page, to a page with no text and, finally just the embedded image:
Of course, to get the labeling to work to a useful degree you need to send it just the image, not the whole page. We then need to figure out where are the images in the page. And surprise, this is not an easy task. My first thought was “oh, someone extracted a bazillion images from Internet Archive books and put them on flickr right? I’ll just use what they did.” I finally tracked down how that project worked and it basically used the ABBY OCR XML file, as that tool marks up text and image blocks in a document. ABBY OCR is a commercial tool, I don’t have it, so there has to be another way of doing this.
Enter the field of document layout analysis. I read papers where I saw folks doing amazing things. I knew this was possible, but I don’t know what I’m doing. I compiled customized forked versions of Tesseract based on random forum comments, I tried every free tool I could find, I even tried emulating a windows program. But none of it worked, nothing would reliably generate a layout analysis pointing out the the images in the document.
I worked backwards, I knew where the text was, based on the OCR API, if I could remove all the text from the page then presumably all that would be left are the images. I looked into how to do image saliency detection which got me to scikit-image pixel cluster measuring via a blog post. Meaning I could successfully detect the images if they were the only things in the document page.
I wrote a script that whiteouts all the text in a page and then looks for clusters of pixels over a certain size. In these examples I put a black border around the text blocks to visualize the bounding boxes but in practice the page become very sparse with the text removed. The more traditional page layouts work better with this approach. For example, overlapping text on images or large faint images often causes them to be broken up into multiple regions. But there is still some success with even the most complex zine aesthetic.
I created a video showing this process on about 3,000 zine pages:
You’ll see a lot of scenarios where it breaks down, and a lot of edge cases. With some more work or better computer vision chops this could be improved I’m sure. But for now it gives me some more materials to run through the API to hopefully return better metadata for the graphical components of the zines.
In the next part we will look at the result of evaluating all this metadata produced including the OCR, the results of this image labeling process and more.