Image Preprocessing with an ASCII Art Generator?

Ryan Xu
2 min readJan 29, 2023

--

A random train of thought.

I read this paper recently, Image-and-Language Understanding from Pixels Only, that does VQA¹ using only vision encoders instead of using separate vision and text encoders. They do this by rendering the question’s text as an image and passing it as an input to the vision network.

¹VQA stands for visual question answering, a task in which a network is expected to answer natural language questions about a given image.

Sample VQA question: Is the house suspended from the ground? Answer: Yes

Quite frankly, this seemed ridiculous, and made me want to create something in response — something like “VQA, but only using only text encoders².” I decided that passing the image through an ascii art generator would probably be a good idea.

²though all of this is already pretty ironic because the encoders used in the paper are vision transformers, which themselves are basically a crude repurposing of an architecture originally designed for text.

For those of you looking for results, I regret to share that I never went through with running this experiment, almost entirely because I found out how stupidly simple ascii art generators are. Most of them just find the character whose brightness value is closest to the average brightness in the image patch. The nice ascii art that you normally encounter is usually carefully designed by people³.

³side note, Chat-GPT is hilariously bad at generating ascii art.

Ascii artists, your jobs are safe!

Anyways, all of this got me thinking. I actually really liked the concept of turning images into good ascii art, where the segmentations of the objects are well represented by the curves of the ascii characters. In this case, learning the character embeddings is in some sense just learning to embed the shape of the characters.

More generally, this made me wonder whether you could do something similar in normal networks — some variation of discretizing and learning an embedding for the discrete forms. Optimistically, discretized forms deep in the network may even begin to represent objects that are easier to symbolically manipulate and interpret.

Thats about it for now — just a train of thought that I had recently.

Honestly, I’m not really sure who the target audience is for this, but I hope that at the very least, you found something pretty, amusing, or informative.

A random train of thought

--

--

Ryan Xu

interested in random things and writing about them. currently doing applied machine learning research for ebay and grokking grokking on the side :)