Translating vision into sound

A deep learning perspective

Viktor Tóth
Mindsoft
21 min readApr 21, 2019

--

Shortcuts: Github repo, Thesis, Videos

The ever increasing number of 36 million blind individuals barely have access to vision rehabilitation procedures, due to their price, availability and risky nature. Certain blinding diseases, like advanced glaucoma, still lack remedy, while recent invasive advances in treatment of retinal infections are complicated, expensive, and provide limited restoration of vision. As the causes of vision impairment can vastly vary, there seem to be no single solution to the problem.

Paul’s cool idea

In 1969, Paul Bach-y-Rita introduced the ingenious concept of sensory substitution: the transfer of sensory information of one modality through another one; for example, blind people may hear or touch images. Essentially, the impaired sense is substituted with an active, functioning one. Bach-y-Rita experimented with tactile devices strapped to the tongue of blind subjects. Visual imagery was translated to patterns of electrical activation, then administered to the dense sensory array on the tongue.

Paul Bach-y-Rita with his electro-tactile stimulation device. Connected to a camera, it’s able to translate images to spatial patterns of tongue stimulations. Photo by Phillipe Psaila, Science Photo Library.

But why tactile? What about sound? Can we translate vision into both auditory and tactile stimuli? As you can imagine, the visual information we receive in each moment far exceeds the depth of our combined tactile and auditory sensations. In fact, there are 1-million neural fibers entering the cortex from the optic nerve, while the auditory nerve incorporates only 30-thousand. Tactile information is even more scarce, especially when considering just the tongue. So if the aim is to substitute vision, we need to get hold off as much neural bandwidth as possible. However, as it turns out, the concurrent stimulation of multiple modalities is much harder to learn. Hence, our best bet is sound, and it actually has been in the last couple of decades.

Visual-to-auditory (V2A) sensory substitution devices are designed to convert images to sound. They have substantial advantage over other visual rehabilitation methods like retinal implants: they are dirt cheap and non-invasive, thus posing no financial burdens or medical risks. One only needs a smartphone equipped with a camera, and headphones to listen to the image-inspired, so called soundscapes. The sensory substitution research community have come up with plenty of devices and corresponding V2A conversion algorithms, but most are merely reiterations, with slight improvements and experimental features, on a solution developed in 1992: The vOICe. Among all the approaches, The vOICe is the most practical and widely used, clearly demonstrated by its 100k+ downloads and around 1300 current active users on Android only.

However, a couple thousand is far from the potential millions of visually impaired, who could take advantage of this technology. So what’s the issue? For the time being, here’s some hand-waving: sensory substitution is quite complex, and problems pile up from the non-linear neural coding of soundscapes, up to the lack in the standardization of training procedures, and even the UI can trouble the users. Most importantly, it’s not a magical solution to blindness as much of the mainstream news coverage tend to portray it. Learning the mapping of sound or touch to images, for those who had very limited, if none, experience with such a spatial arrangement of reality, is extremely difficult. It takes rigorous training for months to achieve a practical level, and lots of physiological stars have to align for one to reach quasi visual sensing, which has actually happened in two miraculous cases. I don’t want to sell snake oil here. For most blind people, sensory substitution is not a viable solution to rehabilitate sight. Yet, we may be able to lower the bar ever so slightly, so more can take advantage of the tech. And that’s where I came in, starting off hot-headed…

Hold my beer: an engineer’s approach

What was clear from the beginning, intuitively, is that the shorter the soundscape, the faster the learning is; and I later found out that solid research in the field of perceptual learning pointed in the same direction. By decreasing the length of the sound representation, we increase the frame rate (FPS) of audio-translated images. The more images we hear in a second, the faster reaction times we have, and thus, the more we can interact with the environment, and correlate it with the synthesized audio. Also, if a truck is speeding towards you at 60 km/h, you better of with a frame rate of 10, than just 1; and The vOICe works best in the range of 0.5–2 FPS. So what the hell, let’s just speed it up!

My first V2A conversion logic. It breaks up the image into k-pixel-long (k=3 here) horizontal pixel blocks. All non-overlapping blocks are sonified, superimposed and hence played at the same time. Horizontal position is converted to sound azimuth (direction from which the sound is heard), vertical position to pitch and brightness to sound amplitude, similarly to how The vOICe functions. Once I implemented and tested this (in the Unity game engine), the issue with such an approach became blatantly clear.

With the above depicted conversion method I could represent any image in a 9 ms long soundscape, which translates to 111 FPS. I could even further expand the blocs in order to downgrade to 30 FPS, which is more than enough. Sounds good, except that by superimposing hundreds of soundstreams, I produced audio samples so convoluted, no human (or machine) could differentiate between the auditory representations of different images. At this point, I’m kinda embarrassed that I considered this as a viable approach at all.

In essence, there is a trade-off between soundscape length and the perceivable visual information it can convey. Pin-pointing our accuracy in differentiating between pitch, azimuth and amplitude values is easy and well-documented. How you distribute the visual information along the time dimension is the non-trivial part, and I realized that’s where previous sensory substitution devices resorted to ad hoc solutions. For instance, The vOICe takes each pixel column of the image one-by-one, and superimpose the soundstreams generated from the pixels within. Columns are played consecutively from left to right, each taking 5 ms, so in total the produced soundscape is 5×image_width ms long. Why left to right? Aren’t the resulting sound bites still too convoluted? One can rationalize the left-to-right scanning, but it definitely doesn’t capture how sighted people perceive visually. Also, mundane, plain parts of the visual field consumes equal amount of time in the auditory domain, as the more semantically relevant sections, which is clearly undesirable.

Sensory substitution devices spread along the axes of substitution delay and visual space abstraction.

On the figure above, substitution delay covers the duration of time from the point an image is taken, until the corresponding soundscape is played in its entirety, including the computational time of the conversion method. Visual space abstraction loosely delineates the amount of visual detail removed from the images before translated into audio. Square symbols represent explicit, circles depict implicit conversion approaches, about which I describe more in my thesis. The size of these symbols, R, displays the amount of perceivable, conveyed visual space, i.e. the dimensions of the converted images. The Vibe and RTEVRUAS devices superimpose the sound sequences of every pixel simultaneously, which leads to convoluted soundscapes, and thus to limited perceivable visual information. Visual space abstraction spans across colored pixels, clustered coloring, grayscale, contour (contrast edges), up to the representation of single objects or scenes. The proposed conversion logic, AEV2A, extracts the contour information of images before transforming them into soundstreams. Ideally, we would aim to achieve as low substitution delay and as much conveyed visual space as possible. In V2A sensory substitution, a compromise need to be made between these two factors, which the figure above attempts to depict. TV stands for The vOICe; Deep-See is a machine learning solution, detecting a set of objects in the scene before vocalizing their names, preserving some positional information.

Let’s take a step back, and examine the limitations of human sound perception.

Biological reality: build on first principles

We’ve nailed down that the auditory pathway is the best candidate for the substitution of vision. Now we need to make sure that the substituting audio signal is at the least encoded in the brain. More precisely, we need to assemble a set of soundscapes, so each excites a unique neural pattern, when presented as a stimulus. That is, the audio representation of a vertical line and a slightly rotated line should have distinct auditory neural encodings. Note, that perceptually we would want these two audio samples to be very similar too, as they correspond to almost identical visual features, but we’ll get to that later.

Up the auditory pathway

Sound is a wave. While we humans like to make much more of it, waves have amplitude (~loudness), frequency (~pitch) and spatial position (~binaural source location). Everything I wrote in parenthesis just here is what we perceive, which is far from the physical description of waves. For instance, the pitch of complex sound sequences is not just solely defined by frequency, but the envelope of the wave contributes to the sensation as well. Furthermore, pitch and loudness are both logarithmic: we can differentiate between small increments at low frequencies, though at higher regimes the same amount of contrast is not recognized. And have I mentioned timbre or harmonics?

Well, auditory processing is quite complicated; mother nature managed to optimize it to such an extent that we still struggle to computationally simulate even the initial stages of the pathway, like the cochlea. And man the cochlea causes a lot of inconvenience in our quest to derive our limitations (and accuracy) at sound perception. It introduces non-linear transformations, compression on the received waves, suppresses some part of it, emphasizes some other through feedback processes of neural origin. The cochlea is responsible for the logarithmic perception of pitch and loudness, for spectral masking — the attenuation of consecutive sounds having close fundamental frequencies — , and for distortion products. In fact, I don’t believe we can really define the V2A conversion logic without having a solid model of the cochlea filtering our audio signal, so the remaining information at its output can be assessed.

Once we arrive at higher areas on the auditory pathway, now neural instead of mechanical sequences encoding the sound stimulus, the exact pitch and amplitude begins to matter less and less, and the modulation of these variables over time starts to dominate, especially in the auditory cortex. Amplitude and frequency modulation (AM, FM) is related to the shifts we temporally recognize in loudness and pitch. Understanding of speech has to do a lot with AM and FM, for example: it doesn’t matter whether I say word an octave higher or lower, it’s the same word. The hearing of birds are finely tuned to AM and FM; moreover, our communication devices also utilize them extensively. The beauty of AM and FM popping up every instance where information has to travel reliably from A to B reveals a fundamental physical principle. But let’s not divert from our problem in hand.

Here’s an amplitude, frequency and spatially modulated stereo soundstream. In the middle, the audio played in the left channel shows decreasing amplitude and frequency at the same time.

Binaural sound localization refers to our ability to integrate sound waves arriving at our ears. The integration can be temporal — what the left ear hears first is likely to be towards the left — , and spectral — once the sound literally traveled through our head, certain frequencies get attenuated, so your brain can compare the frequency distributions received on both sides, similarly to hearing mostly the bass of the sound when a wall is between you and the speakers. It surely gets more complicated than this, but what matters is that we get less accurate at telling sound source locations towards the periphery: we can perceive 1° angle differences in front of us, but it could reach 15° when the sound attacks straight from the sides.

When you merge AM, FM and spatial properties of a sound sequence, you get an auditor stream. Streams may be broken up or fused together, depending on whether we perceive their source as a single entity: the footsteps of a predator walking behind the bushes is neurally encoded as one thing in the auditory cortex, until we see two pairs of eyes peeking at us.

What further complicates the quantification of our hearing limitations is the fact that the above mentioned variables highly interact in a non-linear manner. For example, our threshold of hearing changes by frequency; also our accuracy at locating the source. Higher amplitude sounds tend to mask other sounds with a similar frequency distribution. And the list goes on. But for now, let’s just get down to the numbers!

  • Frequency: hearing range is between 20 Hz and 20 kHz, but above 8 kHz we can barely discriminate between tones (besides those notes being rather irritating); if x ∈ [0,1] linearly represents the pitches we perceive, frequency = A(10^(⍺x) - k), where =2.1, k=0.85 and A=165.4 for humans; by decreasing A, we can shrink the range of frequencies
  • Amplitude: when measured in SPL, our hearing threshold is 0 dB, auditory nerve firing rates are well correlated in the 30 — 50 dB range, 130 dB is the pain threshold (sound caliber of a jet aircraft 15 meters away), but everything above 70 dB is considered annoying; higher amplitudes tend to occupy more fibers, spreading to nerves that are supposed to encode audio even as far as an octave away; amplitude = loudness^(1/⍺), where =0.3 and loudness ∈ [0,1]

Note that I defined the equations of frequency and amplitude with power functions instead of logarithms. In terms of scaling, it’s the same, but at the boundaries, power functions behave more reliably (think about the log and power values at 0). Moreover, I converted the perceivable variable to the physical equivalent (loudness to amplitude, pitch to frequency), because that’s what I need later to synthesize soundscapes of a given perceived quality. Such are called inverse tuning functions: the tuning function describes the difference in neural patterns/perception along a property of the stimulus (e.g. frequency); the inverse tuning function shows the change in the stimulus property as the corresponding perception shifts.

  • Binaural localization: localization error of azimuth (horizontal dimension, measured in degrees around the head, 0° the direction where the nose points) stays below 5° between -45° and +45° azimuth; it reaches 15° at the ears, which is at around 90° azimuth; azimuth perception is more accurate than elevation discrimination, and exploits more corresponding neural resources; furthermore, at elevations higher or lower than our nose level, both our elevation and azimuth localization errors shoot up
  • FM: the auditory cortex of humans is sensitive to as rapid as 64 octave/sec of frequency shifts; the 88 keys of a piano embody 7 octaves, which means you could neurally encode the sound of someone sliding through all the keys, up and down, more than 4 times, in one second, with a finger-wrenching speed of 704 keys/sec; you get the point
  • AM: when it comes to fast consecutive rising and falling amplitude ramps, we could perceive 1000 Hz fast modulations; when we need to discriminate between different shapes of sinusoidal ramps, we could only reach 100 Hz

Now that we’ve approximately mapped the human hearing capabilities, we have an idea about the perceptual auditory dimensions, along which we can distribute our set of soundscapes. To recap, we want to capture all the bandwidth of hearing, so we can encode as much visual information in audio as possible, without stretching the signal much temporally. Let’s be bold and say that at this point, the synthesized vision is encoded at the level of the cortex. What happens next? That’s where the blind’s brain functions start to radically diverge from the sighted: multiple brain imaging studies have shown that the auditory cortex delegates some of the high level sound processing responsibilities to the visual cortex; “simply” because the visual brain regions don’t have much to work with otherwise.

Visual cortex as a barely-tapped computational resource

We call these neural relations between sensory areas cross-modal connections. Depending on the onset of sight, such neural pathways function either directly, or indirectly (through feedback mechanisms from higher multisensory areas). Here’s what we know:

  1. Cross-modal connections most likely start at A1, and terminate at V1, the first layers of both auditory and visual cortices, respectively. Other connections exist, but they are more function specific: when using sensory substitution devices, the part of the visual cortex originally responsible for spatial movement perception (dorsal stream), seems to encode spatial information present in the audio; similar examples arise for the delegation of depth and shape processing
  2. V1 performs higher level auditory computations, like discrimination of complex sound sequences of utterances, and averts from solving simple sound quality analysis that is already done in then auditory system
  3. For blind people, there is a correlation between the extent of utilizing the V1 in auditory processing, and the accuracy/performance with which certain auditory tasks are carried out
  4. Transcanial magnetic stimulation delivered to the visual cortex impaired the ability of a long-time The vOICe user to execute a sensory substitution exercise; the same individual is one of the two visually impaired who claimed to experience coarse, but actual vision through sound

According to these points, here’s what I assume: by specifically tailoring soundscapes that can be picked up by the visual cortex of the blind, we may speed up sensory substitution learning, and get one tiny step closer to actual visual perception through audio. This is quite a radical assumption, I don’t expect you to believe it. Nevertheless, let’s play around with the idea!

First of all, how would one synthesize soundscapes of such quality? Points 1. and 2. indicate that low-level auditory features won’t make it to V1, only information about soundstreams: amplitude, frequency and spatially modulated audio. Points 3. and 4. implies that the computational power of V1 is acquired in a meaningful manner in the processing of V2A sensory substitution stimuli. Furthermore, we know that V1 simple cells respond to contrast lines, or edges, within their receptive field. Hence, by assigning soundstreams of different flow properties to visual edges of distinct position and angle, we might excite V1 neurons with auditory signals carrying similar visual information that such cells would react to in a sighted person’s cortex. Anyone coming from the field of neuroscience would point out that here I also presume ideal conditions on cross-modal plasticity, and probably a lot more. But yeah, that’s my wild hypothesis: just assign soundstreams to contour edges extracted from an image, and expect faster sensory substitution learning outcomes by the virtue of more direct acquisition of V1 computational resources. I intentionally left out the nuances of how cross-modal connections emerge in congenitally and late blind people, and how the apathy and rearrangement of the blind’s visual cortex challenge the whole story; for more, check out the thesis.

Sensory substitution as a compression problem

Getting back to the inner engineer, I wanted to know how the psychoacoustical and brain imaging studies, which I rapidly described in the previous section, can be recognized in a sensory substitution device. How could one incorporate hearing limitations beyond just the logarithmic nature of frequency and amplitude perception? Or, more specifically, how could one researcher, or a group of researchers explicitly define the pixel-to-sound function, as it has been done so far, taking also into account the massively non-linear mechanism present in the auditory system? How would one lay out the distribution of sound features along the time dimension, when our hearing is so sensitive to frequency and amplitude shifts in one instance, but completely ignores them in another, offering no clear ways to categorically distinguish between the two?

After months of contemplation, my humble opinion is: no way dude. And where humans fail, the machines shall prevail.

V2A sensory substitution can be realized as a compression problem: visual features need to be compressed into (perceivable) audio, with an acceptable amount of loss. When designing compression methods, we tend to look at repeating patterns in the uncompressed domain, in order to replace them with a single symbol in the compressed space. As the auditory bandwidth is magnitudes lower than the visual, drastic, nonlinear measures are necessary to be taken; that’s why I ended up choosing autoencoders to perform the compression.

Stock image of an autoencoder. In our case, the input and output are both images, the hidden code z is sound.

Autoencoders can be trained in an unsupervised fashion, it only requires the images, no labels. The network has two parts: the encoder translates visual features to audio representation, while the decoder does the inverse. Autoencoders are trained to reproduce the input at their output as accurately as possible, thus the inner representation (bottleneck) needs to cram as much information of the encoded image as possible, so the decoder can feed from it.

By electing the (trained) encoder half as my V2A conversion function, I brought about two major problems. First, and most importantly, this method won’t work for all the images — as previous sensory substitution devices worked to an extent — , but to a subset that we choose to train the autoencoder on. Second, the bottleneck sound representation has to distribute the visual information, so the visual-audio correspondence is intuitive and perceivable by humans. While the latter issue can be contended with, the former remains a serious limitation.

The (ideal) big picture

I shamelessly present here a depiction of how sensory substitution should be modeled, when incorporating the auditory system, cross-modal plasticity, the computational resources of the blinds’ visual cortex and perceptual learning.

Each image xᵢ is converted to a sequence of soundstreams aᵢ,ₜ then decoded to drawings on the canvas cᵢ,ₜ. Soundstreams concatenated together along t amount to a soundscape.

A camera takes snapshots, which are encoded into sequences of soundstreams, each decoded back to visual features, drawn on a canvas in an iterative manner. The embedded hearing model aims to emulate the nonlinear mechanism along the auditory pathway, so the audio content arriving at the decoder is analogous to neural information entering the cortex. The decoder half reconstructs the image step-by-step, which corresponds to the cross-modal connections and the ensuing neural processing between the auditory and visual cortices, on a very high level. Auditory and visual areas receive feedback from higher associative regions, which feedback is fueled by the somatosensory and motor cortices. That is, the combined tactile and motor information serves as the ground truth for the visually impaired to decode the sound into quasi vision.

Riding the inverse tuning functions

Even if we take the above painted picture for granted, the hard problem remains defining the encoder, as sound and any pattern temporal in nature, have been difficult to generate using neural networks.

If we chose to construct audio sample-by-sample, akin to WaveNet and its variations, we’d have to teach the network to embed the distribution of (most possible) soundscapes that are humanly apprehended and sufficiently spread out, so different visual features could be represented as discernible auditory stimuli. Not to mention how slow the sound generation may be, which would increase the substitution delay unfavorably.

Without further ado, I made a custom sound synthesizer engine that spits out soundstreams when given a vector of amplitude, frequency and spatial modulation values. More specifically, the input consists of an offset value for each modulated property, and a sequence of delta values. Offsets are in the range of [0,1], while modulation values ∈ [-1,1]. By separating offset from modulation, both qualities can be controlled, so e.g. the synthesizer can constrain both the range and scale of amplitude, and the change in amplitude. Obviously we can’t talk about sounds at frequencies of 0.4; these input values represent our perception that would span between 0 and 1 in this case, 1 being the highest, 0 the lowest pitch/loudness allowed.

That’s where the above mentioned inverse tuning functions come handy, providing a close enough approximation on how the perceptional landscape is indexed by soundstream properties. In practice, logarithmic scaling is applied to the input values, so a pitch of 0.4 becomes 1003 Hz.

Modeling of the binaural (azimuth) localization is handled separately. First of all, both offset and modulation values are within the range of [-1,1], -1 being 90° left from head direction. These vectors are translated to interaural time (ITD) and level differences (ILD). ITD stands for the amount of delay introduced to the left audio channel, compared to the right. ILD is a frequency dependent difference in sound amplitude between the two stereo channels.

As already mentioned, binaural sound localization drastically worsens towards the periphery. To account for that, I introduced Gaussian noising on the output of the encoder regarding the azimuth values; the amount of added noise depends on the azimuth value:

Visualization of the amount of Gaussian random noise added at different azimuths of sound source location.

The above trick successfully drove the autoencoder to rely more on center locations, than on azimuths towards the ear:

Evolution of azimuth offset distributions from the back to the front as the autoencoder was trained.

All in all, by riding the inverse tuning functions to synthesize sound, combined with value dependent noising, one can construct a sound synthesis system that more or less excites different perceptional states, given different input variables. Furthermore, modulation rates can be constrained to comply to the 100 Hz amplitude and 64 octave/sec human limits. Last but not least, the synthesizer is blazing fast and completely parallel by design. Here’s how the soundstream equations look like for the left and right channels:

Aₜ is the amplitude modulation vector (offset + modulation), fₜ represents FM, ITDₜ and ILDₜ are binaural cues. More complex soundscapes may be constructed by overlapping multiple soundstreams, though it’s hard to tell whether such sound bits are sufficiently encoded in the brain.

AutoEncoded V2A

AEV2A is a recurrent variational autoencoder that samples from the contour representation of the input image, translates it to audio, then iteratively reconstructs the contour image by drawing on a canvas. It inherits the network structure from DRAW, sprinkling it with the sound synthesizer and the binaural Gaussian noising layer, the latter of which is considered as an implicit hearing model. As of now, no other hearing models have been developed to be included, though I wrote a Tensorflow implementation for CARFAC, which, due to its massive use of feedback connections, is overly time and memory consuming to use.

The AEV2A model structure unfolded for two iterations. x is the input (contour) image, cₜ is the state of the canvas at iteration t, hₜ is the hidden state for either the encoder or the decoder recurrent networks. a is the audio representation, which could either be the raw audio or the above described AM, FM and SM vectors. μₜ and δₜ are the mean and standard deviation of the Normal distribution used to random sample the hidden state zₜ; this is the variational ingredient of the autoencoder, which allows for more distributed soundstreams.

Similarly to DRAW, the reader and writing modules apply Gaussian attention patches on the image, so only parts of the input is read and parts are drawn on the canvas. I also implemented a line drawing module that imprints edges of different angle and position on the image, which is somewhat easier to grasp than the default grid drawing solution. To have some intuition on how the visual features can correspond to sound, check this video of soundscapes and hand postures drawn from them:

Let’s test it: blindfolded for 5 days

As it’s well known and studied, blind people in general enjoy advantages in auditory discrimination. They are able to recognize rapid pitch and pitch-timbre transitions, being 10 times faster in such tasks. The earlier the onset of blindness, the better is the pitch discrimination performance. They localize sounds more reliably in the peripheral fields, as well. Similar auditory enhancements were shown to manifest for blindfolded subjects, only 3 days after being stripped from visual information.

Based on these findings, I figured there would be no better approach to test AEV2A, than to blindfold myself for 5 days non-stop, training on a sensory substitution task in the meanwhile.

Even though I think of myself as someone who has a good grip on my state of mind, even in adverse situations, I have to say, the experience was quite depressing. I barely wanted to get out of bed, I couldn’t plan ahead for longer than an hour, and I needed my friends around me everyday. I was not wishing all the time it to end, I knew it was temporal, but I kept reassuring myself not to blindfold myself, ever again. Apart from the depressive aspects, I was lucky to have friends ‘watching’ sitcoms with me, taking me out for walks, and on the last night, even to a rave party.

I trained for two separate sensory substitution tasks. First, to test whether categorical shape discrimination is possible: I learnt to associate the generated soundscapes with images of my hand in different postures. In the second experiment, I had to identify and pick up objects from a table, again blindfolded, listening to the audio encoded live video of the table in front of me. I managed to discriminate hand postures of others — obviously in the test phase I could not use my own hands — , and grasp for objects accurately, significantly better than chance, only after a few hours of training on the sensory substitution prototype I built.

What now?

It seems that the conversion logic works to the extent that I was able to perform the mentioned simple tasks, requiring a relatively low amount of training time (7 hours in total). The developed conversion logic is far from perfect. My thesis is meant to open a line of inquiry to machine learning approaches in sensory substitution design; hence, it was an exploratory work, with an abundance of assumptions, conjectures and experimentation.

AEV2A can be applied where visual information and fast reactions are paramount, such as in video games. However, as the model is trained on a set of images, one would need separate models for different visual environments. For instance, one model to get around inside an apartment, another to translate Nokia’s Snake game into audio. As of now, to achieve subsecond substitution delay, the visual environment has to be quite simple, low in variance. This limitation should improve with a more sophisticated audio synthesizer that e.g. covers dimensions of timbre, thus widening the bottleneck in the neural network.

Three-dimensional embedding of sound features. Labels depict the corresponding decoded images.

Spending 5 days blindfolded turned out to be an insightful experiment that I’m glad I went through. The extent to which my hearing ability got boosted, due to loss of sight, is questionable at best, as I did not equip the case studies with controls in this respect. A clear next step would be to test the conversion logic with actual blind people, transforming an Atari game into pure sound.

Although further development invested into sensory substitution should ease and enrich the life of the visually impaired, I tend to think that the new generation of brain interfaces, including Kernel, Openwater, Neuralink and others, will up the game and drive neural activation in correlation with the visual information. Such solutions will lead to more direct, faster perceptual learning, skipping the insanely complicated routes of the auditory pathway and cross-modal connections. Until then, I hope to see deep learning models invade the field of sensory substitution.

In case you want to play around with the model, here you may find the code repository. I went out of my way and actually wrote a detailed, but short documentation for the project, so you can effortlessly generate your own dataset of images from videos, then feed it to the model. After training, the synthesized sounds and decoded images can be inspected in Tensorboard. Have fun, turn one of your favorite old games into the sound domain; I’m pretty sure sighted people would enjoy some audio challenges, too!

--

--