The Ancient Secrets of Computer Vision 2 by Joseph Redmon - Condensed
Human Vision - How have we evolved to see the world?
In order to understand Computer Vision, we must first understand how we have evolved to see the world. Not only is it important to investigate how we see but why our sight evolved this way.
What advantages should we ensure we build into our Computer Vision systems?
We use Computer Vision in some of our solutions at Wallscope, so it was important to start from the beginning and ensure I had a solid understanding.
In case you missed the introduction to this series, Joseph Redmon released a series of 20 lectures on Computer Vision in September (2018). As he is an expert in the field, I wrote a lot of notes while going through his lectures. I am tidying my notes for my own future reference but am posting them on Medium also in case these are useful for others.
I highly recommend watching this lecture on Joseph’s Youtube Channel here.
- The Evolution of Eyes
- How Do Our Eyes Work?
- The Brain - Our Visual Processor
- 3D Vision
- Recreating Colour on a Screen
The Evolution of Eyes
To begin with, we need to consider why we have eyes in the first place. The obvious answer is of course to see the world but, in order to fully understand this, we must start by investigating the most basic form our eyes took.
Simple eyes, named eyespots, are photosensitive proteins with no other surrounding structure. Snails for example have these at the tip or base of their tentacles.
Our vision evolved from eyespots and they can only really detect light and a very rough sense of direction. No nerves or brain processing is required as the output is so basic but snails, for example, can use these to detect and avoid bright light to ensure they don’t dry out in the sun.
Importantly, eyespots have extremely low acuity as light from any direction hits the same area of proteins.
Visual Acuity: The relative ability of the visual organ to resolve detail.
Slightly more complex are pit eyes, these are essentially eyespots in a shallow cup shaped pit. These have slightly more acuity as light from one direction is blocked by the edge of the pit, increasing directionality. If only one side of the cells are detecting light, then the source must be to one side.
These eyes still have low acuity as they are relatively simple but are very common in animals. Most animal phyla (28 of 33) developed pit eyes independently. This is due to the fact that recessed sensors are a simple mutation and increased directionality is such a huge benefit.
Phylum (singular of phyla): In Biology, a level of taxonomic rank below Kingdom and above Class.
Many different complex eye structures now exist as different animals have evolved with various needs in diverse environments.
Pinhole Eyes are a further development of the pit eye as the ‘pit’ has recessed much further, only allowing light to enter through a tiny hole (much like some cameras). This tiny hole lets light through which is then projected onto the back surface of the eye or camera. As you can see in the diagram above, the projected image is inverted but the brain handles this (post-processing) and the benefits are much more important. If the hole is small enough, the light hits a very small number of receptors and we can therefore detect exactly where the light is coming from.
As mentioned, eyespots have basically no acuity and pit eyes have very low acuity as some light is blocked by the edges of the ‘pit’. Complex eyes however have very high acuity so what are the advantages that evolved our eyes even further than pinhole eyes.
Humans have Refractive Cornea Eyes which are similar to pinhole eyes but curiously evolved to have a larger hole… To combat the loss of acuity that this difference causes, a cornea and lens is fitted within the opening.
The high acuity of the pinhole eye was a result of the fact that only a tiny amount of light could get through the hole and therefore only a few receptors in the retina get activated. As you can see in the diagram above, the lens also achieves this by focusing the incoming light to a single point on the retina.
The benefit of this structure is that high acuity is maintained, to allow accurate direction, but a lot more light is also allowed in. More light hitting the retina allows more information to be processed which is particularly useful in low-level light (hence why species tend to have at least a lens or cornea). Additionally, this structure gives us the ability to focus.
Focusing incoming light onto the retina is mainly done by the cornea but its focus is fixed. Re-focusing is possible thanks to our ability to alter the refractive index of each lens. Essentially, we can change the shape of the lens to refract light accurately from different sources onto single points on the retina.
This ability to change the shape of our lenses is how we can choose to focus on something close to us or in the distance. If you imagine sitting in a train and looking at some houses in the distance, you would not notice a hair on the window. Conversely, if you focused on the hair (by changing the refractive index of your lenses) the houses in the distance would be blurry.
Therefore, focusing on one depth of field sacrifices acuity in the other depths.
As you may have noticed, complex eyes have all evolved with the same goal - better visual acuity. Only 6 of the 33 animal phyla have complex eyes but 96% of all known species have them so they are clearly very beneficial. This is of course because higher acuity increases the ability to perceive food, predators and mates.
How Do Our Eyes Work?
From above we now know that light passes through our cornea, humours and lens to refract light to focus on our retina. We also know this has all evolved to increase acuity with lots of light for information but what next?
Once light hits the retina, it is absorbed by photosensitive cells that emit neuro-transmitters through the optical nerve to be processed by our visual cortex.
Unlike cameras, our photosensitive cells (called rods and cones) are not evenly distributed or even the same as each other.
Rods and Cones
There are around 126 million photosensitive cells in the retina that are found in different areas and used for very different purposes.
Cones are predominantly found in the centre of the retina, called the fovea, and rods are mainly in the peripherals. There is one spot of the retina that contains neither as this is where the optic nerve connects to the retina - commonly known as the blind-spot.
Interestingly, Octopuses have very similar eyes but do not have a blind-spot. This is because our optic nerve comes out of the retina into the eye and then back out whereas optic nerves in an octopus come out in the opposite direction. Light can not pass through nerves, hence we have a blind-spot.
Rods, predominantly found in our peripherals as mentioned, make up the significant majority of our photosensitive cells as we have roughly 120 million of them in each eye!
We use rods predominantly in low light conditions and they do not see colour for this reason. They respond even if hit by only a single photon so are very sensitive but respond slower. They take a relatively long time to absorb light before emitting a response to our brain so rods work together. Information is pooled by multiple rods into batches of information that get transmitted.
Rods are so adapted for low light vision that they unfortunately very poor in bright light because they saturate very quickly. This is why it takes so long for our eyes to adjust from bright to low light.
If you have ever gone stargazing for example and then glanced at your phone screen, you will notice that it takes 10 to 15 minutes for your ‘night vision’ to return. This is because the phone light saturates your rods and they have to go through the chemical process to desaturate the proteins for them to absorb light again.
Cones on the other hand are found in the fovea and are much rarer as each eye only contains around 6 million of them. This is a lot less than the number of rods but our cones are a lot more concentrated in the centre of our retina for the specific purpose of fine grained, detailed colour vision (most of our bright and colourful day to day lives).
Our cones can see quick movement and have a very fast response time (unlike rods) so brilliant in the quickly changing environments that we live in.
The Fovea is where all the cones are concentrated but is only 1.5mm wide and therefore very densely packed withup to 200,000 cones/mm².
This concentration of cones makes the fovea the area of the retina with the highest visual acuity which is why we move our eyes to read. To process text, the image must be sharp and therefore needs to be projected onto the fovea.
Our Peripheral Vision contains few cones, reducing acuity, but the majority of our rods. This is why we can see shapes moving in our peripherals but not much colour or detail. Try reading this with your peripherals for example, it is blurry and clearly does not have the same level of vision.
The advantage as mentioned above is ‘night vision’ and this is clear when stargazing as stars appear bright when looking at them in your peripheral vision, but dim when you look directly at one. Pilots are taught to not look directly at other planes for exactly this reason, they can see plane lights better in their peripherals.
There are other differences between peripheral and foveal vision. Look at this illusion and then stare at the cross in the centre:
If you look directly at the change in the purple dots, you can clearly see that the purple dots are simply disappearing for a brief moment in a circular motion.
If however you stare at the cross, it looks like all the purple dots disappear and a green dot is travelling in a circle… why?
When using your foveal vision, you are following the movement with your eyes. When fixating on the cross however, you are using your peripheral vision. The important difference is the fact that you are fixating!
The purple light is hitting the exact same points on your retina as you are not moving your eyes. Your rods in those points therefore adjust to the purple so you don’t see them (hence they appear to disappear) and the adjustment makes grey look green.
Our eyes adjusting and losing sensitivity over time when you look directly at something could cause major problems so how do we combat this?
Fixational Eye Movement
There are many ways that we compensate for this loss in sensitivity over time but they all essentially do the same thing - expose different parts of the retina to the light.
There are a couple of large shifts (large being used as a relative term here) and a much smaller movement.
Microsaccades (one of the large movements) are sporadic and random small versions of saccades.
Saccade: (French for jerk) a quick, simultaneous movement of both eyes between two or more phases of fixation in the same direction.
You don’t notice these happening but these tiny short movements expose new parts of the retina to the light.
Ocular Drift is a much slower movement than microsaccades, more of a roaming motion in conjunction with what you are fixating on. This is a random but constant movement.
This image illustrates the constant ocular drift combined with sporadic microsaccades.
Finally, Microtremors are tiny vibrations that are so small that light doesn’t always change which receptor it’s hitting, just the angle at which it hits it. Amazingly, these microtremors are synced between eyes to vibrate at the exact same speed.
These three fixational eye movements allow us to see very fine grained detail!
In fact, the resolution of our fovea is not as high as you might expect, Microsaccades, ocular drift and microtremors help our brain build a more accurate mental model of what is happening in the world.
The Brain - Our Visual Processor
All the information we have discussed so far gets transmitted through our optical nerves but then what?
Our brain takes all of these signals and processes them to give us vision!
It is predominantly thought that our brains developed after our eyes. Jellyfish for example have very complex eyes that connect directly to their muscle tissue for quick reactions.
There is very little point in having a brain without sensory input so it is probable that we developed brains because we had eyes as this allows complex responses beyond just escape reactions.
There are roughly 1 million ganglia in each eye that transmit info to the brain. We know that there are way more rods than there are ganglia so compression must take place at this point and our photoreceptors must complete some pre-processing.
Retinal Ganglion Cell: A type of neuron that receives visual information from photoreceptors.
There are two types of ganglia: M-cells and P-cells.
Magnocellular cells transmit information that help us perceive depth, movement, orientation and position of objects.
Parvocellular cells transmit information that help us perceive colour, shape and very fine details.
These different types of ganglia are connected to different kinds of photoreceptors depending on what they’re responsible for but then all connect to the visual cortex.
The visual cortex contains at least 30 different substructures but we don’t know enough to build a coherent model. We do know however that the information from the ganglia is passed to the primary visual cortex followed by the secondary visual cortex.
V1 - Primary Visual Cortex:
This area of the visual cortex performs low level image processing (discussed in part 1) like edge detection for example.
V2 - Secondary Visual Cortex
Following V1, this area of the visual cortex helps us recognise object sizes, colours and shapes. It is often argued that visual memory is stored in V2.
From V2, the signals are sent to V3, V4, V5 but also fed back to V1 for further processing.
It is theorised (and generally accepted) that the information passes through V1, through V2 and then split and streamed to both the ventral and dorsal systems for two very different purposes.
The Ventral Dorsal Hypothesis
Instead of listing the differences between the two systems, I have cut the slide from Joseph Redmon’s lecture:
This is essentially our conscious, fine-grain detailed sight that we use for recognition and identification. This system takes the high detail foveal signals as it we need to consciously see in the greatest detail possible. As we need such high detail (and most of this detail comes from the brains visual processing), the processing speed is relatively slow when compared to the dorsal system.
Why would we need unconscious vision? If someone through a ball at you right now, you would move your head to dodge it very quickly but the ventral system has a slow processing speed. We can dodge something and then look for the thrown object afterwards because we do not know what was thrown! We therefore did not consciously see the object, we reacted quickly thanks to our very fast, unconscious vision from our dorsal system.
We also use this ‘unconscious vision’ while walking and texting. Your attention is on your phone screen yet you can avoid bins, etc… on the street.
We use both systems together to pick objects up, like a glass for example. The ventral system allows us to see and locate the glass, the dorsal then guides our motor system to pick the glass up.
This split is really seen when sections of the brain are damaged!
If people damage their dorsal system, they can recognise objects without a problem but struggle to then pick objects up for example. They find it really difficult to use vision for physical tasks.
The majority of the information in the dorsal system isn’t consciously accessible so ventral damage renders a person blind. Interestingly however, even though they cannot consciously see or recognise objects, they can still do things like walk around obstacles.
This man walks around obstacles in a corridor even though he cannot see and later, when questioned, is not consciously aware of what was in his path:
Our brain and vision have co-evolved and are tightly knit. The visual cortex is the largest system in the brain, accounting for 30% of the cerebral cortex and two thirds of its electrical activity. This tightly knit, complex system is still not fully understood so it is highly researched and new discoveries are made all the time.
We have covered a lot of detail about each eye but we have two. Do we need two eyes to see in three dimensions?
Short answer: No.
There are in fact many elements that help our brain model in three dimensions with information from just a single eye!
Focusing for example provides a lot of information on depth like how much the lens has to change and how blurry parts of the image are.
Additionally, movement also helps this as a nearby car moves across our field of vision much faster than a plane (which are travelling much faster) in the distance. Finally, if you are moving (on a train for example), this parallax effect of different objects moving at different speeds still exists. We saw this being used to create 3D images in part 1.
All of this helps us judge depth using each eye individually! It is of course widely known however that our ability to see in three dimensions is greatly assisted by combining the information from both eyes.
What we all mainly consider depth perception is called stereopsis. This uses the differences in the images from both eyes to judge depth. The closer something is to you, the bigger the difference in visual information from each eye. If you hold a finger up in front of you for example and change the distance from your eyes while closing each eye individually - you will see this in action.
If you move your finger really close to your face, you will go cross-eyed. The amount your eyes have to converge to see something also helps with depth perception.
All of this information is great but our brain has to tie it all together, add in its own considerations and build this world model.
In a similar fashion to stereopsis and parallax sight, our brain perceives kinetic depth. Essentially your brain infers the 3D shape of moving objects. This video illustrates this amazingly:
Our brains can also detect occlusion such as “I can only see half a person because they are behind a car” for example. We know the object that is obstructed is further away then the object that is obstructing. Additionally, our brain remembers the general size of things that we are familiar with so we can judge whether a car is near or far based on how big it is.
This is quite a famous illusion that plays with our brains understanding of occlusion.
Finally, our brains also use light and shadows to build our 3D model of the world. This face is a good example of this:
We can judge the 3D shape of this persons nose and philtrum (between the nose and upper lip) solely based on the highlights and shadows created by the light.
Tying this all together, we are very skilled at perceiving depth.
As mentioned earlier, we don’t fully understand our visual processing, we only recently found out that our eyes reset orientation when we blink! (Our eyes rotate a little if watching a rotating object and blinking resets this).
We have such complex eyes that use a huge amount of our resources which is likely down to how beneficial vision is to us. Without sight, we would not exist as we do in the world and without light, we couldn’t have sight (as we know it).
All light is electromagnetic radiation, made up of photons that behave like particles and waves…
The wavelength of ‘visible light’ (what our eyes perceive and therefore what we see) is around 400 to 700 nanometres. That is also the wavelength range of sunlight thankfully. Of course, these are the same as we have evolved to see sunlight and not just chance.
We do not see x-ray as the sun doesn’t shoot x-rays at us, it sends ‘visible light’.
We see a combination of waves of different wavelengths and in the modern age (now that we have light bulbs and not just the sun), these are quite diverse.
As you can see, sunlight contains all wavelengths whereas bulbs have high amounts of more particular wavelengths.
We see objects as a colour based on which wavelengths are reflected off them. A red bottle absorbs most wavelengths but reflects red, hence we see it as red.
The colour of an object therefore depends on the light source. An object cannot reflect wavelengths that did not hit it in the first place so it’s colour will appear different in the sun. Our brain judges the light source and compensates for this a little which is what made this dress so famous!
Dive into that page (linked in the image source) and check out the scientific explanation discussing chromatic adaptation.
Colour differences are particularly strange when objects are under fluorescent light as it appears to us as white light. Sunlight appears white and contains all wavelengths whereas fluorescent light appears white but is missing many wavelengths which therefore cannot be reflected.
Colour Perception (Rods and Cones)
The photoreceptors in your eyes have different response curves and cones have much more complex response curves (hence why rods don’t perceive colour well).
There are three types of cones, short, medium and long which correspond to short (blue), medium (green) and long (red) wavelengths.
Long cones respond mainly to wavelengths very close to green but extend to red, this is why we can see more shades of green than any other colour. We evolved this historically to spot hunting targets and dangers in forests and grasslands.
Our perception of colour comes from these cones. Each cone has an output that is roughly calculated by multiplying the input wave by the response curve and integrating to get the area under the resulting curve. The colour we see is then the relative activation of these three types of cones.
We have many more red and green cones then blue (another reason why we see a lot more shades of green than any other colour) and this is why green also appears brighter than other colours. You can also see from the image above that there are very few blue cones in the fovea (centre of the image).
This is important to bear in mind when designing user interfaces as this can sometimes make a significant effect. Reading green text on a black background for example is much easier than blue.
Most humans have these three cones but there is a lot of variation in nature. Some animals have more and can therefore perceive even more colours than we can!
Every additional cone allows the eye to perceive 100 times as many colours as before.
As mentioned, rods don’t really contribute to our perception of colour. They are fully saturated during the day so son’t contribute to our day vision at all. This does not mean they are useless by any means, they just serve very different purposes.
Colourblindness is generally a missing type of cone or a variant in cone wavelength sensitivity. For example, if the red and green cones are even more similar than usual, it becomes very difficult for the person to distinguish between red and green (a very common form of colourblindness).
Recreating Colour on a Screen
If printers and TV’s had to duplicate the reflecting wavelengths of a colour accurately, they would be extremely hard to make! They instead find metamers that are easier to produce.
Metamerism: A perceived matching of colours with different spectral power distributions. Colours that match this way are called metamers.
Finding easy to produce metamers allow us to recreate colours by selectively stimulating cones.
To show that metamers could be created, a group of subjects were gathered and given primary light controls. These controls consisted of three dials that modified the amount of red, green and blue light (RGB) and the subjects were given a target colour. The task was of course to see if the subjects could faithfully reconstruct the target colour by only controlling three primary colours. This was easy for many colours but a touch more complicated for others as negative red light had to be added to recreate some colours.
It was concluded that, given three primary light controls, people can match any colour and additionally; people choose similar distributions to match the target colour. This means that light can be easily reproduced using combinations of individual wavelengths.
Using this information, a map of colour was then made of all humanly visible colours. To represent colours on a screen however, your images need to be represented by a colour space (of which there are many). The most commonly used is sRGB that was developed by Microsoft in 1996 but wider colour spaces have been developed since then.
Adobe RGB was developed two years later and is used in tools such as Photoshop. ProPhoto RGB was created by Kodak and is the largest current colour space, which even extends beyond what our eyes can see, so why don’t we all use this?
If you want to store your image as a jpeg, view your image in a browser or print your image on a non-specialist printer, you will have to use sRGB. ProPhoto RGB is simply too specialised for day to day use so standard equipment and workflow tools do not support it. Even viewing Adobe RGB images in a browser will often be converted to sRGB first which is why sRGB is still the most used today.
Images are represented by pixels and colour is represented by RGB so there are colours that we can see that cannot be recreated on a screen.
Printers use more primaries but still some colours cannot be reproduced! Unless in an illusion:
Finally, people have mapped colour spaces into cubes:
and (more human-like as hue, value, saturation) cylinders:
Hopefully you are convinced that sight is incredible and Computer Vision is no straightforward challenge!
In the next post in this series, I will cover Joseph’s lecture on basic image manipulation.