Beyond Sight: AI and the future of computer Vision

Marsel
3 min readJan 13, 2024

--

People with visual challenges often sense the world more deeply than people with full vision. Thoughts?

The human sensory experience is a fascinating world through which we navigate and interpret our surroundings. However, The sensation of touch, smell, taste, and listening is often overpowered by the visual stimuli.

Ever Wondered how these sensations form the building blocks of the imaginary world of People with visual challenges?

How would you describe the vast expanse of the ocean, the dance of its waves and the canvas of the setting sun to someone who can’t see?

Powered by DALL-E 3

Imagine a world where video is seamlessly transformed into speech using advanced AI technology like Humane. In this world, people with visual challenges could gain access to the visual world like never before. We are just testing the waters of a world where video will play an even more significant role in our daily lives.

Visual challenges affect millions of people worldwide. As of 2024, ~36Million people are blind.

The magic of speech-enabled video lies in the integration of multi-modal AI technology. Traditional video relies on visual input alone, but speech-enabled video combines visual content with auditory information. By leveraging the power of AI, the video is analyzed and converted into speech, allowing for a seamless transition from visual to auditory content.

Speech-enabled video has the potential to revolutionize accessibility for individuals with visual challenges. By converting video into speech, individuals with visual impairments can navigate the visual world with unprecedented ease. In a world where video is translated into speech, individuals with visual challenges can reclaim their independence. They no longer need to rely on others to interpret visual content, as they can simply listen to the speech generated by the AI technology. This newfound independence can greatly enhance their quality of life and enable them to participate fully in various activities.

Speech-enabled video not only benefits individuals with visual challenges but also showcases the potential for human-technology interaction. AI algorithms can process video in real-time, enabling real-time speech translation. This opens up new possibilities for applications such as video conferencing, remote assistance, and language translation, making communication easier and more efficient.

In 2024, a much primitive world where the AI revolution has been the most debated topic. I find myself exploring an Open source machine learning and data visualisation platform, Orangedatamining.

An Interactive data exploration tool for rapid qualitative analysis with clean visualisations. The GUI (Graphic user interface) allows you to focus on exploratory data analysis instead of coding. Place widgets on the canvas, connect them, load your datasets and harvest the insight!

I ran a simple test to classify images of Buildings, Forest, Glacier, Mountain, Sea and Street. In less than 5min, I had the results of a Logistic Regression and KNN model of this image data to assess which Model would suite the classification better.

Test and Score

AUC 0.97 Indicates that the model has a high true positive rate and a low false positive rate across various classification thresholds.

Confusion Matrix for the model
Classification work flow in Orange

Presently, this tool could prove beneficial for photographers in efficiently managing a vast number of photos stored in the cloud. Delving deeper into research within this domain has potential opportunities, enhancing accuracy and addressing a ton of use cases associated with computer vision.

With The integration of multi-modal AI technology, People will experience a new level of accessibility and independence. This groundbreaking innovation holds immense promise for advancing accessibility and improving human-technology interaction.

--

--