Building a web app to read out the text in images, Part 2

Evans Ehiorobo
How-Tos
Published in
5 min readOct 15, 2019
Computer speech bubble with operator

There are several times when I have been faced with the problem of wanting to read something but I won’t be able to, either because I am busy with something else or I can’t be with my phone or computer at the moment (such as when I am on the road).

To solve this problem, I decided to create an application that will extract text from images and then read out the text to me. This way, I can listen to it while moving around and doing other things.

Recap

I began by creating a C++ application that could extract text from an image using Tesseract, an open source Optical Character Recognition tool. Here’s an image which was fed to the application:

Test image

and its output was:

1.1 What is computer vision?As humans, we perceive the three-dimensional structure of the world around us with apparent
ease. Think of how vivid the three-dimensional percept is when you look at a vase of flowers
sitting on the table next to you. You can tell the shape and translucency of each petal through
the subtle patterns of light and Shading that play across its surface and effortlessly segment
each flower from the background of the scene (Figure 1.1). Looking at a framed group por-
trait, you can easily count (and name) all of the people in the picture and even guess at their
emotions from their facial appearance. Perceptual psychologists have spent decades trying to
understand how the visual system works and, even though they can devise optical illusions!
to tease apart some of its principles (Figure 1.3), a complete solution to this puzzle remains
elusive (Marr 1982; Palmer 1999; Livingstone 2008).

Next, I explained how to go about writing the code for the server using Express.js which you can find here. In this article, I will explain how to build the web client using React.js. I will assume you have some experience with React.

The web client

What we want to do is to create a React.js application that will:

  • Allow the user upload an image to the server.
  • Receive the text sent by the server (the text in the image).
  • Read out the text.

We will begin by creating a basic React app that with an input field for selecting an image and a button to upload the image:

import React from 'react';import './App.css';function App() {// DOM functionsfunction show(id) {document.getElementById(id).setAttribute("class", "visible");}function hide(id) {document.getElementById(id).setAttribute("class", "hidden");}// Event handlerslet image;function handleFileSelect(e) {show("submitBtn");hide("selectImageSpan");image = e.target.files[0];}function handleClick(e) {hide("uploadInput");hide("submitBtn");show("uploadingSpan");}return (<div className="App"><header className="App-header"><p><span id="selectImageSpan">Select an image</span><br /><br /><input type="file" id="uploadInput" name="image" onChange={handleFileSelect} /><button id="submitBtn" onClick={handleClick} class="hidden">SUBMIT</button><span id="uploadingSpan" class="hidden">Uploading the image...</span></p></header></div>);}export default App;

In the code above, we have two functions show() and hide() that make elements visible and invisible. We will use them to show and hide elements. Also, we have a span that displays the message “Select an image”, an input field for selecting the image and a button which will submit the image to the server. Finally, there is a span that displays the text “Uploading the image…”.

We see that the button and the last span are hidden at first, then when an image has been selected, the button is displayed and the image is stored in an image variable. When the button is clicked, the span tells the user that the browser is uploading the image. All other elements are cleared from the screen.

This is the output:

Select an image
test_image.jpg selected
Submit button clicked

Next, we will use axios to send the image in a POST request to the server. From this article where I wrote about the back end code, we see that the endpoint for uploading images is /upload , so our event handler for the button becomes:

import axios from "axios";function handleClick(e) {hide("uploadInput");hide("submitBtn");show("uploadingSpan");const data = new FormData();data.append('image', image);const config = {headers: {'content-type': 'multipart/form-data'}};axios.post("http://localhost:5000/upload", data, config).then(res => {// Do something});}

With the code above, when the button is clicked, a FormData object is created and the image is added to it. Next, we create a config object to hold the headers. We need this so that we can set the content-type to “multipart/form-data” because multer which is used by the server has that as a requirement. Finally, we call axios.post() with the data and the headers to upload the image to the server.

Text to speech

To translate the text gotten from the server to speech, we will be using the SpeechSynthesis interface of the Web Speech API. With this, we can make the computer speak, and pause and resume while speaking as shown below:

// Text to speechconst synth = window.speechSynthesis;function speak(text) {const utterThis = new SpeechSynthesisUtterance(text);synth.speak(utterThis);}function pause() {hide("pauseBtn");show("resumeBtn");synth.pause();}function resume() {hide("resumeBtn");show("pauseBtn");synth.resume();}...<button id="pauseBtn" onClick={pause}>PAUSE</button><button id="resumeBtn" onClick={resume} class="hidden">RESUME</button>...

In the code above, we have added two buttons — one to pause and one to resume while the computer is speaking. We have also added the function speak() which takes in the text to say as its argument and calls the speech API to read it out. We also added two functions to pause and resume while speaking.

We will now call this speak() function in the axios POST method when the server returns a response so that the computer will read out the response:

...axios.post("http://localhost:5000/upload", data, config).then(res => {...speak(res.data);...});...

Wrapping up

We have successfully written a React.js application to allow users upload images, get the text in the image from the server and read out the text. You can find the full code for the whole project here. Here is a part recording of the computer reading out the text in the test image above:

Thanks for following this far. If you have any comments or suggestions, please drop them below.

--

--