Let’s create a screen reader!

Published in

Content Uneditable

8 min readFeb 23, 2017

Some time ago my colleague asked me why I haven’t created my own screen reader yet. Actually, I never thought about doing it. Screen readers are so complex, that making them right is extremely difficult (which becomes obvious when we compare the differences in interpreting web content between the biggest screen readers on the market, e.g. VoiceOver and JAWS). But the main reason why CSR (Comandeer’s Screen Reader ;) is still not here is the fact that just until recently web technologies haven’t allowed us to actually read text aloud.

But this has changed and now there’s the Web Speech API (containing Speech Synthesis) which translates written text into speech. Support is surprisingly good, especially for a specification that’s still not an official standard. So if now we’re able to implement the main part of a screen reader in practically every browser, then we might as well do some experimenting. I’ve created an exemplary implementation (I’m sure you can guess the one environment in which it doesn’t work), which I’m going to describe as we go along.

The speech synthesis is the easiest part of creating a screen reader as it’s just calling speechSynthesis.speak method. But before we will be able to do this, we have to:

identify elements that could be read aloud;
select the appropriate element;
specify how to read that element to the user.

Let’s take a closer look at these activities.

Fetching readable elements

What do we want to read to the user? “Everything” is not a satisfactory answer. Not everything on a page is meant to be read, e.g. elements in head (except title of course). The user wouldn’t be delighted to listen to the content of meta tags. It seems wise to restrict reading to the content of body.

However fetching everything from body is still not enough, because we could encounter e.g. script, which is irrelevant to a screen reader user. Additionally, thanks to RDF-a standard, we could also encounter meta[property]. That’s why we should skip such elements. However, it’s still not enough as some elements are intentionally invisible — by using display: none, visibility: hidden or [hidden]. There’s also [aria-hidden=true], which forces that a given element will be skipped by a screen reader.

If we look at the above list of elements, we could divide them into 3 groups:

invisible elements;
elements with [aria-hidden=true] (visible in a “normal” browser, invisible to a screen reader);
“normal elements”.

Only the last ones are of interest to us, others could be safely omitted. Fetching all elements without [aria-hidden=true] attribute from body and then filtering them to throw away all hidden elements seems to be the easiest way to get all relevant elements. That’s exactly what createFocus function does:

It gets all elements from body that don’t have [aria-hidden=true] attribute using document.querySelectorAll, then converts this collection to an array with the help from spread operator and filters the array using Array.prototype.filter. The condition of the filter is just a simple check if display: none or visibility: hidden of an element is a part of its computed styles. Such styles are the actual styles of the element, after applying default browser’s styles and styles defined by the web page author to the element. Because of this we could filter out hidden elements that use inline styles, classes, attributes like [hidden], but also elements that are not rendered by the browser (meta, script etc.; yes, browsers are hiding them using display: none; and yes, it’s possible to overwrite it). Additionally we also fetch the html element, which will represent the entire page.

It’s a naive approach, which doesn’t take some more complex cases into account (nested section, treating p > span just as a part of a paragraph, [role=presentation] etc.), however, it’s really enough to start experimenting on simple sites. But if we want to develop this project further, we also need to include such complex cases. At the end of the day structures of web pages are never that simple.

Selecting the element

If we already have the list of elements that could be read, then it’s time to think how to allow user to move between these elements and read them. In the case of a “normal” browser moving between individual interactive elements is done using Tab key, so I came to a conclusion that the screen reader could stick to [screen reader modifier key] + Tab (in the demo, the screen modifier key is set to Alt). What’s more, the screen reader can’t be restricted only to interactive elements, but it also should allow to navigate between all readable elements. That’s why moving one element forward inside focusList after pressing Alt + Tab seems to be the easiest way to implement it. Similarly pressing Alt + Shift + Tab should move one element backwards inside focusList (like pressing Shift + Tab moves focus to the previous interactive element).

It’s done using a simple moveFocus function, which takes offset as the parameter and moves inside focusList accordingly (if the number is negative, the function will move backward):

However it’s wise to also handle pressing Tab or Shift + Tab. In that case we know what element should gain focus, but we don’t know its position on the list, so we have to find it and then select the element itself:

I’m using Array.prototype.findIndex for this.

Let’s look at the focus function:

This function sets element’s [data-sr-current] attribute, which is used mainly to style it (adding thick, black border around it), focuses the element itself and starts reading it (announceElement).

If we had restricted only to setting [data-sr-current], then our screen reader would read the element, but the focus could be in a totally different place. It could lead to a situation when the user would hear about a link, but when trying to activate it, he would actually activate a button in a totally different area of the site. That’s why it’s necessary to move native focus along with the screen readers “cursor”. A simple yet effective solution.

However there’s still another issue: Chrome loses focus and doesn’t remember what was the last active element — especially if it wasn’t interactive. If the user interchangeably uses Alt + Tab and Tab alone, it could mess up the interaction with the page. Fortunately, there’s a simple remedy for this problem, which we’ll use when creating focusList:

Yep, Chrome (just like other browsers) correctly sets element’s tabIndex property, yet it also requires to explicitly set [tabindex] attribute to work properly. Fortunately this little trick fixes the issue.

There are only two things left: deleting the selection from previous elements when moving it to next ones and setting cursor position on the currently focused element after switching the screen reader on. The first one is done by simply calling element.removeAttribute and the latter is just moving the cursor to the document.activeElement.

Once again: a naive approach, which (surprisingly) works really good.

Reading the element

The biggest issue connected with reading elements is how to actually read them. That’s why I came to a conclusion that it would be best to… not read elements at all. No, I’m not mad — I’ve just done it the way described by W3C’s standards!

I’ve replaced elements with roles. ARIA standard defines many roles, that could be applied to HTML elements. It defines such roles as banner (page’s main header), navigation (an obvious one), main (main content of the page) or button and link. Every element that is of interest in terms of accessibility has its own role, representing it inside the accessibility tree. Other, irrelevant elements are transparent and often represented by their content (Chrome usually represents them as group — the most generic role). There is also a detailed list of default roles for all defined HTML elements.

Due to the fact, that e.g. p element could be changed into heading using [role=heading], a reading algorithm based purely on HTML elements is a big no-no. Sooner or later we would be forced to move to using roles. So why don’t just ditch HTML completely? All we must do is put a list mapping default roles to HTML elements and it’s done! Of course, we can’t forget about the ability to overwrite such binding using [role]:

For every single role we could implement a different reading function, which would allow us to convey some extra information for certain elements, e.g. we could add information about keyboard usage for buttons or about the level of headings (the number after h*):

You’re probably curious what computeAccessibleName function does. According to the ARIA specification, elements have a so called accessible name — a text which is used by browser to identify this element inside the accessibility tree. In case of input it’s a content of appropriate label element, in case of img — [alt] attribute etc. In most cases the element is identified by its own content. However it could be changed by using [aria-label] or [aria-labelledby].

computeAccessibleName function looks like this:

Another very simple implementation, which omits many details and uses only [aria-label], [alt] and elements’ content (fun fact: [aria-label] has higher priority than [alt]).

It should suffice for a basic implementation. However, for a more polished one, we should also check if any part of the element should be omitted (because e.g. it’s hidden via [aria-hidden=true]) or the element has an additional description provided by [aria-describedby] etc.

Speaking

And now, after all these preparations, we can say something out loud:

The text, that we want to read, has to be wrapped by an instance of SpeechSynthesisUtterance class. This allows us to react to various events, e.g. we’re attaching a listener for an end event, which takes place when a given element is fully read. Calling speechSynthesis.cancel , on the other hand, stops the currently read text (to skip reading a long list of links when we’re already on the next element).

At last! We’re saying something using say method — wow!

Creating a primitive screen reader is not a big deal. However doing it right — especially using DOM — is not an easy task. We are forced to manually parse all elements and fetch information from them, which is already exposed by the browser in the accessibility tree. Also we could fetch information that is not exposed by browsers yet.

The biggest drawback of such an attempt at creating a screen reader is the fact that it would work only inside a web page, so it’s not a real competitor for VoiceOver nor JAWS. What it could be though is a very nice experiment or some sort of a tool to test page’s accessibility and check other screen reader’s or browser’s accessibility trees implementations (under the condition that it’s done properly).

This would require much time, so I’ll end this article with a question: is anyone willing to help me out?

Note: This article was originally posted in Polish on Comandeer’s blog.

Let’s create a screen reader!

Fetching readable elements

Selecting the element

Reading the element

Speaking

Written by Tomasz Jakut