BSpeak: An Accessible Voice-Based Crowdsourcing Marketplace for Low-Income Blind People
About 90% of the world’s 285 million visually impaired people live in low-income settings. India, a fast developing economy, is home to over 63 million visually impaired people — the largest proportionally among all countries. Over 70% of India’s visually impaired population live in rural regions, almost 50% are illiterate, and a majority of them live in poverty with limited job opportunities.
In recent years, mainstream crowdsourcing marketplaces like Amazon Mechanical Turk (MTurk) and CrowdFlower have enhanced the earning potential of people by incentivizing them to perform time-sensitive microtasks, such as image and keyword tagging, translation, and transcription, among others. Unfortunately, visually impaired people in India are often unable to use these online marketplaces because of constrained access to internet-connected computer, lack of access to formal banking services, and limited English language skills. Even in the United States where most of these structural limitations are absent, Zyskowski et al. reported that blind MTurk workers face inordinate accessibility barriers such as incomplete task descriptions, inaccessible task features, time restrictions to complete tasks that may yield to poor ratings when the allotted time expires due to accessibility challenges, and visual captchas in the sign-up and task selection process.
In this work, we focused on the assets of low-income, low-literate blind people (e.g., access to mobile phones instead of computers, fluent listening and speaking skills instead of typing) to design a new accessible phone-based crowdsourcing marketplace for them. We drew inspiration from the voice-based design of Respeak — our prior work published at CHI 2017 — to create BSpeak, an accessible phone-based microtasking platform for speech transcription. BSpeak uses a five-step process to obtain a transcription for an audio file, as illustrated in Figure 1.
1. Audio Segmentation: Based on the speaking rate and occurrence of natural pauses in an audio file, the BSpeak engine segments the audio file to yield short segments, typically three to six seconds in length, that are easier for crowd workers to remember.
2. Distribution to Crowd Workers: Each audio segment is randomly distributed to multiple BSpeak application users.
3. Transcription by Crowd Workers using ASR: A BSpeak user listens to a segment and then re-speaks the content into the application. The application uses the built-in Android ASR engine to convert the audio input into a transcript that is read aloud to the user. If the transcript is similar to the audio segment, a user submits it in order to receive a new segment. The transcript is expected to have several errors since a user may not fully understand the audio content or the ASR engine could incorrectly recognize some words.
4. First-Stage Merging: Once a predefined number of users have submitted their individual transcripts for a particular segment, the BSpeak engine merges their transcripts using multiple string alignment and a majority voting process to obtain a best estimation transcript. If speech recognition errors are randomly distributed, aligning transcripts reduces the errors since the correct word is recognized for the majority of the users. The BSpeak engine then compares individual transcripts to the best estimation transcript to determine users’ reward for that particular task. Once the cumulative amount earned by a user reaches 10 INR, a digital payment using Paytm is sent to the user.
5. Second-Stage Merging: The BSpeak engine concatenates the best estimation transcript for all segments into one large file to yield the final transcript for the original audio file.
We conducted accessibility and usability evaluations of BSpeak and compared its affordances and limitations with MTurk. To do so, we designed accessible speech transcription and information retrieval tasks on MTurk and BSpeak and asked 15 low-income blind people to use MTurk and BSpeak to do these tasks. Our mixed-methods analysis indicated that despite creating an MTurk task with accessible features and clear instructions, participants experienced several accessibility and usability barriers due to MTurk’s inaccessible underlying implementation. A poorly designed user interface (UI), unlabeled UI elements, improper use of HTML headings, and absence of landmarks exacerbated their user experience. In contrast to MTurk, participants found BSpeak significantly more accessible and usable. They completed significantly more tasks in a lesser time with a higher performance and a lower mental demand, frustration, and effort. They commended its easy-to-understand layout and found the ability to transcribe audio files by speaking as BSpeak’s key strength. They also preferred generating transcripts by speaking to avoid spelling errors that typing might generate due to constrained space on the phone’s keyboard.
Finally, we conducted a two-week field deployment in India with low-income blind people from rural and peri-urban India to examine BSpeak’s acceptability and feasibility to supplement income of low-income blind people. During the deployment, 24 BSpeak users collectively completed over 16,000 tasks to earn ₹7,310 (~USD 111). The expected payout per hour of the application use was ₹36, comparable to the average hourly wage rate in India. The BSpeak engine produced transcriptions with about 90% accuracy. The cost of transcription was USD 1.20 per minute, one-fourth of the industry standard for transcription of audio files in local languages and accents. Our mixed-methods analysis indicated that BSpeak enhanced the earning potential of blind people. Users also self-reported improvements in general knowledge and language skills.