Respeak: Voice-based Crowd-powered Speech Transcription System

Recent years have seen the rise of crowdsourcing marketplaces like Amazon Mechanical Turk and CrowdFlower that provide people with additional earning opportunities. However, low-income, low-literate people in resource-constrained settings are often unable to use these platforms because they face a complex array of socioeconomic barriers, literacy constraints and infrastructural challenges. For example, 97% of the households in India do not have access to an Internet connected computer, 47% of the population does not have access to a bank account, and around 72% are illiterate with respect to English.

To design a microtasking platform that provides additional earning opportunities to low-income people with limited language and digital literacy skills, we designed, built, and evaluated Respeak – a voice-based, crowd-powered speech transcription system that combines the benefits of crowdsourcing with automatic speech recognition (ASR) to transcribe audio files in local languages like Hindi and localized accents of well-represented languages like English. Respeak allows people to use their basic spoken language skills – rather than typing skills – to transcribe audio files. Respeak employs a multi-step approach, involving a sequence of segmentation and merging steps:

  • Segmentation: The Respeak engine segments an audio file into utterances that are each three to six seconds long.
  • Distribution to Crowd Workers: Each audio segment is sent to multiple Respeak smartphone application users who listen to the segment and re-speak the same words into the application in a quiet environment.
  • Transcription using ASR: The application uses a built-in Google’s Android speech recognition API to generate an instantaneous transcript for the segment, albeit with some errors. The user then submits this transcript to the Respeak
  • First-stage Merging: For each segment, the Respeak engine combines the output transcripts obtained from each user into one best estimation transcript by using multiple string alignment and majority voting. If errors are randomly distributed, aligning transcripts generated by multiple people reduces the word error rate (WER). Each submitted transcript earns Respeak users a reward of mobile talktime depending on the similarity between the transcript they submitted and the best estimation transcript generated by Respeak. Once the cumulative earnings of a user reaches 10 INR, a mobile talktime transfer of the same value is processed to them.
  • Second-stage Merging: Finally, the engine concatenates the best estimation transcript for each segment to yield a final transcript.
Respeak System Overview
Respeak System Overview

We conducted three cognitive experiments with 24 students in India to evaluate:

  • How audio segment length affects content retention and cognitive load experienced by a Respeak user
  • The impact on content retention and cognitive load when segments are presented in a sequential vs. random order
  • Whether speaking or typing proves to be a more efficient and usable output medium for Respeak users.

The experiments revealed that audio files should be partitioned by detecting natural pauses to yield segments of less than 6-seconds in length. These segments should be presented sequentially to ensure higher retention and less cognitive load on users. Lastly, speaking outperformed typing not only on speed, but also on the WER suggesting that users should complete micro-transcription tasks by speaking rather than typing.

We then deployed Respeak in India for one month with 25 low-income college students. The Respeak engine segmented 21 widely varying audio files in Hindi and Indian English into 756 short segments. Collectively, Respeak users performed 5464 micro tasks to transcribe 55 minutes of audio content, and earned USD 46. Respeak produced transcriptions with an average WER of 10%. The cost of speech transcription was USD 0.83 per minute. In addition to providing airtime to users, Respeak also improved their vocabulary and pronunciation skills. The expected payout for an hour of their time was 76 INR (USD 1.16) – one-fourth of the average daily wage rate in India. Since voice is a natural and accessible medium of interaction, Respeak has a strong potential to be an inclusive and accessible platform. We are conducting more deployments with low-literate people and blind people to examine the effectiveness of Respeak.

For more details, please read our full paper published at CHI 2017 here.

Aditya Vashistha, University of Washington

Pooja Sethi, University of Washington

Richard Anderson, University of Washington