Warping Time for More Effective Real-Time Crowdsourcing

REAL-TIME TRANSCRIPTION is a vital accommodation for deaf and heard of hearing people in their daily lives. Captioning is typically expensive due to the years of training that is required.

LEGION:SCRIBE introduced a method that used multiple non-experts to caption audio with high quality and low latency, at far lower costs. We recently developed TimeWarp to help make the task easier for individual workers without hurting collective performance. TimeWarp makes each captionist’s job easier by selectively slowing down and speeding up the playback speed of the audio.

Warping time with 4 workers
An example of time warping using 4 workers. The workers each individually hear their segment (matched by color) in slow motion. However, the crowd can collectively still keep up in real-time.

OFFLINE CAPTIONISTS OFTEN SLOW DOWN AUDIO to make it easier to caption. However, this necessarily puts the worker behind real-time. That’s fine for offline captioning, but means it can’t be used by one person and still keep up with real-time speech.

TimeWarp relies on:

  • People’s ability to hear faster than they can type
  • Scribe’s need for workers to only caption a small part of what they hear

For the parts of the audio workers as asked to type, the audio is played slower. In order to catch up with real-time, the audio is played slightly faster during parts in between where the worker listens for context.


  • 12.6% mean improvement in accuracy
  • 11.4% mean improvement in coverage
  • 19.1% mean improvement in latency

The surprising improvement in latency is due to workers being able to keep up with each word as it was said, instead of memorizing it and then typing it later.


For more, see our full paper, Warping Time for More Effective Real-Time Crowdsourcing.
Walter S. Lasecki, University of Rochester
Christopher D. Miller, University of Rochester
Jeffrey P. Bigham, University of Rochester

About the author

Walter Lasecki

View all posts


  • This is such a great example of how the crowd can function as more than the sum of its parts (i.e. doing something with n humans that’s much better than just combining the efforts of n humans)!

    Can you think of other transient phenomena (such as audio to be captioned) which could utilize a similar approach?

    • There are lots of potential uses even just in the media/format space. For instance, one of our goals is to be able to use a similar approach for video in activity recognition (Legion:AR) and blind assistance (Legion:View).

      I think that there are also other uses of the more general take-away that there are workflow aspects of real-time groups that can let them do things that single users can’t alone. One such domain might also be privacy for instance (lightly explored in our CSCW paper for Legion:AR).

      And of course lots more to come, I’m sure!