As crowdsourcing evolves, the crowds are evolving too. Mechanical Turk is a different population than it was a few years ago. There are different crowds at different times of day. Different crowds may be better or worse for one application or another — CrowdFlower, MobileWorks, or even a crowd of employees within a company or students within a school.
In particular, how can researchers and developers cooperate to collect aggregate data about system properties (e.g. latency, throughput, noise), demographics (gender, age, socioeconomic level), and human performance (motor, perceptual, attention) for the various crowds that they use?
We started exploring this question in a weekend CrowdCamp hackathon at CSCW 2013. Some concrete steps and discoveries included:
- We gathered 25 datasets from a wide variety of experiments on Mechanical Turk by many different researchers, ranging from 2008 to 2013. We found 30,000 unique workers in our sample, and in the most recent datasets, between 20% and 40% workers who had also contributed to previous datasets. So at least on MTurk, the crowd is stable enough for benchmarking between researchers to be a viable idea.
- We trawled the recent research literature for possible benchmarking tasks, including affect detection, image tagging, and word sense disambiguation.
We also discovered that Mechanical Turk worker IDs are not as anonymous as researchers generally assume. For benchmarking that shares information among researchers, it will be necessary to take additional steps to protect worker privacy while preserving the ability to connect the same workers across studies.
Saeideh Bakshi, Georgia Tech
Michael Bernstein, Stanford University
Jeff Bigham, University of Rochester
Jessica Hullman, University of Michigan
Juho Kim, MIT CSAIL
Walter Lasecki, University of Rochester
Matt Lease, University of Texas Austin
Rob Miller, MIT CSAIL
Tanushree Mitra, Georgia Tech