By An T. Nguyen, Matthew Halpern, Byron C. Wallace, Matthew Lease
University of Texas at Austin and Northeastern University
Imagine you are a graphics designer working on a logo for a major corporation. You have a preliminary design in mind, but now you need to hammer out the fine-grained details, such as the exact font, sizing, and coloring to use. In the past you might have solicited feedback from colleagues, in-person user studies, but you recently heard the power of the “crowd” and you are intrigued. You launch an Amazon Mechanical Turk task asking workers to rate different logo permutations on a 1-5 scale, where 1 and 5 stars are the least and most satisfactory ratings, respectively.
You now have a rich multi-variate dataset for the workers’s opinions, but did all workers undertake the task in good faith? For example, a disinterested worker could have just picked a score and a more malicious worker could have intentionally picked an opposite score. You might like to detect any such cases and filter them out of your data. You might even want to extrapolate from the collected scores to other versions not scored. How could you do any of this?
While the need to accurately model data quality is key, this is a challenging partially-subjective rating scenario where each response is open but partially constrained. In the logo example, the worker is allowed to select any score in the range 1-5. Personal opinions may differ significantly on a score for a specific logo design. This is in contrast to conventional crowdsourcing tasks, such as image tagging and finding email addresses, where one expects a single, correct answer to each question being asked. While a great deal of prior work has proposed ways to model data for such objective tasks, far less work has considered modeling data quality and worker performance under more subjective task scenarios.
The key observation we make in our paper is that the worker data for these partially-subjective tasks, where worker labels are partially ordered (i.e. scores from one to five), are heteroscedastic in nature. Therefore we propose a probabilistic, heteroscedastic model where the means and variances of worker responses are modeled as functions of instance attributes. In other words, the variability of scores can itself vary across the different parameters. Consider the results as the font size of a logo is varied. We would expect that most workers would give the logos with the smallest and largest font sizes low scores. However, the range of scores for the middle range of fonts is going to be more varied.
We demonstrate the effectiveness of our model on a large dataset of nearly 25,000 Mechanical Turk ratings of user experience when viewing videos on smartphones with varying hardware capabilities. Our results show that our method is effective at both predicting user ratings and in detecting unreliable respondents, which is particularly valuable and little studied in this domain of subjective tasks where there is no clear, objective answer.
The link to our full paper is below, along with links to shared data and code, and we welcome any comments or suggestions!
An Thanh Nguyen, Matthew Halpern, Byron C. Wallace, and Matthew Lease. Probabilistic Modeling for Crowdsourcing Partially-Subjective Ratings. In Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP), 2016. 10 pages. [ bib | pdf | data | sourcecode ]