CrowdCamp Report: Finding Word Similarity with a Human Touch

Semantic similarity or semantic relatedness are features of natural language that contribute to the challenge machines face when analyzing text. Although semantic relatedness is still a complex challenge only few ground truth data set exist. We argue that the available corpora used to evaluate the performance of natural language tools do not capture all elements of the phenomenon. We present a set of simple interventions that illustrate 1) framing effects influence similarity perception, 2) the distribution of similarity across multiple users is important and 3) semantic relatedness is asymmetric.

A number of metrics in the literature attempt to model and evaluate semantic similarity in natural languages. Semantic similarity has applications in areas such as semantic search, text mining, etc. The concept of semantic similarity has long been considered as a more specific concept than the concept of semantic relatedness. Semantic relatedness, as it includes the concepts of antonymy and meronymy, is more generic than semantic similarity.

Different approaches have been attempted to measure semantic relatedness and similarity. Some methods use structured taxonomies such as WordNet alternative approaches define relatedness between words using search engines (e.g., based on Google counts) or  Wikipedia. All of these methods are evaluated based on the correlation with human ratings. Yet only few benchmark data sets exist. One of the most widely used being the WS-353 data-set [1]. As the corpus is very small and the sample size per pair is low it is arguable if all relevant phenomena are in fact present in the provided data set.

In this study, we aim to understand how human raters perceive word-based semantic relatedness. We argue that asking simple word-based semantic similarity is beyond the scope of existing test sets. Our hypotheses in this paper are as follows:

(H1) The framing effect influences similarity rating by human assessors.
(H2) The distribution of similarity rating does not follow a normal distribution.
(H3) Semantic relatedness is not symmetric. The relatedness between words (e.g., tiger and cat) yields different similarity ratings in a different word order.

To verify our hypotheses, we collected similarity ratings on word pairs from the WS-353 data-set. We randomly selected 102 word pairs from the WS-353 data-set. We collected similarity ratings on the 102 word pairs through Amazon Mechanical Turk (MTurk). We collected 5 dataset for these 102 pairs. Each collection used a different task design and was separated into two batches of 51 words each. Each batch received ratings from 50 unique contributors so that each pair of word received 50 ratings in each condition.

The way the questions were asked to the crowd workers are shown in the following figure. For each question, 4 conditions were differently framed. The first two of these are “How is X similar to Y?” (sim) and “How is Y similar to X?” (inverted-sim).  We further repeated them asking for the difference between both words (dissim and inverted-dissim, respectively). Since the scale is reversed in dissim and inverted-dissim, the dissimilarity ratings were converted into similarity ratings for comparison.

The different ways of framing each question.
The different ways of framing each question.

We compared the distributions of similarity ratings in the original WS-353 dataset and our dataset in order to confirm the framing effect. The mean values of 50 ratings were calculated for each pair in our dataset to compare with original similarity ratings in the WS-353 dataset. We filtered exactly the same 102 word pairs from the WS-353 to ensure the consistency between two settings. The distributions are found to be significantly different (p < 0.001, paired t-test).

Our preliminary results show that similarity ratings for some word pairs in the WS-353 dataset do not follow a normal distribution. Some of the distributions reveal that there are different perceptions of similarity, which gets highlighted by multiple peaks. A possible explanation is that the lower peak can be attributed to individuals that are aware of the factual differences between a “sun” or “star” and an actual planet orbiting a “star” while the others are not aware of it.

We compared the difference between the similarity ratings of sim (dssim) and that of inverted-sim (inverted-dissim) to verify third hypothesis. Scatter plot representations of similarity ratings in different word orders for the similarity question and the dissimilarity question reflect that the semantic relatedness in different orders do not take same mean values, indicating the semantic relatedness is asymmetric. The asymmetric relationship consistently appears in the different types of questions (i.e., similarity and dissimilarity.) The results show a remarkable difference between the similarity of “baby” to “mother” and the similarity of “mother” to “baby”. This indicates that the asymmetric relationship between mother and baby was reflected in the subjective similarity rating.

To measure the inter-rater reliability, we have computed the value of Krippendorff’s alpha for both the original dataset and for the one we obtained through the current analysis. Krippendorff’s alpha is a statistical measure that basically provides a highlight of the agreement achieved when encoding a set of units of analysis in terms of the values of a variable.

 

References

[1] L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin. Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20(1), 2002.

 

For more, see our full paper, Possible Confounds in Word-based Semantic Similarity Test Data, published in CSCW 2017.

Malay Bhattacharyya
Department of Information Technology
Indian Institute of Engineering Science and Technology,
Shibpur
malaybhattacharyya@it.iiests.ac.in

Yoshihiko Suhara
MIT Media Lab, Recruit Institute of Technology
suharay@recruit.ai

Md Mustafizur Rahman
Information Retrieval & Crowdsourcing Lab
University of Texas at Austin
nahid@utexas.edu

Markus Krause
ICSI, UC Berkeley
markus@icsi.berkeley.edu

CrowdCamp Report: Protecting Humans – Worker-Owned Cooperative Models for Training AI Systems

Artificial intelligence is widely expected to reduce the need for human labor in a variety of sectors [2]. Workers on virtual labor marketplaces unknowingly accelerate this process by generating training data for artificial intelligence systems, putting themselves out of a job.

Models such as Universal Basic Income [4] have been proposed to deal with the potential fallout of job loss due to AI. We propose a new model where workers earn ownership of the AI systems they help to train, allowing them to draw a long-term royalty from a tool that replaces their labor [3]. We discuss four central questions:

  1. How should we design the ownership relationship between workers and the AI system?
  2. How can teams of workers find and market AI systems worth building?
  3. How can workers fairly divide earnings from a model trained by multiple people?
  4. Do workers want to invest in AI systems they train?

Crowd Workers gain ownership shares in the AI they help train and reap long-term monetary gains, while Requesters can avail of lower initial training costs.

AI Systems Co-owned by Workers and Requesters

  • Current model (requester-owned): Under the terms of platforms like Amazon Mechanical Turk [1], the data produced (and trained AI systems that result) are owned entirely by requesters in exchange for a fixed price paid to workers for producing that data.
  • Proposed model (worker-owned): In a cooperative model for training AI systems, workers can choose to accept a fraction of that price in exchange for shares of ownership in the resulting trained system (smaller fractions = increased ownership). We can imagine interested outside investors (or even workers themselves) participating in such co-ops as well, bankrolling particular projects that have a significant chance of success.

Finding and Marketing AI Systems

  • Bounties vs. marketplaces: Platforms like Kaggle and Algorithmia allow interested parties to post a bounty (reward) for a trained AI system. Risks under this model include (1) the poster may not accept their solution, (2) the poster may choose another submission over their solution, or (3) the open call may expire. Alternately, Algorithmia also provides a marketplace enabling AI systems to earn money on a per-use basis. Risks here include identifying valuable problem domains with high earning potential.
  • Online vs. offline training models: In an online payment model, workers can provide answers initially and as the AI gains confidence in its predictions, work starts shifting from the crowd to the AI. In an offline payment model, the model can be marketed once it achieves sufficiently accurate predictions, or workers could market a dataset rather than a fully-trained AI system.

Fairly Dividing Earnings from AI Systems

  • Assigning credit: How to optimally assign credit for individual training examples is an open theoretical question. We see the opportunity for both model-specific and black-box solutions.
  • Measuring improvement: Measuring improvement to worker owned and trained AI systems will require methods that incentivize workers to provide the most useful examples, not simply ones that they may have gathered for a test set.
  • Example selection: Training examples could be selected by the AI system (active learning) or by workers. What are fair payment schemes for various kinds of mixed-initiative systems?
  • Data maintenance: Data may become stale over time, or change usefulness. Should workers be responsible for maintaining data, and what are fair financial incentives?

 

Do Workers Want to Invest in AI Systems?

We launched a survey on Mechanical Turk (MTurk) to gauge interest, and got feedback from 31 workers.

  • On average, workers were willing to give up 25% of their income if given the chance to double it over one year. Only 3 participants said they’d not be willing to give up any of their earnings, and age doesn’t seem to be a factor here.
  • When given a risk factor, over 48% chose to give up some current payment for a future reward.
    hit-1dollar
  • In order to give up 100% of their current earnings, workers needed to be able to make back 3 times their invested amount.
  • 45% of workers reported not being worried at all about AI taking over their jobs.

ai-worried

References

[1] Amazon Mechanical Turk. 2014. Participation Agreement. Retrieved November 4, 2016 from https://www.mturk.com/mturk/conditionsofuse.

[2] Executive Office of the President National Science and Technology Council Committee on Technology. October 2016. Preparing for the Future of Artificial Intelligence.

[3] Anand Sriraman, Jonathan Bragg, Anand Kulkarni. 2016. Worker-Owned Cooperative Models for Training Artificial Intelligence. Under review.

[4] Wikipedia. Basic Income. https://en.m.wikipedia.org/wiki/Basic_income

Anand Sriraman, TCS Research – TRDDC, Pune, India
Jonathan Bragg, University of Washington, USA
Anand Kulkarni, University of California, Berkeley, USA

CrowdCamp 2016: Understanding the Human in the Loop

Report on CrowdCamp 2016: The 7th Workshop on Rapidly Iterating Crowd Ideas, held in conjunction with AAAI HCOMP 2016. Held November 3, 2016 in Austin, TX.

Organizers: Markus Krause (UC Berkeley), Praveen Paritosh (Google), and Adam Tauman Kalai (Microsoft Research)giphy

Human computation and crowdsourcing as a field investigates aspects of the human in the loop. Consequently, we should use metaphors of computer science to describe human phenomena. These phenomena have been studied by other fields such as sociology and psychology for a very long time. Ignoring these fields not only blocks our access to valuable information but also results in simplified models we try to satisfy with artificial intelligence.

We focused this Crowdcamp on methodologically recognizing the human in the loop, by paying more attention to human factors in task design, and borrowing methodologies from scientific fields relying on human instruments, such as survey design, psychology, and sociology.

We believe that this is necessary for and will foster: 1) raising the bar for AI research, by facilitating more natural human datasets that capture the human intelligence phenomena more richly, 2) raising the bar for human computation methodology for collecting data using/via human instruments, and 3) improve the quality of life and unleashing the potential of crowdworkers by taking into consideration human cognitive, behavioral, and social factors.

This year’s Crowdcamp featured some new concepts. Beside of having a theme we also hold a pre workshop social event. The idea of the event was to get together and discuss ideas in an informal and cheerful setting. We found this very helpful to break the ice, form groups, and prepare ideas for the camp. It helped to keep us focused on the tasks without sacrificing social interactions.

We think the pre workshop social event really helped inspiring participants to get to work right away the next day. We are aware of at least one submitted work in progress paper 24 hours after the workshop! We are sure there are even more great results in the individual group reports published on this blog.

We expect to publish all of the data sets we collected in the next week or so, so please check back in a few days to see more of the results of our workshop. A forthcoming issue of AAAI magazine will include an extended version of this report. If you have feedback on the theme of this year’s CrowdCamp, you might find some further points in there to ruminate about. Feel free to share feedback directly or by commenting on this blog post.

Thanks to the many awesome teams that participated in this year’s CrowdCamp, and stay tuned as blog posts from each team describing their particular project will immediately follow this workshop overview post in the coming days.

Report: HCOMP 2016 Workshop on Mathematical Foundations of Human Computation

The HCOMP Workshop on Mathematical Foundations of Human Computation was held at HCOMP 2016 in Austin, Texas on November 3, 2016. The goal of the workshop was to bring together researchers across disciplines to discuss the future of research on the mathematical foundations of human computation, with a particular emphasis on the ways in which theorists can learn from the existing empirical literature on human computation and the ways in which applied and empirical work on human computation can benefit from mathematical foundations.

There has been great progress in human computation in the past decade. However, human computation is still not well understood from a foundational mathematical perspective.  Mathematical foundations are important to influence and shape the future of human computation. They provide frameworks for formalizing desirable properties of systems (e.g., correctness, optimality, scalability, privacy, fairness), predicting the impact of design decisions, designing systems with provable guarantees, and performing counterfactual analysis, which is notoriously hard to do empirically.

The workshop engaged researchers across different disciplines. The workshop featured an opening talk to set the stage for the day and four invited talks covering broad topics including a model of human conscious computation with applications to password generation, meta-unsupervised-learning methods, the design of a programming framework for humans in the loop, and how to increase productivity using (potentially crowdsourced) microtasks. The program also included six contributed talks and ample time reserved for general discussion. The slides for many of the talks are available on the workshop website.

The workshop concluded with an open problem session in which participants delivered short informal pitches of open problems related to the workshop theme. The main themes of the open problems that were proposed include:

  • The characterization of the important properties of computational models of human cognition. For example, how do we classify or quantify the hardness of problems in human computation?
  • Evaluation metrics for human computation problems. What properties do we want human computation to achieve?
  • The design of models that accurately incorporate human behavior. At a minimum, this requires diving into the literature from other disciplines, such as psychology and behavioral economics. Ideally it should be explored in an interdisciplinary manner.
  • How to develop theory and design algorithms that are robust to errors in modeling and experiments.

 

Workshop organizers:

Shuchi Chawla, University of Wisconsin – Madison
Chien-Ju Ho, Cornell University
Michael Kearns, University of Pennsylvania
Jenn Wortman Vaughan, Microsoft Research
Santosh Vempala, Georgia Tech

Humans, Machines and the Future of Work

ConferenceHumans, Machines, and the Future of Work

Automation, driven by technological progress, has been increasing inexorably for the past several decades. Two schools of economic thinking have been engaging in a debate about the potential effects of automation on jobs, employment and human activity: will new technology spawn mass unemployment, as the robots take jobs away from humans? Or will the jobs robots take over release or unveil – or even create – demand for new human jobs?

The debate has flared up again recently because of technological achievements that recently enabled a Google software program called AlphaGo to beat Go world champion Lee Sedol, a task considered even harder than beating the world’s chess champions.

Ultimately the question boils down to this: are today’s modern technological innovations like those of the past, which made obsolete the job of buggy maker, but created the job of automobile manufacturer? Or is there something about today that is markedly different? This is not a new concern. Dating back at least as far as the Luddites of early 19th-century Britain, new technologies caused fear about the inevitable changes they bring.

There is considerable evidence that this concern may be justified. Eric Brynjolfsson and Andrew McAfee of MIT recently wrote:

For several decades after World War II the economic statistics we care most about all rose together here in America as if they were tightly coupled. GDP grew, and so did productivity — our ability to get more output from each worker. At the same time, we created millions of jobs, and many of these were the kinds of jobs that allowed the average American worker, who didn’t (and still doesn’t) have a college degree, to enjoy a high and rising standard of living. But … productivity growth and employment growth started to become decoupled from each other.

As the decoupling data show, the U.S. economy has been performing quite poorly for the bottom 90 percent of Americans for the past 40 years. Technology is driving productivity improvements, which grow the economy, but most people are not seeing any benefit from this growth. While the U.S. economy is still creating jobs, it is not creating enough of them. The labor force participation rate, which measures the active portion of the labor force, has been dropping since the late 1990s.

An in-depth discussion of these issues will take place at a conference on Humans, Machines, and the Future of Work, Dec. 5-6, 2016.

Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments

When collecting human ratings or labels for items, it can be difficult to measure and ensure data quality control, due to both task subjectivity (i.e., lack of a gold standard against which answers can be easily checked), as well as lack of transparency into how individual judges arrive at their submitted ratings. Using a paid microtask platform like Mechanical Turk to collect data introduces further challenges: crowd annotators may be inexpert, minimally trained, and effectively anonymous, and only rudimentary channels may be provided to interact with workers and observe work as it is performed.

To address this challenge, we investigate a very simple idea: requiring judges to provide a rationale supporting each rating decision they make. We explore this idea for the specific task of collecting relevance judgments for search engine evaluation. A relevance judgment rates how relevant a given web page is to a particular search query. We define a rationale as a direct excerpt from a web page (e.g., which a worker might easily provide by copy-and-paste or highlighting).

To illustrate this idea, we show a simple example below in which a user is assumed to be searching for information about symptoms of jaundice, and a worker is asked to rate the relevance of the Wikipedia page on jaundice. The worker might rate the page as relevant and support his/her decision by highlighting an excerpt describing jaundice symptoms, as shown.

What are the symptoms of jaundice?

While this idea of rationales is quite simple and easy to implement — good things for translating ideas from research to practice! — rationales appear to be remarkably useful, offering a myriad of benefits:

  • Rationales enhance transparency and value. While traditional labor models involve high-quality interpersonal interactions and transparency, this comes at significant expense. On the other hand, microtasking typically greatly improves efficiency and costs, but at the cost of low transparency and poor communication channels. Rationales represent a middle-ground between these extremes, with textual excerpts comprising a very light-weight form of communication from worker to requester to enhance transparency. Moreover, because rationales enhance data interpretability, they increase the value of collected data.  This makes rationales useful to collect from traditional annotators as well as crowd annotators, and beyond ensuring initial data quality, rationales increase the enduring value of collected data for all future users as well.
  • Rationales enhance reliability and accountability. Identifying a focused context for explaining annotator decisions as they are made is generally helpful and especially so for subjective tasks without clear answers. By making work more verifiable, arbitrary answers can no longer be given without justification, making annotators more accountable for their work and decreasing any temptation to rush or cheat. In addition, rationales let us establish alternative truths, e.g., in cases of disagreement between raters, or when rating decisions seem surprising or unintuitive without explanation. Moreover, just as rationales allow requesters to understand worker thought processes, they also enable application of sequential task design in which workers refine or refute one another’s reasoning. Such sequential design is critical for minimizing reliance on “experts” to verify data quality when scaling up data collection.
  • Rationales enhance inclusivity. We hypothesized that the improved transparency and accountability of rationales would enable us to remove all traditional barriers to participation without compromising data quality. By allowing anyone interested to work, we support greater scalability of data collection, greater diversity of responses, and equal opportunity access to income. Our experiments on Mechanical Turk imposed no worker qualifications to restrict anyone from performing our tasks, and we were able to successfully collect high quality data without imposing any worker restrictions.

It gets even better. Despite the seemingly extra work required to collect rationales in addition to relevance judgments, in practice rationales do not require extra time (and therefore cost) to collect! Over a series of experiments collecting nearly 10,000 relevance judgments, we found that workers produce higher quality relevance judgments simply by being asked to provide a rationale, and prolific workers require virtually no extra time to provide rationales in addition to ratings. Intuitively, rationales appear to merely capture explicitly the implicit cognitive work inherent in providing ratings: workers already have to find relevant text to perform the rating task, and asking them to simply report that text requires essentially no extra work.

Finally, it’s intuitive that workers who select similar rationales are likely to produce similar relevance judgments, and we can exploit such overlap between worker rationales in aggregating worker judgments in order to further improve data quality.

In sum, our results suggest that rationales may be a remarkably simple yet powerful tool for crowdsourced data collection, particularly for difficult or subjective tasks. There’s much more to our study, so we encourage you to read our full paper! The link to our paper is below, along with links to the presentation slides and our collected data.

Tyler McDonnell, Matthew Lease, Mucahid Kutlu, and Tamer Elsayed. Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments. In Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP), pages 139-148, 2016. Best Paper Award. [ Slides & Data ]

Machine Learning for Dummies

By Eric Holloway and Robert J. Marks II

Dummies Commentary
Humans are slow and sloppy, why do we want human guided machine learning?

Since the 1970s, we’ve known humans can find approximate solutions to NP-Complete (really hard) problems more efficiently than the best algorithms (Krolak 1971). The best algorithms scale quadratically, while humans scale linearly (Dry 2006). We also know many of the most widely used and successful machine learning algorithms are NP-Hard to train optimally (Diettrich 2000). This suggests a human/machine hybrid can produce better models than machine learning alone.

Solution times
Polynomial regression of human solution times against problem sizes (Dry 2006).

However, there are problems with human interaction. The hardest problem is visualization. It is hard visualizing data with more than 3 dimensions. So, we perform dimension reduction by projecting data onto two dimensions. We then collect many weak models (humans draw boxes) from multiple projections to build a strong model; known as boosting in machine learning.

User Interface
User interface for Amazon Mechanical Turk HIT.

Combining crowdsourcing and boosting, we use Amazon’s Mechanical Turk to collect the models. The data is transformed by the models into a feature space. Then, we use linear regression to classify new data.

Linear Regression
Linear regression classification of human produced features.

We test the human/machine hybrid on one artificial dataset and four real world datasets, all with ten or more dimensions. This hybrid is competitive with machine only linear regression on the untransformed data.

Results
Results of linear regression classification using just machine, and using human/machine hybrid.

You can read more about our work in the HCOMP 2016 paper High Dimensional Human Guided Machine Learning.

A demo is available for a limited time.

References

Krolak, P., Felts, W., & Marble, G. (1971). A man-machine approach toward solving the traveling salesman problem. Communications of the ACM, 14(5), 327-334.

Dry, M., Lee, M. D., Vickers, D., & Hughes, P. (2006). Human performance on visually presented traveling salesperson problems with varying numbers of nodes. The Journal of Problem Solving, 1(1), 4.

Dietterich, T. G. (2000, June). Ensemble methods in machine learning. In International workshop on multiple classifier systems (pp. 1-15). Springer Berlin Heidelberg.

 

Rethinking experiment design as algorithm design

As experimentation in the behavioral and social sciences moves from brick-and-mortar laboratories to the web, techniques from human computation and machine learning can be combined to improve the design of experiments. Experimenters can take advantage of the new medium by writing complex computationally mediated adaptive procedures for gathering data: algorithms.

In a paper to be presented at CrowdML’16, we consider this algorithmic approach to experiment design. We review several experiment designs drawn from the fields of medicine, cognitive psychology, cultural evolution, psychophysics, computational design, game theory, and economics, describing their interpretation as algorithms. We then discuss software platforms for efficient execution of these algorithms with people. Finally, we consider how machine learning can optimize crowdsourced experiments and form the foundation of next-generation experiment design.

Consider the transmission chain, an experimental technique that, much like the children’s game Telephone, passes information from one person to the next in succession. As the information changes hands, it is transformed by the perceptual, inductive, and reconstructive biases of the individuals. Eventually, the transformation leads to erasure of the information contained in the input, leaving behind a signature of the transformation process itself. For this reason, transmission chains have been particularly useful in the study of language evolution and the effects of culture on memory.

When applied to functional forms, for example, transmission chains typically revert to a positive linear function, revealing a bias in human learning. In each row of the following figure, reprinted from Kalish et al. (2007), the functional relationship in the leftmost column is passed down a transmission chain of 9 participants, in all four instances reverting to a positive linear function.

Transmission chains can be formally modeled as a Markov chain by assuming that perception, learning, and memory follow the principles of Bayesian inference. Under this analysis, Bayes’ rule is used to infer the process that generated the observed data. A hypothesis is then sampled from the posterior distribution and used to generate the data passed to the next person in the chain. When a transmission chain is set up in this way, iterated learning is equivalent to a form of Gibbs sampling, a widely-used Markov chain Monte Carlo algorithm. The convergence results for the Gibbs sampler thus apply, with the prior as the stationary distribution of the Markov chain on hypotheses. This equivalence raises the question of whether other MCMC-like algorithm can form the basis of new experiment designs.

For more, attend CrowdML at NIPS in Barcelona or see our full paper: Rethinking experiment design as algorithm design.

Jordan W. Suchow, University of California, Berkeley
Thomas L. Griffiths, University of California, Berkeley

Probabilistic Modeling for Crowdsourcing Partially-Subjective Ratings

By An T. Nguyen, Matthew Halpern, Byron C. Wallace, Matthew Lease
University of Texas at Austin and Northeastern University

Imagine you are a graphics designer working on a logo for a major corporation. You have a preliminary design in mind, but now you need to hammer out the fine-grained details, such as the exact font, sizing, and coloring to use. In the past you might have solicited feedback from colleagues, in-person user studies, but you recently heard the power of the “crowd” and you are intrigued. You launch an Amazon Mechanical Turk task asking workers to rate different logo permutations on a 1-5 scale, where 1 and 5 stars are the least and most satisfactory ratings, respectively.

You now have a rich multi-variate dataset for the workers’s opinions, but did all workers undertake the task in good faith? For example, a disinterested worker could have just picked a score and a more malicious worker could have intentionally picked an opposite score. You might like to detect any such cases and filter them out of your data. You might even want to extrapolate from the collected scores to other versions not scored. How could you do any of this?

While the need to accurately model data quality is key, this is a challenging partially-subjective rating scenario where each response is open but partially constrained. In the logo example, the worker is allowed to select any score in the range 1-5. Personal opinions may differ significantly on a score for a specific logo design. This is in contrast to conventional crowdsourcing tasks, such as image tagging and finding email addresses, where one expects a single, correct answer to each question being asked. While a great deal of prior work has proposed ways to model data for such objective tasks, far less work has considered modeling data quality and worker performance under more subjective task scenarios.

The key observation we make in our paper is that the worker data for these partially-subjective tasks, where worker labels are partially ordered (i.e. scores from one to five), are heteroscedastic in nature. Therefore we propose a probabilistic, heteroscedastic model where the means and variances of worker responses are modeled as functions of instance attributes. In other words, the variability of scores can itself vary across the different parameters. Consider the results as the font size of a logo is varied. We would expect that most workers would give the logos with the smallest and largest font sizes low scores. However, the range of scores for the middle range of fonts is going to be more varied.

We demonstrate the effectiveness of our model on a large dataset of nearly 25,000 Mechanical Turk ratings of user experience when viewing videos on smartphones with varying hardware capabilities. Our results show that our method is effective at both predicting user ratings and in detecting unreliable respondents, which is particularly valuable and little studied in this domain of subjective tasks where there is no clear, objective answer.

The link to our full paper is below, along with links to shared data and code, and we welcome any comments or suggestions!

An Thanh Nguyen, Matthew Halpern, Byron C. Wallace, and Matthew Lease. Probabilistic Modeling for Crowdsourcing Partially-Subjective Ratings. In Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP), 2016. 10 pages. [ bib | pdf | data | sourcecode ]

Pairwise, Magnitude, or Stars: What’s the Best Way for Crowds to Rate?

IS THE UBIQUITOUS FIVE STAR RATING SYSTEM IMPROVABLE?
We compare three popular techniques of rating content: five star rating, pairwise comparison, and magnitude estimation.

We collected 39 000 ratings on a popular crowdsourcing platform, allowing us to release a dataset that will be useful for many related studies on user rating techniques.
The dataset is available here.

METHODOLOGY
We ask each worker to rate 10 popular paintings using 3 rating methods:

  • Magnitude: Using any positive number (zero excluded).
  • Star: Choosing between 1 to 5 stars.
  • Pairwise: Pairwise comparisons between two images, with no ties allowed.

We run 6 different experiments (one for each combination of these three types) with 100 participants in each of them. We can thus analyze the bias given by the rating system order, and the results without order bias by using the aggregated data.

At the end of the rating activity in the task, we dynamically build the three painting rankings induced by the choices of the participant, and ask them which of the three rankings better reflects their preference (the ranking comparison is blind: There is no indication on how each ranking has been obtained, and their order is randomized).

Graphical interface to let the worker express their preference on the ranking induced by their own ratings
Graphical interface to let the worker express their preference on the ranking induced by their own ratings

WHAT’S THE PREFERRED TECHNIQUE?
Participants clearly prefer the ranking obtained from their pairwise comparisons.  We notice a memory bias effect: The last technique used is more likely to get the most accurate description of the real user preference. Despite this, the pairwise comparison technique obtained the maximum number of preferences in all cases.

Number of expression of preference of the ranking induced by the three different techniques, grouped by the order in which the tests have been run
Number of expression of preference of the ranking induced by the three different techniques

EFFORT
While the pairwise comparison technique clearly requires more time than the other techniques, it would be comparable in terms of time with the other techniques using a dynamic test system (of order NlogN).

Average time per test, grouped by the order in which the tests have been run
Average time per test

WHAT DID WE LEARN?

  • Star rating is confirmed to be the most familiar way for users to rate content.
  • Magnitude is unintuitive with no added benefit.
  • Pairwise comparison, while requiring a higher number of low-effort user ratings, best reflects intrinsic user preferences and seems to be a promising alternative.

For more, see our full paper, Pairwise, Magnitude, or Stars: What’s the Best Way for Crowds to Rate?

Alessandro Checco, Information School, University of Sheffield
Gianluca Demartini, Information School, University of Sheffield