Pairwise, Magnitude, or Stars: What’s the Best Way for Crowds to Rate?

IS THE UBIQUITOUS FIVE STAR RATING SYSTEM IMPROVABLE?
We compare three popular techniques of rating content: five star rating, pairwise comparison, and magnitude estimation.

We collected 39 000 ratings on a popular crowdsourcing platform, allowing us to release a dataset that will be useful for many related studies on user rating techniques.
The dataset is available here.

METHODOLOGY
We ask each worker to rate 10 popular paintings using 3 rating methods:

  • Magnitude: Using any positive number (zero excluded).
  • Star: Choosing between 1 to 5 stars.
  • Pairwise: Pairwise comparisons between two images, with no ties allowed.

We run 6 different experiments (one for each combination of these three types) with 100 participants in each of them. We can thus analyze the bias given by the rating system order, and the results without order bias by using the aggregated data.

At the end of the rating activity in the task, we dynamically build the three painting rankings induced by the choices of the participant, and ask them which of the three rankings better reflects their preference (the ranking comparison is blind: There is no indication on how each ranking has been obtained, and their order is randomized).

Graphical interface to let the worker express their preference on the ranking induced by their own ratings
Graphical interface to let the worker express their preference on the ranking induced by their own ratings

WHAT’S THE PREFERRED TECHNIQUE?
Participants clearly prefer the ranking obtained from their pairwise comparisons.  We notice a memory bias effect: The last technique used is more likely to get the most accurate description of the real user preference. Despite this, the pairwise comparison technique obtained the maximum number of preferences in all cases.

Number of expression of preference of the ranking induced by the three different techniques, grouped by the order in which the tests have been run
Number of expression of preference of the ranking induced by the three different techniques

EFFORT
While the pairwise comparison technique clearly requires more time than the other techniques, it would be comparable in terms of time with the other techniques using a dynamic test system (of order NlogN).

Average time per test, grouped by the order in which the tests have been run
Average time per test

WHAT DID WE LEARN?

  • Star rating is confirmed to be the most familiar way for users to rate content.
  • Magnitude is unintuitive with no added benefit.
  • Pairwise comparison, while requiring a higher number of low-effort user ratings, best reflects intrinsic user preferences and seems to be a promising alternative.

For more, see our full paper, Pairwise, Magnitude, or Stars: What’s the Best Way for Crowds to Rate?

Alessandro Checco, Information School, University of Sheffield
Gianluca Demartini, Information School, University of Sheffield

About the author

Alessandro Checco

Alessandro Checco graduated from the University of Rome "Tor Vergata", Italy, in 2010.
He received his Ph.D. in mathematics from the Hamilton Institute, NUI Maynooth in 2014, where his research was focused on resource allocation in wireless networks and decentralised algorithm design.
In 2015 he worked on recommender systems as a postdoctoral researcher in Trinity College, Dublin.
He is currently a postdoctoral researcher at the Information School, Sheffield University, UK, where his main research interests are web information retrieval, human computation and data privacy.

View all posts

2 Comments

  • Thanks so much for releasing the dataset.

    This is a really interesting problem. I assume that pairwise rating are more expensive to get a full rating from because you need a lot more of them. How would you compare the overall cost-benefit trade off pairwise vs. absolute ranking?

    • Thanks for your comment!

      That is true, but only partially: it seems that pairwise comparison are less noisy, being scale invariant and less subject to time bias effects (like I there a new item that is better than the previous one but I gave 5 stars to the previous so I am flattening my future ratings) so what we lose in number of ratings might be recovered in precision. It is definitely worth to be explored!