Let’s Agree to Disagree: Fixing Agreement Measures for Crowdsourcing

In the context of micro-task crowdsourcing, each task is usually performed by several workers. This allows researchers to leverage measures of the agreement among workers on the same task, to estimate the reliability of collected data and to better understand answering behaviors of the participants.

While many measures of agreement between annotators have been proposed, they are known to suffer from many problems and abnormalities. In this work, we identify the main limits of the existing agreement measures in the crowdsourcing context, both by means of toy examples as well as with real-world crowdsourcing data, and propose a novel agreement measure based on probabilistic parameter estimation which overcomes such limits. We validate our new agreement measure and show its flexibility as compared to the existing agreement measures.

CURRENT AGREEMENT MEASURES ARE INADEQUATE

The majority of agreement measures is borrowed from data reliability theory, where the reliability of a set of grouped measurements is assessed via a comparison between the inter-group and the intra-group variability, and where typically the judgments are made by a fixed set of assessors. In the context of crowdsourcing, these measures suffer from many problems when they are used for estimating agreement instead of data reliability:

  1. The variability of judgments is typically higher when the judgments concentrate around the center of the scale. This problem is intrinsic to finite scale judgments and can lead to overestimating disagreement over items where the truth concentrates around the scale boundaries.
  2. The values around which judgments concentrate (if any) can be different item by item. This can lead to overestimating expected disagreement and thus increasing the possibility of considering the data as random.
  3. For some items a ground truth (e.g., `gold questions’ in crowdsourcing) might be present, that is a value around which judgments are expected to concentrate. This information is typically not used by classic agreement measures.
  4. The global variability-based correction by chance leads to many idiosyncrasies in the existing measures, making them hard to use in a crowdsourcing setting.

Our goal in this paper is to address the aforementioned issues, and to build a framework more suitable to estimate worker agreement over a group of tasks in a crowdsourcing context.

OUR MODEL

The intuition behind Φ is connected with the definition of agreement: we consider as agreement the amount of concentration around a data value. Conversely, if the data does not concentrate around a value then we have disagreement (negative agreement in our measure), that can be more or less strong depending on how polarized the different opinions are. More in detail, our approach can be described as fitting a distribution to the histogram of the judgments and then measuring the dispersion of such distribution.

It is important to notice that the fitting distribution has to be general enough to capture the main behaviors that might occur: flat (random judgments), bell-shaped (agreement), J-shaped (agreement around a value on the boundary of the scale), and U shaped distribution (disagreement), as shown in the following figure.

Agreement model examples

At the same time, the desired distribution has to have a minimal number of parameters, to avoid overfitting. For this reason, we use a Beta distribution to perform the fit: Φ is a transformed parameter of the Beta distribution over the histogram of the collected answers. Such parameter is related to the standard deviation of the fitted distribution, with the difference that here we account for the finiteness of the rating scale, and thus we adjust for the tendency of having lower dispersion when the data concentrates around a value on the boundaries of the rating scale. For example, if we imagine a scenario where assessors add a random Gaussian noise to the ground truth when making a judgment, we can immediately see that the dispersion will be minimum when the ground truth is on the boundary of the scale, because a Gaussian noise that would result is a judgment outside the boundary would be clipped.

The strength of our approach becomes apparent when applied to a group of items to be judged: in the case of relevance judgment tasks, each item i is allowed to have a different average relevance value, while the agreement among workers is defined as the common Φ that better explains the judgment data.

This allows to solve the problems that arise, in the other agreement measures, when trying to correct by chance by using the dispersion of the whole dataset as normalizing factor.

EXAMPLE

In the following figure we show a representation of the inference results for the judgments of 17 documents. We generated a small synthetic dataset, where the first document has an outlier on the right boundary, and the other 16 documents have a clear central agreement. In the figure it can be seen that documents 2-5 are replicated four times to get 16 documents that have higher agreement. We can see that the model is forced to find the best agreement level (dispersion of the Beta distribution) that collectively explain all the data: while document 1 alone would have been fitted with a high disagreement (a U shaped) Beta, the most probable Beta for the model to explain the whole dataset is the one where the first document has an outlier. This reflects the way we perceive the agreement level as humans, especially with a small set of data samples, and allows to get a robust estimation of agreement for group of documents.

example

ONLINE TOOL

You can test our tool (as shown in the snapshot below) and access the source code at this link.

Online tool
Online tool

For more information, see our full paper, Let’s Agree to Disagree: Fixing Agreement Measures for Crowdsourcing

Alessandro Checco, Information School, University of Sheffield

Pairwise, Magnitude, or Stars: What’s the Best Way for Crowds to Rate?

IS THE UBIQUITOUS FIVE STAR RATING SYSTEM IMPROVABLE?
We compare three popular techniques of rating content: five star rating, pairwise comparison, and magnitude estimation.

We collected 39 000 ratings on a popular crowdsourcing platform, allowing us to release a dataset that will be useful for many related studies on user rating techniques.
The dataset is available here.

METHODOLOGY
We ask each worker to rate 10 popular paintings using 3 rating methods:

  • Magnitude: Using any positive number (zero excluded).
  • Star: Choosing between 1 to 5 stars.
  • Pairwise: Pairwise comparisons between two images, with no ties allowed.

We run 6 different experiments (one for each combination of these three types) with 100 participants in each of them. We can thus analyze the bias given by the rating system order, and the results without order bias by using the aggregated data.

At the end of the rating activity in the task, we dynamically build the three painting rankings induced by the choices of the participant, and ask them which of the three rankings better reflects their preference (the ranking comparison is blind: There is no indication on how each ranking has been obtained, and their order is randomized).

Graphical interface to let the worker express their preference on the ranking induced by their own ratings
Graphical interface to let the worker express their preference on the ranking induced by their own ratings

WHAT’S THE PREFERRED TECHNIQUE?
Participants clearly prefer the ranking obtained from their pairwise comparisons.  We notice a memory bias effect: The last technique used is more likely to get the most accurate description of the real user preference. Despite this, the pairwise comparison technique obtained the maximum number of preferences in all cases.

Number of expression of preference of the ranking induced by the three different techniques, grouped by the order in which the tests have been run
Number of expression of preference of the ranking induced by the three different techniques

EFFORT
While the pairwise comparison technique clearly requires more time than the other techniques, it would be comparable in terms of time with the other techniques using a dynamic test system (of order NlogN).

Average time per test, grouped by the order in which the tests have been run
Average time per test

WHAT DID WE LEARN?

  • Star rating is confirmed to be the most familiar way for users to rate content.
  • Magnitude is unintuitive with no added benefit.
  • Pairwise comparison, while requiring a higher number of low-effort user ratings, best reflects intrinsic user preferences and seems to be a promising alternative.

For more, see our full paper, Pairwise, Magnitude, or Stars: What’s the Best Way for Crowds to Rate?

Alessandro Checco, Information School, University of Sheffield
Gianluca Demartini, Information School, University of Sheffield

How to Best Serve Micro-tasks to the Crowd when there Is Class Imbalance

DOES CLASS IMBALANCE IN RELEVANT JUDGMENT AFFECTS PERFORMANCE?
We study the effect on crowd worker efficiency and effectiveness of the dominance of one class in the data they process. We aim at understanding if there is any bias in workers seeing many negative examples in the identification of positive labels.

We run comparative experiments where we measure label quality and work efficiency over different class distribution settings both including label frequency (i.e., one dominant class) as well as ordering (e.g., positive cases preceding negative ones).

batch types
Order of the document classes in each batch, in blue for ‘relevant’ and red for ‘non-relevant’

We used data from TREC8.  To measure effects of class imbalance, we used two different relevant/non-relevant ratios in a batch of judging tasks: 10%-90% and 50%-50%.

RESULTS
When the relevant documents are shown before the non-relevant ones we obtain the highest precision, while the worst precision is obtained when they are shown at the end of the batch.
Moreover, in batch 2 we observe a low number of true positives and a large number of false positive judgments by the workers, which shows how 90% of non-relevant documents shown at the beginning of the batch create a bias in the workers’ notion of relevance.

Mean judgment accuracy, precision and recall for each setting
Mean judgment accuracy, precision and recall for each setting

When classes are balanced, there is no statistically significant difference in the performance between different orders. On the other hand, seeing a similar number of positive and negative documents leads to good performance with more than 60% accuracy in all the three order settings.

WHAT DID WE LEARN?
When most of the documents are non-relevant and the few relevant ones are presented first, workers perform better. This is a positive result which can be easily applied in practice as in real IR evaluation settings most of the documents to be judged are non-relevant.

Including in the first positions documents known to be relevant will both prime workers on relevance as well as allow for training.

While in a real setting it is not possible to put relevant documents first, it would still be possible to order documents by attributes indicating their relevance (e.g., retrieval rank, number of IR systems retrieving the document, etc.) thus presenting first to the workers the documents with higher probability of being relevant.

For more, see our paper, THE EFFECT OF CLASS IMBALANCE AND ORDER ON CROWDSOURCED RELEVANCE JUDGMENTS

Rehab K. Qarout, Information School, University of Sheffield
Alessandro Checco, Information School, University of Sheffield
Gianluca Demartini, Information School, University of Sheffield