Let’s Agree to Disagree: Fixing Agreement Measures for Crowdsourcing

In the context of micro-task crowdsourcing, each task is usually performed by several workers. This allows researchers to leverage measures of the agreement among workers on the same task, to estimate the reliability of collected data and to better understand answering behaviors of the participants.

While many measures of agreement between annotators have been proposed, they are known to suffer from many problems and abnormalities. In this work, we identify the main limits of the existing agreement measures in the crowdsourcing context, both by means of toy examples as well as with real-world crowdsourcing data, and propose a novel agreement measure based on probabilistic parameter estimation which overcomes such limits. We validate our new agreement measure and show its flexibility as compared to the existing agreement measures.


The majority of agreement measures is borrowed from data reliability theory, where the reliability of a set of grouped measurements is assessed via a comparison between the inter-group and the intra-group variability, and where typically the judgments are made by a fixed set of assessors. In the context of crowdsourcing, these measures suffer from many problems when they are used for estimating agreement instead of data reliability:

  1. The variability of judgments is typically higher when the judgments concentrate around the center of the scale. This problem is intrinsic to finite scale judgments and can lead to overestimating disagreement over items where the truth concentrates around the scale boundaries.
  2. The values around which judgments concentrate (if any) can be different item by item. This can lead to overestimating expected disagreement and thus increasing the possibility of considering the data as random.
  3. For some items a ground truth (e.g., `gold questions’ in crowdsourcing) might be present, that is a value around which judgments are expected to concentrate. This information is typically not used by classic agreement measures.
  4. The global variability-based correction by chance leads to many idiosyncrasies in the existing measures, making them hard to use in a crowdsourcing setting.

Our goal in this paper is to address the aforementioned issues, and to build a framework more suitable to estimate worker agreement over a group of tasks in a crowdsourcing context.


The intuition behind Φ is connected with the definition of agreement: we consider as agreement the amount of concentration around a data value. Conversely, if the data does not concentrate around a value then we have disagreement (negative agreement in our measure), that can be more or less strong depending on how polarized the different opinions are. More in detail, our approach can be described as fitting a distribution to the histogram of the judgments and then measuring the dispersion of such distribution.

It is important to notice that the fitting distribution has to be general enough to capture the main behaviors that might occur: flat (random judgments), bell-shaped (agreement), J-shaped (agreement around a value on the boundary of the scale), and U shaped distribution (disagreement), as shown in the following figure.

Agreement model examples

At the same time, the desired distribution has to have a minimal number of parameters, to avoid overfitting. For this reason, we use a Beta distribution to perform the fit: Φ is a transformed parameter of the Beta distribution over the histogram of the collected answers. Such parameter is related to the standard deviation of the fitted distribution, with the difference that here we account for the finiteness of the rating scale, and thus we adjust for the tendency of having lower dispersion when the data concentrates around a value on the boundaries of the rating scale. For example, if we imagine a scenario where assessors add a random Gaussian noise to the ground truth when making a judgment, we can immediately see that the dispersion will be minimum when the ground truth is on the boundary of the scale, because a Gaussian noise that would result is a judgment outside the boundary would be clipped.

The strength of our approach becomes apparent when applied to a group of items to be judged: in the case of relevance judgment tasks, each item i is allowed to have a different average relevance value, while the agreement among workers is defined as the common Φ that better explains the judgment data.

This allows to solve the problems that arise, in the other agreement measures, when trying to correct by chance by using the dispersion of the whole dataset as normalizing factor.


In the following figure we show a representation of the inference results for the judgments of 17 documents. We generated a small synthetic dataset, where the first document has an outlier on the right boundary, and the other 16 documents have a clear central agreement. In the figure it can be seen that documents 2-5 are replicated four times to get 16 documents that have higher agreement. We can see that the model is forced to find the best agreement level (dispersion of the Beta distribution) that collectively explain all the data: while document 1 alone would have been fitted with a high disagreement (a U shaped) Beta, the most probable Beta for the model to explain the whole dataset is the one where the first document has an outlier. This reflects the way we perceive the agreement level as humans, especially with a small set of data samples, and allows to get a robust estimation of agreement for group of documents.



You can test our tool (as shown in the snapshot below) and access the source code at this link.

Online tool
Online tool

For more information, see our full paper, Let’s Agree to Disagree: Fixing Agreement Measures for Crowdsourcing

Alessandro Checco, Information School, University of Sheffield


Report: Second GroupSight Workshop on Human Computation for Image and Video Analysis

What would be possible if we could accelerate the analysis of images and videos, especially at scale? This question is generating widespread interest across research communities as diverse as computer vision, human computer interaction, computer graphics, and multimedia.

The second Workshop on Human Computation for Image and Video Analysis (GroupSight) took place in Quebec City, Canada on October 24, 2017, as part of HCOMP 2017. The goal of the workshop was to promote greater interaction between this diversity of researchers and practitioners who examine how to mix human and computer efforts to convert visual data into discoveries and innovations that benefit society at large.

This was the second edition of the GroupSight workshop to be held at HCOMP. It was also the first time the workshop and conference were co-located with UIST. A website and blog post on the first edition of GroupSight are also available.

The workshop featured two keynote speakers in HCI doing research on crowdsourced image analysis. Meredith Ringel Morris (Microsoft Research) presented work on combining human and machine intelligence to describe images to people with visual impairments (slides). Walter Lasecki (University of Michigan) discussed projects using real-time crowdsourcing to rapidly and scalably generate training data for computer vision systems.

Participants also presented papers along three emergent themes:

Leveraging the visual capabilities of crowd workers:

  • Abdullah Alshaibani and colleagues at Purdue University presented InFocus, a system enabling untrusted workers to redact potentially sensitive content from imagery. (Best Paper Award)
  • Kyung Je Jo and colleagues at KAIST presented Exprgram (paper, video). This paper introduced a crowd workflow that supports language learning while annotating and searching videos. (Best Paper Runner-Up Award)
  • GroundTruth (paper, video), a system by Rachel Kohler and colleagues at Virginia Tech, combined expert investigators and novice crowds to identify the precise geographic location where images and videos were created.

Kurt Luther hands the best paper award to Alex Quinn.

Creating synergies between crowdsourced human visual analysis and computer vision:

  • Steven Gutstein and colleagues from the U.S. Army Research Laboratory presented a system that integrated a brain-computer interface with computer vision techniques to support rapid triage of images.
  • Divya Ramesh and colleagues from CloudSight presented an approach for real-time captioning of images by combining crowdsourcing and computer vision.

Improving methods for aggregating results from crowdsourced image analysis:

  • Jean Song and colleagues at the University of Michigan presented research showing that tool diversity can improve aggregate crowd performance on image segmentation tasks.
  • Anuparna Banerjee and colleagues at UT Austin presented an analysis of ways that crowd workers disagree in visual question answering tasks.

The workshop also had break-out groups where participants used a bottom-up approach to identify topical clusters of common research interests and open problems. These clusters included real-time crowdsourcing, worker abilities, applications (to computer vision and in general), and crowdsourcing ethics.

A group of researchers talking and seated around a poster board covered in sticky notes.

For more, including keynote slides and papers, check out the workshop website: https://groupsight.github.io/

Danna Gurari, UT Austin
Kurt Luther, Virginia Tech
Genevieve Patterson, Brown University and Microsoft Research New England
Steve Branson, Caltech
James Hays, Georgia Tech
Pietro Perona, Caltech
Serge Belongie, Cornell Tech


Crowdsourcing the Location of Photos and Videos

How can crowdsourcing help debunk fake news and prevent the spread of misinformation? In this paper, we explore how crowds can help expert investigators verify the claims around visual evidence they encounter during their work.

A key step in image verification is geolocation, the process of identifying the precise geographic location where a photo or video was created. Geotags or other metadata can be forged or missing, so expert investigators will often try to manually locate the image using visual clues, such as road signs, business names, logos, distinctive architecture or landmarks, vehicles, and terrain and vegetation.

However, sometimes there are not enough clues to make a definitive geolocation. In these cases, the expert will often draw an aerial diagram, such as the one shown below, and then try to find a match by analyzing miles of satellite imagery.

An aerial diagram of a ground-level photo, and the corresponding satellite imagery of that location.

Source: Bellingcat

This can be a very tedious and overwhelming task – essentially finding a needle in a haystack. We proposed that crowdsourcing might help, because crowds have good visual recognition skills and can scale up, and satellite image analysis can be highly parallelized. However, novice crowds would have trouble translating the ground-level photo or video into an aerial diagram, a process that experts told us requires lots of practice.

Our approach to solving this problem was right in front of us: what if crowds also use the expert’s aerial diagram? The expert was going to make the diagram anyway, so it’s no extra work for them, but it would allow novice crowds to bridge the gap between ground-level photo and satellite imagery.

To evaluate this approach, we conducted two experiments. The first experiment looked at how the level of detail in the aerial diagram affected the crowd’s geolocation performance. We found that in only ten minutes, crowds could consistently narrow down the search area by 40-60%, while missing the correct location only 2-8% of the time, on average.


In our second experiment, we looked at whether to show crowds the ground-level photo, the aerial diagram, or both. The results confirmed our intuition: the aerial diagram was best. When we gave crowds just the ground-level photo, they missed the correct location 22% of the time – not bad, but probably not good enough to be useful, either. On the other hand, when we gave crowds the aerial diagram, they missed the correct location only 2% of the time – a game-changer.

Bar chart showing the diagram condition performed significantly better than the ground photo condition.

For next steps, we are building a system called GroundTruth (video) that brings together experts and crowds to support image geolocation. We’re also interested in ways to synthesize our crowdsourcing results with recent advances in image geolocation from the computer vision research community.

For more, see our full paper, Supporting Image Geolocation with Diagramming and Crowdsourcing, which received the Notable Paper Award at HCOMP 2017.

Rachel Kohler, Virginia Tech
John Purviance, Virginia Tech
Kurt Luther, Virginia Tech

Call for Participation: GroupSight 2017

The Second Workshop on Human Computation for Image and Video Analysis (GroupSight) is to be held on October 24, 2017 at AAAI HCOMP 2017 at Québec City, Canada. This promises an exciting mix of people and papers at the intersection of HCI, crowdsourcing, and computer vision.

The aim of this workshop is to promote greater interaction between the diversity of researchers and practitioners who examine how to mix human and computer efforts to convert visual data into discoveries and innovations that benefit society at large. It will foster in-depth discussion of technical and application issues for how to engage humans with computers to optimize cost/quality trade-offs. It will also serve as an introduction to researchers and students curious about this important, emerging field at the intersection of crowdsourced human computation and image/video analysis.

Topics of Interest

Crowdsourcing image and video annotations (e.g., labeling methods, quality control, etc.)
Humans in the loop for visual tasks (e.g., recognition, segmentation, tracking, counting, etc.)
Richer modalities of communication between humans and visual information (e.g., language, 3D pose, attributes, etc.)
Semi-automated computer vision algorithms
Active visual learning
Studies of crowdsourced image/video analysis in the wild

Submission Details

Submissions are requested in the following two categories: Original Work (not published elsewhere) and Demo (describing new systems, architectures, interaction techniques, etc.). Papers should be submitted as 4-page extended abstracts (including references) using the provided author kit. Demos should also include a URL to a video (max 6 min). Multiple submissions are not allowed. Reviewing will be double-blind.
Previously published work from a recent conference or journal can be considered but the authors should submit an unrevised copy of their published work. Reviewing will be single-blind. Email submissions to groupsight@outlook.com

Important Dates

August 14August 23, 2017: Deadline for paper submission (5:59 pm EDT)
August 25, 2017: Notification of decision
October 24, 2017: Workshop (full-day)