Two Tools are Better Than One: Tool Diversity as a Means of Improving Aggregate Crowd Performance

Human intelligence provides a new means of scaffolding the creation and training AI systems. Recently, the rise of crowdsourcing marketplaces has opened opportunities for us to access human intelligence more scalably and flexibly than ever before. However, one of the biggest concerns when using crowdsourcing is that many times the contributed work can be unreliable.

To increase reliability, prior work has frequently used task decomposition to generate smaller, simpler, less error-prone microtasks. Additionally, since microtasks are easier to obtain agreement over, consensus-based aggregation can be used to output a reasonable single answer from diverse individual worker responses. However, there are limits to how much a task can be decomposed into smaller pieces. Furthermore, there has been little research on how to deal with systematic error biases that can be induced by a tool’s design. Systematic error biases are defined as shared error patterns among workers using a single tool, which can be problematic because they can persist even after decomposition or aggregation of a task.

In this paper, we propose to leverage tool diversity to overcome the limits of microtasking. We contribute the insight that: given a diverse set of tools, answer aggregation done across tools can help improve collective performance by offsetting systematic biases.

Our recent study shows the effectiveness of leveraging tool diversity, particularly in semantic image segmentation tasks, where the task is to find a requested object in a scene and draw a tight boundary around the target object to demarcate it from the background. This is an important problem in computer vision that allow systems to be trained to better understand scenes.

For our experiments, we built four different image segmentation tools and evaluated their segmentation quality in three different aggregation conditions: 1) single tool aggregation using majority voting — which serves as a baseline, 2) two-tool aggregation using majority voting, and 3) two-tool aggregation using expectation maximization (EM) — to see if this well-known optimization method can effectively integrate answers across different tools.

As a result, two-tool aggregation improved F1 scores (the harmonic means of recall and precision) compared to single tool aggregation, especially when the mixed tool pairs had precision and recall trade-offs. We used EM-based aggregation to significantly improve the performance of the tool pairs compared to uniform majority voting in most cases. F1 scores for the different tool pairings are summarized in the figure below. Our results suggest that not only the individual tool’s design but also the aggregation method can affect the performance of multi-tool aggregation.

Our findings open up new opportunities and directions for gaining a deeper understanding of how tool designs influence the aggregate performance on diverse crowdsourcing tasks, and introduces a new way of thinking about decomposing tasks: based on tools instead of subtasks. We suggest that system designers consider the followings when trying to leverage tool diversity — using multiple tools for their applications:

  • The expected error (or bias) from human participants should be distributed differently. This way, the diverse tool set can complement a broad range of error (or bias) types.
  • The task should have an objectively correct answer, which is tractable enough for the workers to answer. For example, live captioning, or text annotation may be amenable to the tool diversity approach. On the other hand, tasks like creative writing would be hard to benefit from our approach because the expected answer is subjective.
  • The task should tolerate imperfections in workers answers. For example, live captioning task used in Scribe tolerates imperfections because typos or some missing words do not cause complete task failure.

Future work may investigate ways to generalize methodologies for leveraging tool diversity in other domains, such as video coding, annotation of fine-grained categories, and activity recognition. Furthermore, this approach may open a new way of optimizing the effort from both human and computer — considering them as different resources with different systematic error biases — to leverage the best of both worlds.

For more details, please read our full paper, Two Tools are Better Than One: Tool Diversity as a Means of Improving Aggregate Crowd Performance, which received the Best Student Paper Award Honorable Mention at IUI 2018.

Jean Y. Song, University of Michigan, Ann Arbor
Juho Kim, KAIST, Republic of Korea
Walter S. Lasecki, University of Michigan, Ann Arbor