Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments

When collecting human ratings or labels for items, it can be difficult to measure and ensure data quality control, due to both task subjectivity (i.e., lack of a gold standard against which answers can be easily checked), as well as lack of transparency into how individual judges arrive at their submitted ratings. Using a paid microtask platform like Mechanical Turk to collect data introduces further challenges: crowd annotators may be inexpert, minimally trained, and effectively anonymous, and only rudimentary channels may be provided to interact with workers and observe work as it is performed.

To address this challenge, we investigate a very simple idea: requiring judges to provide a rationale supporting each rating decision they make. We explore this idea for the specific task of collecting relevance judgments for search engine evaluation. A relevance judgment rates how relevant a given web page is to a particular search query. We define a rationale as a direct excerpt from a web page (e.g., which a worker might easily provide by copy-and-paste or highlighting).

To illustrate this idea, we show a simple example below in which a user is assumed to be searching for information about symptoms of jaundice, and a worker is asked to rate the relevance of the Wikipedia page on jaundice. The worker might rate the page as relevant and support his/her decision by highlighting an excerpt describing jaundice symptoms, as shown.

What are the symptoms of jaundice?

While this idea of rationales is quite simple and easy to implement — good things for translating ideas from research to practice! — rationales appear to be remarkably useful, offering a myriad of benefits:

  • Rationales enhance transparency and value. While traditional labor models involve high-quality interpersonal interactions and transparency, this comes at significant expense. On the other hand, microtasking typically greatly improves efficiency and costs, but at the cost of low transparency and poor communication channels. Rationales represent a middle-ground between these extremes, with textual excerpts comprising a very light-weight form of communication from worker to requester to enhance transparency. Moreover, because rationales enhance data interpretability, they increase the value of collected data.  This makes rationales useful to collect from traditional annotators as well as crowd annotators, and beyond ensuring initial data quality, rationales increase the enduring value of collected data for all future users as well.
  • Rationales enhance reliability and accountability. Identifying a focused context for explaining annotator decisions as they are made is generally helpful and especially so for subjective tasks without clear answers. By making work more verifiable, arbitrary answers can no longer be given without justification, making annotators more accountable for their work and decreasing any temptation to rush or cheat. In addition, rationales let us establish alternative truths, e.g., in cases of disagreement between raters, or when rating decisions seem surprising or unintuitive without explanation. Moreover, just as rationales allow requesters to understand worker thought processes, they also enable application of sequential task design in which workers refine or refute one another’s reasoning. Such sequential design is critical for minimizing reliance on “experts” to verify data quality when scaling up data collection.
  • Rationales enhance inclusivity. We hypothesized that the improved transparency and accountability of rationales would enable us to remove all traditional barriers to participation without compromising data quality. By allowing anyone interested to work, we support greater scalability of data collection, greater diversity of responses, and equal opportunity access to income. Our experiments on Mechanical Turk imposed no worker qualifications to restrict anyone from performing our tasks, and we were able to successfully collect high quality data without imposing any worker restrictions.

It gets even better. Despite the seemingly extra work required to collect rationales in addition to relevance judgments, in practice rationales do not require extra time (and therefore cost) to collect! Over a series of experiments collecting nearly 10,000 relevance judgments, we found that workers produce higher quality relevance judgments simply by being asked to provide a rationale, and prolific workers require virtually no extra time to provide rationales in addition to ratings. Intuitively, rationales appear to merely capture explicitly the implicit cognitive work inherent in providing ratings: workers already have to find relevant text to perform the rating task, and asking them to simply report that text requires essentially no extra work.

Finally, it’s intuitive that workers who select similar rationales are likely to produce similar relevance judgments, and we can exploit such overlap between worker rationales in aggregating worker judgments in order to further improve data quality.

In sum, our results suggest that rationales may be a remarkably simple yet powerful tool for crowdsourced data collection, particularly for difficult or subjective tasks. There’s much more to our study, so we encourage you to read our full paper! The link to our paper is below, along with links to the presentation slides and our collected data.

Tyler McDonnell, Matthew Lease, Mucahid Kutlu, and Tamer Elsayed. Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments. In Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP), pages 139-148, 2016. Best Paper Award. [ Slides & Data ]