Collective human computation – presenting objective questions to multiple humans and collecting their judgments – is a powerful and increasingly popular paradigm for performing computational tasks beyond the reach of today’s algorithms. From image classification to data validation, the human computer is making a comeback.
But how should we measure the performance of a human doing a computational task? Speed without accuracy is worthless, and accuracy itself is hard to measure in classification or estimation tasks in which a close-to-correct judgment still has value.
I assert that the value of a judgment is the amount by which it reduces the surprise of learning the correct answer to a question. This is a basic concept in information theory: the pointwise mutual information between the judgment and the answer.
For example, a classification problem with four equally-likely categories has entropy of 2 bits per question. If you correctly classify a series of objects, you’re giving the full 2 bits of information for each. If you’re a spammer giving judgments that are statistically independent of the correct categories, you’re giving zero information no matter what your spamming strategy is.
Thus, the net value of a contributor’s judgments is the total amount of information they give us, a well-defined extensive quantity that we can measure in bits (or nats or digits, if you please).
This metric has the advantages of being naturally-motivated, task- and model-agnostic, and free of tuning, and it easily plugs in to any resolution algorithm that models contributors and answers as random variables.
Expected values (ie. entropy) can be used to predict a contributor’s performance on a given question, conditioned on what’s already known about that question. Contributors can be preferentially given the questions for which they’re likely to be most informative. By applying this technique to data from Galaxy Zoo 2 (a crowdsourced deep-field galaxy classification project, part of the Zooniverse program), I was able to demonstrate a substantial improvement in accuracy compared to random assignment of questions to contributors.
Finally, we can measure the cost-effectiveness of the judgment collection process or the information efficiency of the resolution algorithm in terms of the total information received from contributors. Related metrics can be used to measure the overlap in information between two contributors or the information wasted by collecting redundant judgments.
The metrics I present can be mixed in to any human computation resolution algorithm that uses a statistical model to turn judgments into answers, by using the model’s estimated parameters to compute a set of conditional probabilities and then plugging these in to the definitions of the information-theoretic quantities. The paper includes worked examples for several models.
For more, see the full paper:
Pay by the Bit: An Information-Theoretic Metric for Collective Human Judgment
Tamsyn P Waterhouse, Google Inc.