Paid Crowdsourcing as a Vehicle for Global Development

Indian Turker at Computer
There are reasons to believe that paid crowdsourcing could be of particular benefit to low-income workers in developing countries. Unlike most employment opportunities, online marketplaces do not require geographic co-location between employer and employee and the only criterion for employment is the ability to complete the task at hand successfully.

Continue reading

Sources of Variability and Adaptive Tasks

The quality of the judgments obtained through crowdsourcing can vary substantially.  One of the sources of variability is between-worker variability, which arises from individual differences (e.g., not all workers have the same skills).  A lot of research has focused on this aspect, and harnessed aggregation techniques to reduce this variability.  For example, a voting schema can be used to aggregate multiple judgments.  This approach reduces the sensitivity to outliers (e.g., poor judgments) which consequently reduces the total variance.

Another type of variability that received less attention, but that we have observed on Amazon Mechanical Turk (MTurk) is within-worker variability. This one explains why the quality of a worker varies over time. For example, the following figure shows the quality of the judgments (as measure with Kappa) obtained from MTurk for one of our classification task.

This figure indicates that the quality of judgments obtained cannot be considered to be fixed over time*. On one hand, this variation can be explained by the presence of a learning curve: the workers must first learn how to complete the task. On the other hand, the boredom effect can partly explain why the quality of the judgments tends to decrease over time.

We believe that a better understanding of the different factors of within-worker variability could allow us to design adaptive interfaces to improve the quality of the results achieved.  For example, an interface could provide more guidelines and feedback to a worker that is at the bottom of the learning curve.  Another example would be an interface that could reduce the boredom effect by modifying the way content is presented (or even t he content itself!).  If you are interested in a more detailed description of the different sources of variability, or if you would like to read more about some ideas of adaptive interfaces, please read our position paper.

Gabriel Parent is a Masters student at the Language Technologies Institute at Carnegie Mellon University. His research focus is human-computer interaction through spoken dialog systems. He is also interested in human-based computation and incorporates crowdsourcing throughout his research.

*The time of the day can also have an impact on the quality since different pools of workers are active at different time of the day (e.g., USA v.s. India).  However, it cannot completely explain the difference observed in our figure since we looked at the results for 25 consecutive hours, thus spanning one entire day.

Interactive Crowdsourcing

My work is about making the world more accessible through technology. Unfortunately, the building blocks aren’t quite there for me to build the tools that I want to create.

In my future, people with disabilities can live and work more independently, supported by reliable intelligent technology that helps them get things done. I imagine a blind person shopping by himself, receiving guidance from his smartphone as he finds and buys the products he wants; I imagine a deaf person taking part not only in a mainstream classroom but also in serendipitous conversations with her hearing peers when a professional interpreter is not available; and I imagine an older, physically-impaired grandmother living in her home longer because an inexpensive assistive robot fluidly adapts to help with new tasks as the environment and her needs change.

Getting there requires strong AI; or maybe just the crowd — likely a bit of both.

I care about two dimensions of crowdsourcing that I think need work and aren’t quite ready yet:

1) Real-Time Human Computation — An OCR program that can only accurately read the text of a menu an hour after a blind patron has left the restaurant is not useful; an assistive robot that must wait minutes between decisions cannot provide timely assistance. Yet, such delays are the norm with human computation today. We need to make human computation faster.

2) Value Sensitive Design — Technology needs to match the expectations of those using it, the workers making it possible, and the bystanders that might unwittingly find themselves in the lens of VizWiz. Tools should give users choices and feedback regarding latency, confidentiality, cost, etc.

I and others are working to enable real-time human computation. Our first project in this area was VizWiz, an iPhone application that lets blind people take a picture, speak a question, and receive answers back from Mechanical Turk in less than 30 seconds. I’m sure that we can make this even faster.

The latter is less clearly defined, more contextually dependent on the app in which it is used. Historically, most services for people with disabilities have adopted strict codes of confidentiality to deal with situations like this. As an example, sign language interpreters have a code, laid out by their professional organization, the Registry of Interpreters for the Deaf, that prevents them from interjecting their own comments into the conversation and from repeating information in conversations that they have interpreted. In fact, all the major professions that provide services to people with disabilities have developed codes of ethics that include confidentiality, respect for the customer, and responsibility to take on only jobs for which he or she has the necessary skills. Maybe we can learn from them.

I want to close with a picture that inspires me:

Pictures Taken By VizWiz Users
Pictures Taken By VizWiz Users

These are pictures asked by blind users of VizWiz, along with the questions they asked and the answers they received. VizWiz could be improved on a whole slew of dimensions, but I like this picture because it’s the first example of a technology that lets blind people ask whatever they want and reliably produces intelligent results.

Its answers were, of course, crowdsourced.

Jeffrey P. Bigham is an assistant professor at the University of Rochester. His research is about improving access for everyone, which leads him on adventures in HCI, AI, Human Computation, Disability Studies, and Systems. He tweets from @jeffbigham.

Universal human computation: crowdsourcing crowdsourcing

As a theoretical computer scientist, I like to think about human computation as a new class of computing in the vein of quantum and probabilistic computing. Unlike in almost all other frontiers of computer science, the state of practice in human computation has leapfrogged the development of appropriate algorithmic models. This is a great indication of crowdsourcing’s success as a discipline — after all, silicon computers took two decades to be constructed after mathematical computers were discovered, but we’re building working crowd-powered algorithms today even without good models of how the underlying systems should be represented.  This rapid progress has built up a theoretical gap that’s screaming for attention.

Where can theory inform crowdsourcing?  Most obviously, better mathematical abstractions can help us analyze the computational efficiency of the systems we build, independent of problematic effects like time of day or search order that confound our study of existing crowdsourcing platforms. Sometimes, picking the right mathematical models can tell us the right strategy to apply in designing crowdsourcing systems for free; for instance, probabilistic oracles and error-correcting codes are well-studied concepts with standard answers (on paper, at least!) for what constitutes an optimal solution and have a natural analogy with crowd platforms.  These areas deserve more attention, especially in areas like queueing theory and mechanism design, where we are beginning to see early successes in applying these notions to crowdsourcing.

Another interesting idea is that the patterns of thinking used in theoretical computer science can guide us to new practical discoveries in human computation as we search for their analogues. In theoretical computer science, we make frequent use of the notion of universal models of computation.  Do analogous notions exist in human computation? Are there algorithms for crowdsourcing that are universal — in other words, capable of crowdsourcing anything that can be crowdsourced?

Along with collaborators Bjoern Hartmann and Matt Can, I’ve been implementing an attempt at a universal algorithm for crowdsourcing on Amazon’s Mechanical Turk as a system called Turkomatic.  The underlying idea of the algorithm is simple: we recursively treat the problems of task planning and decomposition — central challenges in human computation — as problems that can themselves be crowdsourced.  This is a potentially powerful approach; by abstracting away much of the complexity of task design, it reduces a wide variety of problems in crowdsourcing to one: how do we crowdsource the task design problem?  This question is challenging and requires careful control over how much of the design task is given to individual workers.  At the same time, it enables us to attack a potentially wider variety of tasks.

A crowd-generated workflow to generate complex content with minimal requester intervention, in this case, a blog.Above: part of a crowd-generated workflow for a complex task: in this case, to create a new blog on the web.

A side effect of having a general-purpose crowdsourcing interface available to us has been to make consulting the crowd dangerously (expensively?) fast and easy — I’ve found myself posting quick, one-off queries to the crowd for tasks sometimes trivial that emerge in my day-to-day life. I hope that tools like Turkomatic encourage crowdsourcing researchers to begin to use crowdsourcing wherever possible to support our everyday work and lives — in this case, eating our own cooking is as much a way to explore the limits of crowdsourcing as a validation of a certain vision for the future impact of the field.

It’s also my hope that many more folks coming from theoretical CS will choose to look at the rich interplay that’s possible between mathematics and crowdsourcing.  After all, we’re witnessing the birth of a new subfield of computing  — why should HCI have all the fun?

Anand Kulkarni is a fifth-year PhD student at the University of California, Berkeley, where he works on topics in crowdsourcing and theoretical computer science.  For more on the theory-crowdsourcing connection, Turkomatic, and a new “fair trade” crowdsourcing platform, take a look at his position paper.

The Impossible Canal: Reflections on Crowds and Crowdsourcing

Tom Erickson is a social scientist and interaction designer at IBM Research. His chief interest is in how to support productive conversation and collaboration across distributed groups both online and in the world.

In the late 17th Century, under the leadership of Pierre-Paul Riquet, the Canal du Midi was built in southern France. It stretched 240 kilometers from the Atlantic to the Mediterranean, used a hundred locks to manage the 189 meter elevation change, and was beset with problems ranging from lack of appropriate construction techniques to difficulties in managing water flow and dealing with local soil types. Building the canal required knowledge and techniques that considerably surpassed the existing state of the art. In fact, building the Canal du Midi should have been impossible.

The Impossible Canal. Photo by Peter Gugerell, Vienna, Austria. CC By-SA. [Note 1]

The fact that the canal was built is a testimony to the power of the crowd, although the crowd in question was of a different character than crowds often imagined by crowdsorcerers. As described in Chandra Mukerji’s fascinating book, Impossible Engineering [2], the canal was a product of distributed cognition, a la Ed Hutchins [1]. While the surveyors and engineers that one would expect to be instrumental in such a project played important roles, “formal knowledge of hydraulics was of no use without local knowledge of the region along with its terrain, soils, weather and the sea… [and] without artisans who could realize the plans with local materials.” [2, p 10] Interestingly, this knowledge was largely distributed among locals, particularly peasant women from Pyrenees villages who drew upon the ‘common sense’ techniques and knowledge they used to build and maintain community irrigation systems. This knowledge, passed down over generations because of its practical utility, had originally been developed over a millennia before by the Roman Empire engineers who used local labor to construct baths and aqueducts.

The Canal du Midi example raises a number of issues that could be interesting to take up in the workshop:
•    I like this example because it offers a counterpoint to the view of ‘the crowd’ as a set of people with low-level, generic skills. What we get a glimpse of in the Canal du Midi example is a crowd that is useful not for its generic abilities, but for the specialized place-bound, historically-contingent knowledge that is spread across a geographically distributed crowd. This is what makes me particularly interested in examples of situated crowdsourcing, as exemplified by Cyclopath [3, 4], in which members of the crowd are tapped for their unique expertise – their local knowledge. (Other types of “situated crowdsourcing” are discussed in my position paper [Note 2].)
•    More generally, this leads me to ponder the stance designers take towards crowdsourcing. Is there a danger of instituting a sort of cognitive Taylorization, turning knowledge work into a variant of the Fordist assembly line? Is ‘the Crowd’ like ‘the Cloud’ – a source of cheap, commoditized cognitive labor on demand? How might we design crowdsourcing systems that don’t simply extract labor from the crowd, but leave members more knowledgeable than when they began? How might we design them so that they take account of the fact that crowd members are embodied agents situated in complex environments, rather than the analog of transistors on a chip?
•    Mukerji’s invocation of Ed Hutchin’s work [1] also interests me. It seems to me that crowdsourcing is under theorized, and that Hutchin’s vision of distributed cognition as the operation of functional systems that transform and propagate  states across representational media offers an interesting way of thinking about crowdsourcing systems. It is this that I try to explore, albeit in a more concrete way, in the discussion of the ‘table of contents organization’ example in my position paper [Note 2].

1. This file is licensed under the Creative Commons Attribution 2.5 Generic license. Accessed at, April 13, 2011.
2. My position paper is at

1.    Edwin Hutchins. Cognition in the Wild. MIT Press, 1995.
2.    Chandra Mukerji. Impossible Engineering: Technology and Territoriality on the Canal du Midi. Princeton University Press, 2009.
3.    Reid Priedhorsky and Loren Terveen. 2008. The computational geowiki: what, why, and how. Proc. CSCW ’08, 267-276. ACM Press, 2008.
4.    Reid Priedhorsky. Wiki, Absurd Yet Successful: A Position Paper for CHI 2011 Workshop on Crowdsourcing and Human Computation

Crowdsourcing Customization with Socially-Adaptable Interfaces

For many reasons, modern desktop applications are packed with more functionality than is required by any given user, and particularly for any given task. We are developing socially-adaptable interfaces to help users manage all this functionality. In our approach, the creation of task sets—task-specific interface customizations—is crowdsourced to the application’s user community. Any user can create a task set, and when they do it is made instantly available to all users of the application. The result is that a user can sit down at the application, type a few keywords describing their intended task, and get an interface customized to that task.

System diagram for a Socially-Adaptable Interface

To explore this concept, we have developed AdaptableGIMP, a modified version of the GIMP image editor and an associated wiki that provide users with a socially-adaptable interface. We plan to officially release AdaptableGIMP within the next week or so, and we’re excited to see what users will do with it.

Installing a task set in AdaptableGIMP

In terms of Bernstein’s design simplex, socially-adaptable interfaces use a complementary combination of AI and the Crowd. The crowd explicitly creates task sets, gives them names, and documents them. They also provide implicit information about task sets when they use them (e.g. the keyword searches that they perform or the task sets that they choose to install). All of this information is then synthesized using AI techniques to provide users with relevant task sets in response to their searches.

For more information, check out our position paper or

Affective Crowdsourcing

by Rob Morris (Massachusetts Institute of Technology)

Can we gauge the affective state of the crowd? Is the crowd happy? Stressed out? Bored to tears? For many crowdsourcing applications, these questions are crucial. So, how do we answer them? Well, we could always just ask the crowd outright, and administer questionnaires. Or, we could use implicit, behavioral measures, such as those described by Michael Toomim. But we might also benefit from tools that measure affective features directly, in real-time.

Thanks to advances in affective computing, such tools are now at our disposal. In many cases, all we really need is a webcam. Indeed, with just an ordinary webcam, we can now monitor a vast array of affective features, such as posture, facial expression, and heart-rate.

In the Affective Computing Group at MIT, we are applying these techniques to a variety of real-world crowdsourcing applications. For instance, Dan McDuff, a PhD student in our group, is currently tracking facial expressions from thousands of people online, while they watch superbowl advertisements. As part of an art installation, Javier Hernandez-Rivera and Ehsan Hoque are tracking smiles at different locations throughout the MIT campus. With their system, we can start to answer an age-old MIT question – who’s happier: computer science students or media arts students?

MIT Mood Meter

Both of these examples use crowdsourcing techniques to acquire massive amounts of naturalistic affective data. But, we could just as easily turn these examples on their head, and use the same tools to reflect back on our crowdsourcing designs and methodologies – how do our designs affect the emotions of the crowd?

As crowdsourcing researchers, we must remember that emotions are a fundamental part of what makes us human. And, since crowdsourcing is essentially a humanistic field, it behooves us to at least consider the affective states of the crowd. Doing so will help us design better technologies, and it will open doors to many new research areas (several of which I describe in my position paper). Emotions, and indeed all affective phenomena, can bring a lot to bear to our field.

Rob Morris is a 2nd-year PhD student in the Affective Computing Group at MIT. He is currently interested in how crowdsourcing technologies might foster new forms of nearly real-time, emotion regulatory support.

Crowd-tailored algorithms

by Jakob Rogstadius (University of Madeira)

Successful emergency response relies heavily on access to timely, accurate and relevant information about a complex ongoing event. However, currently available systems are not deemed sufficient to meet the needs from victims and responders. In this post, I will outline a key issue in this application domain that is highly relevant to the crowdsourcing community and how this challenge should drive research towards new forms of crowdsourcing.

Algorithms vs. crowdsourcing
There are currently two main approaches for building real-time information systems. Purely automated news aggregators, such as Google News, already perform quite well at the task of gathering and clustering articles related to an event. Some go even further, such as the EMM NewsBrief which also extracts metadata such as locations, people and quotes from the clusters. However, these systems offer generic approaches that are unable to gather and present knowledge in a manner tailored to the characteristics, needs and priorities of a specific event or disaster.

Other systems more specialized for emergency use, such as Ushahidi, Sahana and others mentioned by Kate Starbird in her post adopt an almost purely crowdsourced approach by relying on individuals to submit reports containing all necessary metadata; data which is then presented using default or in some cases event-adapted interfaces. While these systems are designed to be much more adaptive than the news aggregators, they are instead unable to integrate the vast but largely unstructured knowledge base related to a particular disaster that is social and traditional media.

Crowd-in-the-loop analysis
The limitations of both fully automated and fully crowdsourced information processing systems motivate the need for solutions the combine the scalability of algorithmic computation, with the unique human capabilities to adapt to new situations, prioritize information, infer knowledge, estimate trust and question sources. One possible system design (located somewhere in the empty lower-center of Michael Bernstein’s Design Simplex) is presented in the figure below and with further details in our position paper. This design relies on an analysis loop which algorithmically gathers data related to a major event, stores it and presents the knowledge base to users. Those users then add to the information by sharing their conclusions and inferences with others through social media. In addition, a crowd-powered clarification process continuously operates on the knowledge base, by detecting and resolving flaws in the information.

Proposed system design

The missing piece from this design, and the main source of interest from a crowdsourcing perspective, is the handover point and bridge between crowd and algorithm. I argue that we need to understand how to extensively integrate an adaptive and intelligent crowd with cheap and accessible traditional computation, so that we enable systems to be built that are vastly more powerful than those seen today. To do this, we need:

  • Resource management strategies for allocation of computation, workers and worker payments.
  • Predictable crowdsourcing results, in terms of time, cost and quality.
  • New classification and prediction algorithms, for continuous learning from streaming data, and designed to include logical handovers to and from the crowd.

Task Decomposition and Human Computation in Graphics and Vision

by Dan Goldman and Joel Brandt

Dan is a senior research scientist at Adobe Systems, working at the intersection of computer graphics, computer vision, and human-computer interaction.

Joel is research scientist at Adobe Systems, working in the field of human-computer interaction. His research primarily focuses on understanding and supporting the programming process.

In projects like Soylent, the researchers themselves created the sequence of steps for the crowds to execute. Can this sort of reconfigurable process automation be extended to virtual crowds? Some individuals may excel at organizing work into structured processes, while others may work best at executing predetermined processes. Thus far, research into crowdsourcing has focused primarily on the latter category. There are far fewer examples in which the crowd itself participates in task decomposition.

To dip our toe in the waters of self-assembling human computations, we conducted a quick (and very unscientific) study. We posted several HITs to Mechanical Turk that asked turkers to break down complex task into small pieces. We began with three single large tasks of varying specificity: “Write a story about a great man whose pride brings about his downfall,” “Prepare a romantic dinner,” and “Build a house.” Working simultaneously on the same document using PiratePad, participants for all three tasks were asked to identify which tasks were achievable in 5 minutes or less, or to break up tasks that failed that test into sequences of shorter tasks. Output from the “Build a house” task is shown below.

Crowd Breakdown of “Build a house” Task
Steps resulting from 9 Mechanical Turk users who were asked to break down the task “build a house” into 5-minute tasks. (click to enlarge)

Some participants were more careless than others: For example they marked very complex subtasks as trivial, or skipped major obvious steps.  But the process turned out to be somewhat self-correcting, as later participants corrected errors committed by earlier ones.  None of the turkers used conditional statements or loops, not even in the instructional style (e.g. “repeat steps 3 through 5 until X.”).  Yet some subtasks were surprisingly specific and technical, like this subtask from the house-building task: “Mark cap plate of wall framing for roof trusses.”  In this case, the one knowledgeable participant seemed to gum up the process for later turkers, who were unable to assess the length or difficulty of such a task.  It seems likely that this could be ameliorated with either differentiation of expertise, or using some type of feedback or undo to back up to a more comprehensible set of instructions.

Although our experiment was too simplistic to draw any actionable conclusions, it suggests that even on Mechanical Turk, where there is no threshold for participation at all, participants were capable and willing to break down large tasks into smaller ones. Given additional prompting, we expect they could also be encouraged to employ conditional and looping constructs necessary for flexible task breakdown.

We foresee that the current paradigm of manually crafted human programming may quickly give way to self-assembling human computation. Someday, the creator of crowdsourcing tasks may be free to use broad guidelines and allow the crowd to assist in providing the specificity normally required for mechanical computation. To achieve this, such systems must be self-debugging, including corrective mechanisms to revise a program if the outcome does not meet objectives. Designing and building methodologies and platforms for these self-assembling and self-debugging human computations will be a significant challenge for the decades to come.

Leveraging Crowdsourced Technical Documentation: Building a Command Thesaurus

People routinely rely on search engines to support their use of software and other interactive systems. Our work examines how we can leverage search query logs, and the web pages queries resolve to, in order to create a mapping from the user’s vocabulary to the (sometimes highly technical) terminology utilized in a system’s user interface. We call this mapping a “command thesaurus”.

To build a command thesaurus, we begin by exploiting the fact that search query logs provide an excellent view of the vocabulary with which users conceive their use of interactive systems. Meanwhile, the terminology employed in a system’s user interface can be extracted from the interface itself. We can then relate user terminology to system terminology by exploring term co-occurrences in relevant online resources (e.g., forums postings, FAQs, etc.) An example illustrates this point:

The query “gimp how to make black and white” is quite common amongst users of the GNU Image Manipulation Program (GIMP). When processing this query, Google returns webpages which mention the GIMP command “desaturate” at a rate which is much higher than can be attributed to chance alone. As such, we may associate “make black and white” and “desaturate” to one another, and record this relationship in GIMP’s command thesaurus.

Once a command thesaurus is available, it enables a wide range of novel affordances and interactions which may improve feature discoverability in software applications. For example, the thesaurus might enable “live tooltips” which indicate how people are using system commands on a day-to-day basis. Command thesauri might also enable search- driven interfaces, where users are presented with relevant commands after typing a few keywords describing the task they would like to perform. Finally, it may be possible to relate the terminology of one application to the terminology of a similar application by comparing their command thesauri. This could enable the translation of commands from one application to the other.

Importantly, command thesauri are generated automatically, and they leverage both user generated web content and day-to-day web searches. As such, command thesauri evolve alongside a system’s user community.

Related presentation at CHI 2011:

Characterizing the Usability of Interactive Applications Through Query Log Analysis – Tuesday, 4:00pm