Real-time Crowd Control of Existing Interfaces

What if the crowd acted like a single (really awesome) worker?

We already know that the crowd can do amazing things. And, Michael Bernstein and I have already advocated on this blog that it’s important to get this work done quickly for all sorts of cool interactive applications.

But, what do you do once you have your crowd in 2 seconds? I argue that one of the most interesting things would be to get the crowd to act like a single high-quality worker.

Crowds aren’t reliable (take as witness the litany of work trying to make them more reliable). The work of individual crowd workers can be of low quality for a host of reasons, and you can’t count on any particular high-quality worker you’ve identified to be around when you need them — let alone for your whole job.

I just want one really high-quality worker recruited on-demand for the duration of my job!


An individual worker could execute a long-term strategy, and respond to changes in the application in real-time (without having to wait for a quorum of other workers to vet their actions). My really awesome worker could do complex control tasks, with the interfaces I already have: “Hey, boss, want me to drive your robot, complete that Excel spreadsheet, or fill in for you on WoW while you’re taking a break, no problem!”

But, I also want my worker to have the advantages of the crowd — to be available on-demand all the time, and to benefit from collective intelligence. I want to choose the crowd that’s in my worker’s head — maybe paid workers, but maybe my friends or family instead.


That’s what our paper is about: the Legion system that we introduce turns the unreliable dynamic crowd into really awesome workers that you can recruit on-demand to reliably control your existing interfaces in real-time.

The fact that Legion works on existing interfaces is important. Many of us in this area build substantial one-off systems and support infrastructure so the crowd can do our tasks, but Legion allows new crowd-powered systems to be created from the interfaces you already have. We’ve used Legion to create a robot that follows natural language commands, to outsource bits of office work, to make an assistive keyboard more accurate, and, yes, even to fill in for us in video games.

To use Legion, users first select a portion of their desktop interface that they would like the crowd to control, provide a natural language description of the task for the crowd to perform, and offer a price that they are willing to pay. Legion forwards a video feed of the interface to the crowd and forwards key presses and mouse clicks made by the crowd back to the interface (think VNC). To improve reliability, multiple workers are recruited to collaboratively complete the task. A fundamental question is how to effectively mediate crowd work to balance reliability with the desire for real-time, closed-loop control of the interface. Legion coordinates task completion by recruiting crowd workers, distributing the video feed, and providing a flexible mediation framework to synthesize the input of multiple crowd workers.

So, how do we decide which input actually makes it to the interface being controlled? We use crowd agreement to figure out which input is likely to be the best, and forward it on. We evaluated a number of these mediation strategies, and (spoiler alert!) the best ended up being a mediator that used crowd agreement to dynamically select a leader who would temporarily (but unwittingly) assume full control.

Here are some visual results of these mediation strategies on a robot navigation task:

Robot Driving Results for Different Mediators

Work in human-computer interaction usually assumes either a single user, or groups of users collaborating in the same virtual space, each in control of a personal cursor. Legion advances a new model in which a diverse and dynamic group collectively acts as a single operator, which introduces all kinds of interesting problems regarding feedback, quality control, and reliability.

We’re all really excited about this project, and think it brings up a whole slew of interesting research questions. The papers outlines a number of interesting angles future work might explore — like how to give good feedback to workers when the actions of each worker might not be taken, the potential of what we call desktop mash-ups, and how Legion might enable the crowd to be programmed by demonstration.

Check out our paper and video, and leave us your comments!

This blog post brought to you by Walter Lasecki, Kyle Murray, Samuel White, Rob Miller, and Jeffrey Bigham.

Jeffrey P. Bigham is an assistant professor at the University of Rochester. His research is about improving access for everyone, which leads him on adventures in HCI, AI, Human Computation, Disability Studies, and Systems. He tweets from @jeffbigham.

Crowds in Two Seconds: Enabling Realtime Crowd-Powered Interfaces

Crowds are already powering novel interactive systems like word processors and question answering systems, but their reach is too limited: crowds are reasonable choices only when the user can wait a minute or more for a response. Users, of course, hate waiting — they abandon interfaces that are slow to react. Imagine a crowd-powered copy/paste mechanism: it becomes much less useful when you need to wait 50 seconds between copying and pasting.

Our goal is realtime crowdsourcing: completing non-trivial crowdsourced computation within seconds of the user’s request. In our upcoming UIST paper, we develop techniques that recruit crowds in two seconds and use them to complete complex search tasks in ten seconds. We use these techniques to create a quick-response crowd-powered camera called Adrenaline, which quickly finds the best moment to snap a photo. Check out a video of the system here.

Adrenaline Camera

We believe that synchronous crowds are the key to realtime crowdsourcing. In synchronous crowds, all members arrive and work simultaneously. With synchronous crowds, systems can dynamically adapt tasks by leveraging the fact that workers are present at the same time.

There are two barriers to using synchronous crowds: recruitment and “crowd control” algorithms. To recruit synchronous crowds, we introduce the retainer model and a set of empirical guidelines for using it. The retainer model pays workers a small wage to wait and respond quickly when asked. Our most effective design results in a majority of workers returning within two seconds and 75% within three seconds. To guide crowds quickly, we then developed rapid refinement, which observes early signs of agreement in synchronous crowds and dynamically narrows the search space to focus on promising directions.

Rapid Refinement

This approach produces results that, on average, are of more reliable quality and arrive faster than even the fastest crowd member working alone.

The retainer model and rapid refinement open the door to powerful new interfaces like Adrenaline. But more importantly, they open the door to many other systems that needs synchronous crowds or fast results. For example, we developed A|B: a low-latency voting platform that returns a poll of five workers within about five seconds. Here, the sans-serifed font won 5-to-0.

A quick voting platform

We also see opportunities for realtime requester feedback and creative support. To explore these ideas, we developed Puppeteer, which allows crowds to help create lots of interestingly-varying alternatives to a photo. Here’s what it looks like when we want to create a dancing crowd:

Joint work with Joel Brant at Adobe, and Rob Miller and David Karger at MIT.

CrowdForge: Crowdsourcing Complex Tasks

Work marketplaces like MTurk are great for accomplishing small, well defined nuggets of work, such as labeling images and transcribing audio, but terrible for many more complex and labor intensive real world tasks. Over the last two years, Robert Kraut, Niki Kittur, Susheel Khamkar and I formalized a general process of solving such complex problems using MTurk. We proposed three basic types of tasks and explored them in several experimental applications. To facilitate these experiments, I implemented CrowdForge, a Django framework that takes output from MTurk HITs and uses it to create new MTurk HITs.

CrowdForge proposes the following task breakdown, roughly inspired by the MapReduce programming paradigm:

  • partition tasks split a problem into sub-problems (one to many)
  • map tasks solve a small unit of work (one to one)
  • reduce tasks combine multiple results into one (many to one)


Using the CrowdForge approach, we got turkers to write articles on a given topic, to conduct car purchase research, and even write articles for a scientific journal.


In the first experiment, turkers generated encyclopedia-style articles on a given subject. The approach we took was to first generate an article outline using an partition task, then for each heading in the outline, to collect facts on the topic, next to combine these facts into a paragraph, finally merging all of the paragraphs into a final article. The following diagram illustrates this process and excerpts from an article on New York City:


On average the articles produced by the group were of rated on par with the Simple Wikipedia article on the same topic, and higher than those produced individually.

Product comparisons

Next, we used CrowdForge to create purchase decision matrices to assist consumers looking to buy a new car. Given a short description of a consumer need, we created two partition tasks: one to decide which cars might be appropriate to consider, and one to decide which features the consumer cares most about. This double partition resulted in an empty product comparison chart. Each cell in the chart then spawned a map task to collect related facts. Next, these facts are reduced into a sentence, resulting in a product comparison chart. Here’s an excerpt from a car recommendation for a hypothetical suburban family that drives their two children to and from school and enjoys occasional road trips:


The entire task was completed in 54 different HITs for a total cost of $3.70. When we tried to compare to the individual case, we had no success getting individuals to generate a similar product comparison chart, even when offering more money than we paid the entire group.

Scientific journalism

For the third and final test, we chose to address the complex problem of crowdsourcing the science journalism process; i.e., turning a paper published in an academic venue (such as Science) into a newspaper article for the general public.

We worked with the journalists from New Scientist to identify a typical structure for a popular science article, including creating a news lead, describing what scientists did, what they found, getting a quote from a relevant expert and an author of the study, and describing implications and future work. For each of these sections we worked to develop subflows that would produce them. Some of the subflows required iteration, trying several different approaches and quality verification with intermediate voting steps. In total our article generation task involved 11 subflows comprising 31 subtasks, which represent 262 worker judgments and a total cost of approximately $100.

We then used this approach to generate a summary of this paper from Salganik et al. The results were surprisingly good, but not perfect. Below is a sample of the best:

“Blockbusters, bestsellers, hit songs – the main variable that makes these things more popular than their lesser-known counterparts is not quality differences, according to a recent study. The answer lies in social influence – the more people know a certain movie is the one to watch, the more likely it will become popular” (Best – 6.25/7)

and worst:

“The psychology of song preference. Song quality isn’t the best predictor of song success.” (Worst – 1.25/7)

samples of article introductions as voted on by other turkers.

Future directions

The first direction I’d like to take CrowdForge is to support mixed human-computer intelligence tasks, following through with the MapReduce programming paradigm. In this context, pools of machine intelligence could intermix with pools of human intelligence. The research here would involve an exploration into designing task flows that contain some sub-tasks well suited to machines, and others to humans.

The second direction for the future of CrowdForge is to harness communities that are more specialized in certain areas. I’d like to explore the idea of “targeted tasks”, which is similar to targeted advertisement commonly seen online. With the new custom work marketplace alluded to in the previous paragraph, one could imagine embedding HITs right in the a web page, which, for some online communities (eg. graphic artists), could solicit a very specialized kind of human intelligence.

CrowdForge’s dependence on Mechanical Turk is a roadblock for both of these futures. As a result, I’m very interested in looking at work marketplaces that support both human and machine tasks, and can be better integrated into the web at large.

For more details about CrowdForge, please read our UIST 2011 publication.

Boris Smus is an engineer on the Google Chrome developer relations team. He has a masters in HCI from Carnegie Mellon University where he focused on crowd and physical interfaces. For more information, see his blog at

Up Next: UIST 2011

So far this summer, we’ve had a great slate of papers from HCOMP, ICWSM, CHI, CSCW and ICML. As we move into the fall, though, it’s time to look forward to a new set of venues. First up is UIST 2011: the ACM Symposium on User Interface Software and Technology. In the weeks to come, authors of accepted UIST papers will be sharing early previews of their upcoming work on crowd and social computing.

Do you have other conferences or journals that you’d like to see us cover? Let us know.

4chan and /b/: Anonymity and Ephemerality in Online Communities

When designing for online communities like forums or crowd computing platforms, we face important design decisions — and two of the most impactful decisions are identity and archiving strategies. In terms of identity, pseudonyms online were the default for a long time. Recently, however, sites like Facebook have moved us toward a model of real identity: Mark Zuckerberg has gone so far as to say that pseudonyms and multiple accounts demonstrate a “lack of integrity”. In terms of archiving, search engines have moved us toward a more honest, permanent archive, where the web never forgets. Researchers have argued time and time again that strong identity and permanent archives are critical to the functioning of an online community.

My co-authors and I took note of a trend in the opposite direction — sites that purposefully delete old content and treat content as anonymous by default — in a recent paper that was lucky enough to win Best Paper at ICWSM. In particular, we focused on the community at 4chan and its “random” board called /b/. [Warning: /b/ is extremely offensive, and purposefully so. Tread carefully.] The 4chan community has been extremely successful by many metrics: 7 million monthly users, untold numbers of online memes, and impressively (sometimes scarily) powerful online protest movements.

So what’s going on here? Can we use 4chan and /b/ to reconsider our notions of what anonymity and ephemerality look like in online communities?

Visually, /b/ is a throwback to 90’s web design:

Our group characterized the ephemerality and anonymity that typify /b/. A few highlights of our findings:

  • The median thread spends just 5 seconds on the first page of /b/ across its entire lifetime before being whisked off by newer content. That median thread will get completely removed from the site within 5 minutes.
  • Over 90% of all posts are made completely anonymously: no pseudonyms, no Facebook Connect, no nothin’.
  • /b/ has developed alternate archiving and identity signaling mechanisms in the absence of accounts and site history: /b/ folders, an impressive amount of slang, memes as it’s-in-it’s-out fashion markers, and an often thin line between being serious, offensive, and just trolling.

In a world where our research is often driven by the data we have available, we urge the research community to consider engaging more with communities like 4chan. If you are interested in collecting your own dataset, please contact us and we’ll help you get set up.

Here’s the paper. It had some nice coverage on Slate, Gawker and KnowYourMeme. My coauthor Andrés had fun creating a (kind of disturbing) word cloud of our 5 million post dataset:

And of course, we made some peer review memes:

Michael Bernstein is a PhD student at MIT’s Computer Science and Artificial Intelligence Laboratory. He coauthored this paper with Andrés Monroy-Hernández and Drew Harry and the MIT Media Lab, Paul André at U. Southampton (now at Carnegie Mellon), and Katrina Panovich and Greg Vargas at MIT CSAIL. Thanks to the many, many researchers who gave us feedback on early drafts of the paper.

Rating Scales for Collective Intelligence in Innovation Communities

We all rate user-generated content on the web all the time: videos, blog posts, pictures. These ratings usually serve to identify content that we might like. In online innovation communities such as Dell’s IdeaStorm or My Starbucks Idea those ratings serve a different purpose: identify the “best” idea to be implemented by the host organization (there is a whole other argument that it is only about marketing and not about the “ideas” but let’s not go down that road). So the question arises: how can we best design collective intelligence mechanisms for idea selection in innovation communities?

In a paper published at the last ICIS conference, a group of researchers from TUM (Germany) compared three different rating mechanisms in a field experiment (n=313)  against a base-line expert rating.

To test the collective intelligence assumption, they also tested for a moderating effect of user’s expertise. Here is the research model:

They find that a multi-attribute scale works significantly better regarding rating accuracy than the simpler scales, but, as expected, there are some drawbacks regarding user satisfaction. Surprisingly, the most simple scale scores worst both on rating accuracy and satisfaction. They also find that user expertise has no moderating effect. Under the hood of the paper is an interesting method contribution: the authors present a way to judge an individual user’s rating accuracy (they term it “fit-score”). This could serve as a basis for future investigations in the design of collective intelligence mechanisms.

Christoph Riedl is currently a post-doctoral fellow at Harvard University researching innovation competitions and online communities. His focus is on open innovation, crowd sourcing, and collective intelligence.

Christoph Riedl, Ivo Blohm, Jan Marco Leimeister, Helmut Krcmar (2010): Rating Scales for Collective Intelligence in Innovation Communities: Why Quick and Easy Decision Making Does Not Get it Right. Proceedings of Thirty First International Conference on Information Systems (ICIS’10), St. Louis, MO, USA. SSRN

Three cobblers combined make a genius mind? Structures for crowd creativity

by Lixiu Yu and Jeffrey V. Nickerson
There is an old Chinese proverb: Three cobblers combined make a genius mind. While creative individuals have been studied widely, few findings can be directly applied to crowd creativity. To study crowd creativity, we need to understand organization structure.

Organization structure can vary along a spectrum from tightly connected to loosely coupled. The structures can be visualized as networks. Despite the numerous varieties of  networks, all of these structures can be seen as combinations of several basic topologies: chains, rings, hub-and-spokes, trees and complete graphs. For example, iterative communication, where ideas are passed as in the game of telephone, has a chain structure. Brainstorming, where everyone talks to everyone else, has a complete graph structure. These basic structures can be combined into complex structures through which cognitive or social creativity may emerge.

Moreover, there are organizational networks within and without. At the individual level, creativity can be seen as an output of a neural network, our brains. At the societal level, creativity evolves out of social networks. These influence each other: our neural networks are conditioned by our interactions with each other, and our social structures are influenced by our cognitive capacities. Creativity at one level comes from the collective learning of neurons and at another level from the collective learning of individuals.

Thus, a path to increased social creativity might come through innovative wiring of networks. That is, by artificially constructing temporary networks of individuals we might find which structures produce better designs. Since we want to construct crowd coordination structures and test the collectives’ performance on creative tasks, we apply research on structures from several disciplines: artificial intelligence, psychology, management, and computer science.

In particular, we have started our investigations using a structure based on the architecture of genetic algorithms. Genetic algorithms mimic the process of natural evolution. In our scheme, parent ideas are selected and mated to produce children; features are passed on from generation to generation. It is essentially a combination and selection process. Combination allows advantageous features to propagate and new combinations of features to emerge. The selection process works as a filtering system: bad features drop out of the system. Such combination and selection is ubiquitous in our lives. For example, in everyday interactions we pick up thoughts that rub against each other: we even speak metaphorically of creative sparks. At the macro level, technologies are invented by combining existing technologies in novel ways, according to Paul Thagard. Culture can seen as emerging from a process of imitation – old ideas propagate – and innovation – new ideas occur.

In our system, we design a structure that allows the distributed features of creative ideas from individuals participating as a crowd to be selected and combined into a collective creative output.

We have performed experiments that provide initial illustrations of how collective creativity might be studied. We have used a system based on combination, one example among many possible structures for creativity. In the experiments, members of the crowd either generate ideas afresh or combine the ideas of others. Twelve hundred people designed a chair (see Figure 1), and 500 people designed an alarm clock. We have found that ideas do in fact evolve. Some features will propagate more than others, and the ideas, judged by another crowd, increase in creativity. The structure drives the crowd to select good features, a process of imitation, and come up with new combinations of features, a process of innovation.

Since these two processes are arguably the basic forces behind human progress, our system can function as a microcosm, a laboratory for culture. Not unlike the world, the lab is composed of a variety of individuals, some talented, some not, some hardworking, some gaming the system. We can vary the crowd’s structure to improve the results. Eventually, the crowd will improve its own structure, and the boundaries between lab and world will blur, as the crowd establishes itself as a creative force, a collective mind.

Figure 1. Chair designs generated by the crowd. For more information, see Cooks or Cobblers: Crowd Creativity through Combination

Human Computation Needs Artificial Intelligence

Amazon Mechanical Turk’s tagline ‘Artificial Artificial Intelligence’ emphasizes the ready availability of human intelligence, and hence suggests diminished importance for artificial intelligence methods. A recent position paper argues that a crowd is sufficient for all steps of task solving: from decomposing the task into small pieces all the way to controlling its execution. In this view, all computations can be crowd-sourced. In contrast, we stand for a diametrically opposite viewpoint. As AI researchers we foresee AI systems to be essential and rather central in making crowd-sourcing (especially micro-crowd-sourcing) successful and enabling it to solve complex and challenging problems. In our vision, mixed-initiative systems where human and artificial intelligence methods complement each other is the only paradigm that will achieve a disruptive growth in this space.

Machines love numbers, humans do not, necessarily. Humans have intuitions about real-world tasks, machines can trade off unintuitively related tasks, say, by expressing their value in a decision-theoretic framework. Moreover, machines are almost-free, human work costs money. These high-level ideas suggest natural points of collaboration between AI techniques and the crowd.

Initially Mechanical Turk was used to merely generate labeled training data or building knowledge bases. Very early, researchers identified that human work, while cheap is not necessarily, high quality. This resulted in early work to use AI technology in assessing the quality of each worker and assessing the confidence of an answer. While useful, there was a key limitation in the tasks that were studied: they were simple, bit-sized jobs requiring a true-false or a fixed-sized response. That work could not be extended to tougher tasks.

Most of recent work has shown the importance of workflow decomposition of a larger task – CastingWords, Turkit, Soylent, etc. all use workflows to divide a large task into micro-tasks that can then be crowd-sourced. These are able to obtain high quality results on complex tasks such as audio transcription, handwriting recognition, and intelligent text editing. With these workflows come interesting challenges – how do we find the best from a space of workflows for a task, how do we optimize an individual workflow by finding the best parameters for it, how do we create best interfaces for realizing a workflow, how do we dynamically control a workflow to obtain the best performance, and so on. We argue that AI techniques are essential in answering all these questions – they can search through a space of workflows, perform numeric optimization to get the best performance out of one workflow, A-B test a space of interfaces and finally, dynamically control a given workflow.


It is said that nothing is as difficult as herding cats, but we are sure that artificial intelligence is up to the task! Or at least the task of herding the crowd… We name our proposed system CLOWDER, which literally means a group of cats.

CLOWDER is an AI agent that will assist crowd-sourcing requesters and enable them to perform complex tasks with ease. CLOWDER will have several components

  1. A declarative language to specify workflows and optimization methods to choose among several
  2. Shared parameters for common workflow types to speed early prototyping
  3. Integrated modeling of workers for easy quality control
  4. Comprehensive decision-theoretic control to optimize the quality-cost-time tradeoffs

Case Study

As a first study, we focus on the problem of intelligent control of workflows, in particular, the iterative improvement workflow. There are several decision points in this workflow, for which the original inventors used static policy. We design a component called TURKONTROL, which optimizes each instance of the workflow dynamically based on the worker responses so far and their quality parameters, which the system automatically estimates. TURKONTROL is able to take rather intelligent decisions in allocating the money for different task (decisions that we as human controllers won’t have anticipated) and obtains much superior quality output for the same amount of money. In fact, the original static policy requires almost 28% more money to achieve the same quality output.

Our paper is available online. You can also find details about our case study here and here. We look forward to hearing your thoughts here and at HCOMP’11.

Daniel S. Weld is a Thomas J. Cable / WRF Professor at the University of Washington. Mausam is a Research Assistant Professor at the University of Washington. Peng Dai is a recent PhD graduate at the University of Washington. All of us are deeply interested in artificial intelligence and its applications for crowd-sourcing.

Artificial Intelligence for Artificial Artificial Intelligence

The earliest successes of micro-crowdsourcing were in generating training data and obtaining little bits of information. In due course of time, the paradigm seems to have changed drastically with more and more complex tasks being solved on Mechanical Turk. Audio transcription (CastingWords), image annotations (TurKit), an intelligent text editor (Soylent) and many other complex tasks become possible due to the use of workflows, which decompose a large job into several micro-tasks. However, quality control (especially over these workflows) continues to be a significant challenge. Our work addresses this problem from a principled basis. We show that artificial intelligence, in particular decision-theoretic techniques, are extremely effective in making the best use of human resources and return vastly superior quality output by intelligently controlling these workflows.

As a case study, we study the iterative improvement workflow (used by Greg Little for the image annotation and handwriting recognition tasks) where an artifact is successively improved by different workers and the better artifact is promoted to the next iteration based on evaluating worker’s votes. This workflow is ideal for a first study since there are important questions unanswered: e.g., how many iterations? How many workers to evaluate in each iteration? Etc.

We present TURKONTROL, a decision-theoretic controller for this workflow. It first learns model parameters via small experiments with Turkers. These parameters encode an individual worker’s (as well as average worker’s) qualities for all types of micro-tasks (in our case, improvement HIT and evaluation HIT). TURKONTROL uses these parameters to initialize a POMDP (partially observable Markov decision process)-based agent that dynamically optimizes and controls each instance of the live workflow on Mechanical Turk. Our use case for the experiments is same as TurKit’s – writing English descriptions of images.

Our first finding is that TURKONTROL produces significantly superior descriptions compared to those generated via non-adaptive workflows, when spending the same amount of money.

In investigating how much savings our new system is able to bring, we run the non-adaptive policy until it produces artifacts (descriptions) with an average quality equal to that produced by TURKONTROL. We find that the static workflow takes 28.7% more cost to reach the same quality!

We also observe that TURKONTROL’s main power is in making intelligent use of human evaluations (ballots). For the non-adaptive policy, after each improvement, it always asks 2 (or 3 if the two disagree) workers to vote for the best description, so it always uses about 2.5 ballots per iteration. In contrast, TURKONTROL does not bother with ballots after the first two improvements because it expects that most workers will improve the artifact. In later iterations, TURKONTROL increases its use of ballots, because the artifacts are harder to improve, and hence TURKONTROL needs more information for making the right decision. The eighth iteration is an interesting exception; at this point improvements have become so rare that if even the first voter rates the new artifact as a loser, then TURKONTROL often believes the verdict.

Average number of ballots for the nonadaptive and dynamic workflows. TURKONTROL makes an intelligent use of ballots.

This clearly shows the strength of an AI system in making the best of use of available resources to obtain the best quality output. Such systems, we believe, will be absolutely essential to enable crowd-sourcing reach its potential in the business world.

Our paper is available online. You can also find details about our previous work, as well as future work. We are looking forward to meeting you at the poster session at HCOMP11.

Peng Dai is a recent PhD graduate at the University of Washington. Mausam is a Research Assistant Professor at the University of Washington. Daniel S. Weld is a Thomas J. Cable / WRF Professor at the University of Washington. All of us are deeply interested in decision-theoretic planning and its applications for crowdsourcing.