Would you be a worker in your crowdsourcing system?

As a Computer Scientist I am interested in two primary research questions about crowdsourcing:

  1. How might new systems broaden the range and increase the utility of crowdsourced work?
  2. What models, tools, and languages can help designers and developers create new applications that rely on crowdsourcing at their core?

I am investigating these questions together with my students at the Berkeley Institute of Design, in our Crowdsourcing Course, and through external collaborations (e.g., Soylent). At CHI, we will present works-in-progress on letting workers recursively divide and conquer complex tasks and on integrating feedback loops into work processes.

As a humanist, I believe it incumbent upon us to also think about the values our systems embody. I have a recurring uneasiness with the brave new world conjured by some of our projects for two reasons. The first one has been articulated before: many crowdsourcing research projects (including my own) rely at their core on a supply of cheap labor on microtask markets. Techniques we introduce to insure quality and responsiveness (e.g., redundancy, busy-waiting) are fundamentally inefficient ways of organizing labor that are only feasible because we exploit orders of magnitude in global income differences [1].

My second reservation is that the language used to describe how our systems decompose, monitor, and regulate the efforts of online workers recalls that of Taylor’s Scientific management. By observing, measuring and codifying skilled work, Taylorism moved knowledge from people into processes. This increased efficiency and made mass manufacturing possible; but it also led to the creation of entire classes of repetitive, undesirable, deskilled jobs.

I believe Stu Card had it right when he wrote that “We should be careful to design a world we actually want to live in.” As a step in this direction we might want to consider whether we ourselves would participate as workers in our own crowdsourcing systems. An exercise in my class, where students had to earn at least $1 as workers on Mechanical Turk suggests that the answer is today is a resounding “No.”

This leads me to ask a third research question – one I am less prepared to answer but where finding an answer is important if we believe that crowdsourcing will actually grow into a significant role in our future economy:

  1. How might we increase the utility, satisfaction and beneficience of crowdsourcing for workers?

I am looking forward to discuss these questions with you at the workshop.

1: Thanks to Volker Wulf for this thought.

Shepherding the Crowd: An Approach to More Creative Crowd Work

By Steven Dow and Scott Klemmer (Stanford HCI Group)

Why should we approach crowdsourcing differently than any collaborative computing system? Sure, crowdsourcing platforms make on-demand access to people easier than ever before. And this access provides new opportunities for distributed systems and social experiments.  However, workers are not simply “artificial artificial intelligence,” but real people with different skills, motivations, and aspirations. At what point did we stop treating people like human beings?

Our work focuses on people. Can we help workers improve their abilities? Can we keep them motivated? Can workers effectively carry out more creative and complex projects? Our experiments show that simple changes in work processes can significantly affect the quality of results. Our goal is to understand the cognitive, social, and motivational factors that govern creative work.

Along with our Berkeley colleagues Björn Hartmann and Anand Kulkarni, we introduce the Shepherd system to manage and provide feedback to workers on content-creation tasks. We propose two key features to help modern micro-task platforms accomplish more complex and creative work. First, formal feedback can improve worker motivation and task performance. Second, real-time visualizations of completed tasks can provide requesters a means to monitor and shepherd workers. We hypothesize that providing infrastructural support for timely and task-specific feedback and worker interaction will lead to better educated, more motivated workers, and better work results. Our next experiment will compare externally provided feedback with self assessment. Does the added cost of assessing work outweigh simpler mechanisms such as asking workers to evaluate their own work?

What’s the potential for creative crowd work?  Check out The Johnny Cash Project and Star Wars Uncut.

Steven Dow examines design thinking, prototyping practices, and crowdsourcing as a Stanford postdoc and Scott Klemmer advocates for high-speed rail in America and co-directs the Stanford HCI group.

Leveraging Online Virtual Agents to Crowdsource Human-Robot Interaction

Human-Robot Interaction (HRI) studies the social aspects of robotic behaviors. Results in HRI have emphasized not only the need for us to utilize human-computer interaction design principles but also principles of psychology. Non-verbal social cues such as gaze, attention, prosody, transparency with goal oriented behavior, and intentions are just a few aspects of behavior that become important with actuated agents. Along with these HRI principles, the community must also focus on traditional topics in robotics and machine learning : dialog management, navigation, manipulation, and learning by demonstration to name a few.

The A.I. community has had a lot of positive results with knowledge based agency. Whether this is in the form of policy learning, symbolic A.I. or so called classic A.I. or straightforward sense-think-act or sense-act architectures, these results have been very promising. Much of this has focused on knowledge acquisition from a large corpus : both from crowds or from standard benchmark data sources like the WSJ corpus. The analog in the robotics community has been working on learning agent behaviors tabula-rasa (from scratch) from direct user interaction. These can sometimes emphasize that we must raise the robots as if they were “babies” (motor babbling, learning by demonstration, kinematic learning, etc).

Our paper argues that the HRI community can also benefit from a data-driven approach to HRI in which the agent mimics non-verbal observed behaviors as well as learning from observed dialog, observed tasks online, and learning from labeled objects it can perceive. We have set up a preliminary study collecting more than 50,000 interactions in our online game, Mars Escape, and used them to train our real-world robot to mimic the role that the human took in the game, that of the robot. While our game doesn’t cover the entire realm that our dream may source from, we are just now establishing what a virtual agent using data from the internet or in general from a virtual world would be trained to use. I hope to present these results and have a discussion with a real group of experts who have had success in harnessing the crowd to give us more appropriate data and to help us gather more participants.

Video for this preliminary work can be found here. [Warning: file is large]

Making Databases more Human

by Adam Marcus (MIT CSAIL), Ph.D. Student

As Eugene Wu and I wrote in our crowd research workshop submission, it’s time to involve the (computer science) systems community in supporting human computation. We’re certainly not the only ones thinking about the topic, but I’d like to talk to you about two systems we’re building at MIT: Qurk for declarative specification of human computation workflows, and Djurk for standardizing human computation platform development.

Qurk lets you write queries in a declarative language (like SQL) that merges crowd- and silicon-powered operations. A simple query in Qurk to select images of males from a table of pictures would be “SELECT image_url FROM images WHERE gender(image_url) == ‘Male’;” In this case, gender is a user-defined function which would ask the crowd to identify the gender of the person in the image.

Human computation and databases research have traditionally been separate. Why cross the streams?

  • Databases eat, speak, and breathe adaptive optimization. The parameters (money, accuracy, latency) and models are different, but databases can integrate these new models into traditional workflows.
  • Common operators, such as filters, joins, and sorts, give us common optimization goals. Databases speak a limited number of common and useful operations. Once we cast popular human computation tasks into this common language, the community can iteratively improve operator implementations.
  • Best practices can be encoded into a package of user-defined functions. Want to batch or verify HITs? Someone will likely have written a package you can use for it in Qurk.

The challenges in integrating databases and human computation are fourfold. First, we need to identify the signals (e.g., worker agreement rate) through which Qurk should adapt query execution. Second, we must learn how common building blocks (e.g., item comparisons or ratings) of larger algorithms (e.g., joins or sorts) are best implemented with the crowd. Third, we have to identify how new challenges (e.g., extremely high operation latency) change how we implement traditional query execution engines. Finally, we should identify the ideal crowd workflow specification language.  Will we build workflows through traditional langauges like SQL, visual workflow builders, or something completely different?

We also offer a call-to-arms to the open source platform-building community. It pains us to see so many human computation platforms being built from scratch, each with its own set of quirks and limitations. Developer time should not be wasted re-implementing common human computation platform kernels. Like Hadoop does for distributed computation and WordPress does for publishing, we would like to see a pluggable, white label platform for human computation. This platform, which we call Djurk, would let developers innovate on questions that matter, such as incentives and interfaces, rather than building yet another job submission framework.

We’re excited to meet everyone at the workshop!

Adam Marcus and Eugene Wu are Ph.D. students at MIT collectively advised by Sam Madden, Rob Miller, and David Karger.

What is a Question? Crowdsourcing Tweet Classification

Micro-task markets have been commonly used to crowdsource tasks such as the categorization of text or the labeling of images. However, there remain various challenges to crowdsourcing such tasks, especially if the task is not easy to define or if it calls for specific skills or expertise. We crowdsourced a task which seemed simple on the surface but turned out to be challenging to design and execute.

We were interested in exploring the types and topics of questions people were asking their friends on Twitter. The first step in this exploration was to identify tweets that were questions. Since manually classifying random tweets was time consuming and hard to scale, we crowdsourced tweet categorization to Mechanical Turk workers. Specifically, we asked workers to identify whether a given tweet was a question or not. This was non-trivial as questions on Twitter were frequently ill-formed, short (maximum length of 140 characters) and adhered to a unique language full of short-hand notations.

Defining this task for workers seemed more difficult than we had expected. While most humans know intuitively what a question is, it was hard for us to phrase the task instructions so as not to impose our own definition of ‘question’ on workers. Further, the unique characteristics of tweets made it hard for workers to select question tweets: some questions were rhetorical, some contained little context, and some were too short to understand.

Another challenge we faced was with respect to ensuring high quality data. We designed several controls to ensure quality. First, we required workers to have a valid Twitter handle in order to do our task. This ensured their familiarity with the language and norms of Twitter. Next, to eliminate spam responses, we included some ‘control tweets’ in the list of tweets presented to Turkers. Control tweets were obviously ‘question’ or ‘not question’. Workers who did the task sincerely would be able to rate these control tweets correctly. We only accepted data from workers who rated all control tweets correctly.

We found that this method of selecting questions from Twitter was not very scalable. Only 29% of workers who completed the task provided valid (non-spam) data. Further,  only 32% of the tweets rated by them were found to be questions. Thus, a large number of workers would have to be recruited for our task to obtain a decent sample of question tweets to study. Moreover, recruiting workers who were Twitter users was hard. However, the controls ensured that we received high quality data and emphasized the need to include verifiable questions in tasks.

We analyzed the questions selected by Turkers by type and topic and found that a large percentage (42%) of these questions were rhetorical. We also found that the most popular topics for questions were entertainment, personal and health, and technology.

Our experience shows that there are still challenges to address in crowdsourcing simple human intelligence tasks such as classification of text. We look forward to sharing our methodology with the workshop participants and discussing ideas for dealing with such challenges.

User, Crowd, AI: The Future of Design

by Michael Bernstein (MIT CSAIL), workshop organizer

For years, the task of user interface design has boiled down to one critical question: agency. How much of the interaction relies on user input, and how much on algorithms and computation? Do we ask the user to sort through a pile of documents himself, or do we rely on an imperfect search engine? Does the user enter her own location in Foursquare, or does the phone instead try to triangulate her location using cell towers and GPS?

This question of agency leads to a design axis. At the ends we have completely user-controlled systems and completely AI-driven systems. Lots of designs sit somewhere in between.

But, the future of interaction may well be social, and it may well be crowd-driven. Crowds have begun creeping in to our user interfaces: Google Autosuggest uses others’ queries to accelerate yours, crisis maps like Ushahidi rely on crowdsourced contributions, and new systems like Soylent and VizWiz push the crowd directly into the end-user’s hands.

The User-AI axis no longer works. We need to introduce a third element: Crowd. Adding Crowd to the design space gives us a three-axis picture: one I call The Design Simplex, but which you are welcome to call a triangle:

There are a huge number of unexplored areas in this design space. What happens when we use crowds to quickly vet AIs that aren’t yet good enough for primetime on their own? (Pushing the points farther toward ‘AI’ while maintaining a healthy heaping of ‘Crowd’.) We could deliver technologies to the user years ahead of their general availability, and all the time be using the crowd work to train better AIs. What kinds of Crowd-User hybrids can we build that are more complex or powerful than AutoSuggest?

I’m excited to explore this space with you. We are just scratching the surface of what’s possible.

Michael Bernstein is a PhD student in computer science at the Massachusetts Institute of Technology. His dissertation research is on Crowd-powered Interfaces: user interfaces that directly embed crowd contributions to grant powerful new abilities to the user. He is a co-organizer of the CHI crowdsourcing workshop.

Using Humans to Gain Insight from Data

by Lydia Chilton (University of Washington), workshop organizer

My first degree and research experiences were in empirical microeconomics. One strong impression I took away from my experience working with large numeric data sets like the US Census is that there is a limit to the degree of insight numbers alone can provide. In particular, numbers can often tell you what is true, but not why it is true.

A famous labor economics study by David Card and Alan Krueger is a case in point. The Card and Krueger study uses a natural experiment on the New Jersey/Pennsylvania border to test the well-established theory that increasing minimum wage causes an increase in unemployment. The authors collected their own data and ran meticulous statistical tests, but astoundingly found the exact opposite of what the theory predicted. They found that the mandated increase in minimum wages actually decreased unemployment. To my knowledge, this result has never been explained. The numbers can’t tell us “why?”

I am frustrated by this lack of explanation. The numbers can’t tell us why, but behind the numbers are people – people who made decisions that led to this unexpected result. I constantly wonder: can’t we just ask them why?

The problem with asking “why” is the complexity, diversity, and nuance of the answers we get.  I believe that in order to answer “why” questions well, we need to develop new ways for humans to process the responses. Currently, we rely on numeric data sets because computers can process numbers quickly, and because statistical methods tell us how to draw conclusions from them. But today, with crowdsourcing platforms, we have the potential to use people to process human-generated data in order to gain more insight. For example,

  • We could add questions about individual job market decisions to the US Census. The ability to switch jobs is important to a healthy economy. We could ask people who would like to switch jobs, but haven’t, why they haven’t, and gain real insight about inefficiencies in the job market.
  • We could use existing free-text data such as Facebook status updates to probe questions like “Why are Hudson University students from lower income backgrounds more likely to fail freshman classes?” by detecting trends in inferred mental state and other life conditions revealed by the students.

In order to answer “why” questions effectively, we need 1) human computation algorithms that can use humans in parallel to analyze data and draw conclusions, and 2) a method for expressing our confidence in the results – an analog to the powerful statistical tests that express our confidence in numerical results.

Human experience and behavior is rich and varied, and number crunching alone can’t understand it.  But human computation can, and we should explore that opportunity.

Lydia Chilton is a 2nd-year PhD student in computer science at the University of Washington and currently an intern at Microsoft Research in Beijing, China. She is an organizer of the CHI Crowdsourcing workshop and a co-author of TurKit and two labor economics studies of Mechanical Turk workers.

Real-time, Real-world Crowd Computing

by Rob Miller (MIT CSAIL), workshop organizer
with Jeff Bigham (University of Rochester) and Michael Bernstein (MIT CSAIL)

Many applications of crowd computing to date have been batch computation. Consider the ESP Game [1], LabelMe [2], reCaptcha [3], and most tasks seen on Mechanical Turk.  In these applications, people are labeling and transcribing data for a corpus that will eventually be used for indexing or machine learning. This work has two interesting properties.  First, it’s asynchronous.  Eventually the crowd gets around to processing it, but you won’t want to sit around waiting for the answer.  Second, it’s computation, and purely functional computation at that.  The crowd is working entirely in an online world, taking input data in one digital form (like images or audio) and producing output to another digital form (like text or labels).  All the work happens on a web site, in cyberspace.

What if we relax those limits — first that crowd work is asynchronous, and second that it’s only about computation?  What if we could gather a crowd right now to work synchronously with the system, with an end-user, with each other?  What if the work involved the physical world, not just the digital one?  What new applications would be possible?  What new challenges would arise?

Our group at MIT CSAIL has already been studying what might be possible if we can do crowd work on demand, on behalf of an end user.  In collaboration with Jeff Bigham at University of Rochester, Joel Brandt at Adobe, and Bjoern Hartmann at Berkeley, among others, we have built several systems that explore aspects of this idea.  Soylent [4] puts a crowd inside Microsoft Word, so that selecting some text and pressing a button gathers people on demand to help you edit.  VizWiz [5] puts a crowd inside a blind person’s smartphone, letting them take photos and ask for help from sighted people on the web.

VizWiz is a point in this new design space for crowd computing.  First, it’s synchronous; the blind person is waiting for the answer, so VizWiz has to get an answer fast.  Using a basket of techniques, it often manages to get an answer (from people hired on Mechanical Turk) in 30 seconds or less.  Second, the end-user is mobile, out in the physical world, effectively carrying and directing a sensor (the phone’s camera) for the sake of the crowd. The crowd’s effort is still purely computational, but the real world is directly involved.

What if the crowd were also situated in the real world?  What if the crowd carried the sensors on their own mobile devices?  Google GPS traffic sensing and CarTel’s Pothole Patrol [6] are good examples of crowd sensing, but still asynchronous, not on demand. What if the crowd did physical work as well?  A brilliant example of this is the San Ramon Fire Department’s iPhone app.  If you have this app and someone near you is in cardiac arrest, the 911 dispatcher can pop up a notice on your phone with the location of the nearest Automatic Emergency Defibrillator, asking you to bring it to the heart attack victim.  A small amount of effort exerted at the right time can save a life.  What are the more everyday applications for crowd work in the real world?

Finally, a major research challenge for real-time, real-world crowd computing is the nature of the crowd itself.  “Crowd” typically implies a group of people making small contributions that may not be correct, or even well-intentioned.  How can we get a high-quality answer from noisy contributions made in a short time?  Soylent tackles the quality problem using algorithms, at the cost of greater latency; real-time requirements make this approach still more challenging.  VizWiz uses redundancy, getting multiple answers from the crowd.  How does redundancy work in the real world of limited resources and side-effects?  Multiple defibrillators arriving at a heart-attack scene can’t hurt, but would I have to ask the crowd for three cups of coffee just to guarantee that I’ll get at least one?  If I ask a crowd to buy the last donut on the shelf, what will the latecomers do?

Let’s think about real-time, real-world crowd computing, because it’s coming.

Rob Miller is an associate professor of computer science at MIT.  His research interests focus on people and programming.

References

  1. Luis von Ahn and Laura Dabbish. Labeling images with a computer game. CHI 2004.
  2. Bryan C. Russell, Antonio Torralba, Kevin P. Murphy, and William T. Freeman. LabelMe: A Database and Web-Based Tool for Image AnnotationInt. J. Comput. Vision 77, 1-3 (May 2008), 157-173.
  3. Luis von Ahn, Ben Maurer, Colin McMillen, David Abraham and Manuel Blum. reCAPTCHA: Human-Based Character Recognition via Web Security Measures. Science, 321 (September 2008): 1465–1468.
  4. Michael S. Bernstein, Greg Little, Robert C. Miller, Björn Hartmann, Mark S. Ackerman, David R. Karger, David Crowell, and Katrina Panovich. Soylent: a word processor with a crowd inside. UIST 2010.
  5. Jeffrey P. Bigham, Chandrika Jayant, Hanjie Ji, Greg Little, Andrew Miller, Robert C. Miller, Robin Miller, Aubrey Tatarowicz, Brandyn White, Samual White, and Tom Yeh. VizWiz: nearly real-time answers to visual questions. UIST 2010.
  6. Jakob Eriksson, Lewis Girod, Bret Hull, Ryan Newton, Samuel Madden, and Hari Balakrishnan. The pothole patrol: using a mobile sensor network for road surface monitoring.  MobiSys 2008.

The CrowdResearch Blog begins

Two years ago, crowdsourcing was fairly obscure in the research literature.  Now, the field has sped past its tipping point: suddenly, it’s hard to escape.  In 2010, many of the major computer science conferences (SIGIR, CVPR, NACL, NIPS, and Ubicomp to name a few) held crowdsourcing workshops, and many companies are looking to hire “crowdsourcing experts” – if such a thing even exists.

Part of the power of crowdsourcing is that it is applicable to so many diverse domains.  Although it is enabled by computers and the Internet, it isn’t limited to computer science or information science.  It is a tool to linguists, artists, computer vision research, search engines, and many companies.  And it is an object of study from a computational, economic, and humanistic perspective.  The current wave of interest in crowdsourcing is perhaps only a revival and re-envisioning of past efforts exerted long before the age of computers — stay tuned for more on that topic.  Maybe there really is no new thing under the sun.

Although history repeats itself, right now we are seeing a wave of new and interesting ideas right now in crowdsourcing.  So many, in fact, that it’s hard to keep track of it all.

The purpose of this blog is to present crowdsourcing developments, thoughts, and challenges across all domains of research. Our motivating question is:

What are the most important research questions and real-world problems in crowdsourcing, and how do we solve them?

To kick off over the next several months, this blog will be posting articles written by participants in the upcoming CHI 2011 Workshop on Crowdsourcing and Human Computation (May 8, 2011).

If you consider yourself a part of the crowdsourcing community, we welcome your views.