What is a Question? Crowdsourcing Tweet Classification

Micro-task markets have been commonly used to crowdsource tasks such as the categorization of text or the labeling of images. However, there remain various challenges to crowdsourcing such tasks, especially if the task is not easy to define or if it calls for specific skills or expertise. We crowdsourced a task which seemed simple on the surface but turned out to be challenging to design and execute.

We were interested in exploring the types and topics of questions people were asking their friends on Twitter. The first step in this exploration was to identify tweets that were questions. Since manually classifying random tweets was time consuming and hard to scale, we crowdsourced tweet categorization to Mechanical Turk workers. Specifically, we asked workers to identify whether a given tweet was a question or not. This was non-trivial as questions on Twitter were frequently ill-formed, short (maximum length of 140 characters) and adhered to a unique language full of short-hand notations.

Defining this task for workers seemed more difficult than we had expected. While most humans know intuitively what a question is, it was hard for us to phrase the task instructions so as not to impose our own definition of ‘question’ on workers. Further, the unique characteristics of tweets made it hard for workers to select question tweets: some questions were rhetorical, some contained little context, and some were too short to understand.

Another challenge we faced was with respect to ensuring high quality data. We designed several controls to ensure quality. First, we required workers to have a valid Twitter handle in order to do our task. This ensured their familiarity with the language and norms of Twitter. Next, to eliminate spam responses, we included some ‘control tweets’ in the list of tweets presented to Turkers. Control tweets were obviously ‘question’ or ‘not question’. Workers who did the task sincerely would be able to rate these control tweets correctly. We only accepted data from workers who rated all control tweets correctly.

We found that this method of selecting questions from Twitter was not very scalable. Only 29% of workers who completed the task provided valid (non-spam) data. Further,  only 32% of the tweets rated by them were found to be questions. Thus, a large number of workers would have to be recruited for our task to obtain a decent sample of question tweets to study. Moreover, recruiting workers who were Twitter users was hard. However, the controls ensured that we received high quality data and emphasized the need to include verifiable questions in tasks.

We analyzed the questions selected by Turkers by type and topic and found that a large percentage (42%) of these questions were rhetorical. We also found that the most popular topics for questions were entertainment, personal and health, and technology.

Our experience shows that there are still challenges to address in crowdsourcing simple human intelligence tasks such as classification of text. We look forward to sharing our methodology with the workshop participants and discussing ideas for dealing with such challenges.

User, Crowd, AI: The Future of Design

by Michael Bernstein (MIT CSAIL), workshop organizer

For years, the task of user interface design has boiled down to one critical question: agency. How much of the interaction relies on user input, and how much on algorithms and computation? Do we ask the user to sort through a pile of documents himself, or do we rely on an imperfect search engine? Does the user enter her own location in Foursquare, or does the phone instead try to triangulate her location using cell towers and GPS?

This question of agency leads to a design axis. At the ends we have completely user-controlled systems and completely AI-driven systems. Lots of designs sit somewhere in between.

But, the future of interaction may well be social, and it may well be crowd-driven. Crowds have begun creeping in to our user interfaces: Google Autosuggest uses others’ queries to accelerate yours, crisis maps like Ushahidi rely on crowdsourced contributions, and new systems like Soylent and VizWiz push the crowd directly into the end-user’s hands.

The User-AI axis no longer works. We need to introduce a third element: Crowd. Adding Crowd to the design space gives us a three-axis picture: one I call The Design Simplex, but which you are welcome to call a triangle:

There are a huge number of unexplored areas in this design space. What happens when we use crowds to quickly vet AIs that aren’t yet good enough for primetime on their own? (Pushing the points farther toward ‘AI’ while maintaining a healthy heaping of ‘Crowd’.) We could deliver technologies to the user years ahead of their general availability, and all the time be using the crowd work to train better AIs. What kinds of Crowd-User hybrids can we build that are more complex or powerful than AutoSuggest?

I’m excited to explore this space with you. We are just scratching the surface of what’s possible.

Michael Bernstein is a PhD student in computer science at the Massachusetts Institute of Technology. His dissertation research is on Crowd-powered Interfaces: user interfaces that directly embed crowd contributions to grant powerful new abilities to the user. He is a co-organizer of the CHI crowdsourcing workshop.

Using Humans to Gain Insight from Data

by Lydia Chilton (University of Washington), workshop organizer

My first degree and research experiences were in empirical microeconomics. One strong impression I took away from my experience working with large numeric data sets like the US Census is that there is a limit to the degree of insight numbers alone can provide. In particular, numbers can often tell you what is true, but not why it is true.

A famous labor economics study by David Card and Alan Krueger is a case in point. The Card and Krueger study uses a natural experiment on the New Jersey/Pennsylvania border to test the well-established theory that increasing minimum wage causes an increase in unemployment. The authors collected their own data and ran meticulous statistical tests, but astoundingly found the exact opposite of what the theory predicted. They found that the mandated increase in minimum wages actually decreased unemployment. To my knowledge, this result has never been explained. The numbers can’t tell us “why?”

I am frustrated by this lack of explanation. The numbers can’t tell us why, but behind the numbers are people – people who made decisions that led to this unexpected result. I constantly wonder: can’t we just ask them why?

The problem with asking “why” is the complexity, diversity, and nuance of the answers we get.  I believe that in order to answer “why” questions well, we need to develop new ways for humans to process the responses. Currently, we rely on numeric data sets because computers can process numbers quickly, and because statistical methods tell us how to draw conclusions from them. But today, with crowdsourcing platforms, we have the potential to use people to process human-generated data in order to gain more insight. For example,

  • We could add questions about individual job market decisions to the US Census. The ability to switch jobs is important to a healthy economy. We could ask people who would like to switch jobs, but haven’t, why they haven’t, and gain real insight about inefficiencies in the job market.
  • We could use existing free-text data such as Facebook status updates to probe questions like “Why are Hudson University students from lower income backgrounds more likely to fail freshman classes?” by detecting trends in inferred mental state and other life conditions revealed by the students.

In order to answer “why” questions effectively, we need 1) human computation algorithms that can use humans in parallel to analyze data and draw conclusions, and 2) a method for expressing our confidence in the results – an analog to the powerful statistical tests that express our confidence in numerical results.

Human experience and behavior is rich and varied, and number crunching alone can’t understand it.  But human computation can, and we should explore that opportunity.

Lydia Chilton is a 2nd-year PhD student in computer science at the University of Washington and currently an intern at Microsoft Research in Beijing, China. She is an organizer of the CHI Crowdsourcing workshop and a co-author of TurKit and two labor economics studies of Mechanical Turk workers.

Real-time, Real-world Crowd Computing

by Rob Miller (MIT CSAIL), workshop organizer
with Jeff Bigham (University of Rochester) and Michael Bernstein (MIT CSAIL)

Many applications of crowd computing to date have been batch computation. Consider the ESP Game [1], LabelMe [2], reCaptcha [3], and most tasks seen on Mechanical Turk.  In these applications, people are labeling and transcribing data for a corpus that will eventually be used for indexing or machine learning. This work has two interesting properties.  First, it’s asynchronous.  Eventually the crowd gets around to processing it, but you won’t want to sit around waiting for the answer.  Second, it’s computation, and purely functional computation at that.  The crowd is working entirely in an online world, taking input data in one digital form (like images or audio) and producing output to another digital form (like text or labels).  All the work happens on a web site, in cyberspace.

What if we relax those limits — first that crowd work is asynchronous, and second that it’s only about computation?  What if we could gather a crowd right now to work synchronously with the system, with an end-user, with each other?  What if the work involved the physical world, not just the digital one?  What new applications would be possible?  What new challenges would arise?

Our group at MIT CSAIL has already been studying what might be possible if we can do crowd work on demand, on behalf of an end user.  In collaboration with Jeff Bigham at University of Rochester, Joel Brandt at Adobe, and Bjoern Hartmann at Berkeley, among others, we have built several systems that explore aspects of this idea.  Soylent [4] puts a crowd inside Microsoft Word, so that selecting some text and pressing a button gathers people on demand to help you edit.  VizWiz [5] puts a crowd inside a blind person’s smartphone, letting them take photos and ask for help from sighted people on the web.

VizWiz is a point in this new design space for crowd computing.  First, it’s synchronous; the blind person is waiting for the answer, so VizWiz has to get an answer fast.  Using a basket of techniques, it often manages to get an answer (from people hired on Mechanical Turk) in 30 seconds or less.  Second, the end-user is mobile, out in the physical world, effectively carrying and directing a sensor (the phone’s camera) for the sake of the crowd. The crowd’s effort is still purely computational, but the real world is directly involved.

What if the crowd were also situated in the real world?  What if the crowd carried the sensors on their own mobile devices?  Google GPS traffic sensing and CarTel’s Pothole Patrol [6] are good examples of crowd sensing, but still asynchronous, not on demand. What if the crowd did physical work as well?  A brilliant example of this is the San Ramon Fire Department’s iPhone app.  If you have this app and someone near you is in cardiac arrest, the 911 dispatcher can pop up a notice on your phone with the location of the nearest Automatic Emergency Defibrillator, asking you to bring it to the heart attack victim.  A small amount of effort exerted at the right time can save a life.  What are the more everyday applications for crowd work in the real world?

Finally, a major research challenge for real-time, real-world crowd computing is the nature of the crowd itself.  “Crowd” typically implies a group of people making small contributions that may not be correct, or even well-intentioned.  How can we get a high-quality answer from noisy contributions made in a short time?  Soylent tackles the quality problem using algorithms, at the cost of greater latency; real-time requirements make this approach still more challenging.  VizWiz uses redundancy, getting multiple answers from the crowd.  How does redundancy work in the real world of limited resources and side-effects?  Multiple defibrillators arriving at a heart-attack scene can’t hurt, but would I have to ask the crowd for three cups of coffee just to guarantee that I’ll get at least one?  If I ask a crowd to buy the last donut on the shelf, what will the latecomers do?

Let’s think about real-time, real-world crowd computing, because it’s coming.

Rob Miller is an associate professor of computer science at MIT.  His research interests focus on people and programming.


  1. Luis von Ahn and Laura Dabbish. Labeling images with a computer game. CHI 2004.
  2. Bryan C. Russell, Antonio Torralba, Kevin P. Murphy, and William T. Freeman. LabelMe: A Database and Web-Based Tool for Image AnnotationInt. J. Comput. Vision 77, 1-3 (May 2008), 157-173.
  3. Luis von Ahn, Ben Maurer, Colin McMillen, David Abraham and Manuel Blum. reCAPTCHA: Human-Based Character Recognition via Web Security Measures. Science, 321 (September 2008): 1465–1468.
  4. Michael S. Bernstein, Greg Little, Robert C. Miller, Björn Hartmann, Mark S. Ackerman, David R. Karger, David Crowell, and Katrina Panovich. Soylent: a word processor with a crowd inside. UIST 2010.
  5. Jeffrey P. Bigham, Chandrika Jayant, Hanjie Ji, Greg Little, Andrew Miller, Robert C. Miller, Robin Miller, Aubrey Tatarowicz, Brandyn White, Samual White, and Tom Yeh. VizWiz: nearly real-time answers to visual questions. UIST 2010.
  6. Jakob Eriksson, Lewis Girod, Bret Hull, Ryan Newton, Samuel Madden, and Hari Balakrishnan. The pothole patrol: using a mobile sensor network for road surface monitoring.  MobiSys 2008.

The CrowdResearch Blog begins

Two years ago, crowdsourcing was fairly obscure in the research literature.  Now, the field has sped past its tipping point: suddenly, it’s hard to escape.  In 2010, many of the major computer science conferences (SIGIR, CVPR, NACL, NIPS, and Ubicomp to name a few) held crowdsourcing workshops, and many companies are looking to hire “crowdsourcing experts” – if such a thing even exists.

Part of the power of crowdsourcing is that it is applicable to so many diverse domains.  Although it is enabled by computers and the Internet, it isn’t limited to computer science or information science.  It is a tool to linguists, artists, computer vision research, search engines, and many companies.  And it is an object of study from a computational, economic, and humanistic perspective.  The current wave of interest in crowdsourcing is perhaps only a revival and re-envisioning of past efforts exerted long before the age of computers — stay tuned for more on that topic.  Maybe there really is no new thing under the sun.

Although history repeats itself, right now we are seeing a wave of new and interesting ideas right now in crowdsourcing.  So many, in fact, that it’s hard to keep track of it all.

The purpose of this blog is to present crowdsourcing developments, thoughts, and challenges across all domains of research. Our motivating question is:

What are the most important research questions and real-world problems in crowdsourcing, and how do we solve them?

To kick off over the next several months, this blog will be posting articles written by participants in the upcoming CHI 2011 Workshop on Crowdsourcing and Human Computation (May 8, 2011).

If you consider yourself a part of the crowdsourcing community, we welcome your views.