Micro-task markets have been commonly used to crowdsource tasks such as the categorization of text or the labeling of images. However, there remain various challenges to crowdsourcing such tasks, especially if the task is not easy to define or if it calls for specific skills or expertise. We crowdsourced a task which seemed simple on the surface but turned out to be challenging to design and execute.
We were interested in exploring the types and topics of questions people were asking their friends on Twitter. The first step in this exploration was to identify tweets that were questions. Since manually classifying random tweets was time consuming and hard to scale, we crowdsourced tweet categorization to Mechanical Turk workers. Specifically, we asked workers to identify whether a given tweet was a question or not. This was non-trivial as questions on Twitter were frequently ill-formed, short (maximum length of 140 characters) and adhered to a unique language full of short-hand notations.
Defining this task for workers seemed more difficult than we had expected. While most humans know intuitively what a question is, it was hard for us to phrase the task instructions so as not to impose our own definition of ‘question’ on workers. Further, the unique characteristics of tweets made it hard for workers to select question tweets: some questions were rhetorical, some contained little context, and some were too short to understand.
Another challenge we faced was with respect to ensuring high quality data. We designed several controls to ensure quality. First, we required workers to have a valid Twitter handle in order to do our task. This ensured their familiarity with the language and norms of Twitter. Next, to eliminate spam responses, we included some ‘control tweets’ in the list of tweets presented to Turkers. Control tweets were obviously ‘question’ or ‘not question’. Workers who did the task sincerely would be able to rate these control tweets correctly. We only accepted data from workers who rated all control tweets correctly.
We found that this method of selecting questions from Twitter was not very scalable. Only 29% of workers who completed the task provided valid (non-spam) data. Further, only 32% of the tweets rated by them were found to be questions. Thus, a large number of workers would have to be recruited for our task to obtain a decent sample of question tweets to study. Moreover, recruiting workers who were Twitter users was hard. However, the controls ensured that we received high quality data and emphasized the need to include verifiable questions in tasks.
We analyzed the questions selected by Turkers by type and topic and found that a large percentage (42%) of these questions were rhetorical. We also found that the most popular topics for questions were entertainment, personal and health, and technology.
Our experience shows that there are still challenges to address in crowdsourcing simple human intelligence tasks such as classification of text. We look forward to sharing our methodology with the workshop participants and discussing ideas for dealing with such challenges.