ReTool: Interactive Microtask and Workflow Design through Demonstration

Recently, there has been an increasing number of crowdsourcing microtasks that require freeform interactions directly on the content (e.g. drawing bounding boxes over specific objects in an image; or marking specific time points on a video clip). However, existing crowdsourcing platforms, such as Amazon Mechanical Turk (MTurk) and CrowdFlower (CF), do not provide direct support for designing interactive microtasks. To design interactive microtasks, especially interactive microtasks with workflows, requesters have to use programming-based approaches, such as Turkit and AMT SDKs. However, the need of programming skills sets a significant threshold for many requesters.


To lower the barrier of entry for designing and deploying interactive microtasks with workflows, we developed ReTool, a web-based tool that simplifies the process by applying “Programming by Demonstration” (PbD) concept. In our context, PbD refers to the mechanism by which requesters design interactive microtasks with workflows by giving an example of how the tasks can be completed.

Working with ReTool, a requester can design and publish microtasks following the four main steps:

  • Project Creation: The requester creates a project and uploads a piece of sample content to be crowdsourced.
  • Microtask and Workflow Generation: Depending on the type (text or image) of the sample content, a content specific workspace is generated. The requester then performs a sequence of interactions (e.g. tapping-and-dragging, clicking, etc.) on the content within the workspace. The interactions are recorded and analyzed to generate interactive microtasks with workflows.
  • Previewing Microtask Interface & Workflow: The requester can preview microtasks and workflows, edit instructions and other properties (e.g. worker number), add verification tasks and advanced workflows (conditional and looping workflow) at this step.
  • Microtask Publication: The requester uploads all content to be crowdsourced and receives a URL link for accessing available microtasks. The link can published to crowdsourcing marketplaces or social network platforms.

We conducted a user study to find out how potential requesters with varying programming skills use ReTool. We compared ReTool with a lower bound baseline, MTurk online design tool, as an alternative approach. We recruited 14 participants from different university faculties, taught them what is crowdsourcing and how to design microtasks using both tools. We then asked them to complete three design tasks. The results show that ReTool is able to help not only programmers, but also non-programmers and new crowdsourcers to design complex microtasks and workflows in a fairly short time.

For more details, please see our full paper ReTool: Interactive Microtask and Workflow Design through Demonstration published at CHI 2017.

Chen Chen, National University of Singapore

Xiaojun Meng, National University of Singapore

Shengdong Zhao, National University of Singapore

Morten Fjeld, Chalmers University of Technology

CrowdCamp Report: Situated Crowdsourced Access

Navigating through streets and within buildings might seem like a trivial activity, however, its often a challenge for people with visual impairment. Over the last few years, innovation in the space of sensors, devices and smartphone apps have attempted to improve universal access and make navigation easier. However, the technology is not there yet.

Current approaches still incur concerns about safety – unexpected danger from construction or vehicle placement can hurt a user and could have used real-time notifications or help. For example, a visually impaired person uses white cane as a primary mobility tool. This helps in keeping track of objects that are hindrances in the path he is taking. There might be some special situations like heavy hanging objects or objects that are protruding from the wall (can be artistic displays, etc) that cannot be tracked by the white cane but can cause severe head injury if not taken care of. Understanding these aspects are beyond sensor’s limitations, but easy for humans to comprehend.

In this project, we designed an approach to help address some of these problems by adding humans in the loop as sensors and actors to assist with accessibility questions/problems. For example, in the figure below, if a user (green) has a to go to a coffee shop (A or B), she can quickly query the route and can use the location based services like Twitter, etc., crowdsourcing approaches where crowd are the people (orange) in neighborhood who can help her by notifying about a problem if it exists. This can help inform the user’s decision making process, and the navigation system’s path recommendation.

Sample route map
Sample route map

The approach can be implemented using the workflow architecture shown below:

Workflow architecture of the proposed set up
Workflow architecture of the proposed set up

In this approach, an end user can make a request by setting abilities preferences with respect to time, cost, location, and more. The request can then be broadcasted to the people or volunteers in the neighborhood. The volunteers can then respond, providing the system with updated information about the situation with a click of a button. To create this system, we envisioned the possibility of using Twitter or a custom app.

Twitter Approach: Twitter being a very popular social media has attracted so many users who can help others in their area of location without even making a request to the volunteer or another user. In order to first understand this we have to get an idea if the user base in a given area is bigger and there are sufficient tweets from a given region. We considered Pittsburgh as our point of interest and we calculated the average frequency of tweets. As shown below, on an average there is atleast one tweet for every 12 seconds. Hence, reflecting a promising outcome.

Tweets on average
Tweets on average

Custom Application: As a part of brainstorming and prototyping process, we also developed a homescreen app by extending the concept of Twitch Crowdsourcing. This lets users provide and answer just by unlocking their phone. One particular use-case of this app is shown below. If a visually impaired user (VU) has a question about the presence of curb ramp near-by or the number of steps, this request will be visible to the people in the location of VU’s interest where they can make a binary decision by simply checking their mobile phones. This makes the entire experience seamless with minimal cognitive load.

Screenshot of unlock screen application
Screenshot of unlock screen application

We believe that by using situated crowdsourcing, we can overcome the limitations of current sensor technology and real-world deployment, and better empower people with visual disabilities to navigate through buildings or cities more independently.

[1] Rajan Vaish, Keith Wyngarden, Jingshu Chen, Brandon Cheung, and Michael S. Bernstein. 2014. Twitch crowdsourcing: crowd contributions in short bursts of time. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’14). ACM, New York, NY, USA, 3645-3654.

[2] Simo Hosio, Jorge Goncalves, Vili Lehdonvirta, Denzil Ferreira, and Vassilis Kostakos. 2014. Situated crowdsourcing using a market model. In Proceedings of the 27th annual ACM symposium on User interface software and technology (UIST ’14). ACM, New York, NY, USA, 55-64.

Rajan Vaish, University of California, Santa Cruz, USA
Walter Lasecki, University of Rochester, USA
Lydia Manikonda, Arizona State University, USA

The Human Flesh Search: Large-Scale Crowdsourcing for a Decade and Beyond

Human Flesh Search (HFS, 人肉搜索 in Chinese), a Web-enabled large-scale crowdsourcing phenomenon (mostly based on voluntary crowd power without cash rewards), originated in China a decade ago. It is a new form of search and problem solving scheme that involves the collaboration among a potentially large number of voluntary Web users. The term “human flesh,” an unfortunately bad translation from its Chinese name, refers to the human empowerment (in fact, crowd-powered search is a more appropriate English name). HFS has seen tremendous growth since its inception in 2001 (Figure 1). Figure1_updatedFigure 1. (a) Types of HFS episodes, and (b) evolution of HFS episodes based on social desirability HFS has been a unique Web phenomenon for just over 10 years. HFS presents a valuable test-bed for scientists to validate existing and new theories in social computing, sociology, behavioral sciences, and so forth. Based on a comprehensive dataset of HFS episodes collected from participants’ discussion on the Internet, we performed a series of empirical studies, focusing on the scope of HFS activities, the patterns of HFS crowd collaboration process, and the unique characteristics and dynamics of HFS participant networks. More results of the analysis of HFS participant networks can be found in two papers published in 2010 and 2012 (Additional readings 1 and 2). In this paper, a survey of HFS participants was conducted to provide an in-depth understanding of the HFS community and various factors that motivate these participants to contribute. The survey results shed light on the in-depth understanding of HFS participants and people involved in the crowdsourcing systems. Most participants voluntarily contribute to HFS, without expectation of money rewards (either real-world or virtual world money). The findings indicate great potential for researchers to explore how to design a more effective and efficient crowdsourcing system, and how to better utilize this power of the crowds for social goods, solve complex task-solving problems, and even for business purposes like marketing and management. For more, see our full paper, The Chinese “Human Flesh” Web: the first decade and beyond (free download link; preprint is also available upon request). Qingpeng Zhang, City University of Hong Kong Additoinal readings:

  1. Wang F-Y, Zeng D, Hendler J A, Zhang Q, et al (2010). A study of the human flesh search engine: Crowd-powered expansion of online knowledge. Computer, 43: 45-53. doi:10.1109/MC.2010.216
  2. Zhang Q, Wang F-Y, Zeng D, Wang T (2012). Understanding crowd-powered search groups: A social network perspective. PLoS ONE 7(6): e39749. doi:10.1371/journal.pone.0039749

Methodological Debate: How much to pay Indian and US citizens on MTurk?

This is a broadcast search request (hopefully of interest to many readers of the blog), not the presentation of research results.

When conducting research on Amazon Mechnical Turk (MTurk) you always face the question how much to pay workers. You want to be fair, to incentivize diligent work, to expedite recruiting, to sample a somehow representative cross-section of Turkers etc. For the US, I generally aim at $7.50 per hour, slightly more than the minimum wage in the US (although that is non-binding) and presumably slightly higher than the average wage on MTurk. Now I aim for a cross-cultural study comparing survey responses and experiment behavior of Turkers registered as residing in India with US workers. How much to pay in the US, how much in India? For the US it is easy: $7.50 * (expected duration of the HIT in minutes / 60). And India?

The two obvious alternatives are

  1. Pay the same for Indian workers as US workers: $7.50 per hour. MTurk is a global market place in which workers from many nations compete. It’s only fair to pay the same rate for the same work.
  2. Adjust the wage to national price level: ~ $2.50 per hour. A dollar is worth more in the US than in India. Paying the same rate leads to higher incentives for Indian workers and might bias sampling, effort, and results. According to The World Bank, the purchasing power parity conversion factor to market exchange ratio for India compared to the US is 0.3 ( $7.50 in the US would make $2.25 in India. Based on The Economist’s BigMac index one could argue for $2.49 in India (raw index) to $4.5 (adjusted index; According to (Ashenfelter 2012, wages in McDonald’s restaurants in India are 6% of the wage at a McDonald’s restaurant in the US, which could translate to paying $0.45 per hour on MTurk. Given the wide range of estimates, $2.50 might be a reasonable value.

What should be the criteria to decide and which of these two is better?

I appreciate any comments and suggestions and hope that these will be valuable to me and to other readers of Follow the Crowd.

What do your food & drink habits tell about your culture?

Traditional ways to study cross-cultural differences depend on surveys, which are costly and do not scale up. We reveal another way to obtain similar data that could revolutionize the study of global culture.

We propose the use of publicly available data from location-based social networks (LBSNs) to map individual preferences. This is interesting because an LBSN check-in expresses the preference of a user for a certain type of place. LBSNs have also the characteristic to be accessible (almost) everywhere by anyone, solving the scalability problem and allowing data from the entire world to be collected, at a much lower cost (compared to traditional surveys). 

Users expressing preferences
Users expressing their preferences in LBSNs.

Our goal is to propose a new methodology for identifying cultural boundaries and similarities across populations using data collected from LBSNs. Since we know that food and drink habits are able to describe strong differences among people, we use Foursquare check-ins in such locations to represent user preferences for specific types of food and drink. We studied how these preferences change according to time of day and geographical locations. We have found that:

  • The eating and drinking choices in different countries, cities, or neighborhoods of a city reveal fascinating insights into differing habits of human beings. For instance, preferences among people in cities located in the same country tend to be very similar;
  • The time instants when check-ins are performed in food and drink places also provide valuable insights into the cultural aspects of a particular region. For example, whereas Americanss and English people tend to have their main meal at dinner time, Brazilians have it at lunch time.

Given those observations, we consider spatio-temporal dimensions of food and drink check-ins as users’ cultural preferences. We then apply a simple clustering technique to show the “cultural distance” between countries, cities or even regions within a city. We found that:

  • Our results often strongly agree with common knowledge;

  • Comparing our results with the World Values Surveys (a very large study based many years of survey data), the similarities are striking.

Clusters by cities
Clustering cities.

Yet, unlike traditional survey-based empirical studies, such as the aforementioned one, our methodology allows the identification of cultural dynamics much faster, capturing current cultural expressions at nearly real time, and at a much lower cost.

For more, see our full paper, You are What you Eat (and Drink): Identifying Cultural Boundaries by Analyzing Food & Drink Habits in Foursquare.

Thiago H Silva, Universidade Federal de Minas Gerais, Brazil
Pedro O S Vaz de Melo, Universidade Federal de Minas Gerais, Brazil
Jussara M Almeida, Universidade Federal de Minas Gerais, Brazil
Mirco Musolesi, University of Birmingham, UK
Antonio A F Loureiro, Universidade Federal de Minas Gerais, Brazil

For On-Demand Workers, It’s All About the Story

From mystery shopping to furniture assembly, apps such as TaskRabbit and Gigwalk leverage the power of distributed, mobile workers who complete physical world tasks instantly and beyond the constraints of traditional office workspaces. We refer to these workers as the “on-demand mobile workforce.” Mobile workforce services allow task requesters to “crowdsource” tasks in the physical world and aim to disrupt the very nature of employment and work (for good and bad; this may be a matter for another post).

Our paper describes an on-demand workforce service categorization based on two dimensions: (1) task location and (2) task complexity (see figure below). Based on marketplace reviews, user testimonies, and informal observations of the services, we placed four main workforce services into the quadrants to exemplify the categorization.

Categorization of on-demand workforce services.
Categorization of on-demand workforce services.

Although a long line of research on incentives and motivations for crowdsourcing exists, especially on platforms like Amazon’s Mechanical Turk, there hasn’t been much work on physical crowdsourcing, despite the recent appearance of many such platforms. We conducted interviews (see the paper here to learn more about the complete methods and findings) of mobile workforce members to learn more about the extrinsic and intrinsic factors that influence the selection and completion of physical world tasks.

To mention a couple of findings, we found certain task characteristics were highly important to workers as they select and accept tasks:

Knowing the person
Because physical world tasks introduce a different set of personal risks compared to virtual world tasks (e.g., physical harm, deception), workers creatively investigated requesters and scrutinized profile photos, email addresses, and task descriptions. Tasks with profile photos helped workers know who to expect on-site and email addresses were used to cross-reference information on social networking sites.

Knowing the “story”
Tasks that listed intended purposes or background stories of the tasks appealed to the mobile workforce. Tasks for an anniversary surprise or to verify the conditions of a grave plot through a photo affected workers’ opinions and influenced future task selections. Workers also appreciated non-financial incentives of unique experiences that occurred as byproducts of task completion (e.g., meeting new people). Tasks with questionable, unethical intentions (e.g., mailing in old phones, posting fake reviews online, writing student papers) were less likely to be fulfilled.

Generally, this study has broader implications for the design of effective, practical, novel and well-reasoned social and technical crowdsourcing applications that organize help and support in the physical world. Particularly, we hope our findings inform future development of mobile workforce services that are not strictly monetary.

Want to learn more? Check out our full paper here at CSCW 2014.

Rannie Teodoro
Pinar Ozturk
Mor Naaman
Winter Mason
Janne Lindqvist

Remote Shopping Advice: Crowdsourcing In-Store Purchase Decisions

Recent Pew reports, as well as our own survey, have found that consumers shopping in brick-and-mortar stores are increasingly using their mobile phones to contact others while they shop. The increasing capabilities of smartphones, combined with the emergence of powerful social platforms like social networking sites and crowd labor marketplaces, offer new opportunities for turning solitary in-store shopping into a rich social experience.We conducted a study to explore the potential of friendsourcing and paid crowdsourcing to enhance in-store shopping. Participants selected and tried on three outfits at a Seattle-area Eddie Bauer store; we created a single, composite image showing the three potential purchases side-by-side. Participants then posted the image to Facebook, asking their friends for feedback on which outfit to purchase; we also posted the image to Amazon’s Mechanical Turk service, and asked up to 20 U.S.-based Turkers to identify their favorite outfit, provide comments explaining their choice, and provide basic demographic information (gender, age).

Study participants posted composite photos showing their three purchase possibilities; these photos were the posted to Facebook and Mechanical Turk to crowdsource the shopping decision.
Study participants posted composite photos showing their three purchase possibilities; these photos were the posted to Facebook and Mechanical Turk to crowdsource the shopping decision.

Although none of our participants had used paid crowdsourcing before, and all were doubtful that it would be useful to them when we described what we planned to do at the start of the study session, the shopping feedback provided by paid crowd workers turned out to be surprisingly compelling to participants – more so than the friendsourced feedback from Facebook, in part because the crowd workers were more honest, explaining not only what looked good, but also what looked bad, and why! They also enjoyed the ability to see how opinions varied among different demographic groups (e.g., did male raters prefer a different outfit than female raters?).

Although Mechanical Turk had a speed advantage over Facebook, both sources generally provided multiple responses within a few minutes – fast enough that a shopper could get real-time decision-support information from the crowd while still in the store.

Our CSCW 2014 paper on “Remote Shopping Advice” describes our study in more detail, as well as how our findings can be applied toward designing next-generation social shopping experiences.

For more, see our full paper, Remote Shopping Advice: Enhancing In-Store Shopping with Social Technologies.

Meredith Ringel Morris, Microsoft Research
Kori Inkpen, Microsoft Research
Gina Venolia, Microsoft Research

Voyant: Generating Structured Feedback on Visual Designs Using a Crowd of Non-Experts

Crowdsourcing offers an emerging opportunity for users to receive rapid feedback on their designs. A critical challenge for generating feedback via crowdsourcing is to identify what type of feedback is desirable to the user, yet can be generated by non-experts. We created Voyant, a system that leverages a non-expert crowd to generate perception-oriented feedback from a selected audience as part of the design workflow.

The system generates five types of feedback: (i) Elements are the individual elements that can be seen in a design. (ii) First Notice refers to the visual order in which elements are first noticed in the design. (iii) Impressions are the perceptions formed in one’s mind upon first viewing the design. (iv) Goals refer to how well the design is perceived to meet its communicative goals. (v) Guidelines refer to how well the design is perceived to meet known guidelines in the domain.

Voyant decomposes feedback generation into a description and interpretation phase, inspired by how a critique is taught in design education. In each phase, the tasks focus a worker’s attention on specific aspects of a design rather than soliciting holistic evaluations to improve outcomes. The system submits these tasks to an online labor market (Amazon Mechanical Turk). Each type of feedback typically requires a few hours to generate and costs a few US dollars.

Our evaluation shows that users were able to leverage the feedback generated by Voyant to develop insight, and discover previously unknown problems with their designs. For example, the Impressions feedback generated by Voyant on a user’s poster (see the video above). The user intended it to be perceived as Shakespeare, but was surprised to learn of an unintended interpretation (see “dog” in word cloud).

To use Voyant, the user imports a design image and configures the crowd demographics. Once generated, the feedback can be utilized to help iterate toward an effective solution.

Try it:


For more, see our full paper, Voyant: Generating Structured Feedback on Visual Designs Using a Crowd of Non-Experts.
Anbang Xu, University of Illinois at Urbana-Champaign
Shih-Wen Huang, University of Illinois at Urbana-Champaign
Brian P. Bailey, University of Illinois at Urbana-Champaign

CrowdCamp Report: HelloCrowd, The “Hello World!” of human computation

The first program a new computer programmer writes in any new programming language is the “Hello world!” program – a single line of code that prints “Hello world!” to the screen.

We ask, by analogy, what should be the first “program” a new user of crowdsourcing or human computation writes?  “HelloCrowd!” is our answer.

Hello World task
The simplest possible “human computation program”

Crowdsourcing and human computation are becoming ever more popular tools for answering questions, collecting data, and providing human judgment.  At the same time, there is a disconnect between interest and ability, where potential new users of these powerful tools don’t know how to get started.  Not everyone wants to take a graduate course in crowdsourcing just to get their first results. To fix this, we set out to build an interactive tutorial that could teach the fundamentals of crowdsourcing.

After creating an account, HelloCrowd tutorial users will get their feet wet by posting three simple tasks to the crowd platform of their choice. In addition to the “Hello, World” task above, we chose two common crowdsourcing tasks: image labeling and information retrieval from the web.  In the first task, workers provide a label for an image of a fruit, and in the second, workers must find the phone number for a restaurant. These tasks can be reused and posted to any crowd platform you like; we provide simple instructions for some common platforms.  The interactive tutorial will auto-generate the task urls for each tutorial user and for each platform.

Mmm, crowdsourcing is delicious
Mmm, crowdsourcing is delicious

More than just another tutorial on “how to post tasks to MTurk”, our goal with Hello Crowd is to teach fundamental concepts.  After posting tasks, new crowdsourcers will learn how to interpret their results (and get even better results next time).  For example: what concepts might the new crowdsourcer learn from the results for the “hello world” task or for the business phone number task?  Phone numbers are simple, right?  What about “867-5309” vs “555.867.5309” vs “+1 (555) 867 5309”?  Our goal is to get new users of these tools up to speed about  how to get good results: form validation (or not), redundancy, task instructions, etc.

In addition to teaching new crowdsourcers how to crowdsource, our tutorial system will be collecting a longitudinal, cross-platform dataset of crowd responses.  Each person who completes the tutorial will have “their” set of worker responses to the standard tasks, and these are all added together into a public dataset that will be available for future research on timing, speed, accuracy and cost.

We’re very proud of HelloCrowd, and hope you’ll consider giving our tutorial a try.

Christian M. Adriano, Donald Bren School, University of California, Irvine
Anand Kulkarni, MobileWorks
Andy Schriner, University of Cincinnati
Paul Zachary, Department of Political Science, University of California, San Diego

Can we achieve reliable inference using unreliable crowd workers?

Let us assume a set of N crowd workers are given the task of classifying a given dog image into a set of M possible breeds. Since workers may not be canine experts, they may not be able to directly classify and so we should ask simpler questions. There are two basic properties of crowd workers which cause a degraded performance of crowdsourcing systems:

  • Lack of domain expertise (which may necessitate asking binary questions rather than asking for fine classification), and
  • Unreliability (which may necessitate intelligently deployed redundancy)

The above problems can be handled by the use of error-correcting codes. Using code matrices, we can design binary questions for crowd workers that allow the task manager to reliably infer correct classification even with unreliable workers.


The performance of a classification task is heavily dependent on the design of these simple binary questions. The question design problem is equivalent to the design of an M x N binary code matrix A={ali}. The rows correspond to the different classes while a column ai corresponds to the question to the ith worker. As an example, consider the task of classifying a dog image into one of four breeds: Pekingese, Mastiff, Maltese, or Saluki. The binary question of whether a dog has a snub nose or a long nose differentiates between {Pekingese, Mastiff} and {Maltese, Saluki}, whereas the binary question of whether the dog is small or large differentiates between {Pekingese, Maltese} and {Mastiff, Saluki}.


An illustrative example is shown in the figure above for the dog breed classification task. Let the columns corresponding to the ith and jth workers be ai = [1010]’ and aj = [1100]’ respectively. The ith worker is asked: “Is the dog small or large?” since she is to differentiate between the first (Pekingese) or third (Maltese) breed and the others. The jth worker is asked: “Does the dog have a snub nose or a long nose?” since she is to differentiate between the first two breeds (Pekingese, Mastiff) and the others. These questions are designed from the codematrix using taxonomy of dog breeds. The task manager makes the final classification decision as the hypothesis corresponding to the code word (row) that is closest in Hamming distance to the received vector of decisions. A good codematrix can be designed using simulated annealing or cyclic column replacement based optimization.

To evaluate the performance of this scheme, the worker’s reliability can be modeled as a random variable: spammer-hammer model or the Beta model. The average probability of misclassification can be derived as a function of the mean (μ) of the workers’ reliability. The proposed scheme’s performance can be compared with the traditional voting based scheme. The summary of the results are:

  • Crowd Ordering: Better crowds yield better performance in terms of average error probability
  • Coding is better than majority vote: Good codes perform better than majority vote as they diversify the binary questions and use human cognitive energy more efficiently
  • Gap in performance generally increases for larger system size

For more, see our ICASSP 2013 paper, Reliable Classification by Unreliable Crowds
Aditya Vempaty, Syracuse University
Lav R. Varshney,  IBM Thomas J. Watson Research Center
Pramod K. Varshney, Syracuse University