Communicating Context to the Crowd


Crowdsourcing has traditionally consisted of short, independent microtasks that require no background. The advantage of this strategy is that work can be decoupled and assigned to independent workers. But this strategy struggles to support tasks that are increasingly complex such as writing or programming and are not independent of their context.

For instance, imagine that you ask a crowd worker to write a biography for a speaker you’ve invited to your workshop. After the work is completed you realize that the biography is written in an informal, personal tone. This is not technically wrong, it’s just not what you had in mind. You realize that you could have added a line to your task description asking for a formal/academic tone. However, there are countless nuances to a writing task that can’t all be predicted beforehand. This is what we are interested in: the context of a task meaning the collection of conditions and tacit information surrounding the task (e.g. the fact that the biography is needed for an academic workshop).



OUR APPROACH IS TO ITERATE: do some work, communicate with the requester, and edit to fix errors. How can we support communication between the requester and crowd workers to maximize benefits while minimizing costs? If achieved, this goal would create the conditions for crowd work that is more complex and integrated than currently possible.

The main take away is to support this communication through structured microtasks. We have designed 5 different mechanisms for structured communication:


We compare these methods in two studies, the first measuring the benefit of each mechanism, and the second measuring the costs to the requester (e.g. cognitive demand). We found that these mechanisms are most effective when writing is in the early phases. For text that is already high quality, the mechanisms become less effective and can even be counter-productive.


We also found that the mechanisms had varying benefits depending on the quality of the initial text. Early on, when content quality is poor, the requester needs to communicate major issues. Therefore identifying the “main problem” was most effective at improving writing. Later, for average quality content, the different mechanisms have relatively similar added value.

Finally, we found that the cost of a mechanism for the requester is not always correlated with the value that it adds. For instance, for average quality paragraphs, commenting/editing was very costly but did not provide more value than simply highlighting.


For more, see our full paper, Communicating Context to the Crowd for Complex Writing Tasks.
Niloufar Salehi, Stanford University
Jaime Teevan, Microsoft Research
Shamsi Iqbal, Microsoft Research
Ece Kamar, Microsoft Research

Who does make a topic trending on Twitter?

Users on social media sites like Twitter are increasingly relying on crowdsourced recommendations called Trending Topics to find important events and breaking news stories. Topics (mostly keywords, e.g., hashtags) are recommended as trending when they exhibit a sharp spike in their popularity, i.e., their usages by the crowds suddenly jump at a particular time.

While prior works have attempted to classify and predict Twitter trending topics, in this work, we ask a different question — who are the users who make different topics worthy of being recommended as trending?

Specifically, we analyse the demographics of the crowds promoting different trending topics on the Twitter social media. By promoters of a topic, we refer to the users who posted on the topic before it became trending, thereby contributing to the topic’s selection as a trend.

We gathered extensive data from Twitter from July to September, 2016, including millions of users posting on thousands of topics, both before and after the topics became trending. We inferred three demographic attributes for these Twitter users — their gender, race (Asian / Black / White), and age — from their profile photos.

Looking at the demographics of the promoters reveals interesting patterns. For instance, here are the gender and racial demographics of the promoters of some of the Twitter trends on 3rd May 2017:

  • #wikileaks: 24% women, and 76% men
  • #wednesdayWisdom: 52% women, and 48% men
  • #comey: 9% Asian, 12% Black, and 79% White
  • #BlackWomenAtWork: 15% Asian, 52% Black, and 33% White

It is evident that different trends are promoted by widely different demographic groups.


Our analysis led to the following insights:

  • A large fraction of trending topics are promoted by crowds whose demographics are significantly different from Twitter’s overall user population.
  • We find clear evidence of under-representation of certain demographic groups among the promoters of trending topics, with mid-aged-black-females being the most under-represented group.
  • Once a topic becomes trending, it is adopted (i.e., posted) by users whose demographics are less divergent from the overall Twitter population, compared to the users who were promoting the topic before it became trending.
  • Topics promoted predominantly by a single demographic group tend to be of niche interest to that particular group.
  • During events of wider interest (e.g., national elections, police shootings), the topics promoted by different demographic groups tend to reflect their diverse perspectives, which could help understand the different facets of public opinion.

Try out our Twitter app to check demographics of the crowds promoting various trends.

For details, see our full paper, Who Makes Trends? Understanding Demographic Biases in Crowdsourced Recommendations, at ICWSM 2017.

Abhijnan Chakraborty, IIT Kharagpur, India and MPI-SWS, Germany
Johnnatan Messias, Federal University of Minas Gerais, Brazil
Fabricio Benevenuto, Federal University of Minas Gerais, Brazil
Saptarshi Ghosh, IIT Kharagpur, India
Niloy Ganguly, IIT Kharagpur, India
Krishna P. Gummadi, MPI-SWS, Germany

Don’t Bother Me. I’m Socializing!

IT DOES BOTHER US when we see our friends checking their smartphones while having a conversation with us. Although people want to focus on a conversation, it is hard to ignore a series of notification alarms coming from their smartphones. It is reported that smartphone users receive an average of tens to hundreds of push notifications a day [1,2]. Despite its usefulness in immediate delivery of information, an untimely smartphone notification is considered a source of distraction and annoyance during social interactions.

(Left) Notifications interrupt an ongoing social interaction. (Right) Notifications are deferred to a breakpoint, in-between two activities, so that people are less interrupted by notifications.


TO ADDRESS THIS PROBLEM, we have proposed a novel notification management scheme, in which the smartphone defers notifications until an opportune moment during social interactions. A breakpoint [3] is a term originated from psychology that describes a unit of time in between two adjacent actions. The intuition is that there exist breakpoints in which notifications do not, if so minimally, interrupt a social interaction.

A screenshot of the video survey. Participants are asked to respond whether this moment is appropriate to receive a notification.

TO DISCOVER SUCH BREAKPOINTS, we devised a video survey in which participants watch a typical social interaction scenario and respond whether prompted moments in the video are appropriate moments to receive smartphone notifications. People responded that the following four types of breakpoints are appropriate breakpoints in a social interaction; (1) a long silence, (2) a user leaving the table, (3) others using smartphones, and (4) a user left alone.

Types of social context detected by SCAN.

BASED ON THE INSIGHTS FROM THE VIDEO SURVEY, we designed and implemented a Social Context-Aware smartphone Notification system, SCAN, that defers smartphone notifications until a breakpoint. SCAN is a mobile application that detects social context using only built-in sensors. It also works collaboratively with the rest of the group members’ smartphones to sense collocated members, conversation, and others’ smartphone use. SCAN then classifies a breakpoint based on the social context and decides whether to deliver or defer notifications.

SCAN HAS BEEN EVALUATED on ten groups of friends in a controlled setting. SCAN detects four target breakpoint types with high accuracy (precision= 92.0%, recall= 82.5%). Most participants appreciated the value of deferred notifications and found the selected breakpoints appropriate. Overall, we demonstrated that breakpoint-based smartphone notification management is a promising approach to reducing interruptions during social interactions.

WE ARE CURRENTLY EXTENDING SCAN to apply it to various types of social interactions. We also aim to add personalized notification management and to address technical challenges such as system robustness and energy efficiency. Our ultimate goal is to release SCAN as an Android application in Google Play Store and help users to be less distracted by smartphone notifications during social interactions.

You can check out our CSCW 2017 paper to read about this work in more detail.  

“Don’t Bother Me. I’m Socializing!: A Breakpoint-Based Smartphone Notification System”. Proceedings of CSCW 2017. Chunjong Park, Junsung Lim, Juho Kim, Sung-Ju Lee, and Dongman Lee (KAIST)

[1]”An In-situ Study of Mobile Phone Notifications”. Proceedings of MobileHCI 2014. Martin Pielot, Karen Church, and Rodrigo de Oliveira.
[2] “Hooked on Smartphones: An Exploratory Study on Smartphone Overuse Among College Students”. Proceedings of CHI 2014. Uichin Lee, Joonwon Lee, Minsam Ko, Changhun Lee, Yuhwan Kim, Subin Yang, Koji Yatani, Gahgene Gweon, Kyong-Mee Chung, and Junehwa Song.
[3] “The perceptual organization of ongoing behavior”. Journal of Experimental Social Psychology 12, 5 (1976), 436–450. Darren Newtson and Gretchen Engquist.

Subcontracting Microwork

Mainstream crowdwork platforms treat microtasks as indivisible units; however, in our upcoming CHI 2017 paper, we propose that there is value in re-examining this assumption. We argue that crowdwork platforms can improve their value proposition for all stakeholders by supporting subcontracting within microtasks.

We define three models for microtask subcontracting: real-time assistance, task management, and task improvement:

  • Real-time assistance encompasses a model of subcontracting in which the primary worker engages one or more secondary workers to provide real-time advice, assistance, or support during a task
  • Task management subcontracting applies to situations in which a primary worker takes on a meta-work role for a complex task, delegating components to secondary workers and taking responsibility for integrating and/or approving the products of the secondary workers’ labor.
  • Task improvement subcontracting entails allowing a primary worker to edit task structure, including clarifying instructions, fixing user interface components, changing the task workflow, and adding, removing, or merging sub-tasks.

Subcontracting of microwork fundamentally alters many of the assumptions currently underlying crowd work platforms, such as economic incentive models and the efficacy of some prevailing workflows. However, subcontracting also legitimizes and codifies some existing informal practices that currently take place off-platform. In our paper, we identify five key issues crucial to creating a successful subcontracting structure, and reflect on design alternatives for each: incentive models, reputation models, transparency, quality control, and ethical considerations.

To learn more about worker motivations for engaging with subcontracting workflows, we conducted some experimental HITs on mTurk. In one, workers had the choice of whether to complete a complex, three-part task, or to choose to subcontract portions to other (hypothetical) workers (and give up some of the associated pay); we then asked these workers why they did or did not choose to subcontract each task component. Money, skills, and interests all factored into these decisions in complex ways.

Implementing and exploring the parameter space of the subcontracting concepts we propose is a key area for future research. Building platforms that support subcontracting workflows in an intentional manner will enable the crowdwork research community to evaluate the efficacy of these choices and further refine this concept. We particularly stress the importance of the ethical considerations component, as our intent in introducing and formalizing concepts related to subcontracting microwork is to facilitate more inclusive, satisfying, efficient, and high-quality work, rather than to facilitate extreme task decomposition strategies that may result in deskilling or untenable wages.

You can download our CHI 2017 paper to read about subcontracting in more detail.  (Fun fact — the idea for this paper began at the CrowdCamp Workshop at HCOMP 2015 in San Diego; Hooray for CrowdCamp!)

Subcontracting Microwork. Proceedings of CHI 2017. Meredith Ringel Morris (Microsoft Research), Jeffrey P. Bigham (Carnegie Mellon University), Robin Brewer (Northwestern University), Jonathan Bragg (University of Washington), Anand Kulkarni (UC Berkeley), Jessie Li (Carnegie Mellon University), and Saiph Savage (West Virginia University).

Spare5’s Tips for Sourcing Better Training Data

Mere minutes after our awesome advisor, Dan Weld, mentioned The 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP), we were all-in. It’s rare to scroll through an event program and realize that each and every session is going to be so relevant and useful to your work, but that’s exactly how we were all feeling with this event’s agenda. And it did not disappoint!

We returned to Seattle from Austin newly excited and energized to enable folks to earn spare change in their spare time in a fun, engaging way, while providing practitioners with custom, quality, accurate machine learning and AI training data.

Our decision to sponsor HCOMP required very little human computation, and we were thrilled to give a keynote talk on our tips for sourcing better training data. We’ve created an online version of our presentation deck for your reference; hope it’s helpful.

As a brief review, we recommend:

  • great UI & UX for annotators
  • interactive workflow design on mobile & web
  • known, trained, qualified annotators
  • real-time QA & annotator management
  • algorithmic task distribution & quality scoring

Details in the deck.

If you’d like to learn more about these ideas or have something to add, please give us a shout. We’re also particularly interested in the topic of bias in training data, so if this is a concern of yours as well, get in touch and let’s study it together (we’ll bring the data!).

Finally, as we noted in our talk, we’re hiring! We’re growing our data science team and looking for computer vision experts specifically. Check out our openings if you’re looking for your next great opportunity.

A big thanks to everyone at HCOMP. We had a great time and look forward to continuing the many discussions we started there.

Until next year!

— Spare5

Report: GroupSight Workshop at HCOMP 2016 – Human Computation for Image and Video Analysis

The GroupSight workshop hit a surprisingly resonant chord with researchers at the intersection of human computation and computer vision in its first year at HCOMP 2016. My co-organizers, Danna Gurari (UT Austin) and Steve Branson (Caltech), and I sought to bring together people from widely different areas of computer vision and computational photography to explore how CV researchers are using the crowd. Attendees included researchers and students curious about this important, emerging field at the intersection of crowdsourced human computation and image/video analysis. About 30 attendees were treated to an unexpected diversity of approaches to crowd computation from some of the most exciting researchers in CV, including talks from:

– Kristen Grumman (UT Austin) : Active and Interactive Image and Video Segmentation

– Kavita Bala (Cornell/GrokStyle) : Crowdsourcing for Material Recognition in the Wild

– Ariel Shamir (Interdisciplinary Center Israel) : Passive Human Computation

– Kotaro Hara (U Maryland, College Park) : Using Crowdsourcing, Computer Vision, and Google Street View to Collect Sidewalk Accessibility Data

– Brendan McCord (Evolve Tech) : AI + IQ: Building Best of Breed Security Systems

We had short talks from 6 students on topics ranging from geolocation to medical imaging to clustering and summarization, as well as encore-track poster presentations from HCOMP and ECCV. (Their papers can be found here.)

Best paper winnerShay Sheinfeld Best paper runner-upMehrnoosh Sameki

Best paper winner Shay Sheinfeld presented work with Yotam Gingold and Ariel Shamir demonstrating a truly inventive use of the crowd for Video Summarization using Crowdsourced Causality Graphs. Attendees marveled at state-of-the art video summarization that made it seem like our future AI video editors have finally arrived. Best paper runner-up Mehrnoosh Sameki delighted us with surreal medical videos of cells splitting and joining all the while maintaining a near perfect segmentation contour achieved by an interactive pipeline.

Our industrial sponsor Evolv Technology hosted a cozy lunch where students and and senior researchers were able to discuss how to advance novel research in this nascent area of scientific exploration.

The aim of this workshop was to promote greater interaction between the diversity of researchers and practitioners who examine how to mix human and computer efforts to convert visual data into discoveries and innovations that benefit society at large. We succeeded in fostering an in-depth discussion of technical and application issues for how to engage humans with computers to optimize cost/quality trade-offs. My big take-away was that if you are doing CV or Graphics research, there is undoubtedly a cool way to exploit human intelligence in your pipeline for unexpected and remarkable outcomes. We look forward to future iterations of GroupSight at HCOMP and possibly ICCV or a future CVPR. If you are interested in participating in a future GroupSight, please don’t hesitate to contact one of us!

CrowdCamp Report: Gathering Causality Labels

Correlation does not imply causation. This phrase gets thrown around by scientists, statisticians, and laypeople all the time. It means that you shouldn’t use data about two things to infer that one thing causes the other, at least not without making a lot of limiting assumptions. But it is difficult to imagine ignoring causal inference when it seems to be such a key ingredient of intelligent decision-making. Machine learning approaches exist for using data to estimate causal structure, but we think it’s interesting that humans seem to judge causality without even looking at data. So, the goal of our CrowdCamp project was to gather some such judgements from real people.

To start, we make a list of variable names for which we hypothesize humans might have opinions about causal relationships without ever (or at least recently) having looked at the related data. Some of these variables include:

Real Daily Wages, Oil Prices, Internet Traffic, Residential Gas Usage, Power Consumption, Precipitation, Water Usage, Traffic Fatalities, Passenger Miles Flown in Aircraft, Auto Registration, Bus Ridership, Copper Prices, Wheat Harvest, Private Housing Units Started, Power Plant Expenditures, Price of Chicken, Sales of Shampoo, Beer Shipments, Percent of Men with Full Beards,
Pigs Slaughtered, Cases of Measles, Thickness of Ozone Layer, etc.

Using Amazon Mechanical Turk (AMT), we presented workers with sets of ten randomly chosen pairs of variables, and we asked them to choose the most fitting causal relationship between variable A and variable B between these four choices:

  • A causes B
  • B causes A
  • Other variable Z causes A and B
  • No causal relationship

Workers were advised “it’s possible that A and B may be related in several of the above ways. If you feel this is the case, choose the one that you believe is the strongest relationship.”

Example variable pair presented to crowd
Example variable pair presented to crowd

We collected 10 judgements from 50 workers, for a total of 500 judgements on pairs of 42 variables. When workers chose the option of a third variable causing both presented variables, we asked them to name the third variable (though we didn’t force them to). Of the 500 judgements, 74 of them were A->B, 85 were B->A, 34 were Z->A&B, and 307 were no causality. The most common one-directional causality judgements were:

1. Church Attendance -> Internet Traffic
2. Alcohol Demand -> Public Drunkenness
3. Federal Reserve Interest Rate -> Price of Chicken
4. Bus Ridership -> Oil Prices
5. Alcohol Demand -> Number of Forest Fires
6. Public Drunkenness -> Armed Robberies
7. Power Consumption -> Birth Rate
8. Church Attendance -> Armed Robberies
9. Bus Ridership -> Birth Rate
10. Price of Chicken -> Total Rainfall

Many of these are not surprising. Of course interest rates affect prices and alcohol consumption affects drunkenness. Others not so much… why would chicken prices affect rainfall? Also, we realize we only asked about the strength of the causal relationship, not the sign. So we have no way of knowing whether the workers believe going to church causes an increase or a decrease in armed robberies.

We also collected some interesting answers for the optional third variable Z causing both A and B. Most of the time it was some big general factor like population, economic conditions, geographical area, or fuel prices. There were some creative ones too:

A: Deaths from Homicides
B: Beer Shipments
Z: Thieves trying to intercept and steal beer shipments

So we collected all these judgements, now what do we do with them? As for machine learning applications, we see three options:

  1. Use as training/testing labels for causal inference techniques.
  2. See how well they serve for building informative priors to regularize regression problems.
  3. Use them to guide structure learning in probabilistic graphical models.

In conclusion, it was interesting to see how workers on AMT perceived causal relationships between economic, demographic, and miscellaneous variables by only looking at the names of the variables rather than actual data. We think it would be useful to take such qualitative “common-sense” preconceptions into account when designing automatic models of inference.

Alex Braylan, University of Texas at Austin
Kanika Kalra, Tata Research
Tyler McDonnell, University of Texas at Austin

CrowdCamp Report: Finding Word Similarity with a Human Touch

Semantic similarity or semantic relatedness are features of natural language that contribute to the challenge machines face when analyzing text. Although semantic relatedness is still a complex challenge only few ground truth data set exist. We argue that the available corpora used to evaluate the performance of natural language tools do not capture all elements of the phenomenon. We present a set of simple interventions that illustrate 1) framing effects influence similarity perception, 2) the distribution of similarity across multiple users is important and 3) semantic relatedness is asymmetric.

A number of metrics in the literature attempt to model and evaluate semantic similarity in natural languages. Semantic similarity has applications in areas such as semantic search, text mining, etc. The concept of semantic similarity has long been considered as a more specific concept than the concept of semantic relatedness. Semantic relatedness, as it includes the concepts of antonymy and meronymy, is more generic than semantic similarity.

Different approaches have been attempted to measure semantic relatedness and similarity. Some methods use structured taxonomies such as WordNet alternative approaches define relatedness between words using search engines (e.g., based on Google counts) or  Wikipedia. All of these methods are evaluated based on the correlation with human ratings. Yet only few benchmark data sets exist. One of the most widely used being the WS-353 data-set [1]. As the corpus is very small and the sample size per pair is low it is arguable if all relevant phenomena are in fact present in the provided data set.

In this study, we aim to understand how human raters perceive word-based semantic relatedness. We argue that asking simple word-based semantic similarity is beyond the scope of existing test sets. Our hypotheses in this paper are as follows:

(H1) The framing effect influences similarity rating by human assessors.
(H2) The distribution of similarity rating does not follow a normal distribution.
(H3) Semantic relatedness is not symmetric. The relatedness between words (e.g., tiger and cat) yields different similarity ratings in a different word order.

To verify our hypotheses, we collected similarity ratings on word pairs from the WS-353 data-set. We randomly selected 102 word pairs from the WS-353 data-set. We collected similarity ratings on the 102 word pairs through Amazon Mechanical Turk (MTurk). We collected 5 dataset for these 102 pairs. Each collection used a different task design and was separated into two batches of 51 words each. Each batch received ratings from 50 unique contributors so that each pair of word received 50 ratings in each condition.

The way the questions were asked to the crowd workers are shown in the following figure. For each question, 4 conditions were differently framed. The first two of these are “How is X similar to Y?” (sim) and “How is Y similar to X?” (inverted-sim).  We further repeated them asking for the difference between both words (dissim and inverted-dissim, respectively). Since the scale is reversed in dissim and inverted-dissim, the dissimilarity ratings were converted into similarity ratings for comparison.

The different ways of framing each question.
The different ways of framing each question.

We compared the distributions of similarity ratings in the original WS-353 dataset and our dataset in order to confirm the framing effect. The mean values of 50 ratings were calculated for each pair in our dataset to compare with original similarity ratings in the WS-353 dataset. We filtered exactly the same 102 word pairs from the WS-353 to ensure the consistency between two settings. The distributions are found to be significantly different (p < 0.001, paired t-test).

Our preliminary results show that similarity ratings for some word pairs in the WS-353 dataset do not follow a normal distribution. Some of the distributions reveal that there are different perceptions of similarity, which gets highlighted by multiple peaks. A possible explanation is that the lower peak can be attributed to individuals that are aware of the factual differences between a “sun” or “star” and an actual planet orbiting a “star” while the others are not aware of it.

We compared the difference between the similarity ratings of sim (dssim) and that of inverted-sim (inverted-dissim) to verify third hypothesis. Scatter plot representations of similarity ratings in different word orders for the similarity question and the dissimilarity question reflect that the semantic relatedness in different orders do not take same mean values, indicating the semantic relatedness is asymmetric. The asymmetric relationship consistently appears in the different types of questions (i.e., similarity and dissimilarity.) The results show a remarkable difference between the similarity of “baby” to “mother” and the similarity of “mother” to “baby”. This indicates that the asymmetric relationship between mother and baby was reflected in the subjective similarity rating.

To measure the inter-rater reliability, we have computed the value of Krippendorff’s alpha for both the original dataset and for the one we obtained through the current analysis. Krippendorff’s alpha is a statistical measure that basically provides a highlight of the agreement achieved when encoding a set of units of analysis in terms of the values of a variable.



[1] L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin. Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20(1), 2002.


For more, see our full paper, Possible Confounds in Word-based Semantic Similarity Test Data, accepted in CSCW 2017.

Malay Bhattacharyya
Department of Information Technology
Indian Institute of Engineering Science and Technology,

Yoshihiko Suhara
MIT Media Lab, Recruit Institute of Technology

Md Mustafizur Rahman
Information Retrieval & Crowdsourcing Lab
University of Texas at Austin

Markus Krause
ICSI, UC Berkeley

CrowdCamp Report: Protecting Humans – Worker-Owned Cooperative Models for Training AI Systems

Artificial intelligence is widely expected to reduce the need for human labor in a variety of sectors [2]. Workers on virtual labor marketplaces unknowingly accelerate this process by generating training data for artificial intelligence systems, putting themselves out of a job.

Models such as Universal Basic Income [4] have been proposed to deal with the potential fallout of job loss due to AI. We propose a new model where workers earn ownership of the AI systems they help to train, allowing them to draw a long-term royalty from a tool that replaces their labor [3]. We discuss four central questions:

  1. How should we design the ownership relationship between workers and the AI system?
  2. How can teams of workers find and market AI systems worth building?
  3. How can workers fairly divide earnings from a model trained by multiple people?
  4. Do workers want to invest in AI systems they train?

Crowd Workers gain ownership shares in the AI they help train and reap long-term monetary gains, while Requesters can avail of lower initial training costs.

AI Systems Co-owned by Workers and Requesters

  • Current model (requester-owned): Under the terms of platforms like Amazon Mechanical Turk [1], the data produced (and trained AI systems that result) are owned entirely by requesters in exchange for a fixed price paid to workers for producing that data.
  • Proposed model (worker-owned): In a cooperative model for training AI systems, workers can choose to accept a fraction of that price in exchange for shares of ownership in the resulting trained system (smaller fractions = increased ownership). We can imagine interested outside investors (or even workers themselves) participating in such co-ops as well, bankrolling particular projects that have a significant chance of success.

Finding and Marketing AI Systems

  • Bounties vs. marketplaces: Platforms like Kaggle and Algorithmia allow interested parties to post a bounty (reward) for a trained AI system. Risks under this model include (1) the poster may not accept their solution, (2) the poster may choose another submission over their solution, or (3) the open call may expire. Alternately, Algorithmia also provides a marketplace enabling AI systems to earn money on a per-use basis. Risks here include identifying valuable problem domains with high earning potential.
  • Online vs. offline training models: In an online payment model, workers can provide answers initially and as the AI gains confidence in its predictions, work starts shifting from the crowd to the AI. In an offline payment model, the model can be marketed once it achieves sufficiently accurate predictions, or workers could market a dataset rather than a fully-trained AI system.

Fairly Dividing Earnings from AI Systems

  • Assigning credit: How to optimally assign credit for individual training examples is an open theoretical question. We see the opportunity for both model-specific and black-box solutions.
  • Measuring improvement: Measuring improvement to worker owned and trained AI systems will require methods that incentivize workers to provide the most useful examples, not simply ones that they may have gathered for a test set.
  • Example selection: Training examples could be selected by the AI system (active learning) or by workers. What are fair payment schemes for various kinds of mixed-initiative systems?
  • Data maintenance: Data may become stale over time, or change usefulness. Should workers be responsible for maintaining data, and what are fair financial incentives?


Do Workers Want to Invest in AI Systems?

We launched a survey on Mechanical Turk (MTurk) to gauge interest, and got feedback from 31 workers.

  • On average, workers were willing to give up 25% of their income if given the chance to double it over one year. Only 3 participants said they’d not be willing to give up any of their earnings, and age doesn’t seem to be a factor here.
  • When given a risk factor, over 48% chose to give up some current payment for a future reward.
  • In order to give up 100% of their current earnings, workers needed to be able to make back 3 times their invested amount.
  • 45% of workers reported not being worried at all about AI taking over their jobs.



[1] Amazon Mechanical Turk. 2014. Participation Agreement. Retrieved November 4, 2016 from

[2] Executive Office of the President National Science and Technology Council Committee on Technology. October 2016. Preparing for the Future of Artificial Intelligence.

[3] Anand Sriraman, Jonathan Bragg, Anand Kulkarni. 2016. Worker-Owned Cooperative Models for Training Artificial Intelligence. Under review.

[4] Wikipedia. Basic Income.

Anand Sriraman, TCS Research – TRDDC, Pune, India
Jonathan Bragg, University of Washington, USA
Anand Kulkarni, University of California, Berkeley, USA

CrowdCamp 2016: Understanding the Human in the Loop

Report on CrowdCamp 2016: The 7th Workshop on Rapidly Iterating Crowd Ideas, held in conjunction with AAAI HCOMP 2016. Held November 3, 2016 in Austin, TX.

Organizers: Markus Krause (UC Berkeley), Praveen Paritosh (Google), and Adam Tauman Kalai (Microsoft Research)giphy

Human computation and crowdsourcing as a field investigates aspects of the human in the loop. Consequently, we should use metaphors of computer science to describe human phenomena. These phenomena have been studied by other fields such as sociology and psychology for a very long time. Ignoring these fields not only blocks our access to valuable information but also results in simplified models we try to satisfy with artificial intelligence.

We focused this Crowdcamp on methodologically recognizing the human in the loop, by paying more attention to human factors in task design, and borrowing methodologies from scientific fields relying on human instruments, such as survey design, psychology, and sociology.

We believe that this is necessary for and will foster: 1) raising the bar for AI research, by facilitating more natural human datasets that capture the human intelligence phenomena more richly, 2) raising the bar for human computation methodology for collecting data using/via human instruments, and 3) improve the quality of life and unleashing the potential of crowdworkers by taking into consideration human cognitive, behavioral, and social factors.

This year’s Crowdcamp featured some new concepts. Beside of having a theme we also hold a pre workshop social event. The idea of the event was to get together and discuss ideas in an informal and cheerful setting. We found this very helpful to break the ice, form groups, and prepare ideas for the camp. It helped to keep us focused on the tasks without sacrificing social interactions.

We think the pre workshop social event really helped inspiring participants to get to work right away the next day. We are aware of at least one submitted work in progress paper 24 hours after the workshop! We are sure there are even more great results in the individual group reports published on this blog.

We expect to publish all of the data sets we collected in the next week or so, so please check back in a few days to see more of the results of our workshop. A forthcoming issue of AAAI magazine will include an extended version of this report. If you have feedback on the theme of this year’s CrowdCamp, you might find some further points in there to ruminate about. Feel free to share feedback directly or by commenting on this blog post.

Thanks to the many awesome teams that participated in this year’s CrowdCamp, and stay tuned as blog posts from each team describing their particular project will immediately follow this workshop overview post in the coming days.