## Machine Learning for Dummies

By Eric Holloway and Robert J. Marks II

Since the 1970s, we’ve known humans can find approximate solutions to NP-Complete (really hard) problems more efficiently than the best algorithms (Krolak 1971). The best algorithms scale quadratically, while humans scale linearly (Dry 2006). We also know many of the most widely used and successful machine learning algorithms are NP-Hard to train optimally (Diettrich 2000). This suggests a human/machine hybrid can produce better models than machine learning alone.

However, there are problems with human interaction. The hardest problem is visualization. It is hard visualizing data with more than 3 dimensions. So, we perform dimension reduction by projecting data onto two dimensions. We then collect many weak models (humans draw boxes) from multiple projections to build a strong model; known as boosting in machine learning.

Combining crowdsourcing and boosting, we use Amazon’s Mechanical Turk to collect the models. The data is transformed by the models into a feature space. Then, we use linear regression to classify new data.

We test the human/machine hybrid on one artificial dataset and four real world datasets, all with ten or more dimensions. This hybrid is competitive with machine only linear regression on the untransformed data.

You can read more about our work in the HCOMP 2016 paper High Dimensional Human Guided Machine Learning.

A demo is available for a limited time.

References

Krolak, P., Felts, W., & Marble, G. (1971). A man-machine approach toward solving the traveling salesman problem. Communications of the ACM, 14(5), 327-334.

Dry, M., Lee, M. D., Vickers, D., & Hughes, P. (2006). Human performance on visually presented traveling salesperson problems with varying numbers of nodes. The Journal of Problem Solving, 1(1), 4.

Dietterich, T. G. (2000, June). Ensemble methods in machine learning. In International workshop on multiple classifier systems (pp. 1-15). Springer Berlin Heidelberg.

## Rethinking experiment design as algorithm design

As experimentation in the behavioral and social sciences moves from brick-and-mortar laboratories to the web, techniques from human computation and machine learning can be combined to improve the design of experiments. Experimenters can take advantage of the new medium by writing complex computationally mediated adaptive procedures for gathering data: algorithms.

In a paper to be presented at CrowdML’16, we consider this algorithmic approach to experiment design. We review several experiment designs drawn from the fields of medicine, cognitive psychology, cultural evolution, psychophysics, computational design, game theory, and economics, describing their interpretation as algorithms. We then discuss software platforms for efficient execution of these algorithms with people. Finally, we consider how machine learning can optimize crowdsourced experiments and form the foundation of next-generation experiment design.

Consider the transmission chain, an experimental technique that, much like the children’s game Telephone, passes information from one person to the next in succession. As the information changes hands, it is transformed by the perceptual, inductive, and reconstructive biases of the individuals. Eventually, the transformation leads to erasure of the information contained in the input, leaving behind a signature of the transformation process itself. For this reason, transmission chains have been particularly useful in the study of language evolution and the effects of culture on memory.

When applied to functional forms, for example, transmission chains typically revert to a positive linear function, revealing a bias in human learning. In each row of the following figure, reprinted from Kalish et al. (2007), the functional relationship in the leftmost column is passed down a transmission chain of 9 participants, in all four instances reverting to a positive linear function.

Transmission chains can be formally modeled as a Markov chain by assuming that perception, learning, and memory follow the principles of Bayesian inference. Under this analysis, Bayes’ rule is used to infer the process that generated the observed data. A hypothesis is then sampled from the posterior distribution and used to generate the data passed to the next person in the chain. When a transmission chain is set up in this way, iterated learning is equivalent to a form of Gibbs sampling, a widely-used Markov chain Monte Carlo algorithm. The convergence results for the Gibbs sampler thus apply, with the prior as the stationary distribution of the Markov chain on hypotheses. This equivalence raises the question of whether other MCMC-like algorithm can form the basis of new experiment designs.

For more, attend CrowdML at NIPS in Barcelona or see our full paper: Rethinking experiment design as algorithm design.

Jordan W. Suchow, University of California, Berkeley
Thomas L. Griffiths, University of California, Berkeley

## Probabilistic Modeling for Crowdsourcing Partially-Subjective Ratings

By An T. Nguyen, Matthew Halpern, Byron C. Wallace, Matthew Lease
University of Texas at Austin and Northeastern University

Imagine you are a graphics designer working on a logo for a major corporation. You have a preliminary design in mind, but now you need to hammer out the fine-grained details, such as the exact font, sizing, and coloring to use. In the past you might have solicited feedback from colleagues, in-person user studies, but you recently heard the power of the “crowd” and you are intrigued. You launch an Amazon Mechanical Turk task asking workers to rate different logo permutations on a 1-5 scale, where 1 and 5 stars are the least and most satisfactory ratings, respectively.

You now have a rich multi-variate dataset for the workers’s opinions, but did all workers undertake the task in good faith? For example, a disinterested worker could have just picked a score and a more malicious worker could have intentionally picked an opposite score. You might like to detect any such cases and filter them out of your data. You might even want to extrapolate from the collected scores to other versions not scored. How could you do any of this?

While the need to accurately model data quality is key, this is a challenging partially-subjective rating scenario where each response is open but partially constrained. In the logo example, the worker is allowed to select any score in the range 1-5. Personal opinions may differ significantly on a score for a specific logo design. This is in contrast to conventional crowdsourcing tasks, such as image tagging and finding email addresses, where one expects a single, correct answer to each question being asked. While a great deal of prior work has proposed ways to model data for such objective tasks, far less work has considered modeling data quality and worker performance under more subjective task scenarios.

The key observation we make in our paper is that the worker data for these partially-subjective tasks, where worker labels are partially ordered (i.e. scores from one to five), are heteroscedastic in nature. Therefore we propose a probabilistic, heteroscedastic model where the means and variances of worker responses are modeled as functions of instance attributes. In other words, the variability of scores can itself vary across the different parameters. Consider the results as the font size of a logo is varied. We would expect that most workers would give the logos with the smallest and largest font sizes low scores. However, the range of scores for the middle range of fonts is going to be more varied.

We demonstrate the effectiveness of our model on a large dataset of nearly 25,000 Mechanical Turk ratings of user experience when viewing videos on smartphones with varying hardware capabilities. Our results show that our method is effective at both predicting user ratings and in detecting unreliable respondents, which is particularly valuable and little studied in this domain of subjective tasks where there is no clear, objective answer.

The link to our full paper is below, along with links to shared data and code, and we welcome any comments or suggestions!

An Thanh Nguyen, Matthew Halpern, Byron C. Wallace, and Matthew Lease. Probabilistic Modeling for Crowdsourcing Partially-Subjective Ratings. In Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP), 2016. 10 pages. [ bib | | | sourcecode ]

## Pairwise, Magnitude, or Stars: What’s the Best Way for Crowds to Rate?

IS THE UBIQUITOUS FIVE STAR RATING SYSTEM IMPROVABLE?
We compare three popular techniques of rating content: five star rating, pairwise comparison, and magnitude estimation.

We collected 39 000 ratings on a popular crowdsourcing platform, allowing us to release a dataset that will be useful for many related studies on user rating techniques.
The dataset is available here.

METHODOLOGY
We ask each worker to rate 10 popular paintings using 3 rating methods:

• Magnitude: Using any positive number (zero excluded).
• Star: Choosing between 1 to 5 stars.
• Pairwise: Pairwise comparisons between two images, with no ties allowed.

We run 6 different experiments (one for each combination of these three types) with 100 participants in each of them. We can thus analyze the bias given by the rating system order, and the results without order bias by using the aggregated data.

At the end of the rating activity in the task, we dynamically build the three painting rankings induced by the choices of the participant, and ask them which of the three rankings better reflects their preference (the ranking comparison is blind: There is no indication on how each ranking has been obtained, and their order is randomized).

WHAT’S THE PREFERRED TECHNIQUE?
Participants clearly prefer the ranking obtained from their pairwise comparisons.  We notice a memory bias effect: The last technique used is more likely to get the most accurate description of the real user preference. Despite this, the pairwise comparison technique obtained the maximum number of preferences in all cases.

EFFORT
While the pairwise comparison technique clearly requires more time than the other techniques, it would be comparable in terms of time with the other techniques using a dynamic test system (of order NlogN).

WHAT DID WE LEARN?

• Star rating is confirmed to be the most familiar way for users to rate content.
• Magnitude is unintuitive with no added benefit.
• Pairwise comparison, while requiring a higher number of low-effort user ratings, best reflects intrinsic user preferences and seems to be a promising alternative.

For more, see our full paper, Pairwise, Magnitude, or Stars: What’s the Best Way for Crowds to Rate?

## Remembering David Martin and his ethnographic studies of crowdworkers

http://www.humancomputation.com/2016/martin.html

This post is a late tribute to Dr. David B. Martin, who passed away in June 2016. Dave was an ethnographer, inherently interested in the tension between tools, progress and the reality of everyday life. Dave had spent the last few years doing ethnomethodological studies of crowdwork and crowdworkers, together with Jacki O’Neill, Ben Hanrahan, and myself at Xerox Research Centre Europe, bringing forth the invisible, hidden work, and various issues of the worker communities into public view. Through ethnomethodology we studied how people collaborated with each other, how they communicated, and the mundane things they did to make crowdwork work. His vision and efforts in the area of crowdsourcing will continue to live on through his work, constantly inspiring us.

The team’s first encounter with crowdsourcing was in 2011, when Dave and Jacki began studying outsourcing work to investigate if crowdsourcing could be used for work that is traditionally outsourced. Following the tradition of the work practice studies at Xerox, they conducted ethnographic studies of BPO work [1], specifically of low-skilled, piece-rate work of healthcare form digitization [2] to see if it could be crowdsourced. Dave studied at-home work in the US and Jacki studied in-office data entry work in India. These studies showed how the outsourcers’ current workflow was sensitively orchestrated to achieve high quality and rapid turnaround times at minimum cost. Even in a low-skilled, piece-rate environment, it was the subtleties of the employer-employee relationship that enabled the work to be done on time and to quality.

The Xerox team then went on to explore crowdsourcing further, especially the lived work of microtask crowdwork: the work practices that make crowdwork work as experienced by crowdworkers. At this point the research team expanded from Dave and Jacki, and got Shourya Roy and colleagues from the Xerox India lab involved. And later joined, Ben Hanrahan, our in-house crowdwork developer at Xerox lab in Grenoble (and speaker at HCOMP at ‘Remembering David B. Martin and his Ethnographic Studies of Crowd Workers’ event [3] on the 31st of October 2016), and myself, PhD candidate from University of Nottingham, UK.

Our interest in crowdsourcing was to enquire: Who are the people who do crowdwork? What are their work practices? What are their personal and work lives like? Why do they this kind of work? What are their expectations from, and issues with crowdwork? How do the various technologies they use fit with and support these practices?

Our enquiries began with Dave’s study [4] of the Turker Nation Forum, predominantly Turkers based in the US, and my PhD, studying the Turkers based in India. Dave’s study was ground-breaking both methodologically (being an EM ethnography of a forum) and in its findings. It painted the first detailed picture of the lived world of crowdworking. Turkers oriented to turking as ‘work’ and AMT as a ‘labour market’ which came as a surprise to much of the research community. The study showed the motivations and ethical codes the workers followed, and the need for fairness and relationship-building in crowdwork. In platforms like Amazon Mechanical Turk not only does it offer no support for the management of rapid, high quality workflows, but also deliberately designs out the relationship between workers and the organisation, reduces accountability and replaces complex social, organisational & financial motivators almost solely with monetary ones.

Before publication of ‘Being a Turker’, Dave sought and got approval from members of the Turker Nation forum for his paper, as he wanted to ensure he was truly representing the concerns of the Turker community. He truly prized the emails and forum comments about his paper from the members of the turker community, and although we do not have these communications with us, it shows us what really mattered to him.

Dave, Jacki and Ben were part of the analysis of the ethnographic data I collected in India in 2013. Dave took over my phd supervision from Jacki, in 2014, honed by analytical skills and mentored me on the ethnomethodological journey through my phd. The rigorous discussions and debates with him on Zimmerman, Becker and so on helped frame the key discussions in my thesis. With Andy Crabtree and Tom Rodden from University of Nottingham we wrote the ‘Understanding Indian Crowdworkers’ as an introduction[5] to the Indian turkers’ story. The first major contribution from the phd was the ‘Turk-life in India’ paper [6] that showed the various levels of digital and English-language literacy amongst turkers, the role infrastructure and technology played in turking and social nature of turking. ‘Turking in a global labour market’ was a comparative[7] narrative of workers based in the US and India, the understandings they had of the market and of each other in a transnational labour platform that brought them head-on into a very competitive work environment and what that meant for designers, policy makers, researchers and activists. This also led us to design a conceptual tool [8] – TurkBench, meant to provide dynamic scheduling and provide crowdworkers with personalized market visualization and session management.

Our collective research into crowdsourcing showed that turkers were collaborative agents, who cared about their work, the money they earned, their reputation and relationship with the requesters. It also brought to the fore the discussion on design, since in crowdsourcing, the labour market is embodied through the technology you create, and thus have far-reaching consequences in the lives of the crowdworkers, not just in their ability to successfully complete quality work, but also in their working conditions and standards of living. Dave wanted to ensure that technology-designers were not designing in a bubble, and were aware of their responsibilities, and the consequences their design had on ‘real’ people.

Dave had been working to support this cause since his first brush with crowdsourcing. One of his final projects was the book chapter “Understanding The Crowd: ethical and practical matters in the academic use of crowdsourcing” which highlights the principles of ‘professional ethics’ that were put in place for the safety and well-being of the research subjects, in this case, the crowdworkers, reminding researchers of ‘ethical conduct’ during crowd-based research, providing references and further guidance to use when using the crowd in supporting empirical work. He ignited a spark in the conversations at a Dagstuhl crowdsourcing Seminar [9] in November 2015 “Evaluation in the Crowd: Crowdsourcing and Human-Centred Experiments” where computer scientists from the fields like visualization, psychology, graphics, multimedia assembled to discuss the role of crowdsourcing in empirical research work. Dave challenged the perspectives of the academicians present, reminding them to never overlook the people who provided experimental data for them. The book arising from that seminar is to be published by Springer in early 2017.

The last event he took part in at was at the University of Oxford at the Connected Life Conference [10] in June 2016, where he, along with colleagues, participated in an interdisciplinary debate on socio-digital practices of collective action, continuing his fight to design technologies and advance policy work for workers doing crowdwork, who are ignored, hidden behind algorithmic distribution and assembly of work.

Drawing to a close, I’d like to borrow Jacki’s words from her video at HCOMP 2016, something that echoes with everyone who had known him, met him at conferences or at a pub.
“Dave will be remembered for his passion for people, philosophy, a good argument and a pint. He conducted ethnomethodology without indifference. His sense of fairness and integrity permeated both his social and his work life. And in the fight to design technologies which give agency to typically undervalued workers, we have lost a kindred spirit. The greatest tribute we can give to Dave, is to remember that the crowd has a human face.”

We miss you Dave.

Neha Gupta

References:

Jacki O.Neill and David Martin. In IFIP Conference on Human-Computer Interaction, pp. 429-446, 2013.

[2] Form digitization in BPO: from outsourcing to crowdsourcing?
Jacki O’Neill, Shourya Roy, Antonietta Grasso, and David Martin. In Proceedings of the 31st ACM SIGCHI conference on Human factors in computing systems, pp. 197-206, 2013.
[3] HCOMP talk http://www.humancomputation.com/2016/martin.html
[4] Being a Turker
David Martin, Benjamin V. Hanrahan, Jacki O’Neill, and Neha Gupta. In Proceedings of the 17th ACM CSCW Conference on Computer Supported Cooperative Work & Social Computing, pp. 224-235, 2014.

[5] Understanding Indian Crowdworkers
Neha Gupta, Andy Crabtree, Tom Rodden, David Martin, and Jacki O.Neill. Back to the Future of Organizational Work: Crowdsourcing and Digital Work Marketplaces Workshop at CSCW 2014.

[6] Turk-Life in India
Neha Gupta, David Martin, Benjamin V. Hanrahan, and Jacki O’Neill. In Proceedings of the 18th ACM GROUP: International Conference on Supporting Group Work, pp. 1-11. 2014.

[7] Turking in a Global Labour Market
David Martin, Jacki O.Neill, Neha Gupta, and Benjamin V. Hanrahan. In Proceedings of the 19th ACM CSCW Conference on Computer Supported Cooperative Work & Social Computing, pp. 39-77, 2016

[8] TurkBench: Rendering the Market for Turkers
Benjamin V. Hanrahan, Jutta K. Willamowski, Saiganesh Swaminathan, and David B. Martin. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI), pp.1613-1616.

[9] Dagstuhl seminar http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=15481

[10] Oxford Conference http://connectedlife.oii.ox.ac.uk/2016conference/programme/

## How to Best Serve Micro-tasks to the Crowd when there Is Class Imbalance

DOES CLASS IMBALANCE IN RELEVANT JUDGMENT AFFECTS PERFORMANCE?
We study the effect on crowd worker efficiency and effectiveness of the dominance of one class in the data they process. We aim at understanding if there is any bias in workers seeing many negative examples in the identification of positive labels.

We run comparative experiments where we measure label quality and work efficiency over different class distribution settings both including label frequency (i.e., one dominant class) as well as ordering (e.g., positive cases preceding negative ones).

We used data from TREC8.  To measure effects of class imbalance, we used two different relevant/non-relevant ratios in a batch of judging tasks: 10%-90% and 50%-50%.

RESULTS
When the relevant documents are shown before the non-relevant ones we obtain the highest precision, while the worst precision is obtained when they are shown at the end of the batch.
Moreover, in batch 2 we observe a low number of true positives and a large number of false positive judgments by the workers, which shows how 90% of non-relevant documents shown at the beginning of the batch create a bias in the workers’ notion of relevance.

When classes are balanced, there is no statistically significant difference in the performance between different orders. On the other hand, seeing a similar number of positive and negative documents leads to good performance with more than 60% accuracy in all the three order settings.

WHAT DID WE LEARN?
When most of the documents are non-relevant and the few relevant ones are presented first, workers perform better. This is a positive result which can be easily applied in practice as in real IR evaluation settings most of the documents to be judged are non-relevant.

Including in the first positions documents known to be relevant will both prime workers on relevance as well as allow for training.

While in a real setting it is not possible to put relevant documents first, it would still be possible to order documents by attributes indicating their relevance (e.g., retrieval rank, number of IR systems retrieving the document, etc.) thus presenting first to the workers the documents with higher probability of being relevant.

For more, see our paper, THE EFFECT OF CLASS IMBALANCE AND ORDER ON CROWDSOURCED RELEVANCE JUDGMENTS

## Social Sampling and the Multiplicative Weights Update Method

Understanding when social interactions lead to the emergence of group-level abilities beyond those of an individual is central to understanding human collaboration and collective intelligence. Towards this, in this post we consider a social behavior that is at once conspicuous in daily life, oft discussed in the social science literature, and also recently empirically verified: the behavior essentially boils down to taking suggestions from other people, and we study whether this leads to good decision-making as a whole.

For instance, consider the problem of choosing what stocks to purchase, or which restaurant to go to. From day to day the attractiveness of each option available might change, but overall may tend to vary around some mean. In such cases, there can be a large number of options for an individual to choose from, and it is impossible to hear about all the past experiences that individuals have had with each option. One simple strategy in these situations is to first seek out a recommendation, whether from friends or via web searches, and then evaluate the current information available about the recommended option.  In doing so, no individual has to consider all the different choices themselves, so the process is cognitively simple.

This social behavior can be broken down as follows:

• people look to others for advice about what decisions to make,
• privately gather further information about the recommended options,
• and then make their final decisions.

We study a concrete model of this process called “social sampling” that was  recently formulated and validated empirically using a large behavioral dataset. More precisely, as depicted in the figure below, in every decision-making round each individual first chooses an option to consider by looking at the current decision of a random other person, then the current (stochastic) quality of each option is observed, and finally the individual randomly either keeps this option or remains undecided where the probability of keeping it is determined by the quality.

From the perspective of a single individual, social sampling is a simple heuristic, and it requires very limited cognitive overhead. As a whole, it is a priori unclear whether this process will result in society eventually converging to the best option, or in inferior options gaining popularity by being propagated from one person to the next.

Nevertheless, our results show that a group of individuals implementing social sampling in a diverse range of settings results in approximately optimal decisions, and hence the behavior is collectively rational. Our analysis shows that social sampling is a highly effective distributed algorithm for solving the problem of which decision is best to make, in the sense that social sampling achieves near-optimal regret in this sequential decision-making task. This behavioral mechanism that individuals use may therefore be highly effective in large groups.

Key to our results is the observation that social sampling can be viewed as a distributed implementation of the ubiquitous multiplicative weights update (MWU) method in which the popularity of each option implicitly represents the weight of that option. This relationship provides an algorithmic lens through which we can understand the emergent collective behavior of social sampling.  Beyond these scientific implications, the relationship to MWU could also suggest novel distributed MWU algorithms. Social sampling requires little communication and memory, and hence may be appropriate as a MWU algorithm for low-power devices such as sensor networks or the internet-of-things.

Social Sampling was one of the group projects pursued at the CMO-BIRS 2016 WORKSHOP ON MODELS AND ALGORITHMS FOR CROWDS AND NETWORKS.

## Integrated crowdsourcing helps volunteer-based communities get work done

We are working on crowdsourcing techniques to support volunteer communities (rather than to get work done with paid workers). In these communities, it can be infeasible or undesirable to recruit external paid workers. For example, nonprofits may lack the funds to pay external workers. In other communities, such as crowdsourced innovation platforms, issues of confidentiality or intellectual property rights may make it difficult to involve external workers. Further, volunteers in the community may possess desirable levels of motivation and expertise that may be valuable for the crowdsourcing tasks. In these scenarios, it may be desirable to leverage the volunteers themselves for crowdsourcing tasks, rather than external workers.

A key challenge to leveraging the volunteers themselves is that in any reasonably complex activity (like collaborative ideation) there are exciting tasks to be done and there is other work that is equally important, but less interesting. In a paper to be presented at UIST’16, we demonstrated the integrated crowdsourcing approach that seamlessly integrates the potentially tedious secondary task (e.g., analyzing semantic relationships among ideas) with the more intrinsically-motivated primary task (e.g., idea generation). When the secondary task was seamlessly integrated with the primary task, our participants did as good a job on it as crowds hired for money. They also reported the same levels of enjoyment as when working just on the fun task of idea generation.

Our IdeaHound system embodies the integrated crowdsourcing approach in support of online collaborative ideation communities.  This is how it works:  The main element of the IdeaHound interface (below) is a large white board. Participants make use of this affordance to spatially arrange their own ideas and the ideas of others because such arrangement of inspirational material is helpful to them in their own ideation process.

Their spatial arrangements also serve as an input to a machine learning algorithm (t-SNE) to construct a model of semantic relationships among ideas:

In several talks last fall, we referred to this approach as “organic” crowdsourcing, but the term proved confusing and contentious.

Several past projects (e.g., Crowdy and the User-Powered ASL Dictionary) embedded work tasks into casual learning activities. Our work shows how to generalize the concept to a domain where integration is more difficult.

You can learn more by attending Pao‘s presentation at UIST next week in Tokyo or you can read the paper:

Pao Siangliulue, Joel Chan, Steven P. Dow, and Krzysztof Z. Gajos. IdeaHound: improving large-scale collaborative ideation with crowd-powered real-time semantic modeling. In Proceedings of UIST ’16, 2016.

## Collaboration Among Workers

Chien-Ju Ho (Cornell University), Christopher H. Lin (University of Washington), and Siddharth Suri (Microsoft Research)

The research question we sought out to address in this pilot study is:

Does the fact that workers exist in networks help them solve problems?

To address this question, we gave 100 Mechanical Turk workers a list of 8 cities in the United States and asked them to find the shortest route that visits all cities and starts from Seattle. This is an instance of the Travelling Salesman Problem (TSP).  Here were the HIT parameters:

• The base pay was $0.10 and workers got a$2.00 bonus for getting the best (or tied for the best) route.
• In addition, for every 100 miles away from the best answer we deducted \$0.10 from the maximum bonus.
• We set the duration of the HIT to be1 hour.
• We explicitly said workers could collaborate on this.

One of our main results is that we found worker collaboration!  Next, we show how they collaborated.

• Workers started new threads to collaborate on the HIT.
• Workers tried to fill out a matrix of distances between cities.
• Workers shared their answers (routes).
• Workers did greedy minimization to improve on the answers.

Conclusions

We found some preliminary evidence that workers can use their own networks to collaborate. Moreover, they can solve hard problems together if the requester explicitly allows collaboration and incentivizes them to do so.

We are currently working on a number of follow up questions:

• How do solutions from collaboration compare to solutions from independent workers?
• Could allowing workers to collaborate result in group think?
• For which problems does collaboration help and for which problems does it not help?

Collaboration Among Workers was one of the group projects pursued at the CMO-BIRS 2016 WORKSHOP ON MODELS AND ALGORITHMS FOR CROWDS AND NETWORKS.

by Yiling Chen, Jason Hartline, Yang Liu, Bo Waggoner & Dan Weld

The role of reputation systems in online markets, such as those for crowdwork, is to transform a single-shot game between individual requester and workers into a repeated game between the population of requesters and population of sellers. In the single-shot game cooperation can break down, e.g. workers may provide only low quality service, but in the repeated game cooperation can be sustained, and workers are incented to do high quality work (cf. Kandori, 1992).  Reputation systems will fail, however, if the the marginal cost to a requester for providing an informative review of a worker is greater than the marginal benefit of more accurate future ratings.1
The main driver of this inequality is the free-rider problem: requesters benefit from the public good (i.e. worker reputation) created by the reputation system even if they do not contribute to it.

In this blog post we describe two main ideas for making reputation systems more effective.  First, to increase the marginal benefit of providing an accurate rating we suggest changing reputation from a public good to a private good by casting it as an individualized (to each requester) recommendation.  For example, a requester who gives all workers a top rating is reporting no preference and can be recommended any worker; while a requester with more informative reports will be assigned workers that the system predicts will be preferred by this requester.  Second, to decrease the marginal cost of providing uninformative reviews, we suggest linking decisions between a requester’s review of several workers (i.e., ranking the workers relative to each other instead of scoring them absolutely).

The rest of this blog post is organized as follows.  In the first section we will discuss some of the reasons that marginal cost of informative reviews may outweigh their marginal benefit.  In the second section we describe how revisioning the reputation system as a recommender system. In the third section we describe how linking decisions can make it easier for requesters to give informative reports.

1. Costs and Benefits of Reputation systems.

We believe that ratings are inaccurate because the marginal cost to a requester for providing an informative review is greater than the marginal benefit of more accurate future ratings. There are several costs associated with submitting a review, especially an accurate assessment of a poor worker.

• It takes time to enter a review.
• A review incurs the risk of being embroiled in a dispute
• A negative review may give the requester the reputation of being a ‘harsh grader’ and cause other skilled workers to avoid the requester for fear of damaging their reputation.
• There may be off-platform consequences, such as unfavorable posts on Turkopticon.

By lowering these costs, as we discuss below, an improved reputation system could increase the accuracy of reviews.  There are also several reasons why the marginal benefit of a review is low:

• Requester feedbacks are often aggregated into reputations and a single assessment will likely not visibly affect the worker’s rating.
• Even if a single assessment did change the worker’s score, the change provides no new information to the requester, who already knows how this worker performed. (If the rating affected the scores of other workers, however, by personalizing these ratings, then there would be marginal benefit. (Such personalized ratings may be thought of as a private good).

There are several ways to decrease the cost of providing truthful reviews:

• One can reduce the chance of disputes by making reviews anonymous and only showing a worker their average score. This approach is taken by Uber (and others) who only show drivers their average rating for the most recent 500 trips.
• One way to shrink or eliminate the marginal cost of the time to submit a review is to provide a simple UI which is fast while simultaneously making the process of avoiding the review task cumbersome. This approach, and Amazon’s repeated nagging requests for reviews may alienate users, however.
• Alternatively, in lieu of an explicit review, it may be possible to observe other actions performed by the requester and use that to reveal his or her preferences. For example, one could detect if the requester chooses to subsequently hire the worker, which is presumably a very strong signal that previous interactions were positive.

One can also try to eliminate the free-rider problem by increasing the marginal benefit of reviews.  Again, several options are possible.

• One way of increasing the marginal value is to personalize recommendations. This method is used in Boomerang by Gaikwad et al. (2016) and we discuss it further in the next section.
• Another method for increasing marginal value of correct reviews is changing overall system behavior in a way that impacts the requester. For example, Boomerang gives temporal priority to well-rated workers on subsequent jobs posted by a requester. This creates a clear disincentive to inflate the grade of a poor worker, but may not provide enough of an incentive for requesters to post ratings at all.

One novel approach that deserves more attention is blurring. In this model, requesters ability to observe the reputation of a worker is proportional to the number and quality of ratings that have provided.  If the system supports strong identities then a new requester could be allowed unfettered access to the reputation ratings, but if the requester engaged in transactions and then failed to report an accurate rating, the system would hide information in subsequent rounds (see Figure).  Of course such a mechanism requires a way to estimate the quality of a review.  One method, which needs further thought and experimental evaluation, would be to measure the entropy of the requester’s reviews (this discourages giving uniformly positive 5 star reviews) and agreement with good reviewers (calculated using EM).

2. Reputation systems as recommender systems.

One approach to solving the free rider problem is to turn the public good into private goods. If reputation of workers is no longer something that every requester can benefit from equally, instead, each requester gets personalized recommendations or matches of workers according to preferences revealed by the provided feedbacks, then the marginal value of providing an accurate assessment is increased. This alleviates the incentive to free ride. For example, if a requester always rates five stars, his revealed preference is that he’s happy with any worker and he hence will be matched with workers that others think are less skilled. This approach not only provides incentives for requesters to report accurate assessments but also accommodates heterogeneity in requester preferences, which can be helpful for increasing expressiveness of reputation systems. In such a system a requester’s feedback is taken not only as an assessment but also as an indication of his preference.  The following gives a sketch of how such a system would work.

• Requesters provide reviews of workers in their own terms.
• The requesters’ reviews are normalized and aggregated.
• The aggregated reviews are reinterpreted into an individual recommendations for each requester (according to the requester’s preference inferred from (1)).

There are two key properties of such a system.  First, if a requester gives uninformative reviews, e.g., all five-star ratings, then the requester gets back uninformative recommendations, e.g., all five-star ratings.  Second, uninformative reviews such as all five-star ratings when normalized and aggregated will not dilute the informative reviews (which would make the aggregate reviews less informative).

The Netflix recommendation system is a canonical example.   Users enjoy recommendations based on data contributed from others and their own reviews are used in collaborative filtering to generate personalized recommendations (cf. Ekstrand et al., 2011). Such an approach is based on the idea of decomposing a large while sparse rating matrix to three components: R(rating) = U(Users) Sigma(latent variables) M(movies), namely users, latent variable, and movies. Then we see the above procedure helps identify hidden similarity features that any two users (rating providers) may share. Therefore it is to a user’s best interest (improve accuracy of recommendation) to truthfully reveal his preference.

Collaborative filtering based movie recommendation

Collaborative filtering based worker recommendation

Collaborative filtering is a powerful tool to identify the most representative latent dimensions that can best describe the rating matrix. This decomposition to a great extent helps to characterize useful attributes that can best “predict” a future rating (or reputation). While collaborative filtering methods like the one discussed above do not provide a natural labeling of the latent factors, it would perhaps be interesting to combine collaborative filtering with more sophisticated data mining approaches to obtain natural labeling for discovered latent dimensions.

Another benefit of using a collaborative filtering approach is that it can improve the expressiveness of a reputation system. So far most reputation systems rely on a single (and often naive) dimension for scoring agents, e.g., accomplishment rate for AMT. But there exists potentially many other dimensions that help determine such a “reputation score”.

Naturally the above procedure faces challenges that classical recommender systems have. First, the recommendation-based reputation system need to deal with the cold-start problem (cf. Sedhain et al., 2014), when a new worker arrives. This to a certain degree may discourage new users. Second, the accuracy of the above approach depends on various modeling assumptions. When a model works and when it doesn’t are interesting questions for future studies.

2.1 Theoretical Approaches

We provide some initial thoughts on how one can formally approach reputation systems as a kind of recommender systems.

A first question to solve is an offline learning problem: Given reviews acquired so far (say requesters reviewing workers), how can we predict which matches of requesters to workers would be valuable in the future? One could cast this as a matrix completion kind of problem and apply collaborative filtering techniques to predict the “missing entries”, i.e. reviews that would be given for unknown matches (cf. Koren et al., 2009). Another approach would be to apply techniques from crowdsourcing for inferring underlying parameters given responses on tasks (cf. Moon ,1996; Karger et al., 2014). If we can infer parameters for reviewer and worker preferences and skills, perhaps we can use these for prediction.

A next step is to make this problem dynamic. A system obtains new reviews over time and can make or influence future assignments of workers to requesters. There is an explore-exploit tradeoff because the system may benefit in the long run from initially making suboptimal assignments in order to learn about skills and preferences.  (In some systems an initial questionnaire could directly elicit preferences.)  Here, an interesting challenge is to combine algorithmic approaches such as bandit learning or active learning with the above inference algorithms (cf. Bresler et al., 2015).

Finally, further study of the incentives of recommender systems is warranted.  Such a study would need to explicitly model for the utility of a requester in terms of the accuracy of reviews provided and recommendations received.  See (cf. Jurca and Faltings, 2003; Dasgupta and Ghosh, 2013) for initial work in this area.

3. Feedback Elicitation

Another approach to reversing the direction of the marginal cost greater than marginal value inequality is to increase the informativeness of elicited feedbacks while maintaining or lowering the marginal cost of providing informative feedback.

In many reputation systems, users are asked to provide a numerical rating or score for sellers or service providers that they have interacted with.  If there is no option to opt out of providing a rating, arguably the strategy of always giving a five-star rating without regard to the actual experience is as costless as possible for the users. This strategy however leads to completely uninformative ratings that defies the purposes of eliciting feedbacks.

An alternative approach would be to ask requestors to rank order of the past three workers that he has interacted with. The requester no longer has the option of saying that all workers are excellent and hence is more likely to provide a ranking that’s closer to his true experiences.

There are two reasons why solicitation of rankings may lead to better outcomes than solicitation of scores.   The first reason is that humans find ranking easier than scoring (Miller, 1956, 1994).  The second reason is that requesters may differ in their perception of the magnitudes of the qualities of workers or may prefer to exaggerate the quality of workers to avoid retribution or costly disputes.  For example, Frankel (2014) studied a related delegation problem and showed that under natural assumptions soliciting rankings information is optimal when requesters would otherwise have incentives to misreport scores.  More generally, the approach of ranking is related to the “linking decisions” idea in economics (cf. Jackson and Sonnenschein, 2007).  When requesters are asked to score workers, they make individual decisions on each worker; but when requesters rank workers, their decisions are linked.

Changing the way to ask for feedbacks also brings up interesting interface questions. While conceivably it may be easier for a requester to compare two workers than score them, how about three workers or five workers? Time also adds additional complications as people may not remember their past experiences well. We think there is an interesting research agenda here trying to understand the impact of interface design on the informativeness vs. cost tradeoff for eliciting feedbacks.  The literature on ranking in peer grading may be a useful starting point (cf. Raman and Joachims, 2014).

1 For simplicity of exposition, we speak exclusively of a worker’s reputation, but our ideas also apply to the equally important problem of requestor reputations.

Designing More Informative Reputation Systems was one of the group projects pursued at the CMO-BIRS 2016 WORKSHOP ON MODELS AND ALGORITHMS FOR CROWDS AND NETWORKS.

References:

1. Kandori, Michihiro. “Social norms and community enforcement.” The Review of Economic Studies 59.1 (1992): 63-80.
2. Horton, John J., Joseph M. Golden, Reputation Inflation: Evidence from an Online Labor Market, 2015. http://econweb.tamu.edu/common/files/workshops/Theory%20and%20Experimental%20Economics/2015_3_5_John_Horton.pdf.
3. S.S. Gaikwad, D. Morina, A. Ginzberg, C. Mullings, S. Goyal, D. Gamage, C. Diemert, M. Burton, S. Zhou, M. Whiting, K. Ziulkoski, A. Ballav, A. Gilbee, S.S. Niranga, V. Sehgal, J. Lin, L. Kristianto, A. Richmond-Fuller, J. Regino, N. Chhibber, D. Majeti, S. Sharma, K. Mananova, D. Dhakal, W. Dai, V. Purynova, S. Sandeep, V. Chandrakanthan, T. Sarma, S. Matin, A. Nassar, R. Nistala, A. Stolzoff, K. Milland, V. Mathur, R. Vaish, and M.S. Bernstein (2016) Boomerang: Rebounding the Consequences of Reputation Feedback on Crowdsourcing Platforms. To appear in UIST-16.
4. Ekstrand, Michael D., John T. Riedl, and Joseph A. Konstan. “Collaborative filtering recommender systems.” Foundations and Trends in Human-Computer Interaction 4.2 (2011): 81-173.
5. Suvash Sedhain, Scott Sanner, Darius Braziunas, Lexing Xie, and Jordan Christensen. 2014. Social collaborative filtering for cold-start recommendations. In Proceedings of the 8th ACM Conference on Recommender systems (RecSys ’14). ACM, New York, NY, USA, 345-348. DOI=http://dx.doi.org/10.1145/2645710.2645772
6. Koren, Yehuda, Robert Bell, and Chris Volinsky. “Matrix factorization techniques for recommender systems.” Computer 42.8 (2009): 30-37.
7. Moon, Todd K. “The expectation-maximization algorithm.” IEEE Signal processing magazine 13.6 (1996): 47-60.
8. Karger, David R., Sewoong Oh, and Devavrat Shah. “Budget-optimal task allocation for reliable crowdsourcing systems.” Operations Research 62.1 (2014): 1-24.
9. Bresler et al. “Regret Guarantees for Item-Item Collab FIltering” http://arxiv.org/abs/1507.05371 applies bandit algorithm to choose what rating to ask for
10. Jurca, Radu, and Boi Faltings. “An incentive compatible reputation mechanism.” E-Commerce, 2003. CEC 2003. IEEE International Conference on. IEEE, 2003.
11. Dasgupta, Anirban, and Arpita Ghosh. “Crowdsourced judgement elicitation with endogenous proficiency.” Proceedings of the 22nd international conference on World Wide Web. ACM, 2013.
12. Jackson and Sonnenschein (2007), Overcoming Incentive Constraints by Linking Decisions. Econometrica.
13. Frankel (2014), Aligned Delegation. American Economic Review.
14. Miller, G. A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological review, 63(2), 81.
15. Miller, G. A. (1994). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological review, 101(2), 343.
16. Raman, K., & Joachims, T. (2014, August). Methods for ordinal peer grading. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1037-1046). ACM.