Machine Learning for Dummies

By Eric Holloway and Robert J. Marks II

Dummies Commentary
Humans are slow and sloppy, why do we want human guided machine learning?

Since the 1970s, we’ve known humans can find approximate solutions to NP-Complete (really hard) problems more efficiently than the best algorithms (Krolak 1971). The best algorithms scale quadratically, while humans scale linearly (Dry 2006). We also know many of the most widely used and successful machine learning algorithms are NP-Hard to train optimally (Diettrich 2000). This suggests a human/machine hybrid can produce better models than machine learning alone.

Solution times
Polynomial regression of human solution times against problem sizes (Dry 2006).

However, there are problems with human interaction. The hardest problem is visualization. It is hard visualizing data with more than 3 dimensions. So, we perform dimension reduction by projecting data onto two dimensions. We then collect many weak models (humans draw boxes) from multiple projections to build a strong model; known as boosting in machine learning.

User Interface
User interface for Amazon Mechanical Turk HIT.

Combining crowdsourcing and boosting, we use Amazon’s Mechanical Turk to collect the models. The data is transformed by the models into a feature space. Then, we use linear regression to classify new data.

Linear Regression
Linear regression classification of human produced features.

We test the human/machine hybrid on one artificial dataset and four real world datasets, all with ten or more dimensions. This hybrid is competitive with machine only linear regression on the untransformed data.

Results of linear regression classification using just machine, and using human/machine hybrid.

You can read more about our work in the HCOMP 2016 paper High Dimensional Human Guided Machine Learning.

A demo is available for a limited time.


Krolak, P., Felts, W., & Marble, G. (1971). A man-machine approach toward solving the traveling salesman problem. Communications of the ACM, 14(5), 327-334.

Dry, M., Lee, M. D., Vickers, D., & Hughes, P. (2006). Human performance on visually presented traveling salesperson problems with varying numbers of nodes. The Journal of Problem Solving, 1(1), 4.

Dietterich, T. G. (2000, June). Ensemble methods in machine learning. In International workshop on multiple classifier systems (pp. 1-15). Springer Berlin Heidelberg.


Integrated crowdsourcing helps volunteer-based communities get work done

by Pao Siangliulue, Joel Chan, Steven P. Dow, and Krzysztof Z. Gajos

We are working on crowdsourcing techniques to support volunteer communities (rather than to get work done with paid workers). In these communities, it can be infeasible or undesirable to recruit external paid workers. For example, nonprofits may lack the funds to pay external workers. In other communities, such as crowdsourced innovation platforms, issues of confidentiality or intellectual property rights may make it difficult to involve external workers. Further, volunteers in the community may possess desirable levels of motivation and expertise that may be valuable for the crowdsourcing tasks. In these scenarios, it may be desirable to leverage the volunteers themselves for crowdsourcing tasks, rather than external workers.

A key challenge to leveraging the volunteers themselves is that in any reasonably complex activity (like collaborative ideation) there are exciting tasks to be done and there is other work that is equally important, but less interesting. In a paper to be presented at UIST’16, we demonstrated the integrated crowdsourcing approach that seamlessly integrates the potentially tedious secondary task (e.g., analyzing semantic relationships among ideas) with the more intrinsically-motivated primary task (e.g., idea generation). When the secondary task was seamlessly integrated with the primary task, our participants did as good a job on it as crowds hired for money. They also reported the same levels of enjoyment as when working just on the fun task of idea generation.

Our IdeaHound system embodies the integrated crowdsourcing approach in support of online collaborative ideation communities.  This is how it works:  The main element of the IdeaHound interface (below) is a large white board. Participants make use of this affordance to spatially arrange their own ideas and the ideas of others because such arrangement of inspirational material is helpful to them in their own ideation process.

The main interface of the IdeaHound system.

Their spatial arrangements also serve as an input to a machine learning algorithm (t-SNE) to construct a model of semantic relationships among ideas:

The computational approach behind IdeaHound

In several talks last fall, we referred to this approach as “organic” crowdsourcing, but the term proved confusing and contentious.

Several past projects (e.g., Crowdy and the User-Powered ASL Dictionary) embedded work tasks into casual learning activities. Our work shows how to generalize the concept to a domain where integration is more difficult.

You can learn more by attending Pao‘s presentation at UIST next week in Tokyo or you can read the paper:

Pao Siangliulue, Joel Chan, Steven P. Dow, and Krzysztof Z. Gajos. IdeaHound: improving large-scale collaborative ideation with crowd-powered real-time semantic modeling. In Proceedings of UIST ’16, 2016.

Hey Twitter crowd … What else is there?

Journalists and news editors use Twitter to contextualize and enrich their articles by examining the public response, from comments and opinions to pointers to related news. This is possible because some users in Twitter devote a substantial amount of time and effort to news curation: carefully selecting and filtering news stories highly relevant to specific audiences.

We developed an automatic method that groups together all the users who tweet a particular news item, and later detects new contents posted by them that are related to the original news item.

We call each such group a transient news crowd. The beauty of this approach, in addition to being fully automatic, is that there is no need to pre-define topics and the crowd becomes available immediately, allowing journalists to cover news beats incorporating the shifts of interest of their audiences.

Transient news crowds
Figure 1. Detection of follow-up stories related to a published article using the crowd of users that tweeted the article.

Transient news crowds

We define the crowd of a news article as the set of users that tweeted the article within the first 6 hours after it is published. We followed users on each crowd during one week, recording every public tweet they posted during this period. We used Twitter data around news stories published by two prominent international news portals: BBC News and Al Jazeera English.

What did we find?

  • After they tweet a news article, people’s subsequent tweets are correlated to that article during a brief period of time.
  • The correlation is weak but significant, in terms of reflecting the similarity between the articles that originate a crowd.
  • While the majority of crowds simply disperse over time, parts of some news crowds come together again around new newsworthy events.

Crowd summarisation

We illustrate the outcome of our automatic method with the article Central African rebels advance on capital, posted on Al Jazeera on 28 December, 2012.

transient news crowds - example
Figure 2. Word clouds generated for the crowd on the article “Central African rebels advance on capital”, by considering the terms appearing in stories filtered by our system (top) and on the top stories by frequency

Without using our method (in the figure, bottom), we obtain frequently-posted articles which are weakly related or not related at all to the original news article. Using our method (in the figure, top), we observe several follow-up articles to the original one. Four days after the news article was published, several members of the crowd tweeted an article about the fact that the rebels were considering a coalition offer. Seven days after the news article was published, crowd members posted that rebels had stopped advancing towards Bangui, the capital of the Central African Republic.

You can find more details in our papers:

  • Janette Lehmann, Carlos Castillo, Mounia Lalmas and Ethan Zuckerman: Transient News Crowds in Social Media. Seventh International AAAI Conference on Weblogs and Social Media, 2013, Massachusetts.
  • Janette Lehmann, Carlos Castillo, Mounia Lalmas and Ethan Zuckerman: Finding News Curators in Twitter. WWW Workshop on Social News On the Web (SNOW), Rio de Janeiro, Brazil.

Janette Lehmann, Universitat Pompeu Fabra
Carlos Castillo, Qatar Computing Research Institute
Mounia Lalmas, Yahoo! Labs Barcelona
Ethan Zuckerman, MIT Center for Civic Media

Predicting How Online Creative Collaborations Form and Succeed

What leads to a successful creative collaboration? Be it music, movies, or multimedia… collaborative online communities are springing up around all sorts of shared artistic interests. Even more exciting, these communities offer new opportunities to study creativity and collaboration through their social dynamics.

A screenshot of the FAWM.ORG website for 2013, listing more than 800 collaborations.
A screenshot of the FAWM.ORG songwriting community website in 2013. More than 8% of all compositions are collaborations.

We examine collaboration in February Album Writing Month (FAWM), an online music community. The annual goal for each member is to write 14 new songs in the 28 days of February. In previous work we found that FAWM newcomers who collab in their first year are more likely to (1) write more songs, (2) reach the 14-song goal, (3) give feedback to others, and (4) donate money to support the site. Given the individual and group benefits, we sought to better understand what factors lead to successful collabs.

By combining traditional member surveys with a computational analysis over four years of archival log data, we were able to extend several existing social theories about collaboration in nuanced and interesting ways. A few of our main findings are:

  1. Collabs form out of shared interests but different backgrounds. Theory predicts that people work with others who share their interests. But we found that, for example, a heavy-metal songwriter is less likely to collab with another metalhead than, say, a jazz pianist (who enjoys head-banging on occasion).
  2. Collabs are associated with small status differences. Existing theory also predicts that people tend to work with others of the same social status. In our study, members teamed up with folks of slightly different status more often than those of identical status. (There are several explanations, ranging from newcomer socialization to hero-worship.)
  3. A balanced effort is most enjoyable for both participants. The “social loafing” literature suggests that people are disappointed by collabs when their partner is a slacker. However, we found that the slackers themselves were disappointed, too.

To top it all off, the novel path-based regression model we use is significantly better than other standard techniques for predicting new collabs that will form (see the graphs below). This has exciting implications for recommender systems and other socio-technological tools to help improve collaboration in online creative communities.

ROC and precision-recall curves comparing our path-based regression approach to standard link prediction methods. Our model provides insights into social theory and yields better predictions, as well.
ROC and precision-recall curves comparing our path-based regression model to standard link prediction methods. Our model provides both accurate predictions and insights into social theory.

For more, please see our full paper Let’s Get Together: The Formation and Success of Online Creative Collaborations.

Burr Settles, CMU Machine Learning Department (now with Duolingo)
Steven Dow, CMU Human Computer Interaction Institute

Link Spam on Wikis: Attack Models and Mitigation

Andrew G. West et al. (see footer), University of Pennsylvania

Much research on abusive behaviors in wiki environments has focused on poorly motivated cases of vandalism. However, a recent line of research at the University of Pennsylvania has instead concentrated on link spam in such environments. Presuming link spammers are well-incentivized (by profit), the authors expected to discover more technically sophisticated and evasive behaviors. Surprisingly, optimal spam behaviors were absent, motivating the authors to propose a novel (and controversial) attack model. After statistical estimation found the attack could prove economically viable a machine-learning classifier was created to address the vulnerability. The feature-rich model autonomously identified 64% of status quo spam at a 0.5% false-positive rate in testing on English Wikipedia. Moreover, the technique has a live implementation that intelligently routes humans to probable link spam.

Continue reading