Why Online Knowledge Markets Can Fail at Scale

Why did Yahoo Answers fail? Why is Stack Overflow declining? Will Quora survive in 2020?

When are community question answering (CQA) platforms like StackExchange, sustainable? In our recent paper at the Web conference, we investigate CQA sustainability from an economic perspective–providing insights on their successes and failures through an interpretable model. Our key idea is to interpret CQA platforms as markets, where participants exchange knowledge. We used this “knowledge market” perspective to analyze the question answering sites on StackExchange, and showed that the content generation in StackExchanges can be captured through Cobb-Douglas model–a production model from Economics. Our model provides intuitive explanation for the successes and failures of knowledge markets through the concept of economies and diseconomies.

size_vs_diseconomies_with_title
Figure 1: Cobb-Douglas model can predict economies (diseconomies) of scale–whether the ratio of answers to questions will increase (decease) with the increase in number of users. We present the economies and diseconomies of scale in three StackExchanges: SUPERUSER (strong diseconomies), PUZZLING (weak economies), and CSTHEORY (strong economies). Most StackExchanges exhibit diseconomies of scale.

Modelling Content Generation in Knowledge Market

In a knowledge market, users generate different types of content such as questions, answers and comments. Content in these markets exhibit dependencies, e.g., answers depends on questions, and comments depend on questions and answers. We combine user participation and content dependency to model content generation in knowledge markets. To this end, we use production functions, which are a natural way to model output in a market. These functions involve two components, a basis function (Figure 2) and an interaction type (Figure 3). We consider three possible basis: exponential, power, and sigmoid; and four possible interaction types: essential, interactive essential, antagonistic, and substitutable. Considering these choices, we found that the combination of power basis and interactive essential interaction provides the best fit to our dataset for all StackExchanges. In economics, this combination is known as the Cobb-Douglas model.

Three Basis Functions
Figure 2: A basis function captures the relationship between an input and output, e.g., how the number of questions affect the number of answers.

Four Interaction Types
Figure 3: An interaction types capture how different inputs interact to produce an output, e.g., how the number of questions and number of answerers interact to generate number of answers.

Key Findings

Our model asserts that, for most StackExchanges, the aggregate user behavior changes with the size of user community. Figure 4 shows how answering behavior in ANDROID StackExchange changes with community size. Specifically, as the community grows, the users who join more recently are likely to contribute fewer answers compared to those who joined at an early stage. This phenomena is true for most StackExchanges, with varying strength. The larger communities tend to behave similar to ANDROID. Now, why do we observe such size dependent distribution of user behavior? It turns out, in StackExchanges, there’s a core group of users who substantially contribute answers for a long period. The size of this core does not grow in proportion to the community size, which causes the size dependent distribution, which in turn causes diseconomies–ratio of answer to question declines with the increase in community size.

android_sdd
Figure 4: Size dependent distribution of answering behavior in ANDROID StackExchange. Users who join more recently are likely to contribute fewer answers compared to those who joined at an early stage.

For more, see our full paper.
Himel Dev, University of Illinois at Urbana–Champaign

Alleviating Labor Conditions for Content Moderators: Obfuscating Images in Crowdsourced Image Moderation

If one were to figure out how to resolve the Turing test for social media content moderation, a lot of money could be made. Thus far, however, algorithmic solutions through machine learning and artificial intelligence are not sufficient to automatically filter all content that users post on the internet. As a result, social media companies have resorted to hiring human reviewers or outsourcing these tasks to online labor markets such as crowdsourcing platforms or third-party companies.

While the repercussions of hate speech, graphic images, or other content that might not align with community standards are of concern for platforms to not upset their users and provide benign environments for advertising,  there has recently been an increased interest in the (unseen) labor that content moderators do in order to uphold these content standards.

image obfuscation for content moderation

Recent conferences and workshops at UCLASanta Clara University, the University of Southern California, and the Alexander von Humboldt Institute for Internet and Society at Berlin have brought together researchers, industry practitioners, and actual moderators to discuss the ethical and professional ramifications of a job that is in formation. Additionally, an increasing number of dissertations, news articles, books, as well as documentary movies have begun to focus on this issue.

In a court case filed with the King County Superior Court in Seattle, Washington, the plaintiffs Soto and Blauert are suing a major corporation for alleged development of post-traumatic stress disorder (PTSD) because of exposure to materials such as child pornography that they have encountered as part of their work as content moderators. The current status quo of research often invokes precarious labor conditions of content moderators but rarely succeeds at providing empirical underpinnings. Ethnographic research is one attempt to get at the core of the labor conditions of moderators, while experimental research such as our study is another approach.

Workers in commercial content moderation settings include hired employees of social media companies, crowd workers from online labor markets such as Amazon Mechanical Turk and  Figure Eight (formerly known as CrowdFlower), and contractors from specialized third-party companies. In our study, we investigate how labor conditions of these workers can be alleviated while maintaining the precision of human judgement.

The question that guides our research in this paper is: How can we reveal the minimum amount of information to a human reviewer such that an objectionable image can still be correctly identified?

We do this by developing a special moderation interface for crowdsourced image moderation on Amazon Mechanical Turk. We collect a set of images both “safe for work” and “not safe for work” (i.e., graphic content) via Google Images and task crowd workers with moderating these images at varying degrees of blurriness. In doing so, low-level pixel details are eliminated but images remain sufficiently recognizable to accurately moderate. In the case that an image is too heavily obfuscated, we additionally provide tools for works to partially reveal blurred regions to better help them complete their task while still protecting them from the majority of the image contents. Beyond merely reducing exposure, putting finer-grained tools in the hands of the workers provides them with a higher-degree of control in limiting their exposure and allows them to determine how much they see, when they see it, and for how long. In conducting this study, we aim to (1) gauge to what degree of such obfuscation can moderators sufficiently discern content, and (2) identify whether obfuscation can alleviate emotional well-being in content moderation processes.

various blurring tools for content moderation

We have carefully considered this issue and recognize that our use of graphic images is a controversial decision. However, we cannot meaningfully assess effects of obfuscation in this particular domain without using pictures that may be considered unsafe for work. Unlike in actual content moderation settings, where moderators continuously sift through content, participants in this study are asked to moderate 10 images in a controlled setting. In most cases, these will be obfuscated, but in some cases people will still be able to see images depicting pornography or violence. To provide additional safeguards to protect participants in case of emotional trauma or disturbance, we provide a list of national mental health resources at the end of our experiment. Furthermore, the consent form carries descriptive and explicit warnings regarding the disturbing nature of some of the images and discourages users who might not want to be exposed to such content from partaking in the study. The design of this study was heavily influenced by valuable feedback from the Institutional Review Board and the Office of the Vice President for Legal Affairs at the University of Texas at Austin, for which we are grateful.

The criteria for human judgment that we provide to content moderators are based on leaked moderation rules used by Facebook on the crowdsourcing platform oDesk (now known as Upwork). First, participants will be asked to moderate a set of images, which we will obfuscate, at different degrees of blur (certain participants will be exposed to one particular level of blur and others to more or less levels of blur). Subsequently, participants will be asked to respond to a survey via Qualtrics about demographics, positive and negative experience and feelings, positive and negative affect, emotional exhaustion, as well as perceived ease of use and usefulness of the moderation interfaces.

As this project is a Work-In-Progress, we cannot report data at this point but have recently obtained final approval from the Institutional Review Board at the University of Texas at Austin, contingent on which we are now able to proceed with the experiment. With this paper, we hope to establish a case for more empirical work that puts the labor conditions of moderators first, and we invite other researchers to also explore methods for improving working conditions of content moderators. However, we believe that this is not a task that should be left to just scientists at research institutions. First and foremost, this is a responsibility that social media platforms need be held to, and we hope that our research also supplements future arguments to hold social media companies more accountable for the menial forms of labor they create.

For more details, please read our Work-In-Progress paper/extended abstract, But Who Protects the Moderators? The Case of Crowdsourced Image Moderation, which will be presented in several forms at the 6th AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2018) and the 6th ACM Collective Intelligence Conference (CI 2018).

Brandon Dang, University of Texas at Austin
Martin J. Riedl, University of Texas at Austin & Alexander von Humboldt Institute for Internet and Society
Matthew Lease, University of Texas at Austin

The first two authors contributed equally.

Two Tools are Better Than One: Tool Diversity as a Means of Improving Aggregate Crowd Performance

Human intelligence provides a new means of scaffolding the creation and training AI systems. Recently, the rise of crowdsourcing marketplaces has opened opportunities for us to access human intelligence more scalably and flexibly than ever before. However, one of the biggest concerns when using crowdsourcing is that many times the contributed work can be unreliable.

To increase reliability, prior work has frequently used task decomposition to generate smaller, simpler, less error-prone microtasks. Additionally, since microtasks are easier to obtain agreement over, consensus-based aggregation can be used to output a reasonable single answer from diverse individual worker responses. However, there are limits to how much a task can be decomposed into smaller pieces. Furthermore, there has been little research on how to deal with systematic error biases that can be induced by a tool’s design. Systematic error biases are defined as shared error patterns among workers using a single tool, which can be problematic because they can persist even after decomposition or aggregation of a task.

In this paper, we propose to leverage tool diversity to overcome the limits of microtasking. We contribute the insight that: given a diverse set of tools, answer aggregation done across tools can help improve collective performance by offsetting systematic biases.

Our recent study shows the effectiveness of leveraging tool diversity, particularly in semantic image segmentation tasks, where the task is to find a requested object in a scene and draw a tight boundary around the target object to demarcate it from the background. This is an important problem in computer vision that allow systems to be trained to better understand scenes.



For our experiments, we built four different image segmentation tools and evaluated their segmentation quality in three different aggregation conditions: 1) single tool aggregation using majority voting — which serves as a baseline, 2) two-tool aggregation using majority voting, and 3) two-tool aggregation using expectation maximization (EM) — to see if this well-known optimization method can effectively integrate answers across different tools.

As a result, two-tool aggregation improved F1 scores (the harmonic means of recall and precision) compared to single tool aggregation, especially when the mixed tool pairs had precision and recall trade-offs. We used EM-based aggregation to significantly improve the performance of the tool pairs compared to uniform majority voting in most cases. F1 scores for the different tool pairings are summarized in the figure below. Our results suggest that not only the individual tool’s design but also the aggregation method can affect the performance of multi-tool aggregation.



Our findings open up new opportunities and directions for gaining a deeper understanding of how tool designs influence the aggregate performance on diverse crowdsourcing tasks, and introduces a new way of thinking about decomposing tasks: based on tools instead of subtasks. We suggest that system designers consider the followings when trying to leverage tool diversity — using multiple tools for their applications:

  • The expected error (or bias) from human participants should be distributed differently. This way, the diverse tool set can complement a broad range of error (or bias) types.
  • The task should have an objectively correct answer, which is tractable enough for the workers to answer. For example, live captioning, or text annotation may be amenable to the tool diversity approach. On the other hand, tasks like creative writing would be hard to benefit from our approach because the expected answer is subjective.
  • The task should tolerate imperfections in workers answers. For example, live captioning task used in Scribe tolerates imperfections because typos or some missing words do not cause complete task failure.

Future work may investigate ways to generalize methodologies for leveraging tool diversity in other domains, such as video coding, annotation of fine-grained categories, and activity recognition. Furthermore, this approach may open a new way of optimizing the effort from both human and computer — considering them as different resources with different systematic error biases — to leverage the best of both worlds.

For more details, please read our full paper, Two Tools are Better Than One: Tool Diversity as a Means of Improving Aggregate Crowd Performance, which received the Best Student Paper Award Honorable Mention at IUI 2018.

Jean Y. Song, University of Michigan, Ann Arbor
Juho Kim, KAIST, Republic of Korea
Walter S. Lasecki, University of Michigan, Ann Arbor

Analyzing naturally crowdsourced instructions at scale

“The proportion of ingredients is important, but the final result is also a matter of how you put them together,” said Alain Ducasse, one of only two chefs in the world with 21 Michelin stars. In fact, for cooking professionals like chefs, cooking journalists, and culinary students, it is important to understand not only the culinary characteristics of the ingredients but also the diverse cooking processes. For example, categorizing different approaches to cooking a dish or identifying usage patterns of particular ingredients and cooking methods are crucial tasks for them.

However, these analysis tasks require extensive browsing and comparison which is very demanding. Why? It’s because there are thousands of recipes available online even for something seemingly as simple as making a chocolate chip cookie.

screen-shot-2018-04-12-at-01-17-22Figure 1: search results for “chocolate chip cookie” in leading recipe websites.

In essence, these recipes are naturally crowdsourced instructions for a shared goal, like making a chocolate chip cookie. They exist in diverse contexts (no oven, no gluten, from scratch, etc.), diverse level of details and lengths, different levels of required skills and expertise, and in different writing styles.

We devised a computational pipeline which a) constructs graphical representation which captures semantic structure for each recipe in natural language text using machine-assisted human computation techniques. b) compares structural and semantic similarities between every recipe.

On top of the pipeline, we built an analytics dashboard which invites cooking professionals to analyze a large collection of recipes for a single dish.

ss_dashboardFigure 2: RecipeScape is an interface for analyzing cooking processes at scale with three main visualization components: (a) RecipeMap provides clusters of recipes with respect to their structural similarities, where each point on the map is a clickable recipe, (b) RecipeDeck provides in-depth view and pairwise comparisons of recipes, (c) RecipeStat provides usage patterns of cooking actions and ingredients.

The interface aims to provide analysis on three different levels. It aims to a) provide statistical information of individual ingredients, cooking actions. b) provide analysis for structural comparison of recipes to examine the representative, outlier recipes. c) provide clusters of recipes to examine fundamental similarities and differences of the various approaches.

Our user study with 4 cooking professionals and 7 culinary students suggest that RecipeScape broadens browsing and analysis capabilities. For example, users have reported RecipeScape enables explorations of both common and exotic recipes, comparisons of substitute ingredients and cooking actions, discovers fundamentally different approaches to cooking a dish and assists imagination and mental simulations of diverse cooking processes.

Our approach is not only bounded to cooking but is applicable to other how-tos like software tutorials, makeup instructions, furniture assemblies and more.

This work was presented at CHI 2018 in Montreal as “RecipeScape: An Interactive Tool for Analyzing Cooking Instructions at Scale”. For more detail, please visit our project website, https://recipescape.kixlab.org/

 

 

ConceptScape: Collaborative Concept Mapping for Video Learning

While video has become a widely adopted medium for online education, existing interface and interaction designs provide limited support for content navigation and learning. To support concept-driven navigation and comprehension of lecture videos, we present ConceptScape, a system that uses interactive concept maps to enable concept-oriented learning with lecture videos for learners. Initial results from our evaluation of a prototype show that watching a lecture video with an interactive concept map can support comprehension in learning process, prompt more reflection afterward and provide a shortcut to refer back to a specific section.

But how do we generate interactive concept maps for numerous online lecture videos at scale? We designed a crowdsourcing workflow to capture multiple workers’ common understanding of a lecture video and represent workers’ understandings as an interactive concept map for future learners. The main challenge we are tackling here is to elicit workers’ individual reflections while guiding them to reach consensus on components of a concept map.

ConceptScape’s crowdsourcing workflow includes three stages with eight detailed steps. The first stage is Concept and Timestamp Generation that includes three steps, namely finding concepts, pruning concepts, adjusting time stamps. The second stage is Concept Linking includes three steps, which are linking concepts, supplement links, and pruning links. The last stage is Link Labeling includes two steps that are nominating labels and voting.

Our crowdsourcing workflow consists of three main stages, each of which reflects the three key cognitive activities in concept map construction: listing concepts, linking concepts, and explaining relationships. Stages are further divided into steps with different instructions in order to guide the workers to focus on specific activities in the concept mapping process. Overall, our key design choices are:

  • Each stage is designed to yield different types of output, and within a stage, multiple steps are added for quality control.
  • Each stage has a unique interface and instruction designed to collect specific components of the concept map.
  • In each step, workers contribute in parallel (for efficiency) while our aggregation algorithm maintains sequential step-transitions (for quality control).
  • A worker is guided to work on a specific micro concept mapping activity in a step (e.g., pruning duplicate concepts), but may choose to work on other concept mapping  activities as they see fit (e.g., adding more concepts or changing the timestamp).
  • While allowing flexible work in multiple concept mapping activities, we collect extra contributions from wider perspectives of workers; on the other hand, a more restrictive aggregation method is adopted to deal with extra contributions in later steps since we intend to converge the concept map.

To evaluate our approach of crowdsourcing concept maps, we recruited participants from Amazon’s Mechanical Turk to generate concept maps for three lecture videos and compared our results to expert-generated concept maps and ones generated by individual novices. We evaluated:

  • The holistic quality of concept maps: Third-party evaluators, blinded to experimental conditions, rated in a 1-10 scale to indicate the overall quality of a concept map.
  • The component quality of concept maps: Evaluators scored three components, namely concepts, links, and link phrases separately, and then we summed up the three scores as a total score.

Our result shows that ConceptScape generated concept maps with quality comparable to expert-generated concept maps, in terms of both holistic evaluation and component evaluation. ConceptScape also generated concept maps with higher component-level quality than individual novices.

To see if task flexibility brings in value, we further looked into the amount of extra contributions from workers. We found that workers indeed contributed more than what they were assigned to do.

Beyond crowdsourcing interactive concept maps for educational goodness, our crowdsourcing workflow design may also help inform those who aim to crowdsource open-ended work that requires higher-order thinking, such as those that demand cognitive analysis and creativity.

For more, see our full paper, ConceptScape or video.
Ching Liu, NTHU
Juho Kim, KAIST
Hao-chuan Wang, UC Davis

clickbait-news

Understanding the Production and Consumption of Clickbaits in Twitter

With the growing shift towards news consumption, primarily through social media sites like Twitter, most of the traditional as well as new-age media houses are promoting their news stories by tweeting about them. The competition for user attention in such mediums has led many media houses to use catchy sensational form of tweets to attract more users – a process known as clickbaiting.  Examples of clickbaits include “17 Reasons You Should Never Ever Use Makeup”, “These Dads Quite Frankly Just Don’t Care What You Think”, or “10 reasons why Christopher Hayden was the worst ‘Gilmore Girls’ character”.

On one hand, the success of such clickbaits in attracting visitors to the news websites has helped mushrooming of several digital media companies. However, on the other hand, there are concerns regarding the news value the articles offer, drawing the demand for their blanket ban from many quarters. We believe that we need to consider all associated angles, especially the clickbait readers, before enforcing any drastic ban.

In this paper, we analyze the readership of clickbaits in Twitter. We collect around 12 Million tweets over eight months covering both clickbait and non-clickbait (or traditional) tweets, and then attempt to investigate the following research questions:

  • How are clickbait tweets different from non-clickbait tweets?
  • How do clickbait production and consumption differ from non-clickbaits?
  • Who are the consumers of clickbait and non-clickbait tweets?
  • How do the clickbait and non-clickbait consumers differ as a group?
clickbait-tweet-properties
The presence of different entities in both clickbait and non-clickbait tweets.

Our investigation reveals several interesting insights on the production of clickbaits. For example, clickbait tweets include more entities such as images, hashtags, and user mentions, which help in capturing the attention of the consumers. Additionally, we find that a higher percentage of clickbait tweets convey positive sentiments as compared to non-clickbait tweets. As a result, clickbait tweets tend to have a wider and deeper reach in its consumer base than non-clickbait tweets.

We also make multiple interesting observations regarding the consumers of clickbaits. For example, clickbait tweets are consumed more by women than men, as well as by younger people compared to the consumers of non-clickbaits. Additionally, they have higher mutual engagement among each other. On the other hand, non-clickbait consumers are more reputed in the community, and have relatively higher follower base than clickbait consumers.

Overall, we make two major contributions in this paper: (i) to our knowledge, this is the first attempt to understand the consumers of clickbaits, and (ii) while doing so, we also make the first effort to contextualize the rise of clickbaits with the tabloidization of news. We believe that this paper can foster further research going beyond only negative aspects of clickbaits, and help bring in a more holistic view of the online news spectrum.

For more, see our full paper, Tabloids in the Era of Social Media? Understanding the Production and Consumption of Clickbaits in Twitter, at CSCW 2018.

Abhijnan ChakrabortyIndian Institute of Technology Kharagpur, India
Rajdeep Sarkar,  Indian Institute of Technology Kharagpur, India
Ayushi Mrigen, Indian Institute of Technology Kharagpur, India
Niloy GangulyIndian Institute of Technology Kharagpur, India

Let’s Agree to Disagree: Fixing Agreement Measures for Crowdsourcing

In the context of micro-task crowdsourcing, each task is usually performed by several workers. This allows researchers to leverage measures of the agreement among workers on the same task, to estimate the reliability of collected data and to better understand answering behaviors of the participants.

While many measures of agreement between annotators have been proposed, they are known to suffer from many problems and abnormalities. In this work, we identify the main limits of the existing agreement measures in the crowdsourcing context, both by means of toy examples as well as with real-world crowdsourcing data, and propose a novel agreement measure based on probabilistic parameter estimation which overcomes such limits. We validate our new agreement measure and show its flexibility as compared to the existing agreement measures.

CURRENT AGREEMENT MEASURES ARE INADEQUATE

The majority of agreement measures is borrowed from data reliability theory, where the reliability of a set of grouped measurements is assessed via a comparison between the inter-group and the intra-group variability, and where typically the judgments are made by a fixed set of assessors. In the context of crowdsourcing, these measures suffer from many problems when they are used for estimating agreement instead of data reliability:

  1. The variability of judgments is typically higher when the judgments concentrate around the center of the scale. This problem is intrinsic to finite scale judgments and can lead to overestimating disagreement over items where the truth concentrates around the scale boundaries.
  2. The values around which judgments concentrate (if any) can be different item by item. This can lead to overestimating expected disagreement and thus increasing the possibility of considering the data as random.
  3. For some items a ground truth (e.g., `gold questions’ in crowdsourcing) might be present, that is a value around which judgments are expected to concentrate. This information is typically not used by classic agreement measures.
  4. The global variability-based correction by chance leads to many idiosyncrasies in the existing measures, making them hard to use in a crowdsourcing setting.

Our goal in this paper is to address the aforementioned issues, and to build a framework more suitable to estimate worker agreement over a group of tasks in a crowdsourcing context.

OUR MODEL

The intuition behind Φ is connected with the definition of agreement: we consider as agreement the amount of concentration around a data value. Conversely, if the data does not concentrate around a value then we have disagreement (negative agreement in our measure), that can be more or less strong depending on how polarized the different opinions are. More in detail, our approach can be described as fitting a distribution to the histogram of the judgments and then measuring the dispersion of such distribution.

It is important to notice that the fitting distribution has to be general enough to capture the main behaviors that might occur: flat (random judgments), bell-shaped (agreement), J-shaped (agreement around a value on the boundary of the scale), and U shaped distribution (disagreement), as shown in the following figure.

Agreement model examples

At the same time, the desired distribution has to have a minimal number of parameters, to avoid overfitting. For this reason, we use a Beta distribution to perform the fit: Φ is a transformed parameter of the Beta distribution over the histogram of the collected answers. Such parameter is related to the standard deviation of the fitted distribution, with the difference that here we account for the finiteness of the rating scale, and thus we adjust for the tendency of having lower dispersion when the data concentrates around a value on the boundaries of the rating scale. For example, if we imagine a scenario where assessors add a random Gaussian noise to the ground truth when making a judgment, we can immediately see that the dispersion will be minimum when the ground truth is on the boundary of the scale, because a Gaussian noise that would result is a judgment outside the boundary would be clipped.

The strength of our approach becomes apparent when applied to a group of items to be judged: in the case of relevance judgment tasks, each item i is allowed to have a different average relevance value, while the agreement among workers is defined as the common Φ that better explains the judgment data.

This allows to solve the problems that arise, in the other agreement measures, when trying to correct by chance by using the dispersion of the whole dataset as normalizing factor.

EXAMPLE

In the following figure we show a representation of the inference results for the judgments of 17 documents. We generated a small synthetic dataset, where the first document has an outlier on the right boundary, and the other 16 documents have a clear central agreement. In the figure it can be seen that documents 2-5 are replicated four times to get 16 documents that have higher agreement. We can see that the model is forced to find the best agreement level (dispersion of the Beta distribution) that collectively explain all the data: while document 1 alone would have been fitted with a high disagreement (a U shaped) Beta, the most probable Beta for the model to explain the whole dataset is the one where the first document has an outlier. This reflects the way we perceive the agreement level as humans, especially with a small set of data samples, and allows to get a robust estimation of agreement for group of documents.

example

ONLINE TOOL

You can test our tool (as shown in the snapshot below) and access the source code at this link.

Online tool
Online tool

For more information, see our full paper, Let’s Agree to Disagree: Fixing Agreement Measures for Crowdsourcing

Alessandro Checco, Information School, University of Sheffield

img_7693

Report: Second GroupSight Workshop on Human Computation for Image and Video Analysis

What would be possible if we could accelerate the analysis of images and videos, especially at scale? This question is generating widespread interest across research communities as diverse as computer vision, human computer interaction, computer graphics, and multimedia.

The second Workshop on Human Computation for Image and Video Analysis (GroupSight) took place in Quebec City, Canada on October 24, 2017, as part of HCOMP 2017. The goal of the workshop was to promote greater interaction between this diversity of researchers and practitioners who examine how to mix human and computer efforts to convert visual data into discoveries and innovations that benefit society at large.

This was the second edition of the GroupSight workshop to be held at HCOMP. It was also the first time the workshop and conference were co-located with UIST. A website and blog post on the first edition of GroupSight are also available.

The workshop featured two keynote speakers in HCI doing research on crowdsourced image analysis. Meredith Ringel Morris (Microsoft Research) presented work on combining human and machine intelligence to describe images to people with visual impairments (slides). Walter Lasecki (University of Michigan) discussed projects using real-time crowdsourcing to rapidly and scalably generate training data for computer vision systems.

Participants also presented papers along three emergent themes:

Leveraging the visual capabilities of crowd workers:

  • Abdullah Alshaibani and colleagues at Purdue University presented InFocus, a system enabling untrusted workers to redact potentially sensitive content from imagery. (Best Paper Award)
  • Kyung Je Jo and colleagues at KAIST presented Exprgram (paper, video). This paper introduced a crowd workflow that supports language learning while annotating and searching videos. (Best Paper Runner-Up Award)
  • GroundTruth (paper, video), a system by Rachel Kohler and colleagues at Virginia Tech, combined expert investigators and novice crowds to identify the precise geographic location where images and videos were created.

Kurt Luther hands the best paper award to Alex Quinn.

Creating synergies between crowdsourced human visual analysis and computer vision:

  • Steven Gutstein and colleagues from the U.S. Army Research Laboratory presented a system that integrated a brain-computer interface with computer vision techniques to support rapid triage of images.
  • Divya Ramesh and colleagues from CloudSight presented an approach for real-time captioning of images by combining crowdsourcing and computer vision.

Improving methods for aggregating results from crowdsourced image analysis:

  • Jean Song and colleagues at the University of Michigan presented research showing that tool diversity can improve aggregate crowd performance on image segmentation tasks.
  • Anuparna Banerjee and colleagues at UT Austin presented an analysis of ways that crowd workers disagree in visual question answering tasks.

The workshop also had break-out groups where participants used a bottom-up approach to identify topical clusters of common research interests and open problems. These clusters included real-time crowdsourcing, worker abilities, applications (to computer vision and in general), and crowdsourcing ethics.

A group of researchers talking and seated around a poster board covered in sticky notes.

For more, including keynote slides and papers, check out the workshop website: https://groupsight.github.io/

Danna Gurari, UT Austin
Kurt Luther, Virginia Tech
Genevieve Patterson, Brown University and Microsoft Research New England
Steve Branson, Caltech
James Hays, Georgia Tech
Pietro Perona, Caltech
Serge Belongie, Cornell Tech

diagrams

Crowdsourcing the Location of Photos and Videos

How can crowdsourcing help debunk fake news and prevent the spread of misinformation? In this paper, we explore how crowds can help expert investigators verify the claims around visual evidence they encounter during their work.

A key step in image verification is geolocation, the process of identifying the precise geographic location where a photo or video was created. Geotags or other metadata can be forged or missing, so expert investigators will often try to manually locate the image using visual clues, such as road signs, business names, logos, distinctive architecture or landmarks, vehicles, and terrain and vegetation.

However, sometimes there are not enough clues to make a definitive geolocation. In these cases, the expert will often draw an aerial diagram, such as the one shown below, and then try to find a match by analyzing miles of satellite imagery.

An aerial diagram of a ground-level photo, and the corresponding satellite imagery of that location.

Source: Bellingcat

This can be a very tedious and overwhelming task – essentially finding a needle in a haystack. We proposed that crowdsourcing might help, because crowds have good visual recognition skills and can scale up, and satellite image analysis can be highly parallelized. However, novice crowds would have trouble translating the ground-level photo or video into an aerial diagram, a process that experts told us requires lots of practice.

Our approach to solving this problem was right in front of us: what if crowds also use the expert’s aerial diagram? The expert was going to make the diagram anyway, so it’s no extra work for them, but it would allow novice crowds to bridge the gap between ground-level photo and satellite imagery.

To evaluate this approach, we conducted two experiments. The first experiment looked at how the level of detail in the aerial diagram affected the crowd’s geolocation performance. We found that in only ten minutes, crowds could consistently narrow down the search area by 40-60%, while missing the correct location only 2-8% of the time, on average.

diagrams

In our second experiment, we looked at whether to show crowds the ground-level photo, the aerial diagram, or both. The results confirmed our intuition: the aerial diagram was best. When we gave crowds just the ground-level photo, they missed the correct location 22% of the time – not bad, but probably not good enough to be useful, either. On the other hand, when we gave crowds the aerial diagram, they missed the correct location only 2% of the time – a game-changer.

Bar chart showing the diagram condition performed significantly better than the ground photo condition.

For next steps, we are building a system called GroundTruth (video) that brings together experts and crowds to support image geolocation. We’re also interested in ways to synthesize our crowdsourcing results with recent advances in image geolocation from the computer vision research community.

For more, see our full paper, Supporting Image Geolocation with Diagramming and Crowdsourcing, which received the Notable Paper Award at HCOMP 2017.

Rachel Kohler, Virginia Tech
John Purviance, Virginia Tech
Kurt Luther, Virginia Tech

Call for Participation: GroupSight 2017

The Second Workshop on Human Computation for Image and Video Analysis (GroupSight) is to be held on October 24, 2017 at AAAI HCOMP 2017 at Québec City, Canada. This promises an exciting mix of people and papers at the intersection of HCI, crowdsourcing, and computer vision.

The aim of this workshop is to promote greater interaction between the diversity of researchers and practitioners who examine how to mix human and computer efforts to convert visual data into discoveries and innovations that benefit society at large. It will foster in-depth discussion of technical and application issues for how to engage humans with computers to optimize cost/quality trade-offs. It will also serve as an introduction to researchers and students curious about this important, emerging field at the intersection of crowdsourced human computation and image/video analysis.

Topics of Interest

Crowdsourcing image and video annotations (e.g., labeling methods, quality control, etc.)
Humans in the loop for visual tasks (e.g., recognition, segmentation, tracking, counting, etc.)
Richer modalities of communication between humans and visual information (e.g., language, 3D pose, attributes, etc.)
Semi-automated computer vision algorithms
Active visual learning
Studies of crowdsourced image/video analysis in the wild

Submission Details

Submissions are requested in the following two categories: Original Work (not published elsewhere) and Demo (describing new systems, architectures, interaction techniques, etc.). Papers should be submitted as 4-page extended abstracts (including references) using the provided author kit. Demos should also include a URL to a video (max 6 min). Multiple submissions are not allowed. Reviewing will be double-blind.
Previously published work from a recent conference or journal can be considered but the authors should submit an unrevised copy of their published work. Reviewing will be single-blind. Email submissions to groupsight@outlook.com

Important Dates

August 14August 23, 2017: Deadline for paper submission (5:59 pm EDT)
August 25, 2017: Notification of decision
October 24, 2017: Workshop (full-day)

Link

https://groupsight.github.io