CIDR 2013: CrowdQ – Crowdsourced Query Understanding

Understanding complex questions is characteristic of human intelligence. and other Question-and-Answer platforms are good examples of how complex questions are best answered by humans.  Unfortunately, Google and other search engines don’t understand your queries.  In this work we use crowdsourcing combined with algorithms for complex query understanding.

Our proposed system can answer complex queries such as “birthdate of the mayors of all the cities in Italy” The answers for such complex queries are typically available on the Web (or even just in Wikipedia). However, current search engines are not able to provide answers directly because they do not understand the semantics behind user requests.

The proposed system generates query templates using a combination of:

  • query log mining
  • natural language processing (NLP)
    • part-of-speech tagging
    • entity extraction
  • crowdsourcing

Query templates that can then be used to answer whole classes of different questions rather than focusing on just a specific question and answer.

Our proposed approach first transforms the user request into a structured query then  answers the query with machine-readable data publicly available on the Web (i.e., Linked Open Data).

Human input is used to to detect the structure of a user request expressed in natural language:

  • which entities are mentioned
  • which relations exist among the entities
  • what is the type of the desired answer

The crowd is also involved to verify the correctness of automatic annotations in uncertain cases.

The result of this process is an SQL-like query that can be answered automatically by standard database technologies.

For more, see our full paper, CrowdQ – Crowdsourced Query Understanding.

Gianluca Demartini, eXascale Infolab, University of Fribourg, Switzerland
Beth Trushkowsky, AMPLab, UC Berkeley, USA
Tim Kraska, Brown University, USA
Michael J. FranklinAMPLab, UC Berkeley, USA
Daniel Bruckner, UC Berkeley
Daniel Haas, UC Berkeley
Jonathan Harper, UC Berkeley

About the author

Gianluca Demartini

Dr. Gianluca Demartini is a Lecturer in Data Science at the Information School of the University of Sheffield, UK. Previously, he was post-doctoral researcher at the eXascale Infolab at the University of Fribourg, visiting researcher at UC Berkeley, junior researcher at the L3S Research Center, and intern at Yahoo! Research.
His research interests include Web Information Retrieval, Semantic Web, and Human Computation. His Ph.D. work focused on Entity Retrieval. He has published more than 50 peer-reviewed scientific publications and given tutorials about Entity Retrieval and Crowdsourcing at research conferences.

View all posts


  • This is interesting work! Thanks for sharing.

    One question I wonder about while reading is the interaction for search when we include humans in the loop. There is bound to be latency, unless the idea is to pre-compute such information for the most popular queries? I also wonder where you see this going — how do you see human help being used for even more abstract queries, when combined with machine intelligence and data?

    Btw, we had done some work focusing on mostly the human side of that which you may be interested in:


    • Thanks for the question.

      The idea is to involve the crowd exclusively off-line by pre-processing a web search query log and by creating query templates. The latency of the crowd does not really make possible to involve it while the user is waiting for an answer to her query. Instead, at query time the user query is automatically matched to the pre-generated query templates to get a fast answer.

      Your work is also very interesting and relevant. On the long term, I think human help will be used more and more in combination with complex data processing to answer complex search queries. In this context then the question becomes how to estimate the cost of answering an information need and whether it’s worth to start the entire answering process or not given the generated cost (e.g., crowdsourcing, cloud computation, etc.).

      Thanks again for the interest in our work,