Query-Feature Graphs: Bridging User Vocabulary and System Functionality

In this blog post, we describe our work developing query-feature graphs (QF graphs). A QF graph maps a natural-language search query to the specific features of a software package related to that query. For example, given the goal of using the GIMP raster graphics editor to make a color image look like it was taken with black and white film, our QF graph will map a search query like “GIMP black and white” to the GIMP commands “Desaturate,” “Grayscale,” and “Channel Mixer.” Each of these commands achieves the desired effect, but none contains any of the query terms.

The associations expressed in QF graphs are drawn from popular searches performed on the Google search engine, and utilize the wealth of user-generate technical documentation (e.g., tutorials, FAQs, and forums) available online. As such, QF graphs reflect the most common tasks performed by a system’s user base, and evolve with the user community as tasks, techniques, and workarounds fall in and out of favour.

Continue reading

Muse: Reviving Memories with Email Archives

Since the first email message was sent in 1971, the usage of email has evolved significantly. Email is used today not just for interpersonal communication but also, for example, to plan events and trips, maintain records, make online purchases, track to do items, process business workflow, and even to indulge in email wars or tell people that you’ve been summarily fired. Our email archives directly or indirectly capture an astounding amount of our personal histories; in fact, many of us consciously deposit information into email “for the record”, knowing that we can look it up later.  Imagine an archive that captures all the email messages that you’ve typed over a period of 50 years or more! Such an archive would be extremely valuable for reasons of personal or family history, and may also be of interest to related individuals and organizations, and digital archaeologists of the future.  As the Library of Congress says, be sure to save your email archives!

Continue reading


Last summer, I was reading John Markoff’s book What the Dormouse Said at a bed-and-breakfast in Maine.  I came across a passage about Seymour Cray, and it occurred to me that we live in a uniquely interesting time for supercomputing.

The Cloud

Today’s supercomputers are quite different from Cray’s refrigerator-sized multiprocessors.  They generally consist of thousands of networked commodity machines, a la Google’s data centers.  A consequence of this is that more of the design of these supercomputers is being pushed to software.  In the extreme, grid computing projects like Folding@Home involve no hardware design at all.

This is a common pattern in computing history.  When general-purpose hardware becomes inexpensive and readily available, more can be done in software.  (For example, now that we have iPhones, we buy fewer voice recorders and alarm clocks.)  And since more people can write software than build hardware, any shift from hardware to software coincides with a burst of innovation.

In these times of shift, the most influential software innovations are the operating systems — MS-DOS in 1982, and the iPhone SDK and Facebook Connect today.  For the cloud, the operating systems are the distributed file systems, RPC layers, and programming models such as MapReduce, Dryad, Pig, Sawzall, and BigTable.  It’s this software that makes other software possible.

The Crowd

The force behind the rise of human computing is similar to the one behind the rise of cloud computing.  Two billion people now have access to the Internet — and many of them are bored.  So while computers made of people have existed since the mid 1700’s, it is only in the last decade that the “hardware” — people — have become so readily available.

Consequently, we can expect to see a burst of software innovation for crowd computing.  And we have.  We’ve seen a range of creative applications, quality-control methods, monitoring systems, cost-optimization algorithms.  And as in cloud computing, the most influential software for human supercomputers will be the operating systems.

Humanism and Human Computation

A danger in designing human supercomputers is the tendency to take the metaphor literally.  As computer scientists, we’re used to thinking about things like cost reduction, speed, availability, and fault tolerance.  But when designing human supercomputers, focusing only on performance can lead to systems where workers are seen as individual, anonymous, and interchangeable, without needs other than those that can be directly measured as costs to the system.

Of course, most people don’t see themselves this way, and this conflict probably affects the quality of work that they do within such systems.  But it also affects the quality of applications that can be built for these human supercomputers.  A model for people that doesn’t include rich personal identities precludes the ability to use those identities for routing tasks.  A model for people that doesn’t include social relationships precludes the ability to define control flows that make use of those relationships for teamwork.  Most importantly, a design that treats people like machines will cause them to act like machines.  This would be a sad waste of creativity, individual talent, empathy, and spontaneity.

A Hybrid Human Supercomputer

This all suggests something concrete to build: a supercomputer that consists of people and machines, where the machines are treated like machines and the people are treated like people.  The supercomputer would be designed primarily in software, whose core would consist of a virtual machine for low-level resource allocation and a high-level programming language that can easily specify complex workflows.

The design would focus on the human needs of the workers who comprise the system.  Not just their need for money (and at the beginning, we will not address that need at all), but also their need for independent thought, for self-determination, for learning, for community, for co-creation, for being a part of something larger than themselves.

As a natural byproduct, we hope that more powerful programming models for social computing will emerge, bottom-up models that harness a more inclusive range of things that people can do better than machines.  Jabberwocky, as it stands, is the very beginning of our attempt to do this.


The Jabberwocky software stack consists of three layers.  The first layer, Dormouse, is our virtual machine.  It consists primarily of functions for resource allocation and routing of tasks to people and machines.  Importantly, it allows developers and workers to create their own communities, and add profile properties and social structure as they see fit.  Another important component of Dormouse is a template library, that allows developers to reuse task templates created and optimized by others and approved by workers.

The second layer of the Jabberwocky stack is ManReduce, a programming framework inspired by MapReduce (but closer to Dryad) that interfaces with Dormouse.  ManReduce provides convenient programming abstractions for both human and machine computation.  It automatically parses input files, routes intermediate data, produces output files, and transparently handles parallelism and serialization.  After we built ManReduce, we found out that Nikki, Boris, Shusheel and Bob at CMU had similar lines of thinking in their excellent and independent CrowdForge work, which will also be presented at UIST.  While the spirit of the two frameworks are the same, there are some key differences that we discuss in the paper.

And the highest level of the Jabberwocky stack is Dog, a high-level scripting language built on ManReduce and inspired by Pig and Sawzall.  A key consideration in the design of Dog was that it should be human-readable even by non-programmers.  Jabberwocky will be an open-source environment, like a web browser, where workers can see the source code of the Dog programs in which they will take a part.  The hope is that being able to read the source code will both give workers a sense of context for the tasks that they are doing, and help them to make informed decisions about whether they want to participate.  Another key consideration is that Dog should be simple enough so that the workers that comprise system can write Dog scripts.  In the case of human computation, we prefer to have computers that can program themselves.

Jabberwocky is joint work with Salman Ahmad, Alexis Battle, and Zahan Malkani.  Please check out the full paper here.  We plan to make the framework available starting in early 2012. If you are interested in using or contributing to Jabberwocky, please drop us a note.

PlateMate: Crowdsourcing Nutrition Analysis from Food Photographs

What if we could put an automated nutritionist in everyone’s pocket? Before each meal, you’d snap a quick picture of your plate and then dig in. From there, your automated nutritionist would identify the foods, measure the portions, and add up the calories. It might even go further, providing tips like “eat more vegetables at lunch” or “lay off those chocolate chip cookies you have every Tuesday to save 150 calories.” If everyone had access to that much data and advice, we could all eat healthier and worry less about obesity, heart disease, and other serious consequences of careless eating.

Continue reading