The web is a valuable source of information, but most of the data can not be automatically processed since are intended for human-consumption.
Wrappers are specialized programs that extract the data from the source code of HTML pages and organize them in a more structured way, making them machine processable.
For example, suppose we want to collect data about movies (e.g. titles, directors, actors, etc.) by means of a set of wrapper extracting the data from the sites available on the Web. Other than the most famous (e.g., IMDB) web sites, many others can be considered. [Dalvi et al., VLDB 2012] have shown that in many domains, for covering 90% of the entities present in the Web, more than 10000 sites have to be considered.
Fully automated approaches for learning wrappers have been already proposed (e.g. RoadRunner [Crescenzi and Merialdo, AAI 2008]), but they exhibit limited accuracy. On the other side, supervised wrapper generator have limited applicability at the web scale. The crowd could be the trigger for addressing the problem of wrapping very large numbers of data intensive Web sites with high accuracy.
We propose ALFRED [Crescenzi et al. WWW2013, DBCrowd2013], a wrapper inference system supervised by the crowd. To generate wrappers, the system poses sequences of simple questions that require a boolean answer (e.g. “Is ‘City of God’ the title of the movie in the page?” Y/N). The answers provided by the workers recruited on a crowdsourcing platform are exploited to generate the correct wrapper.
Preliminary results are promising:
To generate accurate wrappers, just a few queries are needed. Even in presence of inaccurate workers, ALFRED can generate a correct wrapper with less than 15 queries.
The accuracy of the output wrapper is highly predictable, with an average F-measure close to 100% and its standard deviation less than 1%, i.e., almost perfect wrapper with a small variability.
Workers’ error rates estimation is accurate, and spammers and unreliable workers are early detected.
Costs are contained and highly predictable thanks to a technique to dynamically engage, at runtime, a minimal number of workers, with 92% of the cases covered by just two workers.
Many challenges are still open:
to further reduce the costs we aim at adopting a hybrid approach that partially relies on automatic wrapper generation techniques, with a light supervision by the crowd
gamification is a promising direction to engage workers and scale out the wrappers generation. People can play games while teaching ALFRED how to wrap the web.
For more, see our full paper project website, ALFRED, and the full paper, A framework for learning web wrappers from the crowd.