ABSTRACT
The development of solutions to scale the extraction of data from Web sources is still a challenging issue. High accuracy can be achieved by supervised approaches but the costs of training data, i.e., annotations over a set of sample pages, limit their scalability. Crowd sourcing platforms are making the manual annotation process more affordable. However, the tasks demanded to these platforms should be extremely simple, to be performed by non-expert people, and their number should be minimized, to contain the costs. We introduce a framework to support a supervised wrapper inference system with training data generated by the crowd. Training data are labeled values generated by means of membership queries, the simplest form of queries, posed to the crowd. We show that the costs of producing the training data are strongly affected by the expressiveness of the wrapper formalism and by the choice of the training set. Traditional supervised wrapper inference approaches use a statically defined formalism, assuming it is able to express the wrapper. Conversely, we present an inference algorithm that dynamically chooses the expressiveness of the wrapper formalism and actively selects the training set, while minimizing the number of membership queries to the crowd. We report the results of experiments on real web sources to confirm the effectiveness and the feasibility of the approach.
- D. Angluin. Queries revisited. Theor. Comput. Sci., 313(2):175--194, 2004. Google ScholarDigital Library
- A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In SIGMOD Conference, pages 337--348. ACM, 2003. Google ScholarDigital Library
- M.-F. Balcan, S. Hanneke, and J. W. Vaughan. The true sample complexity of active learning. Machine Learning, 80(2-3):111--139, 2010.Google ScholarDigital Library
- C.-H. Chang and S.-C. Lui. IEPAD: information extraction based on pattern discovery. In WWW, pages 681--688, 2001. Google ScholarDigital Library
- R. Creo, V. Crescenzi, D. Qiu, and P. Merialdo. Minimizing the costs of the training data for learning web wrappers. In VLDS, pages 35--40, 2012.Google Scholar
- V. Crescenzi and G. Mecca. Automatic information extraction from large websites. J. ACM, 51(5):731--779, 2004. Google ScholarDigital Library
- V. Crescenzi and P. Merialdo. Wrapper inference for ambiguous web pages. Applied Artificial Intelligence, 22(1&2):21--52, 2008. Google ScholarDigital Library
- N. N. Dalvi, R. Kumar, and M. A. Soliman. Automatic wrappers for large scale web extraction. PVLDB, 4(4):219--230, 2011. Google ScholarDigital Library
- U. Irmak and T. Suel. Interactive wrapper generation with minimal user effort. In WWW, pages 553--563. ACM, 2006. Google ScholarDigital Library
- T. Furche, G. Gottlob, G. Grasso, O. Gunes, X. Guo, A. Kravchenko, G. Orsi, C. Schallhart, A. J. Sellers, and C. Wang. DIADEM: domain-centric, intelligent, automated data extraction methodology. In WWW (Companion Volume), pages 267--270. ACM, 2012. Google ScholarDigital Library
- G. Gottlob, C. Koch, R. Baumgartner, M. Herzog, and S. Flesca. The lixto data extraction project - back and forth between theory and practice. In PODS, pages 1--12. ACM, 2004. Google ScholarDigital Library
- I. Muslea, S. Minton, and C. A. Knoblock. Active learning with multiple views. J. Artif. Intell. Res. (JAIR), 27:203--233, 2006. Google ScholarDigital Library
- B. Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison, 2009.Google Scholar
- J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926--1940, 1998. Google ScholarDigital Library
- V. Vapnik. An overview of statistical learning theory. IEEE Transactions on Neural Networks, 10(5):988--999, 1999. Google ScholarDigital Library
- Y. Zhai and B. Liu. Structured data extraction from the web based on partial tree alignment. IEEE Trans. Knowl. Data Eng., 18(12):1614--1628, 2006. Google ScholarDigital Library
Index Terms
- A framework for learning web wrappers from the crowd
Recommendations
ALFRED: crowd assisted data extraction
WWW '13 Companion: Proceedings of the 22nd International Conference on World Wide WebThe development of solutions to scale the extraction of data from Web sources is still a challenging issue. High accuracy can be achieved by supervised approaches, but the costs of training data, i.e., annotations over a set of sample pages, limit their ...
A formalized framework for incorporating expert labels in crowdsourcing environment
Crowdsourcing services have been proven efficient in collecting large amount of labeled data for supervised learning tasks. However, the low cost of crowd workers leads to unreliable labels, a new problem for learning a reliable classifier. Various ...
Cost‐effective multi‐instance multilabel active learning
AbstractMulti‐instance multi‐label (MIML) Active Learning (M2AL) aims to improve the learner while reducing the cost as much as possible by querying informative labels of complex bags composed of diverse instances. Existing M2AL solutions suffer high ...
Comments