skip to main content
10.1145/2488388.2488412acmotherconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

A framework for learning web wrappers from the crowd

Published:13 May 2013Publication History

ABSTRACT

The development of solutions to scale the extraction of data from Web sources is still a challenging issue. High accuracy can be achieved by supervised approaches but the costs of training data, i.e., annotations over a set of sample pages, limit their scalability. Crowd sourcing platforms are making the manual annotation process more affordable. However, the tasks demanded to these platforms should be extremely simple, to be performed by non-expert people, and their number should be minimized, to contain the costs. We introduce a framework to support a supervised wrapper inference system with training data generated by the crowd. Training data are labeled values generated by means of membership queries, the simplest form of queries, posed to the crowd. We show that the costs of producing the training data are strongly affected by the expressiveness of the wrapper formalism and by the choice of the training set. Traditional supervised wrapper inference approaches use a statically defined formalism, assuming it is able to express the wrapper. Conversely, we present an inference algorithm that dynamically chooses the expressiveness of the wrapper formalism and actively selects the training set, while minimizing the number of membership queries to the crowd. We report the results of experiments on real web sources to confirm the effectiveness and the feasibility of the approach.

References

  1. D. Angluin. Queries revisited. Theor. Comput. Sci., 313(2):175--194, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In SIGMOD Conference, pages 337--348. ACM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M.-F. Balcan, S. Hanneke, and J. W. Vaughan. The true sample complexity of active learning. Machine Learning, 80(2-3):111--139, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C.-H. Chang and S.-C. Lui. IEPAD: information extraction based on pattern discovery. In WWW, pages 681--688, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Creo, V. Crescenzi, D. Qiu, and P. Merialdo. Minimizing the costs of the training data for learning web wrappers. In VLDS, pages 35--40, 2012.Google ScholarGoogle Scholar
  6. V. Crescenzi and G. Mecca. Automatic information extraction from large websites. J. ACM, 51(5):731--779, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. V. Crescenzi and P. Merialdo. Wrapper inference for ambiguous web pages. Applied Artificial Intelligence, 22(1&2):21--52, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. N. N. Dalvi, R. Kumar, and M. A. Soliman. Automatic wrappers for large scale web extraction. PVLDB, 4(4):219--230, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. U. Irmak and T. Suel. Interactive wrapper generation with minimal user effort. In WWW, pages 553--563. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. T. Furche, G. Gottlob, G. Grasso, O. Gunes, X. Guo, A. Kravchenko, G. Orsi, C. Schallhart, A. J. Sellers, and C. Wang. DIADEM: domain-centric, intelligent, automated data extraction methodology. In WWW (Companion Volume), pages 267--270. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. Gottlob, C. Koch, R. Baumgartner, M. Herzog, and S. Flesca. The lixto data extraction project - back and forth between theory and practice. In PODS, pages 1--12. ACM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. I. Muslea, S. Minton, and C. A. Knoblock. Active learning with multiple views. J. Artif. Intell. Res. (JAIR), 27:203--233, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. B. Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison, 2009.Google ScholarGoogle Scholar
  14. J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926--1940, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. V. Vapnik. An overview of statistical learning theory. IEEE Transactions on Neural Networks, 10(5):988--999, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Y. Zhai and B. Liu. Structured data extraction from the web based on partial tree alignment. IEEE Trans. Knowl. Data Eng., 18(12):1614--1628, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A framework for learning web wrappers from the crowd

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        WWW '13: Proceedings of the 22nd international conference on World Wide Web
        May 2013
        1628 pages
        ISBN:9781450320351
        DOI:10.1145/2488388

        Copyright © 2013 Copyright is held by the International World Wide Web Conference Committee (IW3C2).

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 13 May 2013

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        WWW '13 Paper Acceptance Rate125of831submissions,15%Overall Acceptance Rate1,899of8,196submissions,23%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader