skip to main content
10.1145/1460007.1460016acmconferencesArticle/Chapter ViewAbstractPublication PagesgirConference Proceedingsconference-collections
research-article

Experiences in crawling deep web in the context of local search

Published: 29 October 2008 Publication History

Abstract

Local search engines allow geographically constrained searching of businesses and their products or services. Some of the local search engines use crawlers for indexing Web page contents. These crawlers mostly index Web pages that are accessible through hyperlinks and which include desirable location information. It is extremely important for local search engines to also crawl additional high-quality "local" content (e.g., user reviews) that is available in the Deep Web. Much of this content is hidden behind search forms and is in the form of structured data, which is increasing very rapidly. In this paper, we present our experiences in crawling and extracting a wide variety of local structured data from large number of Deep Web resources. We discuss the challenges in crawling such sources and based on our experience we offer some effective principles to address them. Our experimental results on several Deep Web sources with local content show that the techniques discussed are highly effective.

References

[1]
Bergman, M. 2001. The Deep Web: Surfacing Hidden Value. Journal of Electronic Publishing, 7, 1 (2001).
[2]
He, B., Patel, M., Zhang, Z., and Chang, K. 2007. Accessing the Deep Web. Communications of the ACM, 50, 5 (New York, NY, 2007). 94--101.
[3]
Liu, B., Grossman, R., and Zhai, Y. 2003. Mining Data Records in Web Pages. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Washington, D.C., USA, August 2003). 601--606.
[4]
Manning, C., Raghavan, P., and Schütze, H. 2008. Introduction to Information Retrieval, Cambridge Univ. Press.
[5]
Madhavan, J., Halevy, A. Y., Cohen, S., Dong, X. L., Jeffery, S. R., Ko, D., and Yu, C. 2006. Structured Data Meets the Web: A Few Observations. IEEE Data Eng. Bull. 29, 4 (2006). 19--26.
[6]
Mundluru, D. 2008. Automatically Constructing Wrappers for Effective and Efficient Web Information Extraction. PhD thesis (2008), University of Louisiana at Lafayette (In Preparation).
[7]
Muslea, I., Minton, S., and Knoblock, C. 1999. A Hierarchical Approach to Wrapper Induction. In Proceedings of the 3rd International Conference on Autonomous Agents (Seattle, USA, May 1999). 190--197.
[8]
Raghavan, S., and Garcia-Molina, H. 2001. Crawling the Hidden Web. In Proceedings of the 27th International Conference on Very Large Data Bases (Rome, Italy, September 2001). VLDB'01. 129--138.
[9]
Zhao, H., Meng, W., Wu, Z., Raghavan, V., and Yu, C. 2005. Fully Automatic Wrapper Generation for Search Engines. In Proceedings of the 14th International World Wide Web Conference (Chiba, Japan, May 2005). 66--75.

Cited By

View all
  • (2018)Harvesting Deep Web Data Through Produser InvolvementThe Dark Web10.4018/978-1-5225-3163-0.ch009(175-198)Online publication date: 2018
  • (2017)Information Retrieval in Web Crawling Using Population Based, and Local Search Based Meta-heuristics: A ReviewProceedings of Sixth International Conference on Soft Computing for Problem Solving10.1007/978-981-10-3325-4_10(87-104)Online publication date: 13-Apr-2017
  • (2017)A survey of Web crawlers for information retrievalWIREs Data Mining and Knowledge Discovery10.1002/widm.12187:6Online publication date: 7-Aug-2017
  • Show More Cited By

Index Terms

  1. Experiences in crawling deep web in the context of local search

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      GIR '08: Proceedings of the 5th Workshop on Geographic Information Retrieval
      October 2008
      68 pages
      ISBN:9781605582535
      DOI:10.1145/1460007
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 29 October 2008

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. deep web crawling
      2. local search
      3. structured data
      4. wrappers

      Qualifiers

      • Research-article

      Conference

      CIKM08
      CIKM08: Conference on Information and Knowledge Management
      October 29 - 30, 2008
      California, Napa Valley, USA

      Acceptance Rates

      Overall Acceptance Rate 46 of 61 submissions, 75%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)2
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 14 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2018)Harvesting Deep Web Data Through Produser InvolvementThe Dark Web10.4018/978-1-5225-3163-0.ch009(175-198)Online publication date: 2018
      • (2017)Information Retrieval in Web Crawling Using Population Based, and Local Search Based Meta-heuristics: A ReviewProceedings of Sixth International Conference on Soft Computing for Problem Solving10.1007/978-981-10-3325-4_10(87-104)Online publication date: 13-Apr-2017
      • (2017)A survey of Web crawlers for information retrievalWIREs Data Mining and Knowledge Discovery10.1002/widm.12187:6Online publication date: 7-Aug-2017
      • (2015)Deep web performance enhance on search engine2015 International Conference on Soft Computing Techniques and Implementations (ICSCTI)10.1109/ICSCTI.2015.7489619(137-140)Online publication date: Oct-2015
      • (2013)Where the streets have no nameProceedings of the 7th Workshop on Geographic Information Retrieval10.1145/2533888.2533937(47-48)Online publication date: 5-Nov-2013
      • (2012)Multi-source Conflating Index Construction for Local Search in a Low-Coverage CountryProceedings of the 2012 Eighth Latin American Web Congress10.1109/LA-WEB.2012.21(28-31)Online publication date: 25-Oct-2012
      • (2010)Automatically Extracting Web Data RecordsActive Media Technology10.1007/978-3-642-15470-6_51(510-521)Online publication date: 2010

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media