skip to main content
10.1145/1242572.1242622acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
Article

Robust web page segmentation for mobile terminal using content-distances and page layout information

Published: 08 May 2007 Publication History

Abstract

The demand of browsing information from general Web pages using a mobile phone is increasing. However, since the majority of Web pages on the Internet are optimized for browsing from PCs, it is difficult for mobile phone users to obtain sufficient information from the Web. Therefore, a method to reconstruct PC-optimized Web pages for mobile phone users is essential. An example approach is to segment the Web page based on its structure, and utilize the hierarchy of the content element to regenerate a page suitable for mobile phone browsing. In our previous work, we have examined a robust automatic Web page segmentation scheme which uses the distance between content elements based on the relative HTML tag hierarchy, i.e., the number and depth of HTML tags in Web pages. However, this scheme has a problem that the content-distance based on the order of HTML tags does not always correspond to the intuitional distance between content elements on the actual layout of a Web page. In this paper, we propose a hybrid segmentation method which segments Web pages based on both the content-distance calculated by the previous scheme, and a novel approach which utilizes Web page layout information. Experiments conducted to evaluate the accuracy of Web page segmentation results prove that the proposed method can segment Web pages more accurately than conventional methods. Furthermore, implementation and evaluation of our system on the mobile phone prove that our method can realize superior usability compared to commercial Web browsers.

References

[1]
Gen Hattori, Kazunori Matsumoto, Fumiaki Sugaya. Auto Web Page Distilling Scheme Based on Content-Distance Using Relative Tag Hierarchy. DBSJ Letters, Vol.4, No.1, 2005. (in Japanese).
[2]
Gen Hattori, Kazunori Matsumoto, Fumiaki Sugaya. Dynamic Segmentation of a Web Page Based on Standard Deviation of Content-Distances. IPSJ Transaction of Database, Vol.47, No.SIG8, pp. 81--89, 2006. (in Japanese).
[3]
Seiji Yamada, Yuki Nakai. Monitoring Partial Update of Web Pages by Interactive Relational Learning. JSAI Technical Papers of Active Mining, Vol. 17, No. 5, pp.614--621, 2002. (in Japanese).
[4]
Small Screen Rendering (Opera Software ASA). http://www.opera.com/products/mobile/smallscreen/.
[5]
Yu Chen, Wei-Ying Ma, Hong-Jiang Zhang. Improving Web Browsing on Small Devices Based on Table Classification. The Twelfth International World Wide Web Conference, 20--24, May 2003.
[6]
Hidetaka Masuda, Shuichi Tsukamoto, Saisuke Yasutomi, and Hiroshi Nakagawa. Recognition of HTML Table Structure. The First International Joint Conference on Natural Language Processing (IJCNLP-04), pp.183--188, 2004.
[7]
George Buchanan, Sarah Farrant, Matt Jones, and Harold Thimbleby. Improving Mobile Internet Usability. Proc. of 10th International World Wide Web Conference, Hong Kong, China, 2001.
[8]
Jones, M., Buchanan, G., Thimbleby, H. Sorting out Searching on Small Screen Devices. Conference on Mobile HCI, 2002.
[9]
O. Buyukkokten, H. Garcia-Molina, and A. Paepcke. Seeing the whole in parts: Text summarization for web browsing on handheld devices. Proc. of 10th International World Wide Web Conference, 2001.
[10]
O. Buyukkokten, H. Garcia-Molina, A. Paepcke, and T. Winograd. Power browser: Efficient web browsing for PDAs. Proc. of Human-Computer Interaction Conference 2000, 2000.
[11]
Y. Chen, W. Ma, and H. Zhang. Detecting web page structure for adaptive viewing on small form factor devices. In Proc. World Wide Web Conference 2003, 2003.
[12]
T. Maekawa, T. Hara, and S. Nishio. A Collaborative Web Browsing System for Multiple Mobile Users. Proc. of IEEE International Conference on Pervasive Computing and Communications (PerCom 2006), pp. 22--35, 2006.
[13]
Baluja, S. Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework. Proc. of the International Conference on World Wide Web (WWW '06), pp.33--42, 2006.
[14]
N. Milic-Frayling, R. Sommerer. SmartView: Enhanced Document Viewer for Mobile Devices. MSR-TR-2002-114, 2002.
[15]
N. Milic-Frayling, R. Sommerer, K. Rodden, and A. F. Blackwell. SearchMobil: Web Viewing and Search for Mobile Devices. Proc. of the International Conference on World Wide Web (WWW '03), 2003.
[16]
Anderson, C.R., Domingos, P., and Weld, D.S. Personalizing web sites for mobile users. Proc. of 10th International World Wide Web Conference, 2001.
[17]
T. Maekawa, T. Hara, and S. Nishio. Content Description and Partitioning Methods for Collaborative Browsing by Multiple Mobile Users. Proc. of International Workshop on Mobility in Databases and Distributed Systems (MDDS 2005), pp. 1068--1072, 2005.
[18]
Wobbrock, J., Forlizzi, J., Hudson, S., Myers, B. WebThumb: interaction techniques for small-screen browsers. In Proc. UIST '02, pp. 205--208, 2002.
[19]
am H., Baudisch P. Summary Thumbnails: Readable Overviews for Small Screen Web Browsers. Proc. ACM CHI 2005, p. 681--690, 2005.
[20]
Patrick Baudisch, Xing Xie, Chong Wang, Wei-Ying Ma, Collapse-to-Zoom: Viewing Web Pages on Small Screen Devices by Interactively Removing Irrelevant Content, 17th Annual ACM Symposium on User Interface Software and Technology (UIST 2004), 2004.
[21]
HTML Tidy Library Project. http://tidy.sourceforge.net/.
[22]
Yahoo! Business News. http://news.yahoo.com/i/749.
[23]
In-Stat. http://www.instat.com/.
[24]
Alexa. http://www.alexa.com/.
[25]
Google Wireless Transcoder. http://www.google.com/xhtml/.
[26]
CNN. http://www.cnn.com/.

Cited By

View all
  • (2024)Research on the Impact of Mobile Terminal Information Layout on Visual Search-Taking Bookkeeping Application as an ExampleHuman-Centered Design, Operation and Evaluation of Mobile Communications10.1007/978-3-031-60458-4_18(268-281)Online publication date: 1-Jun-2024
  • (2023)Methods for Automatic Web Page Layout Testing and Analysis: A ReviewIEEE Access10.1109/ACCESS.2023.324254911(13948-13964)Online publication date: 2023
  • (2022)Web Image Context Extraction with Graph Neural Networks and Sentence Embeddings on the DOM TreeMachine Learning and Principles and Practice of Knowledge Discovery in Databases10.1007/978-3-030-93736-2_20(258-267)Online publication date: 17-Feb-2022
  • Show More Cited By

Index Terms

  1. Robust web page segmentation for mobile terminal using content-distances and page layout information

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WWW '07: Proceedings of the 16th international conference on World Wide Web
    May 2007
    1382 pages
    ISBN:9781595936547
    DOI:10.1145/1242572
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 May 2007

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. content-distance
    2. mobile phone
    3. segmentation
    4. web page
    5. web page layout

    Qualifiers

    • Article

    Conference

    WWW'07
    Sponsor:
    WWW'07: 16th International World Wide Web Conference
    May 8 - 12, 2007
    Alberta, Banff, Canada

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)24
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Research on the Impact of Mobile Terminal Information Layout on Visual Search-Taking Bookkeeping Application as an ExampleHuman-Centered Design, Operation and Evaluation of Mobile Communications10.1007/978-3-031-60458-4_18(268-281)Online publication date: 1-Jun-2024
    • (2023)Methods for Automatic Web Page Layout Testing and Analysis: A ReviewIEEE Access10.1109/ACCESS.2023.324254911(13948-13964)Online publication date: 2023
    • (2022)Web Image Context Extraction with Graph Neural Networks and Sentence Embeddings on the DOM TreeMachine Learning and Principles and Practice of Knowledge Discovery in Databases10.1007/978-3-030-93736-2_20(258-267)Online publication date: 17-Feb-2022
    • (2021)Content-Determined Web Page Segmentation and Navigation for Mobile Web SearchingResult Page Generation for Web Searching10.4018/978-1-7998-0961-6.ch007(88-108)Online publication date: 2021
    • (2021)Screen Content Quality Assessment: Overview, Benchmark, and BeyondACM Computing Surveys10.1145/347097054:9(1-36)Online publication date: 8-Oct-2021
    • (2021)Postal address extraction from the web: a comprehensive surveyArtificial Intelligence Review10.1007/s10462-021-09983-1Online publication date: 14-Mar-2021
    • (2020)Web Page Segmentation RevisitedProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3412782(3047-3054)Online publication date: 19-Oct-2020
    • (2018)A Novel Approach for Extracting Pertinent Keywords for Web Image Annotation Using Semantic Distance and Euclidean DistanceSoftware Engineering10.1007/978-981-10-8848-3_17(173-183)Online publication date: 13-Jun-2018
    • (2017)XDBrowser 2.0Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems10.1145/3025453.3025547(4574-4584)Online publication date: 2-May-2017
    • (2016)Identifying semantic blocks in Web pages using Gestalt laws of groupingWorld Wide Web10.1007/s11280-015-0370-019:5(957-978)Online publication date: 1-Sep-2016
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media