skip to main content
10.1145/2339530.2339621acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Harnessing the wisdom of the crowds for accurate web page clipping

Published: 12 August 2012 Publication History

Abstract

Clipping Web pages, namely extracting the informative clips (areas) from Web pages, has many applications, such as Web printing and e-reading on small handheld devices. Although many existing methods attempt to address this task, most of them can either work only on certain types of Web pages (e.g., news- and blog-like web pages), or perform semi-automatically where extra user efforts are required in adjusting the outputs. The problem of clipping any types of Web pages accurately in a totally automatic way remains pretty much open. To this end in this study we harness the wisdom of the crowds to provide accurate recommendation of informative clips on any given Web pages. Specifically, we leverage the knowledge on how previous users clip similar Web pages, and this knowledge repository can be represented as a transaction database where each transaction contains the clips selected by a user on a certain Web page. Then, we formulate a new pattern mining problem, mining top-1 qualified pattern, on transaction database for this recommendation. Here, the recommendation considers not only the pattern support but also the pattern occupancy (proposed in this work). High support requires that patterns appear frequently in the database, while high occupancy requires that patterns occupy a large portion of the transactions they appear in. Thus, it leads to both precise and complete recommendation. Additionally, we explore the properties on occupancy to further prune the search space for high-efficient pattern mining. Finally, we show the effectiveness of the proposed algorithm on a human-labeled ground truth dataset consisting of 2000 web pages from 100 major Web sites, and demonstrate its efficiency on large synthetic datasets.

Supplementary Material

JPG File (310_m_talk_11.jpg)
MP4 File (310_m_talk_11.mp4)

References

[1]
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD, pages 207--216, 1993.
[2]
D. Burdick, M. Calimlim, and J. Gehrke. Mafia: A maximal frequent itemset algorithm for transactional databases. In Proceedings of the IEEE ICDE, pages 443--452, 2001.
[3]
J. Fan, P. Luo, S. H. Lim, S. Liu, P. Joshi, and J. Liu. Article clipper- a system for web article extraction. In Proceedings of the ACM SIGKDD, pages 743--746, 2011.
[4]
K. Gade, J. Wang, and G. Karypis. Efficient closed pattern mining in the presence of tough block constraints. In Proceedings of the ACM SIGKDD, pages 138--147, 2004.
[5]
B. Goethals and M. J. Zaki. Frequent itemset mining implementations.In Proceedings of the ICDM workshop on Frequent Itemset Mining Implementations, 2003.
[6]
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proceedings of the ACM SIGMOD, pages 1--12, 2000.
[7]
J. Han, J. Pei, Y. Yin, and R. Mao. Mining frequent patterns without candidate generation: A frequent-pattern tree approach. Data Mining and Knowledge Discovery, pages 53--87, 2005.
[8]
Y. Hu, G. Xin, R. Song, G. Hu, S. Shi, Y. Cao, and H. Li. Title extraction from bodies of html documents and its application to web page retrieval. In Proceedings of the ACM SIGIR, pages 250--257, 2005.
[9]
S. H. Lim, L. Zheng, J. Jin, H. Hou, J. Fan, and J. Liu. Automatic selection of print-worthy content for enhanced web page printing experience. Proceedings of the ACM DocEng, pages 165--168, 2010.
[10]
P. Luo, J. Fan, S. Liu, F. Lin, Y. Xiong, and J. Liu. Web article extraction for web printing: a dom+visual based approach. In Proceedings of the ACM DocEng, pages 66--69, 2009.
[11]
N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In Proceedings of the ACM ICDT, pages 398--416, 1999.
[12]
J. Pasternack and D. Roth. Extracting article text from the web with maximum subsequence segmentation. In Proceedings of the ACM WWW, pages 971--980, 2009.
[13]
J. Pei, J. Han, and L. Lakshmanan. Mining frequent itemsets with convertible constraints. In Proceedings of the IEEE ICDE, pages 433--442, 2001.
[14]
J. Wang, C. Chen, C. Wang, J. Pei, J. Bu, Z. Guan, and W. V. Zhang. Can we learn a template-independent wrapper for news article extraction from a single training site? In Proceedings of the ACM SIGKDD, pages 1345--1354, 2009.
[15]
J. Wang, J. Han, S. Member, Y. Lu, and P. Tzvetkov. Tfp: An efficient algorithm for mining top-k frequent closed itemsets. IEEE Trans. on Knowledge and Data Engineering}, 17:2005, 2005.
[16]
J. Xiao and J. Fan. Printmarmoset: Redesigning the print button for sustainability. Proceedings of the ACM CHI, pages 109--112, 2009.

Cited By

View all
  • (2021)High Occupancy Itemset Mining with Consideration of Transaction OccupancyArabian Journal for Science and Engineering10.1007/s13369-021-06075-8Online publication date: 9-Sep-2021
  • (2019)An Overview of CrowdsourcingAdvanced Methodologies and Technologies in Network Architecture, Mobile Computing, and Data Analytics10.4018/978-1-5225-7598-6.ch130(1763-1776)Online publication date: 2019
  • (2019)A Surrogate-Assisted Multiobjective Evolutionary Algorithm for Large-Scale Task-Oriented Pattern MiningIEEE Transactions on Emerging Topics in Computational Intelligence10.1109/TETCI.2018.28720553:2(106-116)Online publication date: Apr-2019
  • Show More Cited By

Index Terms

  1. Harnessing the wisdom of the crowds for accurate web page clipping

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '12: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
    August 2012
    1616 pages
    ISBN:9781450314626
    DOI:10.1145/2339530
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 August 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. frequent and dominant pattern
    2. occupancy
    3. web page clipping
    4. wisdom of the crowds

    Qualifiers

    • Research-article

    Conference

    KDD '12
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 08 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)High Occupancy Itemset Mining with Consideration of Transaction OccupancyArabian Journal for Science and Engineering10.1007/s13369-021-06075-8Online publication date: 9-Sep-2021
    • (2019)An Overview of CrowdsourcingAdvanced Methodologies and Technologies in Network Architecture, Mobile Computing, and Data Analytics10.4018/978-1-5225-7598-6.ch130(1763-1776)Online publication date: 2019
    • (2019)A Surrogate-Assisted Multiobjective Evolutionary Algorithm for Large-Scale Task-Oriented Pattern MiningIEEE Transactions on Emerging Topics in Computational Intelligence10.1109/TETCI.2018.28720553:2(106-116)Online publication date: Apr-2019
    • (2018)An Overview of CrowdsourcingEncyclopedia of Information Science and Technology, Fourth Edition10.4018/978-1-5225-2255-3.ch698(8023-8035)Online publication date: 2018
    • (2017)Pattern Recommendation in Task-oriented Applications: A Multi-Objective Perspective [Application Notes]IEEE Computational Intelligence Magazine10.1109/MCI.2017.270857812:3(43-53)Online publication date: 18-Jul-2017
    • (2015)Automatic Web Content Extraction by Combination of Learning and GroupingProceedings of the 24th International Conference on World Wide Web10.1145/2736277.2741659(1264-1274)Online publication date: 18-May-2015
    • (2015)An Optimization to CHARM Algorithm for Mining Frequent Closed Itemsets2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing10.1109/CIT/IUCC/DASC/PICOM.2015.33(226-235)Online publication date: Oct-2015
    • (2015)Correlation range query for effective recommendationsWorld Wide Web10.1007/s11280-013-0265-x18:3(709-729)Online publication date: 1-May-2015

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media