skip to main content
10.1145/1526709.1526880acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
poster

Purely URL-based topic classification

Published: 20 April 2009 Publication History

Abstract

Given only the URL of a web page, can we identify its topic? This is the question that we examine in this paper. Usually, web pages are classified using their content, but a URL-only classifier is preferable, (i) when speed is crucial, (ii) to enable content filtering before an (objection-able) web page is downloaded, (iii) when a page's content is hidden in images, (iv) to annotate hyperlinks in a personalized web browser, without fetching the target page, and (v) when a focused crawler wants to infer the topic of a target page before devoting bandwidth to download it. We apply a machine learning approach to the topic identification task and evaluate its performance in extensive experiments on categorized web pages from the Open Directory Project (ODP). When training separate binary classifiers for each topic, we achieve typical F-measure values between 80 and 85, and a typical precision of around 85. We also ran experiments on a small data set of university web pages. For the task of classifying these pages into faculty, student, course and project pages, our methods improve over previous approaches by 13.8 points of F-measure.

References

[1]
The 4 universities data set. http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo--20/www/data/.
[2]
Open directory project. http://www.dmoz.org/.
[3]
E. Baykan, M. Henzinger, and I. Weber. Web page language identification based on urls. In International conference on Very Large Data Bases (VLDB), pages 176--187, 2008.
[4]
S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In International conference on Management of data (SIGMOD), pages 307--318, 1998.
[5]
M. Kan and H. O. N. Thi. Fast webpage classification using url features. In International conference on Information and knowledge management (CIKM), pages 325--326, 2005.
[6]
X. Qi and B. D. Davison. Knowing a web page by the company it keeps. In International conference on Information and knowledge management (CIKM), pages 228--237, 2006.
[7]
X. Qi and B. D. Davison. Web page classification: Features and algorithms. ACM Computing Surveys, 41, 2009. To appear.

Cited By

View all
  • (2024)Automatic Noise Generation and Reduction for Text ClassificationIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2023.332513532(139-150)Online publication date: 1-Jan-2024
  • (2024)Research on Mental Health Problem Identification Algorithm of College Students2024 IEEE 2nd International Conference on Image Processing and Computer Applications (ICIPCA)10.1109/ICIPCA61593.2024.10708980(506-510)Online publication date: 28-Jun-2024
  • (2024)AI-UNet: Attention Information-based deep URL Network for adult webpage classificationNeural Computing and Applications10.1007/s00521-024-10408-737:4(2597-2615)Online publication date: 7-Dec-2024
  • Show More Cited By

Index Terms

  1. Purely URL-based topic classification

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WWW '09: Proceedings of the 18th international conference on World wide web
    April 2009
    1280 pages
    ISBN:9781605584874
    DOI:10.1145/1526709

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 April 2009

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. ODP
    2. URL
    3. topic classification

    Qualifiers

    • Poster

    Conference

    WWW '09
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)28
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 22 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Automatic Noise Generation and Reduction for Text ClassificationIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2023.332513532(139-150)Online publication date: 1-Jan-2024
    • (2024)Research on Mental Health Problem Identification Algorithm of College Students2024 IEEE 2nd International Conference on Image Processing and Computer Applications (ICIPCA)10.1109/ICIPCA61593.2024.10708980(506-510)Online publication date: 28-Jun-2024
    • (2024)AI-UNet: Attention Information-based deep URL Network for adult webpage classificationNeural Computing and Applications10.1007/s00521-024-10408-737:4(2597-2615)Online publication date: 7-Dec-2024
    • (2022)An approach for selecting countermeasures against harmful information based on uncertainty managementComputer Science and Information Systems10.2298/CSIS210211057K19:1(415-433)Online publication date: 2022
    • (2022)WEB PAGE CLASSIFICATION WITH DEEP LEARNING METHODSUludağ University Journal of The Faculty of Engineering10.17482/uumfd.891038(191-204)Online publication date: 16-Mar-2022
    • (2022)Identifying Categories of Domain Names by Using Deep Learning Methods2022 International Conference on Machine Learning, Cloud Computing and Intelligent Mining (MLCCIM)10.1109/MLCCIM55934.2022.00092(504-510)Online publication date: Aug-2022
    • (2022)Graph convolutional networks in language and vision: A surveyKnowledge-Based Systems10.1016/j.knosys.2022.109250251(109250)Online publication date: Sep-2022
    • (2022)Manipulation of URL Addresses Using Machine Learning to Provide Better Cyber SecurityComplex Systems: Spanning Control and Computational Cybernetics: Applications10.1007/978-3-031-00978-5_22(503-519)Online publication date: 19-Sep-2022
    • (2021)On the Value of Wikipedia as a Gateway to the WebProceedings of the Web Conference 202110.1145/3442381.3450136(249-260)Online publication date: 19-Apr-2021
    • (2021)Robust Detection of Malicious URLs with Self-Paced Wide & Deep LearningIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2021.3121388(1-1)Online publication date: 2021
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media