article

Blocking objectionable web content by leveraging multiple information sources

Authors:

Jianping ZhangAuthors Info & Claims

ACM SIGKDD Explorations Newsletter, Volume 8, Issue 1

Pages 17 - 26

https://doi.org/10.1145/1147234.1147238

Published: 01 June 2006 Publication History

Abstract

The World Wide Web has now become a humongous archive of various contents. The inordinate amount of information found on the web presents a challenge to deliver right information to the right users. On one hand, the abundant information is freely accessible to all web denizens; on the other hand, much of such information may be irrelevant or even deleterious to some users. For example, some control and filtering mechanisms are desired to prevent inappropriate or offensive materials such as pornographic websites from reaching children. Ways of accessing websites are termed as Access Scenarios. An Access Scenario can include using search engines (e.g., image search that has very little textual content), URL redirection to some websites, or directly typing (porn) website URLs. In this paper we propose a framework to analyze a website from several different aspects or information sources, and generate a classification model aiming to accurately classify such content irrespective of access scenarios. Extensive experiments are performed to evaluate the resulting system, which illustrates the promise of the proposed approach.

References

[1]

N. Agarwal, E. Haque, H. Liu, and L. Parsons. A subspace clustering framework for research group collaboration. International Journal of Information Technology and Web Engineering, 1(1):35--58, January-March 2006.

[2]

S. Ahmed and F. Mithun. Word stemming to enhance spam filtering. In In Proceedings of Conference on Email and Anti-Spam (CEAS 2004), 2004.

[3]

A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of COLT'98, pages 92--100, 1998.

Digital Library

[4]

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. Roy.Statistical Society (B), 39:1--38, 1977.

[5]

G. Forman. An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res., 3:1289--1305, 2003.

[6]

S. Y. Ho and S. M. Lui. Exploring the factors affecting internet content filters acceptance. SIGecom Exch., 4(1):29--36, 2003.

Digital Library

[7]

T. Joachims. Transductive inference for text classification using support vector machines. In Proceedings of 16th International Conference on Machine Learning (ICML'00), pages 200--209, 1999.

Digital Library

[8]

T. Joachims. A statistical learning model of text classification for support vector machines. In SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 128--136, New York, NY, USA, 2001. ACM Press.

Digital Library

[9]

M. Y. Kan. Webpage classification without the web page. In Proceedings of the 13th International World Wide Web Conference, pages 262--263, 2004.

Digital Library

[10]

M. Y. Kan and H. O. N. Thi. Fast webpage classification using url features. In Proceedings of the conference on Information and Knowledge Management, 2005.

Digital Library

[11]

R. Kosala and H. Blockeel. Web mining research: A survey. ACM SIGKDD Explorations Newsletter, 2(1):1--15, 2000.

Digital Library

[12]

O. W. Kwon and J. H. Lee. Text categorization based on k-nearest neighbor approach for web site classification. Information Processing and Management: an International Journal, 39(1):25--44, 2003.

Digital Library

[13]

C. H. Lee, M. Y. Kan, and S. Lai. Stylistic and lexical co-training for web block classification. In Proceedings of the 6th annual ACM international workshop on Web information and data management (WIDM 04), pages 136--143, 2004.

Digital Library

[14]

S. H. Lin and J. M. Ho. Discovering informative content blocks from web documents. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 588--593, 2002.

Digital Library

[15]

B. Liu, Y. Dai, X. Li, W. S. Lee, and P. S. Yu. Building text classifiers using positive and unlabeled examples. In Proceedings of the Third IEEE International Conference on Data Mining (ICDM-03), page 179, 2003.

Digital Library

[16]

H. Liu and H. Motoda. Feature Extraction, Construction and Selection: A Data Mining Perspective. Boston: Kluwer Academic Publishers, 1998. 2nd Printing, 2001.

Digital Library

[17]

H. Liu and H. Motoda. Feature Selection for Knowledge Discovery & Data Mining. Boston: Kluwer Academic Publishers, 1998.

Digital Library

[18]

H. Liu and L. Yu. Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. on Knowledge and Data Engineering, 17(4):491--502, 2005.

Digital Library

[19]

K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3):103--134, 2000.

Digital Library

[20]

P. Parente. Audio enriched links: web page previews for blind users. In Proceedings of the ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 04), pages 2--8, 2004.

Digital Library

[21]

L. Parson, E. Haque, and H. Liu. Subspace clustering for high dimensional data - a review. SIGKDD Explorations, 6(1):90--105, 2004.

Digital Library

[22]

K. Peng, S. Vucetic, B. Han, H. Xie, and Z. Obradovic. Exploiting unlabeled data for improving accuracy of predictive data mining. In Proceedings of the Third IEEE International Conference on Data Mining (ICDM), pages 267--275, 2003.

Digital Library

[23]

G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Technical Report 87-881, Department of Computer Science, Cornell University, 1987.

Digital Library

[24]

M. Seeger. Learning with labeled and unlabeled data. Technical report, Institute for Adaptive and Neural Computation, University of Edinburgh, 2001.

[25]

A. Selamat and S. Omatu. Web page feature selection and classification using neural networks. Information Sciences Informatics and Computer Science: An International Journal, 158(1):69--88, 2004.

Digital Library

[26]

N. Soonthomphisaj, P. Chartbanchachai, T. Pratheeptham, and B. Kijsiriku. Web page categorization using hierarchical heading structure. In Proceedings of the 24th International Conference on Information Technology Interfaces (ITI 02), pages 37--42, 2002.

[27]

L. Tang and H. Liu. Bias analysis in text classification for highly skewed data. In ICDM'05.

Digital Library

[28]

D. M. J. Tax and R. P. W. Duin. Support vector domain description. In Pattern Recognition Letters, volume 20, pages 1991--1999. 1999.

Digital Library

[29]

D. M. J. Tax and R. P. W. Duin. Uniform object generation for optimizing one-class classifiers. Journal of Machine Learning Research, Special Issue on Kernel Methods, 2(2):155--173, 2002.

Digital Library

[30]

V. N. Vapnik. Statistical Learning Theory. Wiley, 1998.

[31]

Y. Yang and J. Pedersen. A comparative study on feature set selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning, pages 412--420, Nashville, TN, 1997. Morgan Kaufmann.

Digital Library

[32]

H. Yu, J. Han, and K. C. Chang. PEBL: Web page classification without negative examples. IEEE Transactions on Knowledge and Data Engineering (TKDE), 16(1):70--81, 2004.

Digital Library

Cited By

Albattah WAlbahli S(2023)Big Data Analytics: Deep Content-Based Prediction with Sampling PerspectiveComputer Systems Science and Engineering10.32604/csse.2023.02154845:1(531-544)Online publication date: 2023
https://doi.org/10.32604/csse.2023.021548
Albattah WKhan RAlsharekh MKhasawneh S(2022)Feature Selection Techniques for Big Data AnalyticsElectronics10.3390/electronics1119317711:19(3177)Online publication date: 3-Oct-2022
https://doi.org/10.3390/electronics11193177
Albattah WKhan RKhan K(2020)Attributes Reduction in Big DataApplied Sciences10.3390/app1014490110:14(4901)Online publication date: 17-Jul-2020
https://doi.org/10.3390/app10144901
Show More Cited By

Index Terms

Blocking objectionable web content by leveraging multiple information sources

Recommendations

Content Integration from Web and Broadcast Information Sources
ICKS '04: Proceedings of the International Conference on Informatics Research for Development of Knowledge Society Infrastructure

It becomes possible to acquire information fromdiverse information sources of different media types.For instance, with the spreading of digital TVbroadcasting and broadband internet, it becomespossible for users to acquire more media-rich andmore ...
Content Strategy for the Web
Killer Web Content: Make the Sale, Deliver the Service, Build the Brand

Comments

Information & Contributors

Information

Published In

cover image ACM SIGKDD Explorations Newsletter

ACM SIGKDD Explorations Newsletter Volume 8, Issue 1

June 2006

104 pages

ISSN:1931-0145

EISSN:1931-0153

DOI:10.1145/1147234

Issue’s Table of Contents

Copyright © 2006 Authors.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2006

Published in SIGKDD Volume 8, Issue 1

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
340
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 07 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Albattah WAlbahli S(2023)Big Data Analytics: Deep Content-Based Prediction with Sampling PerspectiveComputer Systems Science and Engineering10.32604/csse.2023.02154845:1(531-544)Online publication date: 2023
https://doi.org/10.32604/csse.2023.021548
Albattah WKhan RAlsharekh MKhasawneh S(2022)Feature Selection Techniques for Big Data AnalyticsElectronics10.3390/electronics1119317711:19(3177)Online publication date: 3-Oct-2022
https://doi.org/10.3390/electronics11193177
Albattah WKhan RKhan K(2020)Attributes Reduction in Big DataApplied Sciences10.3390/app1014490110:14(4901)Online publication date: 17-Jul-2020
https://doi.org/10.3390/app10144901
Patel OBharill NTiwari APatel VGupta OCao JLi JPrasad M(2019)Advanced Quantum Based Neural Network Classifier and Its Application for Objectionable Web Content FilteringIEEE Access10.1109/ACCESS.2019.29269897(98069-98082)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2019.2926989
Ben Aouicha MHadj Taieb MBen Hamadou A(2016)LWCRNeurocomputing10.1016/j.neucom.2016.08.045216:C(816-843)Online publication date: 5-Dec-2016
https://dl.acm.org/doi/10.1016/j.neucom.2016.08.045
Zeng JDuan JWu C(2013)Adaptive Topic Modeling for Detection Objectionable TextProceedings of the 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 0110.1109/WI-IAT.2013.54(381-388)Online publication date: 17-Nov-2013
https://dl.acm.org/doi/10.1109/WI-IAT.2013.54
Duan JZeng J(2013)Web objectionable text content detection using topic modeling techniqueExpert Systems with Applications10.1016/j.eswa.2013.05.03240:15(6094-6104)Online publication date: Nov-2013
https://doi.org/10.1016/j.eswa.2013.05.032
He ZLi XHu W(2009)A boosted semi-supervised learning framework for web page filtering2009 IEEE International Conference on Systems, Man and Cybernetics10.1109/ICSMC.2009.5346290(2133-2136)Online publication date: Oct-2009
https://doi.org/10.1109/ICSMC.2009.5346290
Gómez Hidalgo JSanz EGarcía FRodríguez M(2009)Chapter 7 Web Content FilteringSocial Networking and The Web10.1016/S0065-2458(09)01007-9(257-306)Online publication date: 2009
https://doi.org/10.1016/S0065-2458(09)01007-9
Wang ZLi B(2009)Learning to Recommend Tags for On-line PhotosSocial Computing and Behavioral Modeling10.1007/978-1-4419-0056-2_29(1-9)Online publication date: 23-Feb-2009
https://doi.org/10.1007/978-1-4419-0056-2_29
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents