skip to main content
article

Blocking objectionable web content by leveraging multiple information sources

Published: 01 June 2006 Publication History

Abstract

The World Wide Web has now become a humongous archive of various contents. The inordinate amount of information found on the web presents a challenge to deliver right information to the right users. On one hand, the abundant information is freely accessible to all web denizens; on the other hand, much of such information may be irrelevant or even deleterious to some users. For example, some control and filtering mechanisms are desired to prevent inappropriate or offensive materials such as pornographic websites from reaching children. Ways of accessing websites are termed as Access Scenarios. An Access Scenario can include using search engines (e.g., image search that has very little textual content), URL redirection to some websites, or directly typing (porn) website URLs. In this paper we propose a framework to analyze a website from several different aspects or information sources, and generate a classification model aiming to accurately classify such content irrespective of access scenarios. Extensive experiments are performed to evaluate the resulting system, which illustrates the promise of the proposed approach.

References

[1]
N. Agarwal, E. Haque, H. Liu, and L. Parsons. A subspace clustering framework for research group collaboration. International Journal of Information Technology and Web Engineering, 1(1):35--58, January-March 2006.
[2]
S. Ahmed and F. Mithun. Word stemming to enhance spam filtering. In In Proceedings of Conference on Email and Anti-Spam (CEAS 2004), 2004.
[3]
A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of COLT'98, pages 92--100, 1998.
[4]
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. Roy.Statistical Society (B), 39:1--38, 1977.
[5]
G. Forman. An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res., 3:1289--1305, 2003.
[6]
S. Y. Ho and S. M. Lui. Exploring the factors affecting internet content filters acceptance. SIGecom Exch., 4(1):29--36, 2003.
[7]
T. Joachims. Transductive inference for text classification using support vector machines. In Proceedings of 16th International Conference on Machine Learning (ICML'00), pages 200--209, 1999.
[8]
T. Joachims. A statistical learning model of text classification for support vector machines. In SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 128--136, New York, NY, USA, 2001. ACM Press.
[9]
M. Y. Kan. Webpage classification without the web page. In Proceedings of the 13th International World Wide Web Conference, pages 262--263, 2004.
[10]
M. Y. Kan and H. O. N. Thi. Fast webpage classification using url features. In Proceedings of the conference on Information and Knowledge Management, 2005.
[11]
R. Kosala and H. Blockeel. Web mining research: A survey. ACM SIGKDD Explorations Newsletter, 2(1):1--15, 2000.
[12]
O. W. Kwon and J. H. Lee. Text categorization based on k-nearest neighbor approach for web site classification. Information Processing and Management: an International Journal, 39(1):25--44, 2003.
[13]
C. H. Lee, M. Y. Kan, and S. Lai. Stylistic and lexical co-training for web block classification. In Proceedings of the 6th annual ACM international workshop on Web information and data management (WIDM 04), pages 136--143, 2004.
[14]
S. H. Lin and J. M. Ho. Discovering informative content blocks from web documents. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 588--593, 2002.
[15]
B. Liu, Y. Dai, X. Li, W. S. Lee, and P. S. Yu. Building text classifiers using positive and unlabeled examples. In Proceedings of the Third IEEE International Conference on Data Mining (ICDM-03), page 179, 2003.
[16]
H. Liu and H. Motoda. Feature Extraction, Construction and Selection: A Data Mining Perspective. Boston: Kluwer Academic Publishers, 1998. 2nd Printing, 2001.
[17]
H. Liu and H. Motoda. Feature Selection for Knowledge Discovery & Data Mining. Boston: Kluwer Academic Publishers, 1998.
[18]
H. Liu and L. Yu. Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. on Knowledge and Data Engineering, 17(4):491--502, 2005.
[19]
K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3):103--134, 2000.
[20]
P. Parente. Audio enriched links: web page previews for blind users. In Proceedings of the ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 04), pages 2--8, 2004.
[21]
L. Parson, E. Haque, and H. Liu. Subspace clustering for high dimensional data - a review. SIGKDD Explorations, 6(1):90--105, 2004.
[22]
K. Peng, S. Vucetic, B. Han, H. Xie, and Z. Obradovic. Exploiting unlabeled data for improving accuracy of predictive data mining. In Proceedings of the Third IEEE International Conference on Data Mining (ICDM), pages 267--275, 2003.
[23]
G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Technical Report 87-881, Department of Computer Science, Cornell University, 1987.
[24]
M. Seeger. Learning with labeled and unlabeled data. Technical report, Institute for Adaptive and Neural Computation, University of Edinburgh, 2001.
[25]
A. Selamat and S. Omatu. Web page feature selection and classification using neural networks. Information Sciences Informatics and Computer Science: An International Journal, 158(1):69--88, 2004.
[26]
N. Soonthomphisaj, P. Chartbanchachai, T. Pratheeptham, and B. Kijsiriku. Web page categorization using hierarchical heading structure. In Proceedings of the 24th International Conference on Information Technology Interfaces (ITI 02), pages 37--42, 2002.
[27]
L. Tang and H. Liu. Bias analysis in text classification for highly skewed data. In ICDM'05.
[28]
D. M. J. Tax and R. P. W. Duin. Support vector domain description. In Pattern Recognition Letters, volume 20, pages 1991--1999. 1999.
[29]
D. M. J. Tax and R. P. W. Duin. Uniform object generation for optimizing one-class classifiers. Journal of Machine Learning Research, Special Issue on Kernel Methods, 2(2):155--173, 2002.
[30]
V. N. Vapnik. Statistical Learning Theory. Wiley, 1998.
[31]
Y. Yang and J. Pedersen. A comparative study on feature set selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning, pages 412--420, Nashville, TN, 1997. Morgan Kaufmann.
[32]
H. Yu, J. Han, and K. C. Chang. PEBL: Web page classification without negative examples. IEEE Transactions on Knowledge and Data Engineering (TKDE), 16(1):70--81, 2004.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGKDD Explorations Newsletter
ACM SIGKDD Explorations Newsletter  Volume 8, Issue 1
June 2006
104 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/1147234
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2006
Published in SIGKDD Volume 8, Issue 1

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Big Data Analytics: Deep Content-Based Prediction with Sampling PerspectiveComputer Systems Science and Engineering10.32604/csse.2023.02154845:1(531-544)Online publication date: 2023
  • (2022)Feature Selection Techniques for Big Data AnalyticsElectronics10.3390/electronics1119317711:19(3177)Online publication date: 3-Oct-2022
  • (2020)Attributes Reduction in Big DataApplied Sciences10.3390/app1014490110:14(4901)Online publication date: 17-Jul-2020
  • (2019)Advanced Quantum Based Neural Network Classifier and Its Application for Objectionable Web Content FilteringIEEE Access10.1109/ACCESS.2019.29269897(98069-98082)Online publication date: 2019
  • (2016)LWCRNeurocomputing10.1016/j.neucom.2016.08.045216:C(816-843)Online publication date: 5-Dec-2016
  • (2013)Adaptive Topic Modeling for Detection Objectionable TextProceedings of the 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 0110.1109/WI-IAT.2013.54(381-388)Online publication date: 17-Nov-2013
  • (2013)Web objectionable text content detection using topic modeling techniqueExpert Systems with Applications10.1016/j.eswa.2013.05.03240:15(6094-6104)Online publication date: Nov-2013
  • (2009)A boosted semi-supervised learning framework for web page filtering2009 IEEE International Conference on Systems, Man and Cybernetics10.1109/ICSMC.2009.5346290(2133-2136)Online publication date: Oct-2009
  • (2009)Chapter 7 Web Content FilteringSocial Networking and The Web10.1016/S0065-2458(09)01007-9(257-306)Online publication date: 2009
  • (2009)Learning to Recommend Tags for On-line PhotosSocial Computing and Behavioral Modeling10.1007/978-1-4419-0056-2_29(1-9)Online publication date: 23-Feb-2009
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media