research-article

Features selection from high-dimensional web data using clustering analysis

Authors:

Héctor Menéndez,

Gema Bello-Orgaz,

David CamachoAuthors Info & Claims

WIMS '12: Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics

Article No.: 20, Pages 1 - 9

https://doi.org/10.1145/2254129.2254155

Published: 13 June 2012 Publication History

Abstract

The features selection methodologies have become an important field of the data preprocessing techniques. These methods are applied to reduced the dimension of the attributes of different datasets to simplify their analysis. Some of the classical techniques used are wrapper approaches, heuristic functions and filters. The main problem of these approaches is that they usually are black box and computationally expensive algorithms. This work presents a new straightforward strategy to reduce the dimension of the attributes. This new methodology cares about the variables distribution and has been oriented to clustering analysis. It provides an easier human interpretation of the attributes selection strategy and the resulting clusters. Finally, this new approach has been experimentally tested using the FIFA World Cup web dataset, a well-known social-based statistical data with a high number of variables, to show how the features selection strategy find the most relevant variables.

References

[1]

Fifa web site, 2011. http://www.fifa.com/worldcup/archive/southafrica2010/statistics/index.html.

[2]

C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. Procopiuc, and J. S. Park. Fast algorithms for projected clustering. SIGMOD Rec., 28(2):61--72, June 1999.

Digital Library

[3]

G. Bello, H. Menéndez, and D. Camacho. Using the clustering coefficient to guide a genetic-based communities finding algorithm. In H. Yin, W. Wang, and V. Rayward-Smith, editors, Intelligent Data Engineering and Automated Learning - IDEAL 2011, volume 6936 of Lecture Notes in Computer Science, pages 160--169. Springer Berlin/Heidelberg, 2011.

Digital Library

[4]

J. C. Bezdek, J. Keller, R. Krisnapuram, and N. Pal. Fuzzy Models and Algorithms for Pattern Recognition and Image Processing (The Handbooks of Fuzzy Sets). Springer, 1 edition, Mar. 2005.

Digital Library

[5]

A. L. Blum and P. Langley. Selection of relevant features and examples in machine learning. Artif. Intell., 97:245--271, December 1997.

Digital Library

[6]

S. R. Carroll and D. J. Carroll. Statistics Made Simple for School Leaders. Rowman & Littlefield, 2002.

[7]

L. Curiel, B. Baruque, C. Dueñas, E. Corchado, and C. Pérez-Tárrago. Genetic algorithms to simplify prognosis of endocarditis. In Proceedings of the 12th international conference on Intelligent data engineering and automated learning, IDEAL'11, pages 454--462, Berlin, Heidelberg, 2011. Springer-Verlag.

Digital Library

[8]

H. Davulcu, G. Yang, M. Kifer, and I. V. Ramakrishnan. Computational aspects of resilient data extraction from semistructured sources (extended abstract). In Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, PODS '00, pages 136--144, New York, NY, USA, 2000. ACM.

Digital Library

[9]

K. Delac, M. Grgic, and S. Grgic. Independent comparative study of PCA, ICA, and LDA on the FERET data set. International Journal of Imaging Systems and Technology, 15(5):252--260, 2005.

[10]

J. Han and M. Kamber. Data mining: concepts and techniques. Morgan Kaufmann, 2006.

Digital Library

[11]

E. Hruschka, R. Campello, A. Freitas, and A. de Carvalho. A survey of evolutionary algorithms for clustering. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 39(2):133--155, march 2009.

Digital Library

[12]

K. Kailing, H. P. Kriegel, and P. Kroger. Density-Connected Subspace Clustering for High-Dimensional Data. In Proc. 4th SIAM International Conference on Data Mining, Apr. 2004.

[13]

R. Kohavi and G. H. John. Wrappers for feature subset selection. Artif. Intell., 97:273--324, December 1997.

Digital Library

[14]

G. N. Lance and W. T. Williams. A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems. The Computer Journal, 9(4):373--380, Feb. 1967.

[15]

D. T. Larose. Discovering Knowledge in Data. John Wiley & Sons, 2005.

[16]

D. MacKay. Information Theory, Inference and Learning Algorithms. Cambridge University Press, 2003.

Digital Library

[17]

J. B. Macqueen. Some methods of classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pages 281--297, 1967.

[18]

V. Roth and T. Lange. Feature selection in clustering problems. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.

Cited By

Alazab MKhurma RAwajan ACamacho D(2022)A new intrusion detection system based on Moth–Flame Optimizer algorithmExpert Systems with Applications10.1016/j.eswa.2022.118439210(118439)Online publication date: Dec-2022
https://doi.org/10.1016/j.eswa.2022.118439
Kohli SMehrotra S(2016)A Clustering Approach for Optimization of Search ResultJournal of Image and Graphics10.18178/joig.4.1.63-664:1(63-66)Online publication date: 2016
https://doi.org/10.18178/joig.4.1.63-66
Mehrotra SKohli S(2016)Application of Clustering for Improving Search Result of a WebsiteInformation Systems Design and Intelligent Applications10.1007/978-81-322-2752-6_34(349-356)Online publication date: 3-Feb-2016
https://doi.org/10.1007/978-81-322-2752-6_34
Show More Cited By

Index Terms

Features selection from high-dimensional web data using clustering analysis
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
    2. Retrieval tasks and goals

Recommendations

Improving Vietnamese web page clustering by combining neighbors' content and using iterative feature selection
SoICT '12: Proceedings of the 3rd Symposium on Information and Communication Technology

Web page clustering is a fundamental technique to offer a solution for data management, information locating and its interpretation of Web data and to facilitate users for navigation, discrimination and understanding. Most existing clustering algorithms ...
Subspace clustering for high dimensional data: a review
Special issue on learning from imbalanced datasets

Subspace clustering is an extension of traditional clustering that seeks to find clusters in different subspaces within a dataset. Often in high dimensional data, many dimensions are irrelevant and can mask existing clusters in noisy data. Feature ...
A GA-Based Feature Selection for High-Dimensional Data Clustering
WGEC '09: Proceedings of the 2009 Third International Conference on Genetic and Evolutionary Computing

High-dimensional data clustering is an open problem in modern data mining. This paper proposed a new genetic algorithm-based feature selection for high-dimensional data clustering, called GA-FSFclustering. This approach searches effective feature ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

WIMS '12: Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics

June 2012

571 pages

ISBN:9781450309158

DOI:10.1145/2254129

Conference Chair:
Dumitru Dan Burdescu
University of Craiova, Romania
,
Program Chairs:
Rajendra Akerkar
Western Norway Research Institute, Norway
,
Costin Bădică
SUniversity of Craiova, Romania

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

UCV: University of Craiova
WNRI: Western Norway Research Institute

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Ministerio de Educación, Cultura y Deporte

Conference

WIMS '12

Sponsor:

UCV
WNRI

WIMS '12: 2nd International Conference on Web Intelligence, Mining and Semantics

June 13 - 15, 2012

Craiova, Romania

Acceptance Rates

Overall Acceptance Rate 140 of 278 submissions, 50%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
137
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Alazab MKhurma RAwajan ACamacho D(2022)A new intrusion detection system based on Moth–Flame Optimizer algorithmExpert Systems with Applications10.1016/j.eswa.2022.118439210(118439)Online publication date: Dec-2022
https://doi.org/10.1016/j.eswa.2022.118439
Kohli SMehrotra S(2016)A Clustering Approach for Optimization of Search ResultJournal of Image and Graphics10.18178/joig.4.1.63-664:1(63-66)Online publication date: 2016
https://doi.org/10.18178/joig.4.1.63-66
Mehrotra SKohli S(2016)Application of Clustering for Improving Search Result of a WebsiteInformation Systems Design and Intelligent Applications10.1007/978-81-322-2752-6_34(349-356)Online publication date: 3-Feb-2016
https://doi.org/10.1007/978-81-322-2752-6_34
Ganghishetti PVadlamani R(2014)Association Rule Mining via Evolutionary Multi-objective OptimizationProceedings of the 8th International Workshop on Multi-disciplinary Trends in Artificial Intelligence - Volume 887510.1007/978-3-319-13365-2_4(35-46)Online publication date: 8-Dec-2014
https://dl.acm.org/doi/10.1007/978-3-319-13365-2_4
Pham TPhan TNguyen PHa Q(2013)Hidden Topic Models for Multi-label Review ClassificationProceedings of the 5th International Conference on Computational Collective Intelligence. Technologies and Applications - Volume 808310.1007/978-3-642-40495-5_60(603-611)Online publication date: 11-Sep-2013
https://dl.acm.org/doi/10.1007/978-3-642-40495-5_60

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten