ABSTRACT
An important ingredient for a successful recipe for solving machine learning problems is the availability of a suitable dataset. However, such a dataset may have to be extracted from a large unstructured and semi-structured data like programming code, scripts, and text. In this work, we propose a plug-in based, extensible feature extraction framework for which we have prototyped as a tool. The proposed framework is demonstrated by extracting features from two different sources of semi-structured and unstructured data. The semi-structured data comprised of web page and script based data whereas the other data was taken from email data for spam filtering. The usefulness of the tool was also assessed on the aspect of ease of programming.
- Alexa Top 500 Global Sites: http://www.alexa.com/topsites. Accessed: 2017-08-06.Google Scholar
- Bullock, J. 2007. LibXtract: A Lightweight Library for Audio Feature Extraction. Proc. International Computer Music Conference. (2007), 36.Google Scholar
- Elgammal, A. M. and Ismail, M. A. 2001. A graph-based segmentation and feature extraction framework for Arabic text recognition. Proceedings of Sixth International Conference on Document Analysis and Recognition. (2001), 622626. Google ScholarDigital Library
- Ergin, S. and Kilinc, O. 2014. A new feature extraction framework based on wavelets for breast cancer diagnosis. Computers in Biology and Medicine. 51, (2014), 171182. Google ScholarDigital Library
- Frstner, W. 1994. A Framework for Low Level Feature Extraction. 3rd European Conference on Computer Vision (ECCV). (1994), 383394. Google ScholarDigital Library
- Grzegorowski, M. and Stawicki, S. 2015. Window-Based Feature Extraction Framework for Multi-Sensor Data: A Posture Recognition Case Study. 5, (2015), 397405.Google Scholar
- Guang Dai and Otan, Y. A robust feature extraction framework for face recognition. 2004 International Conference on Image Processing, 2004. ICIP 04. 13931396.Google ScholarCross Ref
- Hoffman Billy and Sullivan Bryan 2008. Ajax Security. Addison-Wesley Professional. Google ScholarDigital Library
- JavaMail: https://javaee.github.io/javamail/. Accessed: 2017-09-02.Google Scholar
- Javassist by jboss-javassist: http://jboss-javassist.github.io/javassist/. Accessed: 2017-09-02.Google Scholar
- jsoup Java HTML Parser, with best of DOM, CSS, and jquery: https://jsoup.org/. Accessed: 2017-09-02.Google Scholar
- LanguageTool Style and Grammar Check: https://languagetool.org/. Accessed: 2017-09-02.Google Scholar
- Likarish, P., Jung, E. and Jo, I. 2009. Obfuscated malicious javascript detection using classification techniques. 2009 4th International Conference on Malicious and Unwanted Software (MALWARE) (Oct. 2009), 4754.Google ScholarCross Ref
- Ma, J., Theiler, J. and Perkins, S. 2004. Two realizations of a general feature extraction framework. Pattern Recognition. 37, 5 (2004), 875887.Google ScholarCross Ref
- Muzaffar, A.W., Azam, F. and Qamar, U. 2015. A Relation Extraction Framework for Biomedical Text Using Hybrid Feature Set. Computational and Mathematical Methods in Medicine. 2015, (Aug. 2015), 112.Google Scholar
- Nussbaum, S., Niemeyer, I. and Canty, M.J. 2006. Seath - a New Tool for Automated Feature Extraction in the Context of Object-Based Image Analysis Abstract: 1st International Conference on Object-based Image Analysis (OBIA 2006). XXXV, (2006), 4/C42.Google Scholar
- Paola, S.D. and Fedon, G. 2006 Subverting AJAX. Chaos Communication Congress (December 2006), 10--15.Google Scholar
- Nunan, A.E., Souto, E., Santos, E.M. and Feitosa, E. 2012. Automatic Classification of Cross-Site Scripting in Web Pages using Document-based and URL-based Features. Symposium on Computers and Communications (2012) Google ScholarDigital Library
- Qurashi, U.S. and Anwar, Z. 2012. AJAX based attacks: Exploiting Web 2.0. 2012 International Conference on Emerging Technologies (Oct. 2012), 16.Google ScholarCross Ref
- Shams, R. and Mercer, R.E. 2013. Classifying Spam Emails Using Text and Readability Features. 2013 IEEE 13th International Conference on Data Mining (Dec. 2013), 657666.Google Scholar
- SpamAssasin database: http://spamassassin.apache.org/old/publiccorpus/. Accessed: 2017-09-02.Google Scholar
- The Stanford Natural Language Processing Group: https://nlp.stanford.edu/software/tagger.shtml. Accessed: 2017-09-02.Google Scholar
- Tran, T., Luo, W., Phung, D., Gupta, S., Rana, S., Kennedy, R.L., Larkins, A. and Venkatesh, S. 2014. A framework for feature extraction from hospital medical data with applications in risk prediction. BMC Bioinformatics. 15, 1 (2014), 425.Google ScholarCross Ref
- Weka 3 - Data Mining with Open Source Machine Learning Software in Java: http://www.cs.waikato.ac.nz/ml/weka/. Accessed: 2017-08-06.Google Scholar
- White Paper: The Cross Site Scripting Threat: 2007. https://info.virtualforge.com/en/prs/wp/cross_site. Accessed: 2017-08-31.Google Scholar
- XSSed --- Cross Site Scripting (XSS) attacks information and archive: http://xssed.com/. Accessed: 2017-08-06.Google Scholar
- Yan, S., Xu, X., Xu, D., Lin, S. and Li, X. 2012. Beyond spatial pyramids: A new feature extraction framework with dense spatial sampling for image classification. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 7575 LNCS, PART 4 (2012), 473487. Google ScholarDigital Library
Index Terms
- Development of a plugin based extensible feature extraction framework
Recommendations
A comprehensive survey of the feature extraction methods in the EEG research
ICA3PP'12: Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part IIThis survey paper categories, compares, and summaries from published technical and review articles in feature extraction methods in Electroence-phalography research and defines the feature, feature extraction, formalizes the relevance of the ...
Feature extraction through local learning
RELIEF is considered one of the most successful algorithms for assessing the quality of features. It has been recently proved that RELIEF is an online learning algorithm that solves a convex optimization problem with a margin-based objective function. ...
Comments