skip to main content
10.1145/3167132.3167328acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Development of a plugin based extensible feature extraction framework

Authors Info & Claims
Published:09 April 2018Publication History

ABSTRACT

An important ingredient for a successful recipe for solving machine learning problems is the availability of a suitable dataset. However, such a dataset may have to be extracted from a large unstructured and semi-structured data like programming code, scripts, and text. In this work, we propose a plug-in based, extensible feature extraction framework for which we have prototyped as a tool. The proposed framework is demonstrated by extracting features from two different sources of semi-structured and unstructured data. The semi-structured data comprised of web page and script based data whereas the other data was taken from email data for spam filtering. The usefulness of the tool was also assessed on the aspect of ease of programming.

References

  1. Alexa Top 500 Global Sites: http://www.alexa.com/topsites. Accessed: 2017-08-06.Google ScholarGoogle Scholar
  2. Bullock, J. 2007. LibXtract: A Lightweight Library for Audio Feature Extraction. Proc. International Computer Music Conference. (2007), 36.Google ScholarGoogle Scholar
  3. Elgammal, A. M. and Ismail, M. A. 2001. A graph-based segmentation and feature extraction framework for Arabic text recognition. Proceedings of Sixth International Conference on Document Analysis and Recognition. (2001), 622626. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ergin, S. and Kilinc, O. 2014. A new feature extraction framework based on wavelets for breast cancer diagnosis. Computers in Biology and Medicine. 51, (2014), 171182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Frstner, W. 1994. A Framework for Low Level Feature Extraction. 3rd European Conference on Computer Vision (ECCV). (1994), 383394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Grzegorowski, M. and Stawicki, S. 2015. Window-Based Feature Extraction Framework for Multi-Sensor Data: A Posture Recognition Case Study. 5, (2015), 397405.Google ScholarGoogle Scholar
  7. Guang Dai and Otan, Y. A robust feature extraction framework for face recognition. 2004 International Conference on Image Processing, 2004. ICIP 04. 13931396.Google ScholarGoogle ScholarCross RefCross Ref
  8. Hoffman Billy and Sullivan Bryan 2008. Ajax Security. Addison-Wesley Professional. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. JavaMail: https://javaee.github.io/javamail/. Accessed: 2017-09-02.Google ScholarGoogle Scholar
  10. Javassist by jboss-javassist: http://jboss-javassist.github.io/javassist/. Accessed: 2017-09-02.Google ScholarGoogle Scholar
  11. jsoup Java HTML Parser, with best of DOM, CSS, and jquery: https://jsoup.org/. Accessed: 2017-09-02.Google ScholarGoogle Scholar
  12. LanguageTool Style and Grammar Check: https://languagetool.org/. Accessed: 2017-09-02.Google ScholarGoogle Scholar
  13. Likarish, P., Jung, E. and Jo, I. 2009. Obfuscated malicious javascript detection using classification techniques. 2009 4th International Conference on Malicious and Unwanted Software (MALWARE) (Oct. 2009), 4754.Google ScholarGoogle ScholarCross RefCross Ref
  14. Ma, J., Theiler, J. and Perkins, S. 2004. Two realizations of a general feature extraction framework. Pattern Recognition. 37, 5 (2004), 875887.Google ScholarGoogle ScholarCross RefCross Ref
  15. Muzaffar, A.W., Azam, F. and Qamar, U. 2015. A Relation Extraction Framework for Biomedical Text Using Hybrid Feature Set. Computational and Mathematical Methods in Medicine. 2015, (Aug. 2015), 112.Google ScholarGoogle Scholar
  16. Nussbaum, S., Niemeyer, I. and Canty, M.J. 2006. Seath - a New Tool for Automated Feature Extraction in the Context of Object-Based Image Analysis Abstract: 1st International Conference on Object-based Image Analysis (OBIA 2006). XXXV, (2006), 4/C42.Google ScholarGoogle Scholar
  17. Paola, S.D. and Fedon, G. 2006 Subverting AJAX. Chaos Communication Congress (December 2006), 10--15.Google ScholarGoogle Scholar
  18. Nunan, A.E., Souto, E., Santos, E.M. and Feitosa, E. 2012. Automatic Classification of Cross-Site Scripting in Web Pages using Document-based and URL-based Features. Symposium on Computers and Communications (2012) Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Qurashi, U.S. and Anwar, Z. 2012. AJAX based attacks: Exploiting Web 2.0. 2012 International Conference on Emerging Technologies (Oct. 2012), 16.Google ScholarGoogle ScholarCross RefCross Ref
  20. Shams, R. and Mercer, R.E. 2013. Classifying Spam Emails Using Text and Readability Features. 2013 IEEE 13th International Conference on Data Mining (Dec. 2013), 657666.Google ScholarGoogle Scholar
  21. SpamAssasin database: http://spamassassin.apache.org/old/publiccorpus/. Accessed: 2017-09-02.Google ScholarGoogle Scholar
  22. The Stanford Natural Language Processing Group: https://nlp.stanford.edu/software/tagger.shtml. Accessed: 2017-09-02.Google ScholarGoogle Scholar
  23. Tran, T., Luo, W., Phung, D., Gupta, S., Rana, S., Kennedy, R.L., Larkins, A. and Venkatesh, S. 2014. A framework for feature extraction from hospital medical data with applications in risk prediction. BMC Bioinformatics. 15, 1 (2014), 425.Google ScholarGoogle ScholarCross RefCross Ref
  24. Weka 3 - Data Mining with Open Source Machine Learning Software in Java: http://www.cs.waikato.ac.nz/ml/weka/. Accessed: 2017-08-06.Google ScholarGoogle Scholar
  25. White Paper: The Cross Site Scripting Threat: 2007. https://info.virtualforge.com/en/prs/wp/cross_site. Accessed: 2017-08-31.Google ScholarGoogle Scholar
  26. XSSed --- Cross Site Scripting (XSS) attacks information and archive: http://xssed.com/. Accessed: 2017-08-06.Google ScholarGoogle Scholar
  27. Yan, S., Xu, X., Xu, D., Lin, S. and Li, X. 2012. Beyond spatial pyramids: A new feature extraction framework with dense spatial sampling for image classification. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 7575 LNCS, PART 4 (2012), 473487. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Development of a plugin based extensible feature extraction framework

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SAC '18: Proceedings of the 33rd Annual ACM Symposium on Applied Computing
            April 2018
            2327 pages
            ISBN:9781450351911
            DOI:10.1145/3167132

            Copyright © 2018 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 9 April 2018

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate1,650of6,669submissions,25%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader