skip to main content
10.1145/2479787.2479789acmotherconferencesArticle/Chapter ViewAbstractPublication PageswimsConference Proceedingsconference-collections
research-article

Automatic classification of documents in cold-start scenarios

Published:12 June 2013Publication History

ABSTRACT

Document classification is key to ensuring quality of any digital library. However, classifying documents is a very time-consuming task. In addition, few or none of the documents in a newly created repository are classified. The non-classification of documents not only prevents users from finding information but also hinders the system's aptitude to recommend relevant items. Moreover, the lack of classified documents prevents any kind of machine learning algorithm to automatically annotate these items. In this work, we propose a novel approach to automatically classifying documents that differs from previous works in the sense that it exploits the wisdom of the crowds available on the Web. Our proposed strategy adapts an automatic tagging approach combined with a straightforward matching algorithm to classify documents in a given domain classification. To validate our findings, we compared our methods against the existing and performed a user evaluation with 61 participants to estimate the quality of the classifications. Results show that, in 72% of the cases, the automatic classification is relevant and well accepted by participants. In conclusion, automatic classification can facilitate access to relevant documents.

References

  1. S. Bethard, S. Ghosh, J. H. Martin, and T. Sumner. Topic model methods for automatically identifying out-of-scope resources. In Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries, JCDL '09, pages 19--28, NY, USA, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. T. M. Department, T. Minka, and J. Lafferty. Expectation-propagation for the generative aspect model. In In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, pages 352--359. Morgan Kaufmann, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. E. Diaz-Aviles, M. Fisichella, R. Kawase, W. Nejdl, and A. Stewart. Unsupervised auto-tagging for learning object enrichment. In EC-TEL, volume 6964 of Lecture Notes in Computer Science, pages 83--96. Springer, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. E. Diaz-Aviles, M. Georgescu, A. Stewart, and W. Nejdl. Lda for on-the-fly auto tagging. In Proceedings of the fourth ACM conference on Recommender systems, RecSys '10, pages 309--312, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. I. B. et al. A latent dirichlet model for unsupervised entity resolution. In SDM, 2006.Google ScholarGoogle Scholar
  7. M. Fisichella, A. Stewart, K. Denecke, and W. Nejdl. Unsupervised public health event detection for epidemic intelligence. In J. Huang, N. Koudas, G. Jones, X. Wu, K. Collins-Thompson, and A. An, editors, CIKM, pages 1881--1884. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl Acad Sci U S A, 101 Suppl 1:5228--5235, April 2004.Google ScholarGoogle ScholarCross RefCross Ref
  9. S. Hettich and S. D. Bay. The uci kdd archive, 1999.Google ScholarGoogle Scholar
  10. T. Joachims. Text categorization with support vector machines: Learning with many relevant features, 1998.Google ScholarGoogle Scholar
  11. T. Joachims. Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms. Kluwer Academic Publishers, MA, USA, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T.-K. Kim, H. Kim, W. Hwang, and J. Kittler. Component-based lda face description for image retrieval and mpeg-7 standardisation. Image Vision Comput., 23(7):631--642, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Kolcz and W. tau Yih. Raising the baseline for high-precision text classifiers. In P. Berkhin, R. Caruana, and X. Wu, editors, KDD, pages 400--409. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. F.-F. Li and P. Perona. A bayesian hierarchical model for learning natural scene categories. In CVPR '05: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 2, pages 524--531. IEEE Computer Society, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Moschitti and R. Basili. Complex linguistic features for text classification: A comprehensive study. In S. McDonald and J. Tait, editors, ECIR, volume 2997 of Lecture Notes in Computer Science, pages 181--196. Springer, 2004.Google ScholarGoogle Scholar
  16. K. Niemann, U. Schwertel, M. Kalz, A. Mikroyannidis, M. Fisichella, M. Friedrich, M. Dicerto, K.-H. Ha, P. Holtkamp, and R. Kawase. Skill-based scouting of open management content. In EC-TEL, volume 6383 of Lecture Notes in Computer Science, pages 632--637. Springer, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Rendle and L. Schmidt-Thieme. Pairwise interaction tensor factorization for personalized tag recommendation. In Proceedings of the third ACM international conference on Web search and data mining, WSDM '10, pages 81--90, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. F. Sebastiani. Machine learning in automated text categorization. ACM Comput. Surv., 34:1--47, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Soucy and G. W. Mineau. Beyond tfidf weighting for text categorization in the vector space model. In L. P. Kaelbling and A. Saffiotti, editors, IJCAI, pages 1130--1135. Professional Book Center, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Veeramachaneni, D. Sona, and P. Avesani. Hierarchical dirichlet model for document classification. In L. D. Raedt and S. Wrobel, editors, ICML, volume 119 of ACM International Conference Proceeding Series, pages 928--935. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. Xing and M. Girolami. Employing latent dirichlet allocation for fraud detection in telecommunications. Pattern Recogn. Lett., 28(13):1727--1734, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Automatic classification of documents in cold-start scenarios

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        WIMS '13: Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
        June 2013
        408 pages
        ISBN:9781450318501
        DOI:10.1145/2479787

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 June 2013

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        WIMS '13 Paper Acceptance Rate28of72submissions,39%Overall Acceptance Rate140of278submissions,50%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader