ABSTRACT
Document classification is key to ensuring quality of any digital library. However, classifying documents is a very time-consuming task. In addition, few or none of the documents in a newly created repository are classified. The non-classification of documents not only prevents users from finding information but also hinders the system's aptitude to recommend relevant items. Moreover, the lack of classified documents prevents any kind of machine learning algorithm to automatically annotate these items. In this work, we propose a novel approach to automatically classifying documents that differs from previous works in the sense that it exploits the wisdom of the crowds available on the Web. Our proposed strategy adapts an automatic tagging approach combined with a straightforward matching algorithm to classify documents in a given domain classification. To validate our findings, we compared our methods against the existing and performed a user evaluation with 61 participants to estimate the quality of the classifications. Results show that, in 72% of the cases, the automatic classification is relevant and well accepted by participants. In conclusion, automatic classification can facilitate access to relevant documents.
- S. Bethard, S. Ghosh, J. H. Martin, and T. Sumner. Topic model methods for automatically identifying out-of-scope resources. In Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries, JCDL '09, pages 19--28, NY, USA, 2009. Google ScholarDigital Library
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarDigital Library
- T. M. Department, T. Minka, and J. Lafferty. Expectation-propagation for the generative aspect model. In In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, pages 352--359. Morgan Kaufmann, 2002. Google ScholarDigital Library
- E. Diaz-Aviles, M. Fisichella, R. Kawase, W. Nejdl, and A. Stewart. Unsupervised auto-tagging for learning object enrichment. In EC-TEL, volume 6964 of Lecture Notes in Computer Science, pages 83--96. Springer, 2011. Google ScholarDigital Library
- E. Diaz-Aviles, M. Georgescu, A. Stewart, and W. Nejdl. Lda for on-the-fly auto tagging. In Proceedings of the fourth ACM conference on Recommender systems, RecSys '10, pages 309--312, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- I. B. et al. A latent dirichlet model for unsupervised entity resolution. In SDM, 2006.Google Scholar
- M. Fisichella, A. Stewart, K. Denecke, and W. Nejdl. Unsupervised public health event detection for epidemic intelligence. In J. Huang, N. Koudas, G. Jones, X. Wu, K. Collins-Thompson, and A. An, editors, CIKM, pages 1881--1884. ACM, 2010. Google ScholarDigital Library
- T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl Acad Sci U S A, 101 Suppl 1:5228--5235, April 2004.Google ScholarCross Ref
- S. Hettich and S. D. Bay. The uci kdd archive, 1999.Google Scholar
- T. Joachims. Text categorization with support vector machines: Learning with many relevant features, 1998.Google Scholar
- T. Joachims. Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms. Kluwer Academic Publishers, MA, USA, 2002. Google ScholarDigital Library
- T.-K. Kim, H. Kim, W. Hwang, and J. Kittler. Component-based lda face description for image retrieval and mpeg-7 standardisation. Image Vision Comput., 23(7):631--642, 2005. Google ScholarDigital Library
- A. Kolcz and W. tau Yih. Raising the baseline for high-precision text classifiers. In P. Berkhin, R. Caruana, and X. Wu, editors, KDD, pages 400--409. ACM, 2007. Google ScholarDigital Library
- F.-F. Li and P. Perona. A bayesian hierarchical model for learning natural scene categories. In CVPR '05: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 2, pages 524--531. IEEE Computer Society, 2005. Google ScholarDigital Library
- A. Moschitti and R. Basili. Complex linguistic features for text classification: A comprehensive study. In S. McDonald and J. Tait, editors, ECIR, volume 2997 of Lecture Notes in Computer Science, pages 181--196. Springer, 2004.Google Scholar
- K. Niemann, U. Schwertel, M. Kalz, A. Mikroyannidis, M. Fisichella, M. Friedrich, M. Dicerto, K.-H. Ha, P. Holtkamp, and R. Kawase. Skill-based scouting of open management content. In EC-TEL, volume 6383 of Lecture Notes in Computer Science, pages 632--637. Springer, 2010. Google ScholarDigital Library
- S. Rendle and L. Schmidt-Thieme. Pairwise interaction tensor factorization for personalized tag recommendation. In Proceedings of the third ACM international conference on Web search and data mining, WSDM '10, pages 81--90, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- F. Sebastiani. Machine learning in automated text categorization. ACM Comput. Surv., 34:1--47, 2002. Google ScholarDigital Library
- P. Soucy and G. W. Mineau. Beyond tfidf weighting for text categorization in the vector space model. In L. P. Kaelbling and A. Saffiotti, editors, IJCAI, pages 1130--1135. Professional Book Center, 2005. Google ScholarDigital Library
- S. Veeramachaneni, D. Sona, and P. Avesani. Hierarchical dirichlet model for document classification. In L. D. Raedt and S. Wrobel, editors, ICML, volume 119 of ACM International Conference Proceeding Series, pages 928--935. ACM, 2005. Google ScholarDigital Library
- D. Xing and M. Girolami. Employing latent dirichlet allocation for fraud detection in telecommunications. Pattern Recogn. Lett., 28(13):1727--1734, 2007. Google ScholarDigital Library
Index Terms
- Automatic classification of documents in cold-start scenarios
Recommendations
Passage detection using text classification
Passages can be hidden within a text to circumvent their disallowed transfer. Such release of compartmentalized information is of concern to all corporate and governmental organizations. Passage retrieval is well studied; we posit, however, that passage ...
A Hybrid Classifier Approach for Web Retrieved Documents Classification
ITCC '04: Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC'04) Volume 2 - Volume 2The paper presents a hybrid technique for theclassification of web returned hits into concepthierarchies. The technique involves a combination ofmanual and automatic classifiers. At first, all webreturned documents are assigned to human ...
Improving Cold Start Recommendation by Mapping Feature-Based Preferences to Item Comparisons
UMAP '17: Proceedings of the 25th Conference on User Modeling, Adaptation and PersonalizationMany Recommender Systems (RSs) rely on user preference data in the form of ratings or likes for items. Previous research has shown that item comparisons can also be effectively used to model user preferences and build RS. However, users often express ...
Comments