research-article

Automatic classification of documents in cold-start scenarios

Authors:
Ricardo Kawase

Leibniz University of Hanover, Hannover, Germany

Leibniz University of Hanover, Hannover, Germany
View Profile

,
Marco Fisichella

Leibniz University of Hanover, Hannover, Germany

Leibniz University of Hanover, Hannover, Germany
View Profile

,
Bernardo Pereira Nunes

Leibniz University of Hanover, Hannover, Germany

Leibniz University of Hanover, Hannover, Germany
View Profile

,
Kyung-Hun Ha

ESCP Europe, Berlin, Germany

ESCP Europe, Berlin, Germany
View Profile

,
Markus Bick

ESCP Europe, Berlin, Germany

ESCP Europe, Berlin, Germany
View Profile

WIMS '13: Proceedings of the 3rd International Conference on Web Intelligence, Mining and SemanticsJune 2013Article No.: 19Pages 1–10https://doi.org/10.1145/2479787.2479789

Published:12 June 2013Publication History

WIMS '13: Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics

Pages 1–10

ABSTRACT

Document classification is key to ensuring quality of any digital library. However, classifying documents is a very time-consuming task. In addition, few or none of the documents in a newly created repository are classified. The non-classification of documents not only prevents users from finding information but also hinders the system's aptitude to recommend relevant items. Moreover, the lack of classified documents prevents any kind of machine learning algorithm to automatically annotate these items. In this work, we propose a novel approach to automatically classifying documents that differs from previous works in the sense that it exploits the wisdom of the crowds available on the Web. Our proposed strategy adapts an automatic tagging approach combined with a straightforward matching algorithm to classify documents in a given domain classification. To validate our findings, we compared our methods against the existing and performed a user evaluation with 61 participants to estimate the quality of the classifications. Results show that, in 72% of the cases, the automatic classification is relevant and well accepted by participants. In conclusion, automatic classification can facilitate access to relevant documents.

References

S. Bethard, S. Ghosh, J. H. Martin, and T. Sumner. Topic model methods for automatically identifying out-of-scope resources. In Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries, JCDL '09, pages 19--28, NY, USA, 2009. Google ScholarDigital Library
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarDigital Library
T. M. Department, T. Minka, and J. Lafferty. Expectation-propagation for the generative aspect model. In In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, pages 352--359. Morgan Kaufmann, 2002. Google ScholarDigital Library
E. Diaz-Aviles, M. Fisichella, R. Kawase, W. Nejdl, and A. Stewart. Unsupervised auto-tagging for learning object enrichment. In EC-TEL, volume 6964 of Lecture Notes in Computer Science, pages 83--96. Springer, 2011. Google ScholarDigital Library
E. Diaz-Aviles, M. Georgescu, A. Stewart, and W. Nejdl. Lda for on-the-fly auto tagging. In Proceedings of the fourth ACM conference on Recommender systems, RecSys '10, pages 309--312, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
I. B. et al. A latent dirichlet model for unsupervised entity resolution. In SDM, 2006.Google Scholar
M. Fisichella, A. Stewart, K. Denecke, and W. Nejdl. Unsupervised public health event detection for epidemic intelligence. In J. Huang, N. Koudas, G. Jones, X. Wu, K. Collins-Thompson, and A. An, editors, CIKM, pages 1881--1884. ACM, 2010. Google ScholarDigital Library
T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl Acad Sci U S A, 101 Suppl 1:5228--5235, April 2004.Google ScholarCross Ref
S. Hettich and S. D. Bay. The uci kdd archive, 1999.Google Scholar
T. Joachims. Text categorization with support vector machines: Learning with many relevant features, 1998.Google Scholar
T. Joachims. Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms. Kluwer Academic Publishers, MA, USA, 2002. Google ScholarDigital Library
T.-K. Kim, H. Kim, W. Hwang, and J. Kittler. Component-based lda face description for image retrieval and mpeg-7 standardisation. Image Vision Comput., 23(7):631--642, 2005. Google ScholarDigital Library
A. Kolcz and W. tau Yih. Raising the baseline for high-precision text classifiers. In P. Berkhin, R. Caruana, and X. Wu, editors, KDD, pages 400--409. ACM, 2007. Google ScholarDigital Library
F.-F. Li and P. Perona. A bayesian hierarchical model for learning natural scene categories. In CVPR '05: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 2, pages 524--531. IEEE Computer Society, 2005. Google ScholarDigital Library
A. Moschitti and R. Basili. Complex linguistic features for text classification: A comprehensive study. In S. McDonald and J. Tait, editors, ECIR, volume 2997 of Lecture Notes in Computer Science, pages 181--196. Springer, 2004.Google Scholar
K. Niemann, U. Schwertel, M. Kalz, A. Mikroyannidis, M. Fisichella, M. Friedrich, M. Dicerto, K.-H. Ha, P. Holtkamp, and R. Kawase. Skill-based scouting of open management content. In EC-TEL, volume 6383 of Lecture Notes in Computer Science, pages 632--637. Springer, 2010. Google ScholarDigital Library
S. Rendle and L. Schmidt-Thieme. Pairwise interaction tensor factorization for personalized tag recommendation. In Proceedings of the third ACM international conference on Web search and data mining, WSDM '10, pages 81--90, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
F. Sebastiani. Machine learning in automated text categorization. ACM Comput. Surv., 34:1--47, 2002. Google ScholarDigital Library
P. Soucy and G. W. Mineau. Beyond tfidf weighting for text categorization in the vector space model. In L. P. Kaelbling and A. Saffiotti, editors, IJCAI, pages 1130--1135. Professional Book Center, 2005. Google ScholarDigital Library
S. Veeramachaneni, D. Sona, and P. Avesani. Hierarchical dirichlet model for document classification. In L. D. Raedt and S. Wrobel, editors, ICML, volume 119 of ACM International Conference Proceeding Series, pages 928--935. ACM, 2005. Google ScholarDigital Library
D. Xing and M. Girolami. Employing latent dirichlet allocation for fraud detection in telecommunications. Pattern Recogn. Lett., 28(13):1727--1734, 2007. Google ScholarDigital Library

Index Terms

Automatic classification of documents in cold-start scenarios
1. Information systems
  1. Information retrieval
  2. Information storage systems
    1. Record storage systems

Recommendations

Passage detection using text classification

Passages can be hidden within a text to circumvent their disallowed transfer. Such release of compartmentalized information is of concern to all corporate and governmental organizations. Passage retrieval is well studied; we posit, however, that passage ...
Read More
A Hybrid Classifier Approach for Web Retrieved Documents Classification
ITCC '04: Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC'04) Volume 2 - Volume 2

The paper presents a hybrid technique for theclassification of web returned hits into concepthierarchies. The technique involves a combination ofmanual and automatic classifiers. At first, all webreturned documents are assigned to human ...
Read More
Improving Cold Start Recommendation by Mapping Feature-Based Preferences to Item Comparisons
UMAP '17: Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization

Many Recommender Systems (RSs) rely on user preference data in the form of ratings or likes for items. Previous research has shown that item comparisons can also be effectively used to model user preferences and build RS. However, users often express ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WIMS '13: Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
June 2013
408 pages
ISBN:9781450318501
DOI:10.1145/2479787
Conference Chair:
David Camacho
Autonomous University of Madrid, Spain
,
Program Chairs:
Rajendra Akerkar
Western Norway Research Institute, Norway
,
Maria D. Rodriguez Moreno
University of Alcalá, Spain
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 June 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
automatic classification
cold-start
digital libraries
information retrieval
user evaluation
Qualifiers
- research-article
Conference

Acceptance Rates
WIMS '13 Paper Acceptance Rate28of72submissions,39%Overall Acceptance Rate140of278submissions,50%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 159
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Automatic classification of documents in cold-start scenarios

WIMS '13: Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics

ABSTRACT

References

Cited By

Index Terms

Recommendations

Passage detection using text classification

A Hybrid Classifier Approach for Web Retrieved Documents Classification

Improving Cold Start Recommendation by Mapping Feature-Based Preferences to Item Comparisons

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Automatic classification of documents in cold-start scenarios

WIMS '13: Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics

ABSTRACT

References

Cited By

Index Terms

Recommendations

Passage detection using text classification

A Hybrid Classifier Approach for Web Retrieved Documents Classification

Improving Cold Start Recommendation by Mapping Feature-Based Preferences to Item Comparisons

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media