Abstract
This study develops, implements, and evaluates a multilabel text classification algorithm called the multilabel categorical K-nearest neighbor (ML-CKNN). The proposed algorithm is designed to automatically identify 25 types of risk factors with specific meanings reported in Section 1A of SEC form 10-K. The idea of ML-CKNN is to compute a categorical similarity score for each label by the K-nearest neighbors in that category. ML-CKNN is tailored to achieve the goal of extracting risk factors from 10Ks. The proposed algorithm can perfectly classify 74.94% of risk factors and 98.75% of labels. Moreover, ML-CKNN is empirically shown to outperform ML-KNN and other multilabel algorithms. The extracted risk factors could be valuable to empirical studies in accounting or finance.
- Abdou, K. and Dicle, M. F. 2007. Do risk factors matter in the Ipo valuation? J. Finan. Regul. Compl. 15, 1, 63--89.Google ScholarCross Ref
- Antweiler, W. and Frank, M. Z. 2004. Is all that talk just noise? The information content of internet stock message boards. J. Finance, 59, 3, 1259--1294.Google ScholarCross Ref
- Bai, X., Nunez, M., and Kalagnanam, J. 2010. Managing data quality risk in accounting information systems. Inf. Sys. Res., To appear. Google ScholarDigital Library
- Balakrishnan, R., Qiu, X. Y., and Srinivasan, P. 2010. On the predictive ability of narrative disclosures in annual reports. Euro. J. Oper. Res., 202, 3, 789--801.Google ScholarCross Ref
- Brucker, F., Benites, F., and Sapozhnikova, E. 2010. Multi-Label classification and extracting predicted class hierarchies. Patt. Recogn. 44, 724--738. Google ScholarDigital Library
- Campbell, J. L., Chen, H., Dhaliwal, D. S., Lu, H., and Steele, L. 2011. The information content of mandatory risk factor disclosures in corporate filings, Tech. repo., University of Arizona.Google Scholar
- Cheng, W. and Hüllermeier, E. 2009. Combining instance-based learning and logistic regression for multilabel classification. Mach. Learn. 76, 2, 211--225. Google ScholarDigital Library
- Cover, T. and Hart, P. 1967. Nearest neighbor pattern classification. Inf. Theory 13, 1, 21--27.Google ScholarDigital Library
- Das, S. and Chen, M. 2007. Yahoo! For amazon: sentiment extraction from small talk on the web. Manag. Sci., 53, 9, 1375--1388. Google ScholarDigital Library
- De Comité, F., Gilleron, R., and Tommasi, M. 2003. Learning multi-label alternating decision trees from texts and data. In Proceedings of the International Conference on Machine Learning and Data Mining in Pattern Recognetion. 251--274. Google ScholarDigital Library
- Elisseeff, A. and Weston, J. 2002. Kernel methods for multi-labelled classification and categorical regression problems. Advan. Neural Inf. Process. Syst. 14, 681--687.Google Scholar
- Feldman, R., Govindaraj, S., Livnat, J., and Segal, B. 2009. Management's tone change, post earnings announcement drift and accruals. Rev. Accoun. Studies, to appear.Google Scholar
- Gu, B., Konana, P., Rajagopalan, B., and Chen, H. 2007. Competition among virtual communities and user valuation: The case of investing-related communities. Inf. Syst. Res. 18, 1, 68--85. Google ScholarDigital Library
- Han, J. and Kamber, M. 2006. Data Mining: Concepts and Techniques. Morgan Kaufmann. Google ScholarDigital Library
- Hanley, K. and Hoberg, G. 2010. The information content of Ipo prospectuses. Rev. Finan. Studies, 23, 7, 2821--2864.Google ScholarCross Ref
- Kothari, S. P., Li, X., and Short, J. E. 2009. The effect of disclosures by management, analysts, and business press on cost of capital, return volatility, and analyst forecasts: A study using content analysis. Account. Rev. 84, 5, 1639--1670.Google ScholarCross Ref
- Li, F. 2008. Annual report readability, current earnings, and earnings persistence. J. Account. Econo. 45, 2-3, 221--247.Google Scholar
- Li, F. 2010. The information content of forward-looking statements in corporate filings—A naive bayesian machine learning approach. J. Account. Res. To appear.Google ScholarCross Ref
- Li, F. 2011. Textual analysis of corporate disclosures: A survey of the literature. J. Account. Liter. To appear.Google Scholar
- Li, H., Guo, Y., Wu, M., Li, P., and Xiang, Y. 2010. Combine multi-valued attribute decomposition with multi-label learning. Expert Syst. Appl. 37, 12. Google ScholarDigital Library
- Loughran, T. and McDonald, B. 2009. When is a liability not a liability. J. Finan. To appear.Google Scholar
- Loughran, T. and McDonald, B. 2010. Measuring readability in financial text. SSRN eLibrary.Google Scholar
- Mangen, C. and Durnev, A. 2010. The real effects of disclosure tone: Evidence from restatements. SSRN eLibrary.Google Scholar
- Nelson, K. K. and Pritchard, A. C. 2007. Litigation risk and voluntary disclosure: The use of meaningful cautionary language. SSRN eLibrary.Google Scholar
- Schapire, R. E. and Singer, Y. 2000. Boostexter: A boosting-based system for text categorization. Mach. Learn. 39, 2, 135--168. Google ScholarDigital Library
- Schumaker, R. P. and Chen, H. C. 2009. A quantitative stock prediction system based on financial news. Inf. Process. Manag. 45, 5, 571--583. Google ScholarDigital Library
- Tetlock, P., Saar-Tsechansky, M., and Macskassy, S. 2008. More than words: quantifying language to measure firms' fundamentals. J. Finance, 63, 3, 1437--1467.Google ScholarCross Ref
- Tetlock, P. C. 2007. Giving content to investor sentiment: The role of media in the stock market. J. Finance, 62, 3, 1139--1168.Google ScholarCross Ref
- Tsoumakas, G., Katakis, I., and Vlahavas, I. 2010. Mining multi-label data. In Data Mining and Knowledge Discovery Handbook, 667--685.Google Scholar
- Tsoumakas, G. and Vlahavas, I. 2007. Random k-labelsets: An ensemble method for multilabel classification. In Proceedings of the Conference on Machine Learning: ECML. 406--417. Google ScholarDigital Library
- Zhang, M. L. and Zhou, Z. H. 2006. Multilabel neural networks with applications to functional genomics and text categorization. IEEE Trans. Knowl. Data Engin. 1338--1351. Google ScholarDigital Library
- Zhang, M. L. and Zhou, Z. H. 2007. Ml-Knn: A lazy learning approach to multi-label learning. Patt. Recogn., 40, 7, 2038--2048. Google ScholarDigital Library
Index Terms
- A multilabel text classification algorithm for labeling risk factors in SEC form 10-K
Recommendations
Effective multi-label active learning for text classification
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data miningLabeling text data is quite time-consuming but essential for automatic text classification. Especially, manually creating multiple labels for each document may become impractical when a very large amount of data is needed for training multi-label text ...
A probabilistic methodology for multilabel classification
Multilabel classification is a relatively recent subfield of machine learning. Unlike to the classical approach, where instances are labeled with only one category, in multilabel classification, an arbitrary number of categories is chosen to label an ...
IMPROVING MULTILABEL CLASSIFICATION BY AVOIDING IMPLICIT NEGATIVITY WITH INCOMPLETE DATA
Many real-world problems require multilabel classification, in which each training instance is associated with a set of labels. There are many existing learning algorithms for multilabel classification; however, these algorithms assume implicit ...
Comments