skip to main content
research-article

A multilabel text classification algorithm for labeling risk factors in SEC form 10-K

Published:18 October 2008Publication History
Skip Abstract Section

Abstract

This study develops, implements, and evaluates a multilabel text classification algorithm called the multilabel categorical K-nearest neighbor (ML-CKNN). The proposed algorithm is designed to automatically identify 25 types of risk factors with specific meanings reported in Section 1A of SEC form 10-K. The idea of ML-CKNN is to compute a categorical similarity score for each label by the K-nearest neighbors in that category. ML-CKNN is tailored to achieve the goal of extracting risk factors from 10Ks. The proposed algorithm can perfectly classify 74.94% of risk factors and 98.75% of labels. Moreover, ML-CKNN is empirically shown to outperform ML-KNN and other multilabel algorithms. The extracted risk factors could be valuable to empirical studies in accounting or finance.

References

  1. Abdou, K. and Dicle, M. F. 2007. Do risk factors matter in the Ipo valuation? J. Finan. Regul. Compl. 15, 1, 63--89.Google ScholarGoogle ScholarCross RefCross Ref
  2. Antweiler, W. and Frank, M. Z. 2004. Is all that talk just noise? The information content of internet stock message boards. J. Finance, 59, 3, 1259--1294.Google ScholarGoogle ScholarCross RefCross Ref
  3. Bai, X., Nunez, M., and Kalagnanam, J. 2010. Managing data quality risk in accounting information systems. Inf. Sys. Res., To appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Balakrishnan, R., Qiu, X. Y., and Srinivasan, P. 2010. On the predictive ability of narrative disclosures in annual reports. Euro. J. Oper. Res., 202, 3, 789--801.Google ScholarGoogle ScholarCross RefCross Ref
  5. Brucker, F., Benites, F., and Sapozhnikova, E. 2010. Multi-Label classification and extracting predicted class hierarchies. Patt. Recogn. 44, 724--738. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Campbell, J. L., Chen, H., Dhaliwal, D. S., Lu, H., and Steele, L. 2011. The information content of mandatory risk factor disclosures in corporate filings, Tech. repo., University of Arizona.Google ScholarGoogle Scholar
  7. Cheng, W. and Hüllermeier, E. 2009. Combining instance-based learning and logistic regression for multilabel classification. Mach. Learn. 76, 2, 211--225. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Cover, T. and Hart, P. 1967. Nearest neighbor pattern classification. Inf. Theory 13, 1, 21--27.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Das, S. and Chen, M. 2007. Yahoo! For amazon: sentiment extraction from small talk on the web. Manag. Sci., 53, 9, 1375--1388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. De Comité, F., Gilleron, R., and Tommasi, M. 2003. Learning multi-label alternating decision trees from texts and data. In Proceedings of the International Conference on Machine Learning and Data Mining in Pattern Recognetion. 251--274. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Elisseeff, A. and Weston, J. 2002. Kernel methods for multi-labelled classification and categorical regression problems. Advan. Neural Inf. Process. Syst. 14, 681--687.Google ScholarGoogle Scholar
  12. Feldman, R., Govindaraj, S., Livnat, J., and Segal, B. 2009. Management's tone change, post earnings announcement drift and accruals. Rev. Accoun. Studies, to appear.Google ScholarGoogle Scholar
  13. Gu, B., Konana, P., Rajagopalan, B., and Chen, H. 2007. Competition among virtual communities and user valuation: The case of investing-related communities. Inf. Syst. Res. 18, 1, 68--85. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Han, J. and Kamber, M. 2006. Data Mining: Concepts and Techniques. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Hanley, K. and Hoberg, G. 2010. The information content of Ipo prospectuses. Rev. Finan. Studies, 23, 7, 2821--2864.Google ScholarGoogle ScholarCross RefCross Ref
  16. Kothari, S. P., Li, X., and Short, J. E. 2009. The effect of disclosures by management, analysts, and business press on cost of capital, return volatility, and analyst forecasts: A study using content analysis. Account. Rev. 84, 5, 1639--1670.Google ScholarGoogle ScholarCross RefCross Ref
  17. Li, F. 2008. Annual report readability, current earnings, and earnings persistence. J. Account. Econo. 45, 2-3, 221--247.Google ScholarGoogle Scholar
  18. Li, F. 2010. The information content of forward-looking statements in corporate filings—A naive bayesian machine learning approach. J. Account. Res. To appear.Google ScholarGoogle ScholarCross RefCross Ref
  19. Li, F. 2011. Textual analysis of corporate disclosures: A survey of the literature. J. Account. Liter. To appear.Google ScholarGoogle Scholar
  20. Li, H., Guo, Y., Wu, M., Li, P., and Xiang, Y. 2010. Combine multi-valued attribute decomposition with multi-label learning. Expert Syst. Appl. 37, 12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Loughran, T. and McDonald, B. 2009. When is a liability not a liability. J. Finan. To appear.Google ScholarGoogle Scholar
  22. Loughran, T. and McDonald, B. 2010. Measuring readability in financial text. SSRN eLibrary.Google ScholarGoogle Scholar
  23. Mangen, C. and Durnev, A. 2010. The real effects of disclosure tone: Evidence from restatements. SSRN eLibrary.Google ScholarGoogle Scholar
  24. Nelson, K. K. and Pritchard, A. C. 2007. Litigation risk and voluntary disclosure: The use of meaningful cautionary language. SSRN eLibrary.Google ScholarGoogle Scholar
  25. Schapire, R. E. and Singer, Y. 2000. Boostexter: A boosting-based system for text categorization. Mach. Learn. 39, 2, 135--168. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Schumaker, R. P. and Chen, H. C. 2009. A quantitative stock prediction system based on financial news. Inf. Process. Manag. 45, 5, 571--583. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Tetlock, P., Saar-Tsechansky, M., and Macskassy, S. 2008. More than words: quantifying language to measure firms' fundamentals. J. Finance, 63, 3, 1437--1467.Google ScholarGoogle ScholarCross RefCross Ref
  28. Tetlock, P. C. 2007. Giving content to investor sentiment: The role of media in the stock market. J. Finance, 62, 3, 1139--1168.Google ScholarGoogle ScholarCross RefCross Ref
  29. Tsoumakas, G., Katakis, I., and Vlahavas, I. 2010. Mining multi-label data. In Data Mining and Knowledge Discovery Handbook, 667--685.Google ScholarGoogle Scholar
  30. Tsoumakas, G. and Vlahavas, I. 2007. Random k-labelsets: An ensemble method for multilabel classification. In Proceedings of the Conference on Machine Learning: ECML. 406--417. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Zhang, M. L. and Zhou, Z. H. 2006. Multilabel neural networks with applications to functional genomics and text categorization. IEEE Trans. Knowl. Data Engin. 1338--1351. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Zhang, M. L. and Zhou, Z. H. 2007. Ml-Knn: A lazy learning approach to multi-label learning. Patt. Recogn., 40, 7, 2038--2048. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A multilabel text classification algorithm for labeling risk factors in SEC form 10-K

                  Recommendations

                  Comments

                  Login options

                  Check if you have access through your login credentials or your institution to get full access on this article.

                  Sign in

                  Full Access

                  • Published in

                    cover image ACM Transactions on Management Information Systems
                    ACM Transactions on Management Information Systems  Volume 2, Issue 3
                    October 2011
                    138 pages
                    ISSN:2158-656X
                    EISSN:2158-6578
                    DOI:10.1145/2019618
                    Issue’s Table of Contents

                    Copyright © 2011 ACM

                    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                    Publisher

                    Association for Computing Machinery

                    New York, NY, United States

                    Publication History

                    • Revised: 1 August 2011
                    • Accepted: 1 August 2011
                    • Received: 1 May 2011
                    • Published: 18 October 2008
                    Published in tmis Volume 2, Issue 3

                    Permissions

                    Request permissions about this article.

                    Request Permissions

                    Qualifiers

                    • research-article
                    • Research
                    • Refereed

                  PDF Format

                  View or Download as a PDF file.

                  PDF

                  eReader

                  View online with eReader.

                  eReader