research-article

A multilabel text classification algorithm for labeling risk factors in SEC form 10-K

Authors:
Ke-Wei Huang

National University of Singapore, Singapore

National University of Singapore, Singapore
View Profile

,
Zhuolun Li

National University of Singapore, Singapore

National University of Singapore, Singapore
View Profile

ACM Transactions on Management Information Systems Volume 2 Issue 3Article No.: 18pp 1–19https://doi.org/10.1145/2019618.2019624

Published:18 October 2008Publication History

ACM Transactions on Management Information Systems

Abstract

This study develops, implements, and evaluates a multilabel text classification algorithm called the multilabel categorical K-nearest neighbor (ML-CKNN). The proposed algorithm is designed to automatically identify 25 types of risk factors with specific meanings reported in Section 1A of SEC form 10-K. The idea of ML-CKNN is to compute a categorical similarity score for each label by the K-nearest neighbors in that category. ML-CKNN is tailored to achieve the goal of extracting risk factors from 10Ks. The proposed algorithm can perfectly classify 74.94% of risk factors and 98.75% of labels. Moreover, ML-CKNN is empirically shown to outperform ML-KNN and other multilabel algorithms. The extracted risk factors could be valuable to empirical studies in accounting or finance.

References

Abdou, K. and Dicle, M. F. 2007. Do risk factors matter in the Ipo valuation&quest; J. Finan. Regul. Compl. 15, 1, 63--89.Google ScholarCross Ref
Antweiler, W. and Frank, M. Z. 2004. Is all that talk just noise&quest; The information content of internet stock message boards. J. Finance, 59, 3, 1259--1294.Google ScholarCross Ref
Bai, X., Nunez, M., and Kalagnanam, J. 2010. Managing data quality risk in accounting information systems. Inf. Sys. Res., To appear. Google ScholarDigital Library
Balakrishnan, R., Qiu, X. Y., and Srinivasan, P. 2010. On the predictive ability of narrative disclosures in annual reports. Euro. J. Oper. Res., 202, 3, 789--801.Google ScholarCross Ref
Brucker, F., Benites, F., and Sapozhnikova, E. 2010. Multi-Label classification and extracting predicted class hierarchies. Patt. Recogn. 44, 724--738. Google ScholarDigital Library
Campbell, J. L., Chen, H., Dhaliwal, D. S., Lu, H., and Steele, L. 2011. The information content of mandatory risk factor disclosures in corporate filings, Tech. repo., University of Arizona.Google Scholar
Cheng, W. and Hüllermeier, E. 2009. Combining instance-based learning and logistic regression for multilabel classification. Mach. Learn. 76, 2, 211--225. Google ScholarDigital Library
Cover, T. and Hart, P. 1967. Nearest neighbor pattern classification. Inf. Theory 13, 1, 21--27.Google ScholarDigital Library
Das, S. and Chen, M. 2007. Yahoo&excl; For amazon: sentiment extraction from small talk on the web. Manag. Sci., 53, 9, 1375--1388. Google ScholarDigital Library
De Comité, F., Gilleron, R., and Tommasi, M. 2003. Learning multi-label alternating decision trees from texts and data. In Proceedings of the International Conference on Machine Learning and Data Mining in Pattern Recognetion. 251--274. Google ScholarDigital Library
Elisseeff, A. and Weston, J. 2002. Kernel methods for multi-labelled classification and categorical regression problems. Advan. Neural Inf. Process. Syst. 14, 681--687.Google Scholar
Feldman, R., Govindaraj, S., Livnat, J., and Segal, B. 2009. Management's tone change, post earnings announcement drift and accruals. Rev. Accoun. Studies, to appear.Google Scholar
Gu, B., Konana, P., Rajagopalan, B., and Chen, H. 2007. Competition among virtual communities and user valuation: The case of investing-related communities. Inf. Syst. Res. 18, 1, 68--85. Google ScholarDigital Library
Han, J. and Kamber, M. 2006. Data Mining: Concepts and Techniques. Morgan Kaufmann. Google ScholarDigital Library
Hanley, K. and Hoberg, G. 2010. The information content of Ipo prospectuses. Rev. Finan. Studies, 23, 7, 2821--2864.Google ScholarCross Ref
Kothari, S. P., Li, X., and Short, J. E. 2009. The effect of disclosures by management, analysts, and business press on cost of capital, return volatility, and analyst forecasts: A study using content analysis. Account. Rev. 84, 5, 1639--1670.Google ScholarCross Ref
Li, F. 2008. Annual report readability, current earnings, and earnings persistence. J. Account. Econo. 45, 2-3, 221--247.Google Scholar
Li, F. 2010. The information content of forward-looking statements in corporate filings—A naive bayesian machine learning approach. J. Account. Res. To appear.Google ScholarCross Ref
Li, F. 2011. Textual analysis of corporate disclosures: A survey of the literature. J. Account. Liter. To appear.Google Scholar
Li, H., Guo, Y., Wu, M., Li, P., and Xiang, Y. 2010. Combine multi-valued attribute decomposition with multi-label learning. Expert Syst. Appl. 37, 12. Google ScholarDigital Library
Loughran, T. and McDonald, B. 2009. When is a liability not a liability. J. Finan. To appear.Google Scholar
Loughran, T. and McDonald, B. 2010. Measuring readability in financial text. SSRN eLibrary.Google Scholar
Mangen, C. and Durnev, A. 2010. The real effects of disclosure tone: Evidence from restatements. SSRN eLibrary.Google Scholar
Nelson, K. K. and Pritchard, A. C. 2007. Litigation risk and voluntary disclosure: The use of meaningful cautionary language. SSRN eLibrary.Google Scholar
Schapire, R. E. and Singer, Y. 2000. Boostexter: A boosting-based system for text categorization. Mach. Learn. 39, 2, 135--168. Google ScholarDigital Library
Schumaker, R. P. and Chen, H. C. 2009. A quantitative stock prediction system based on financial news. Inf. Process. Manag. 45, 5, 571--583. Google ScholarDigital Library
Tetlock, P., Saar-Tsechansky, M., and Macskassy, S. 2008. More than words: quantifying language to measure firms' fundamentals. J. Finance, 63, 3, 1437--1467.Google ScholarCross Ref
Tetlock, P. C. 2007. Giving content to investor sentiment: The role of media in the stock market. J. Finance, 62, 3, 1139--1168.Google ScholarCross Ref
Tsoumakas, G., Katakis, I., and Vlahavas, I. 2010. Mining multi-label data. In Data Mining and Knowledge Discovery Handbook, 667--685.Google Scholar
Tsoumakas, G. and Vlahavas, I. 2007. Random k-labelsets: An ensemble method for multilabel classification. In Proceedings of the Conference on Machine Learning: ECML. 406--417. Google ScholarDigital Library
Zhang, M. L. and Zhou, Z. H. 2006. Multilabel neural networks with applications to functional genomics and text categorization. IEEE Trans. Knowl. Data Engin. 1338--1351. Google ScholarDigital Library
Zhang, M. L. and Zhou, Z. H. 2007. Ml-Knn: A lazy learning approach to multi-label learning. Patt. Recogn., 40, 7, 2038--2048. Google ScholarDigital Library

Index Terms

Recommendations

Effective multi-label active learning for text classification
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

Labeling text data is quite time-consuming but essential for automatic text classification. Especially, manually creating multiple labels for each document may become impractical when a very large amount of data is needed for training multi-label text ...
Read More
A probabilistic methodology for multilabel classification

Multilabel classification is a relatively recent subfield of machine learning. Unlike to the classical approach, where instances are labeled with only one category, in multilabel classification, an arbitrary number of categories is chosen to label an ...
Read More
IMPROVING MULTILABEL CLASSIFICATION BY AVOIDING IMPLICIT NEGATIVITY WITH INCOMPLETE DATA

Many real-world problems require multilabel classification, in which each training instance is associated with a set of labels. There are many existing learning algorithms for multilabel classification; however, these algorithms assume implicit ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Management Information Systems Volume 2, Issue 3
October 2011
138 pages
ISSN:2158-656X
EISSN:2158-6578
DOI:10.1145/2019618
Issue’s Table of Contents

Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Revised: 1 August 2011
- Accepted: 1 August 2011
- Received: 1 May 2011
- Published: 18 October 2008
Published in tmis Volume 2, Issue 3

Permissions
Request permissions about this article.
Request Permissions
Author Tags
Text classification
annual reports
multilabel classification
risk factors
text mining
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 46
  Total Citations
  View Citations
- 1,545
  Total Downloads
- Downloads (Last 12 months)125
- Downloads (Last 6 weeks)14
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A multilabel text classification algorithm for labeling risk factors in SEC form 10-K

ACM Transactions on Management Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

Effective multi-label active learning for text classification

A probabilistic methodology for multilabel classification

IMPROVING MULTILABEL CLASSIFICATION BY AVOIDING IMPLICIT NEGATIVITY WITH INCOMPLETE DATA

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A multilabel text classification algorithm for labeling risk factors in SEC form 10-K

ACM Transactions on Management Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

Effective multi-label active learning for text classification

A probabilistic methodology for multilabel classification

IMPROVING MULTILABEL CLASSIFICATION BY AVOIDING IMPLICIT NEGATIVITY WITH INCOMPLETE DATA

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media