skip to main content
10.1145/3097983.3098159acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Large Scale Sentiment Learning with Limited Labels

Published: 13 August 2017 Publication History

Abstract

Sentiment analysis is an important task in order to gain insights over the huge amounts of opinions that are generated in the social media on a daily basis. Although there is a lot of work on sentiment analysis, there are no many datasets available which one can use for developing new methods and for evaluation. To the best of our knowledge, the largest dataset for sentiment analysis is TSentiment [8], a 1.6 millions machine-annotated tweets dataset covering a period of about 3 months in 2009. This dataset however is too short and therefore insufficient to study heterogeneous, fast evolving streams. Therefore, we annotated the Twitter dataset of 2015 (228 million tweets without retweets and 275 million with retweets) and we make it publicly available for research. For the annotation we leverage the power of unlabeled data, together with labeled data using semi-supervised learning and in particular, Self-Learning and Co-Training. Our main contribution is the provision of the TSentiment15 dataset together with insights from the analysis, which includes a batch and a stream-processing of the data. In the former, all labeled and unlabeled data are available to the algorithms from the beginning, whereas in the later, they are revealed gradually based on their arrival time in the stream.

References

[1]
A. Aue and M. Gamon. Customizing sentiment classifiers to new domains: A case study. In Proceedings of recent advances in natural language processing (RANLP), volume 1, pages 2--1, 2005.
[2]
S. Baccianella, A. Esuli, and F. Sebastiani. Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In LREC, volume 10, pages 2200--2204, 2010.
[3]
A. Bifet and E. Frank. Sentiment knowledge discovery in twitter streaming data. In International Conference on Discovery Science, pages 1--15. Springer, 2010.
[4]
A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, pages 92--100. ACM, 1998.
[5]
S. Dasgupta and V. Ng. Mine the easy, classify the hard: a semi-supervised approach to automatic sentiment classification. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pages 701--709. Association for Computational Linguistics, 2009.
[6]
S. Fralick. Learning to recognize patterns without a teacher. IEEE Transactions on Information Theory, 13(1):57--64, 1967.
[7]
L. Gatti, M. Guerini, and M. Turchi. Sentiwords: Deriving a high precision and high coverage lexicon for sentiment analysis. IEEE Transactions on Affective Computing, 7(4):409--421, 2016.
[8]
A. Go, R. Bhayani, and L. Huang. Twitter sentiment classification using distant supervision. Processing, pages 1--6, 2009.
[9]
A. Go, R. Bhayani, and L. Huang. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1:12, 2009.
[10]
Y. He and D. Zhou. Self-training from labeled features for sentiment analysis. Information Processing & Management, 47(4):606--616, 2011.
[11]
S. Li, Z. Wang, G. Zhou, and S. Y. M. Lee. Semi-supervised learning for imbalanced sentiment classification. In IJCAI proceedings-international joint conference on artificial intelligence, volume 22, page 1826, 2011.
[12]
Y. Liu, X. Yu, A. An, and X. Huang. Riding the tide of sentiment change: Sentiment analysis with evolving online reviews. World Wide Web, 16(4):477--496, jul 2013.
[13]
M. Lucas and D. Downey. Scaling semi-supervised naive bayes with feature marginals. In ACL (1), pages 343--351, 2013.
[14]
X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, et al. Mllib: Machine learning in apache spark. Journal of Machine Learning Research, 17(34):1--7, 2016.
[15]
S. M. Mohammad, S. Kiritchenko, and X. Zhu. Nrc-canada: Building the state-of- the-art in sentiment analysis of tweets. arXiv preprint arXiv:1308.6242, 2013.
[16]
K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using em. Machine learning, 39(2--3):103--134, 2000.
[17]
S. Sedhai and A. Sun. Hspam14: A collection of 14 million tweets for hashtag- oriented spam research. In SIGIR, pages 223--232. ACM, 2015.
[18]
N. F. F. D. Silva, L. F. Coletta, and E. R. Hruschka. A survey and comparative study of tweet sentiment analysis via semi-supervised learning. ACM Computing Surveys (CSUR), 49(1):15, 2016.
[19]
J. Su, J. S. Shirab, and S. Matwin. Large scale text classification using semi- supervised multinomial naive bayes. In ICML, pages 97--104, 2011.
[20]
P. A. Tapia and J. D. Velásquez. Twitter sentiment polarity analysis: A novel approach for improving the automated labeling in a text corpora. In International Conference on Active Media Technology, pages 274--285. Springer, 2014.
[21]
K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pages 173--180. Association for Computational Linguistics, 2003.
[22]
S. Wagner, M. Zimmermann, E. Ntoutsi, and M. Spiliopoulou. Ageing-based multinomial naive bayes classifiers over opinionated data streams. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2015, Porto, Portugal, September 7-11, 2015, Proceedings, Part I, pages 401--416, 2015.
[23]
S. Wang and C. D. Manning. Baselines and bigrams: Simple, good sentiment and topic classification. In ACL, pages 90--94. Association for Computational Linguistics, 2012.
[24]
R. Xia, C. Wang, X.-Y. Dai, and T. Li. Co-training for semi-supervised sentiment classification based on dual-view bags-of-words representation. In ACL (1), pages 1054--1063, 2015.
[25]
L. Zhao, M. Huang, Z. Yao, R. Su, Y. Jiang, and X. Zhu. Semi-supervised multi- nomial naive bayes for text classification by leveraging word-level statistical constraint. In AAAI, 2016.
[26]
M. Zimmermann, E. Ntoutsi, and M. Spiliopoulou. A semi-supervised self-adaptive classifier over opinionated streams. In ICDM Workshop, pages 425--432, 2014

Cited By

View all
  • (2025)Advancing Sentiment Analysis of Social Media Data: Unveiling Public Perception of Environmental Challenges in MalaysiaInternational Conference on Innovation, Sustainability, and Applied Sciences10.1007/978-3-031-68952-9_21(159-167)Online publication date: 12-Feb-2025
  • (2024)LexiSNTAGMM: an unsupervised framework for sentiment classification in data from distinct domains, synergistically integrating dictionary-based and machine learning approachesSocial Network Analysis and Mining10.1007/s13278-024-01268-z14:1Online publication date: 18-May-2024
  • (2023)BERT Self-Learning Approach with Limited Labels for Document ClassificationLearning and Intelligent Optimization10.1007/978-3-031-24866-5_21(278-291)Online publication date: 5-Feb-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2017
2240 pages
ISBN:9781450348874
DOI:10.1145/3097983
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. co-training
  2. self-learning
  3. semi-supervised learning
  4. sentiment analysis

Qualifiers

  • Research-article

Funding Sources

  • ALEXANDRIA
  • German Research Foundation (DFG) project OSCAR

Conference

KDD '17
Sponsor:

Acceptance Rates

KDD '17 Paper Acceptance Rate 64 of 748 submissions, 9%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)24
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Advancing Sentiment Analysis of Social Media Data: Unveiling Public Perception of Environmental Challenges in MalaysiaInternational Conference on Innovation, Sustainability, and Applied Sciences10.1007/978-3-031-68952-9_21(159-167)Online publication date: 12-Feb-2025
  • (2024)LexiSNTAGMM: an unsupervised framework for sentiment classification in data from distinct domains, synergistically integrating dictionary-based and machine learning approachesSocial Network Analysis and Mining10.1007/s13278-024-01268-z14:1Online publication date: 18-May-2024
  • (2023)BERT Self-Learning Approach with Limited Labels for Document ClassificationLearning and Intelligent Optimization10.1007/978-3-031-24866-5_21(278-291)Online publication date: 5-Feb-2023
  • (2022)Overview of the development of AI dataset annotation2022 2nd International Symposium on Artificial Intelligence and its Application on Media (ISAIAM)10.1109/ISAIAM55748.2022.00041(174-181)Online publication date: Jun-2022
  • (2022)A heterogeneous online learning ensemble for non-stationary environmentsKnowledge-Based Systems10.1016/j.knosys.2019.104983188:COnline publication date: 21-Apr-2022
  • (2022)A survey on classification techniques for opinion mining and sentiment analysisArtificial Intelligence Review10.1007/s10462-017-9599-652:3(1495-1545)Online publication date: 10-Mar-2022
  • (2022)Ensemble Semi-supervised Machine Learning Algorithm for Classifying Complaint TweetsMachine Intelligence and Smart Systems10.1007/978-981-16-9650-3_5(65-74)Online publication date: 24-May-2022
  • (2021)AraSenCorpus: A Semi-Supervised Approach for Sentiment Annotation of a Large Arabic Text CorpusApplied Sciences10.3390/app1105243411:5(2434)Online publication date: 9-Mar-2021
  • (2020)Serialized Co-Training-Based Recognition of Medicine Names for Patent Mining and RetrievalInternational Journal of Data Warehousing and Mining10.4018/IJDWM.202007010516:3(87-107)Online publication date: 1-Jul-2020
  • (2020)Unsupervised Genre-Based Multidomain Sentiment Lexicon Learning Using Corpus-Generated Polarity Seed WordsIEEE Access10.1109/ACCESS.2020.30052428(118050-118071)Online publication date: 2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media