research-article

Large Scale Sentiment Learning with Limited Labels

Authors:

Vasileios Iosifidis,

Eirini NtoutsiAuthors Info & Claims

KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 1823 - 1832

https://doi.org/10.1145/3097983.3098159

Published: 13 August 2017 Publication History

Abstract

Sentiment analysis is an important task in order to gain insights over the huge amounts of opinions that are generated in the social media on a daily basis. Although there is a lot of work on sentiment analysis, there are no many datasets available which one can use for developing new methods and for evaluation. To the best of our knowledge, the largest dataset for sentiment analysis is TSentiment [8], a 1.6 millions machine-annotated tweets dataset covering a period of about 3 months in 2009. This dataset however is too short and therefore insufficient to study heterogeneous, fast evolving streams. Therefore, we annotated the Twitter dataset of 2015 (228 million tweets without retweets and 275 million with retweets) and we make it publicly available for research. For the annotation we leverage the power of unlabeled data, together with labeled data using semi-supervised learning and in particular, Self-Learning and Co-Training. Our main contribution is the provision of the TSentiment15 dataset together with insights from the analysis, which includes a batch and a stream-processing of the data. In the former, all labeled and unlabeled data are available to the algorithms from the beginning, whereas in the later, they are revealed gradually based on their arrival time in the stream.

References

[1]

A. Aue and M. Gamon. Customizing sentiment classifiers to new domains: A case study. In Proceedings of recent advances in natural language processing (RANLP), volume 1, pages 2--1, 2005.

[2]

S. Baccianella, A. Esuli, and F. Sebastiani. Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In LREC, volume 10, pages 2200--2204, 2010.

[3]

A. Bifet and E. Frank. Sentiment knowledge discovery in twitter streaming data. In International Conference on Discovery Science, pages 1--15. Springer, 2010.

Digital Library

[4]

A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, pages 92--100. ACM, 1998.

Digital Library

[5]

S. Dasgupta and V. Ng. Mine the easy, classify the hard: a semi-supervised approach to automatic sentiment classification. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pages 701--709. Association for Computational Linguistics, 2009.

[6]

S. Fralick. Learning to recognize patterns without a teacher. IEEE Transactions on Information Theory, 13(1):57--64, 1967.

Digital Library

[7]

L. Gatti, M. Guerini, and M. Turchi. Sentiwords: Deriving a high precision and high coverage lexicon for sentiment analysis. IEEE Transactions on Affective Computing, 7(4):409--421, 2016.

Digital Library

[8]

A. Go, R. Bhayani, and L. Huang. Twitter sentiment classification using distant supervision. Processing, pages 1--6, 2009.

[9]

A. Go, R. Bhayani, and L. Huang. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1:12, 2009.

[10]

Y. He and D. Zhou. Self-training from labeled features for sentiment analysis. Information Processing & Management, 47(4):606--616, 2011.

Digital Library

[11]

S. Li, Z. Wang, G. Zhou, and S. Y. M. Lee. Semi-supervised learning for imbalanced sentiment classification. In IJCAI proceedings-international joint conference on artificial intelligence, volume 22, page 1826, 2011.

[12]

Y. Liu, X. Yu, A. An, and X. Huang. Riding the tide of sentiment change: Sentiment analysis with evolving online reviews. World Wide Web, 16(4):477--496, jul 2013.

Digital Library

[13]

M. Lucas and D. Downey. Scaling semi-supervised naive bayes with feature marginals. In ACL (1), pages 343--351, 2013.

[14]

X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, et al. Mllib: Machine learning in apache spark. Journal of Machine Learning Research, 17(34):1--7, 2016.

Digital Library

[15]

S. M. Mohammad, S. Kiritchenko, and X. Zhu. Nrc-canada: Building the state-of- the-art in sentiment analysis of tweets. arXiv preprint arXiv:1308.6242, 2013.

[16]

K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using em. Machine learning, 39(2--3):103--134, 2000.

[17]

S. Sedhai and A. Sun. Hspam14: A collection of 14 million tweets for hashtag- oriented spam research. In SIGIR, pages 223--232. ACM, 2015.

Digital Library

[18]

N. F. F. D. Silva, L. F. Coletta, and E. R. Hruschka. A survey and comparative study of tweet sentiment analysis via semi-supervised learning. ACM Computing Surveys (CSUR), 49(1):15, 2016.

Digital Library

[19]

J. Su, J. S. Shirab, and S. Matwin. Large scale text classification using semi- supervised multinomial naive bayes. In ICML, pages 97--104, 2011.

[20]

P. A. Tapia and J. D. Velásquez. Twitter sentiment polarity analysis: A novel approach for improving the automated labeling in a text corpora. In International Conference on Active Media Technology, pages 274--285. Springer, 2014.

[21]

K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pages 173--180. Association for Computational Linguistics, 2003.

Digital Library

[22]

S. Wagner, M. Zimmermann, E. Ntoutsi, and M. Spiliopoulou. Ageing-based multinomial naive bayes classifiers over opinionated data streams. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2015, Porto, Portugal, September 7-11, 2015, Proceedings, Part I, pages 401--416, 2015.

Digital Library

[23]

S. Wang and C. D. Manning. Baselines and bigrams: Simple, good sentiment and topic classification. In ACL, pages 90--94. Association for Computational Linguistics, 2012.

Digital Library

[24]

R. Xia, C. Wang, X.-Y. Dai, and T. Li. Co-training for semi-supervised sentiment classification based on dual-view bags-of-words representation. In ACL (1), pages 1054--1063, 2015.

[25]

L. Zhao, M. Huang, Z. Yao, R. Su, Y. Jiang, and X. Zhu. Semi-supervised multi- nomial naive bayes for text classification by leveraging word-level statistical constraint. In AAAI, 2016.

[26]

M. Zimmermann, E. Ntoutsi, and M. Spiliopoulou. A semi-supervised self-adaptive classifier over opinionated streams. In ICDM Workshop, pages 425--432, 2014

Cited By

Zahra AMa LKhong K(2025)Advancing Sentiment Analysis of Social Media Data: Unveiling Public Perception of Environmental Challenges in MalaysiaInternational Conference on Innovation, Sustainability, and Applied Sciences10.1007/978-3-031-68952-9_21(159-167)Online publication date: 12-Feb-2025
https://doi.org/10.1007/978-3-031-68952-9_21
Bashiri HNaderi H(2024)LexiSNTAGMM: an unsupervised framework for sentiment classification in data from distinct domains, synergistically integrating dictionary-based and machine learning approachesSocial Network Analysis and Mining10.1007/s13278-024-01268-z14:1Online publication date: 18-May-2024
https://doi.org/10.1007/s13278-024-01268-z
Joaquim CFaleiros T(2023)BERT Self-Learning Approach with Limited Labels for Document ClassificationLearning and Intelligent Optimization10.1007/978-3-031-24866-5_21(278-291)Online publication date: 5-Feb-2023
https://doi.org/10.1007/978-3-031-24866-5_21
Show More Cited By

Index Terms

Large Scale Sentiment Learning with Limited Labels
1. Computing methodologies
  1. Machine learning
    1. Learning settings
      1. Semi-supervised learning settings
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Sentiment analysis

Recommendations

A Survey and Comparative Study of Tweet Sentiment Analysis via Semi-Supervised Learning

Twitter is a microblogging platform in which users can post status messages, called “tweets,” to their friends. It has provided an enormous dataset of the so-called sentiments, whose classification can take place through supervised learning. To build ...
Inductive Semi-supervised Multi-Label Learning with Co-Training
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

In multi-label learning, each training example is associated with multiple class labels and the task is to learn a mapping from the feature space to the power set of label space. It is generally demanding and time-consuming to obtain labels for training ...
Sentiment analysis on big sparse data streams with limited labels
Abstract
Sentiment analysis is an important task in order to gain insights over the huge amounts of opinionated texts generated on a daily basis in social media like Twitter. Despite its huge amount, standard supervised learning methods won’t work upon ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 2017

2240 pages

ISBN:9781450348874

DOI:10.1145/3097983

General Chairs:
Stan Matwin
Dalhousie University
,
Shipeng Yu
LinkedIn
,
Faisal Farooq
IBM

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

ALEXANDRIA
German Research Foundation (DFG) project OSCAR

Conference

KDD '17

Sponsor:

KDD '17: The 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 13 - 17, 2017

NS, Halifax, Canada

Acceptance Rates

KDD '17 Paper Acceptance Rate 64 of 748 submissions, 9%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

23
Total Citations
View Citations
789
Total Downloads

Downloads (Last 12 months)24
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zahra AMa LKhong K(2025)Advancing Sentiment Analysis of Social Media Data: Unveiling Public Perception of Environmental Challenges in MalaysiaInternational Conference on Innovation, Sustainability, and Applied Sciences10.1007/978-3-031-68952-9_21(159-167)Online publication date: 12-Feb-2025
https://doi.org/10.1007/978-3-031-68952-9_21
Bashiri HNaderi H(2024)LexiSNTAGMM: an unsupervised framework for sentiment classification in data from distinct domains, synergistically integrating dictionary-based and machine learning approachesSocial Network Analysis and Mining10.1007/s13278-024-01268-z14:1Online publication date: 18-May-2024
https://doi.org/10.1007/s13278-024-01268-z
Joaquim CFaleiros T(2023)BERT Self-Learning Approach with Limited Labels for Document ClassificationLearning and Intelligent Optimization10.1007/978-3-031-24866-5_21(278-291)Online publication date: 5-Feb-2023
https://doi.org/10.1007/978-3-031-24866-5_21
Ao BFan B(2022)Overview of the development of AI dataset annotation2022 2nd International Symposium on Artificial Intelligence and its Application on Media (ISAIAM)10.1109/ISAIAM55748.2022.00041(174-181)Online publication date: Jun-2022
https://doi.org/10.1109/ISAIAM55748.2022.00041
Idrees MMinku LStahl FBadii A(2022)A heterogeneous online learning ensemble for non-stationary environmentsKnowledge-Based Systems10.1016/j.knosys.2019.104983188:COnline publication date: 21-Apr-2022
https://dl.acm.org/doi/10.1016/j.knosys.2019.104983
Hemmatian FSohrabi M(2022)A survey on classification techniques for opinion mining and sentiment analysisArtificial Intelligence Review10.1007/s10462-017-9599-652:3(1495-1545)Online publication date: 10-Mar-2022
https://dl.acm.org/doi/10.1007/s10462-017-9599-6
Yenkar PSawarkar S(2022)Ensemble Semi-supervised Machine Learning Algorithm for Classifying Complaint TweetsMachine Intelligence and Smart Systems10.1007/978-981-16-9650-3_5(65-74)Online publication date: 24-May-2022
https://doi.org/10.1007/978-981-16-9650-3_5
Al-Laith AShahbaz MAlaskar HRehmat A(2021)AraSenCorpus: A Semi-Supervised Approach for Sentiment Annotation of a Large Arabic Text CorpusApplied Sciences10.3390/app1105243411:5(2434)Online publication date: 9-Mar-2021
https://doi.org/10.3390/app11052434
Deng NXiong C(2020)Serialized Co-Training-Based Recognition of Medicine Names for Patent Mining and RetrievalInternational Journal of Data Warehousing and Mining10.4018/IJDWM.202007010516:3(87-107)Online publication date: 1-Jul-2020
https://dl.acm.org/doi/10.4018/IJDWM.2020070105
Sanagar SGupta D(2020)Unsupervised Genre-Based Multidomain Sentiment Lexicon Learning Using Corpus-Generated Polarity Seed WordsIEEE Access10.1109/ACCESS.2020.30052428(118050-118071)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3005242
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten