ABSTRACT
Detecting the presence and amount of private information being shared in online media is the first step towards analyzing information revealing habits of users in social networks and a useful method for researchers to study aggregate privacy behavior. In this work, we aim to find out if text contains private content by using our novel learning based approach `privacy detective' that combines topic modeling, named entity recognition, privacy ontology, sentiment analysis, and text normalization to represent privacy features. Privacy detective investigates a broader range of privacy concerns compared to previous approaches that focus on keyword searching or profile related properties. We collected 500,000 tweets from 100,000 Twitter users along with other information such as tweet linkages and follower relationships. We reach 95.45% accuracy in a two-class task classifying Twitter users who do not reveal much private information and Twitter users who share sensitive information. We score timelines according to three privacy levels after having Amazon Mechanical Turk (AMT) workers annotate collected tweets according to privacy categories. Supervised machine learning classification results on these annotations reach 69.63% accuracy on a three-class task. Inter-annotator agreement on timeline privacy scores between various AMT workers and our classifiers fall under the same positive agreement level. Additionally, we show that a user's privacy level is correlated with her friends' privacy scores and also with the privacy scores of people mentioned in her text but not with the number of her followers. As such, privacy in social networks appear to be socially constructed, which can have great implications for privacy enhancing technologies and educational interventions.
- https://opennlp.apache.org.Google Scholar
- http://alias-i.com/lingpipe. October 2008.Google Scholar
- S. Aksoy and R. M. Haralick. Feature normalization and likelihood-based similarity measures for image retrieval. Pattern Recognition Letters, 22(5):563--582, 2001. Google ScholarDigital Library
- D. Blei. Probabilistic topic models. Communications of the ACM, 55(4), 2012. Google ScholarDigital Library
- D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993--1022, 2003. Google ScholarDigital Library
- J. Bollen, H. Mao, and A. Pepe. Modeling public mood and emotion: Twitter sentiment and socio-economic phenomena. In ICWSM, 2011.Google Scholar
- P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. D. Pietra, and J. C. Lai. Class-based n-gram models of natural language. Computational linguistics, 18(4):467--479, 1992. Google ScholarDigital Library
- R. Chow, I. Oberst, and J. Staddon. Sanitization's slippery slope: the design and study of a text revision assistant. In Proceedings of the 5th Symposium on Usable Privacy and Security, page 13. ACM, 2009. Google ScholarDigital Library
- N. A. Christakis and J. H. Fowler. The spread of obesity in a large social network over 32 years. New England journal of medicine, 357(4):370--379, 2007.Google Scholar
- N. A. Christakis and J. H. Fowler. The collective dynamics of smoking in a large social network. New England journal of medicine, 358(21):2249--2258, 2008.Google Scholar
- J. Cohen. A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement, 20(1):37, 1960.Google ScholarCross Ref
- J. Cohen. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological bulletin, 70(4):213, 1968.Google ScholarCross Ref
- E. D. Cristofaro, C. Soriente, G. Tsudik, and A. Williams. Hummingbird: Privacy at the time of twitter. In IEEE Symposium on Security and Privacy, pages 285--299. IEEE Computer Society, 2012. Google ScholarDigital Library
- Y. Freund, R. E. Schapire, et al. Experiments with a new boosting algorithm. In ICML, volume 96, pages 148--156, 1996.Google ScholarDigital Library
- A. J. Gill, A. Vasalou, C. Papoutsi, and A. N. Joinson. Privacy dictionary: a linguistic taxonomy of privacy for content analysis. In Proceedings of the 2011 annual conference on Human factors in computing systems, pages 3227--3236. ACM, 2011. Google ScholarDigital Library
- M. Hart, P. Manadhata, and R. Johnson. Text classification for data loss prevention. In Privacy Enhancing Technologies, pages 18--37. Springer, 2011. Google ScholarDigital Library
- J. R. Landis, G. G. Koch, et al. The measurement of observer agreement for categorical data. biometrics, 33(1):159--174, 1977.Google Scholar
- J. H. Lau, N. Collier, and T. Baldwin. On-line trend analysis with topic models:#twitter trends detection topic model online. In COLING, pages 1519--1534, 2012.Google Scholar
- K. Liu and E. Terzi. A framework for computing the privacy scores of users in online social networks. ACM Transactions on Knowledge Discovery from Data (TKDD), 5(1):6, 2010. Google ScholarDigital Library
- H. Mao, X. Shuai, and A. Kapadia. Loose tweets: an analysis of privacy leaks on twitter. In Proceedings of the 10th annual ACM workshop on Privacy in the electronic society, pages 1--12. ACM, 2011. Google ScholarDigital Library
- A. K. McCallum. Mallet: A machine learning for language toolkit. 2002.Google Scholar
- O. Owoputi, B. O'Connor, C. Dyer, K. Gimpel, N. Schneider, and N. A. Smith. Improved part-of-speech tagging for online conversational text with word clusters. In Proceedings of NAACL-HLT, pages 380--390, 2013.Google Scholar
- J. W. Pennebaker, M. E. Francis, and R. J. Booth. Linguistic inquiry and word count: Liwc 2001. Mahway: Lawrence Erlbaum Associates, 2001.Google Scholar
- J. C. Platt. Sequential minimal optimization: A fast algorithm for training support vector machines. Advances in Kernel Methods Support Vector Learning, 208(MSR-TR-98--14):1--21, 1998.Google Scholar
- A. Ritter, S. Clark, O. Etzioni, et al. Named entity recognition in tweets: an experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1524--1534. Association for Computational Linguistics, 2011. Google ScholarDigital Library
- M. Sleeper, J. Cranshaw, P. G. Kelley, B. Ur, A. Acquisti, L. F. Cranor, and N. Sadeh. i read my twitter the next morning and was astonished: a conversational perspective on twitter regrets. In Proceedings of the 2013 ACM annual conference on Human factors in computing systems, pages 3277--3286. ACM, 2013. Google ScholarDigital Library
- K. Thomas, C. Grier, and D. M. Nicol. unfriendly: Multi-party privacy risks in social networks. In M. J. Atallah and N. J. Hopper, editors, Privacy Enhancing Technologies, volume 6205 of Lecture Notes in Computer Science, pages 236--252. Springer, 2010. Google ScholarDigital Library
- A. Vasalou, A. J. Gill, F. Mazanderani, C. Papoutsi, and A. Joinson. Privacy dictionary: A new resource for the automated content analysis of privacy. Journal of the American Society for Information Science and Technology, 62(11):2095--2105, 2011. Google ScholarDigital Library
- Y. Wang, G. Norcie, S. Komanduri, A. Acquisti, P. G. Leon, and L. F. Cranor. "i regretted the minute i pressed share": A qualitative study of regrets on facebook. In Proceedings of the Seventh Symposium on Usable Privacy and Security, SOUPS '11, pages 10:1--10:16, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- Z. Xue, D. Yin, B. D. Davison, and B. Davison. Normalizing microtext. In Analyzing Microtext, 2011.Google Scholar
Index Terms
- Privacy Detective: Detecting Private Information and Collective Privacy Behavior in a Large Social Network
Recommendations
An analytical framework for online privacy research
An analytical framework is suggested for interdisciplinary online privacy research.Websites managers views and knowledge is a neglected topic in privacy research.Websites managers indicate that their own websites do not violate users privacy.The younger ...
Internet Privacy Concerns versus Behavior: A Protection Motivation Approach
This study examines the possible disconnect between student concerns about privacy when using the Internet and their behavior. The literature indicates that Internet users are concerned about privacy but their web-browsing habits consistently put their ...
Privacy Sensitivity: Application in Arabic
IALP '09: Proceedings of the 2009 International Conference on Asian Language ProcessingPersonal Identifiable Information (PII) describes a relationship between information and a uniquely identifiable person. Sensitive PII refers to a category of PII that contains significant information about individuals. In general, sources of ...
Comments