ABSTRACT
In this paper we study the privacy preservation properties of aspecific technique for query log anonymization: token-based hashing. In this approach, each query is tokenized, and then a secure hash function is applied to each token. We show that statistical techniques may be applied to partially compromise the anonymization. We then analyze the specific risks that arise from these partial compromises, focused on revelation of identity from unambiguous names, addresses, and so forth, and the revelation of facts associated with an identity that are deemed to be highly sensitive. Our goal in this work is two fold: to show that token-based hashing is unsuitable for anonymization, and to present a concrete analysis of specific techniques that may be effective in breaching privacy, against which other anonymization schemes should be measured.
- R. Barzilay and K. McKeown. Extracting paraphrases from a parallel corpus. In Proc. of the 39th Annual Meeting of the Association for Computational Linguistics, pages 50--57, 2001. Google ScholarDigital Library
- R. J. Bayardo and R. Agrawal. Data privacy through optimal k-anonymization. In Proc. of the 21st International Conference on Data Engineering, pages 217--228, 2005. Google ScholarDigital Library
- A. Broder. A taxonomy of web search. SIGIR Forum, 36(2):3--10, 2002. Google ScholarDigital Library
- B. J. Jansen, A. Spink, J. Bateman, and T. Saracevic. Real life information retrieval: A study of user queries on the web. SIGIR Forum, 32(1):5--17, 1998. Google ScholarDigital Library
- B. J. Jansen, A. Spink, and T. Saracevic. Real life, real users, and real needs: A study and analysis of user queries on the web. Information Processing and Management, 36(2):207--227, 2000. Google ScholarDigital Library
- R. Jones, B. Rey, O. Madani, and W. Greiner. Generating query substitutions. In Proc. of the 15th International Conference on World Wide Web, pages 387--396, 2006. Google ScholarDigital Library
- J. Kleinberg and E. Tardos. Algorithm Design. Addison Wesley, 2005. Google ScholarDigital Library
- L. Lee. Measures of distributional similarity. In Proc. of the 37th Annual Meeting of the Association for Computational Linguistics, pages 25--32, 1999. Google ScholarDigital Library
- R. Lempel and S. Moran. Optimizing result prefetching in web search engines with segmented indices. ACM Transactions on Internet Technology, 4(1):31--59, 2004. Google ScholarDigital Library
- A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. In Proc. of the 23rd ACM Symposium on the Principles of Database Systems, pages 223--228, 2004. Google ScholarDigital Library
- J. Novak, P. Raghavan, and A. Tomkins. Anti-aliasing on the web. In Proc. of the 13th International Conference on World Wide Web, pages 30--39, 2004. Google ScholarDigital Library
- R. Pang and V. Paxson. A high-level programming environment for packet trace anonymization and transformation. In Proc. of the ACM SIGCOMM 2003 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, pages 339--351, 2003. Google ScholarDigital Library
- F. Pereira, N. Tishby, and L. Lee. Distributional clustering of English words. In Proc. of the 31st Annual Meeting of the Association for Computational Linguistics, pages 183--190, 1993. Google ScholarDigital Library
- D. E. Rose and D. Levinson. Understanding user goals in web search. In Proc. of the 13th International Conference on World Wide Web, pages 13--19, 2004. Google ScholarDigital Library
- N. C. M. Ross. End user searching on the internet: An analysis of term pair topics submitted to the excite search engine. Journal of American Society of Information Sciences, 51(10):949--958, 2000. Google ScholarDigital Library
- P. Samarati and L. Sweeney. Generalizing data to provide anonymity when disclosing information. In Proc. of the 17th ACM Symposium on the Principles of Database Systems, page 188, 1998. Google ScholarDigital Library
- C. Silverstein, H. Marais, M. Henzinger, and M. Moricz. Analysis of a very large web search engine query log. SIGIR Forum, 33(1):6--12, 1999. Google ScholarDigital Library
- A. Slagell and W. Yurcik. Sharing computer network logs for security and privacy: A motivation for new methodologies of anonymization. In Workshop of the 1st International Conference on Security and Privacy for Emerging Areas in Communication Networks, pages 80--89, 2005.Google ScholarCross Ref
- A. Spink. A user-centered approach to evaluating human interaction with web search engines: An exploratory study. Information Processing and Management, 38(3):401--426, 2002. Google ScholarDigital Library
- A. Spink, B. J. Jansen, D. Wolfram, and T. Saracevic. From e-sex to e-commerce: Web search changes. Computer, 35(3):107--109, 2002. Google ScholarDigital Library
- A. Spink and H. C. Ozmultu. Characteristics of question format web queries: An exploratory study. Information Processing and Management, 38(4):453--471, 2002. Google ScholarDigital Library
- S. Zhong, Z. Yang, and R. N. Wright. Privacy-enhancing k-anonymization of customer data. In Proc. of the 24th ACM Symposium on the Principles of Database Systems, pages 139--147, 2005. Google ScholarDigital Library
Index Terms
On anonymizing query logs via token-based hashing
Recommendations
Using Search Results to Microaggregate Query Logs Semantically
Revised Selected Papers of the 8th International Workshop on Data Privacy Management and Autonomous Spontaneous Security - Volume 8247Query log anonymization has become an important challenge nowadays. A query log contains the search history of the users, as well as the selected results and their position in the ranking. These data are used to provide a personalized re-ranking of ...
A semantic-preserving differentially private method for releasing query logs
Highlights- We discuss the challenges and particularities of privacy-preserving releases of query logs.
AbstractQuery logs are of great interest for data analysis. They allow characterizing user profiles, user behaviors and search habits. However, since query logs usually contain personal information, data controllers should implement ...
Anonymizing sequential releases
KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data miningAn organization makes a new release as new information become available, releases a tailored view for each data request, releases sensitive information and identifying information separately. The availability of related releases sharpens the ...
Comments