skip to main content
10.1145/1242572.1242657acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
Article

On anonymizing query logs via token-based hashing

Published: 08 May 2007 Publication History

Abstract

In this paper we study the privacy preservation properties of aspecific technique for query log anonymization: token-based hashing. In this approach, each query is tokenized, and then a secure hash function is applied to each token. We show that statistical techniques may be applied to partially compromise the anonymization. We then analyze the specific risks that arise from these partial compromises, focused on revelation of identity from unambiguous names, addresses, and so forth, and the revelation of facts associated with an identity that are deemed to be highly sensitive. Our goal in this work is two fold: to show that token-based hashing is unsuitable for anonymization, and to present a concrete analysis of specific techniques that may be effective in breaching privacy, against which other anonymization schemes should be measured.

References

[1]
R. Barzilay and K. McKeown. Extracting paraphrases from a parallel corpus. In Proc. of the 39th Annual Meeting of the Association for Computational Linguistics, pages 50--57, 2001.
[2]
R. J. Bayardo and R. Agrawal. Data privacy through optimal k-anonymization. In Proc. of the 21st International Conference on Data Engineering, pages 217--228, 2005.
[3]
A. Broder. A taxonomy of web search. SIGIR Forum, 36(2):3--10, 2002.
[4]
B. J. Jansen, A. Spink, J. Bateman, and T. Saracevic. Real life information retrieval: A study of user queries on the web. SIGIR Forum, 32(1):5--17, 1998.
[5]
B. J. Jansen, A. Spink, and T. Saracevic. Real life, real users, and real needs: A study and analysis of user queries on the web. Information Processing and Management, 36(2):207--227, 2000.
[6]
R. Jones, B. Rey, O. Madani, and W. Greiner. Generating query substitutions. In Proc. of the 15th International Conference on World Wide Web, pages 387--396, 2006.
[7]
J. Kleinberg and E. Tardos. Algorithm Design. Addison Wesley, 2005.
[8]
L. Lee. Measures of distributional similarity. In Proc. of the 37th Annual Meeting of the Association for Computational Linguistics, pages 25--32, 1999.
[9]
R. Lempel and S. Moran. Optimizing result prefetching in web search engines with segmented indices. ACM Transactions on Internet Technology, 4(1):31--59, 2004.
[10]
A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. In Proc. of the 23rd ACM Symposium on the Principles of Database Systems, pages 223--228, 2004.
[11]
J. Novak, P. Raghavan, and A. Tomkins. Anti-aliasing on the web. In Proc. of the 13th International Conference on World Wide Web, pages 30--39, 2004.
[12]
R. Pang and V. Paxson. A high-level programming environment for packet trace anonymization and transformation. In Proc. of the ACM SIGCOMM 2003 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, pages 339--351, 2003.
[13]
F. Pereira, N. Tishby, and L. Lee. Distributional clustering of English words. In Proc. of the 31st Annual Meeting of the Association for Computational Linguistics, pages 183--190, 1993.
[14]
D. E. Rose and D. Levinson. Understanding user goals in web search. In Proc. of the 13th International Conference on World Wide Web, pages 13--19, 2004.
[15]
N. C. M. Ross. End user searching on the internet: An analysis of term pair topics submitted to the excite search engine. Journal of American Society of Information Sciences, 51(10):949--958, 2000.
[16]
P. Samarati and L. Sweeney. Generalizing data to provide anonymity when disclosing information. In Proc. of the 17th ACM Symposium on the Principles of Database Systems, page 188, 1998.
[17]
C. Silverstein, H. Marais, M. Henzinger, and M. Moricz. Analysis of a very large web search engine query log. SIGIR Forum, 33(1):6--12, 1999.
[18]
A. Slagell and W. Yurcik. Sharing computer network logs for security and privacy: A motivation for new methodologies of anonymization. In Workshop of the 1st International Conference on Security and Privacy for Emerging Areas in Communication Networks, pages 80--89, 2005.
[19]
A. Spink. A user-centered approach to evaluating human interaction with web search engines: An exploratory study. Information Processing and Management, 38(3):401--426, 2002.
[20]
A. Spink, B. J. Jansen, D. Wolfram, and T. Saracevic. From e-sex to e-commerce: Web search changes. Computer, 35(3):107--109, 2002.
[21]
A. Spink and H. C. Ozmultu. Characteristics of question format web queries: An exploratory study. Information Processing and Management, 38(4):453--471, 2002.
[22]
S. Zhong, Z. Yang, and R. N. Wright. Privacy-enhancing k-anonymization of customer data. In Proc. of the 24th ACM Symposium on the Principles of Database Systems, pages 139--147, 2005.

Cited By

View all
  • (2024)Group Decision-Making among Privacy-Aware AgentsSSRN Electronic Journal10.2139/ssrn.4726578Online publication date: 2024
  • (2024)Differentially private distributed estimation and learningIISE Transactions10.1080/24725854.2024.2337068(1-17)Online publication date: 22-May-2024
  • (2024)A Data Anonymization Methodology for Security Operations Centers: Balancing Data Protection and Security in Industrial SystemsInformation Sciences10.1016/j.ins.2024.121534(121534)Online publication date: Oct-2024
  • Show More Cited By

Index Terms

  1. On anonymizing query logs via token-based hashing

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WWW '07: Proceedings of the 16th international conference on World Wide Web
      May 2007
      1382 pages
      ISBN:9781595936547
      DOI:10.1145/1242572
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 08 May 2007

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. hash-based anonymization
      2. privacy
      3. query logs

      Qualifiers

      • Article

      Conference

      WWW'07
      Sponsor:
      WWW'07: 16th International World Wide Web Conference
      May 8 - 12, 2007
      Alberta, Banff, Canada

      Acceptance Rates

      Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)12
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 06 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Group Decision-Making among Privacy-Aware AgentsSSRN Electronic Journal10.2139/ssrn.4726578Online publication date: 2024
      • (2024)Differentially private distributed estimation and learningIISE Transactions10.1080/24725854.2024.2337068(1-17)Online publication date: 22-May-2024
      • (2024)A Data Anonymization Methodology for Security Operations Centers: Balancing Data Protection and Security in Industrial SystemsInformation Sciences10.1016/j.ins.2024.121534(121534)Online publication date: Oct-2024
      • (2023)Differentially Private Network Data Collection for Influence MaximizationProceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems10.5555/3545946.3599081(2795-2797)Online publication date: 30-May-2023
      • (2022)Expanding the attack surface: Robust profiling attacks threaten the privacy of sparse behavioral dataScience Advances10.1126/sciadv.abl64648:33Online publication date: 19-Aug-2022
      • (2022)Asymptotically Optimal and Secure Multiwriter/Multireader Similarity SearchIEEE Access10.1109/ACCESS.2022.320896210(101957-101971)Online publication date: 2022
      • (2022)City of Disguise: A Query Obfuscation Game on the ClueWebAdvances in Information Retrieval10.1007/978-3-030-99739-7_34(281-287)Online publication date: 5-Apr-2022
      • (2021)Efficient Query Obfuscation with KeyqueriesIEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology10.1145/3486622.3493950(154-161)Online publication date: 14-Dec-2021
      • (2020)Information Leakage in Encrypted Deduplication via Frequency AnalysisACM Transactions on Storage10.1145/336584016:1(1-30)Online publication date: 29-Mar-2020
      • (2020)Generation of Synthetic Query Auto Completion LogsAdvances in Information Retrieval10.1007/978-3-030-45439-5_41(621-635)Online publication date: 8-Apr-2020
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media