Article

On anonymizing query logs via token-based hashing

Authors:

Andrew TomkinsAuthors Info & Claims

WWW '07: Proceedings of the 16th international conference on World Wide Web

Pages 629 - 638

https://doi.org/10.1145/1242572.1242657

Published: 08 May 2007 Publication History

Abstract

In this paper we study the privacy preservation properties of aspecific technique for query log anonymization: token-based hashing. In this approach, each query is tokenized, and then a secure hash function is applied to each token. We show that statistical techniques may be applied to partially compromise the anonymization. We then analyze the specific risks that arise from these partial compromises, focused on revelation of identity from unambiguous names, addresses, and so forth, and the revelation of facts associated with an identity that are deemed to be highly sensitive. Our goal in this work is two fold: to show that token-based hashing is unsuitable for anonymization, and to present a concrete analysis of specific techniques that may be effective in breaching privacy, against which other anonymization schemes should be measured.

References

[1]

R. Barzilay and K. McKeown. Extracting paraphrases from a parallel corpus. In Proc. of the 39th Annual Meeting of the Association for Computational Linguistics, pages 50--57, 2001.

Digital Library

[2]

R. J. Bayardo and R. Agrawal. Data privacy through optimal k-anonymization. In Proc. of the 21st International Conference on Data Engineering, pages 217--228, 2005.

Digital Library

[3]

A. Broder. A taxonomy of web search. SIGIR Forum, 36(2):3--10, 2002.

Digital Library

[4]

B. J. Jansen, A. Spink, J. Bateman, and T. Saracevic. Real life information retrieval: A study of user queries on the web. SIGIR Forum, 32(1):5--17, 1998.

Digital Library

[5]

B. J. Jansen, A. Spink, and T. Saracevic. Real life, real users, and real needs: A study and analysis of user queries on the web. Information Processing and Management, 36(2):207--227, 2000.

Digital Library

[6]

R. Jones, B. Rey, O. Madani, and W. Greiner. Generating query substitutions. In Proc. of the 15th International Conference on World Wide Web, pages 387--396, 2006.

Digital Library

[7]

J. Kleinberg and E. Tardos. Algorithm Design. Addison Wesley, 2005.

Digital Library

[8]

L. Lee. Measures of distributional similarity. In Proc. of the 37th Annual Meeting of the Association for Computational Linguistics, pages 25--32, 1999.

Digital Library

[9]

R. Lempel and S. Moran. Optimizing result prefetching in web search engines with segmented indices. ACM Transactions on Internet Technology, 4(1):31--59, 2004.

Digital Library

[10]

A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. In Proc. of the 23rd ACM Symposium on the Principles of Database Systems, pages 223--228, 2004.

Digital Library

[11]

J. Novak, P. Raghavan, and A. Tomkins. Anti-aliasing on the web. In Proc. of the 13th International Conference on World Wide Web, pages 30--39, 2004.

Digital Library

[12]

R. Pang and V. Paxson. A high-level programming environment for packet trace anonymization and transformation. In Proc. of the ACM SIGCOMM 2003 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, pages 339--351, 2003.

Digital Library

[13]

F. Pereira, N. Tishby, and L. Lee. Distributional clustering of English words. In Proc. of the 31st Annual Meeting of the Association for Computational Linguistics, pages 183--190, 1993.

Digital Library

[14]

D. E. Rose and D. Levinson. Understanding user goals in web search. In Proc. of the 13th International Conference on World Wide Web, pages 13--19, 2004.

Digital Library

[15]

N. C. M. Ross. End user searching on the internet: An analysis of term pair topics submitted to the excite search engine. Journal of American Society of Information Sciences, 51(10):949--958, 2000.

Digital Library

[16]

P. Samarati and L. Sweeney. Generalizing data to provide anonymity when disclosing information. In Proc. of the 17th ACM Symposium on the Principles of Database Systems, page 188, 1998.

Digital Library

[17]

C. Silverstein, H. Marais, M. Henzinger, and M. Moricz. Analysis of a very large web search engine query log. SIGIR Forum, 33(1):6--12, 1999.

Digital Library

[18]

A. Slagell and W. Yurcik. Sharing computer network logs for security and privacy: A motivation for new methodologies of anonymization. In Workshop of the 1st International Conference on Security and Privacy for Emerging Areas in Communication Networks, pages 80--89, 2005.

[19]

A. Spink. A user-centered approach to evaluating human interaction with web search engines: An exploratory study. Information Processing and Management, 38(3):401--426, 2002.

Digital Library

[20]

A. Spink, B. J. Jansen, D. Wolfram, and T. Saracevic. From e-sex to e-commerce: Web search changes. Computer, 35(3):107--109, 2002.

Digital Library

[21]

A. Spink and H. C. Ozmultu. Characteristics of question format web queries: An exploratory study. Information Processing and Management, 38(4):453--471, 2002.

Digital Library

[22]

S. Zhong, Z. Yang, and R. N. Wright. Privacy-enhancing k-anonymization of customer data. In Proc. of the 24th ACM Symposium on the Principles of Database Systems, pages 139--147, 2005.

Digital Library

Cited By

Papachristou MRahimian M(2024)Group Decision-Making among Privacy-Aware AgentsSSRN Electronic Journal10.2139/ssrn.4726578Online publication date: 2024
https://doi.org/10.2139/ssrn.4726578
Papachristou MRahimian M(2024)Differentially private distributed estimation and learningIISE Transactions10.1080/24725854.2024.2337068(1-17)Online publication date: 22-May-2024
https://doi.org/10.1080/24725854.2024.2337068
Longo GLupia FMerlo APagano FRusso E(2024)A Data Anonymization Methodology for Security Operations Centers: Balancing Data Protection and Security in Industrial SystemsInformation Sciences10.1016/j.ins.2024.121534(121534)Online publication date: Oct-2024
https://doi.org/10.1016/j.ins.2024.121534
Show More Cited By

Index Terms

On anonymizing query logs via token-based hashing
1. Information systems
  1. Information retrieval
  2. Information storage systems

Recommendations

Using Search Results to Microaggregate Query Logs Semantically
Revised Selected Papers of the 8th International Workshop on Data Privacy Management and Autonomous Spontaneous Security - Volume 8247

Query log anonymization has become an important challenge nowadays. A query log contains the search history of the users, as well as the selected results and their position in the ranking. These data are used to provide a personalized re-ranking of ...
A semantic-preserving differentially private method for releasing query logs
Highlights
- We discuss the challenges and particularities of privacy-preserving releases of query logs.
Abstract
Query logs are of great interest for data analysis. They allow characterizing user profiles, user behaviors and search habits. However, since query logs usually contain personal information, data controllers should implement ...
Anonymizing sequential releases
KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

An organization makes a new release as new information become available, releases a tailored view for each data request, releases sensitive information and identifying information separately. The availability of related releases sharpens the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '07: Proceedings of the 16th international conference on World Wide Web

May 2007

1382 pages

ISBN:9781595936547

DOI:10.1145/1242572

General Chairs:
Carey Williamson
University of Calgary, Canada
,
Mary Ellen Zurko
IBM, USA
,
Program Chairs:
Peter Patel-Schneider
Bell Labs Research, USA
,
Prashant Shenoy
University of Massachusetts at Amherst, USA

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

ACM: Association for Computing Machinery

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 May 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

WWW'07

Sponsor:

ACM

WWW'07: 16th International World Wide Web Conference

May 8 - 12, 2007

Alberta, Banff, Canada

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

65
Total Citations
View Citations
578
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Papachristou MRahimian M(2024)Group Decision-Making among Privacy-Aware AgentsSSRN Electronic Journal10.2139/ssrn.4726578Online publication date: 2024
https://doi.org/10.2139/ssrn.4726578
Papachristou MRahimian M(2024)Differentially private distributed estimation and learningIISE Transactions10.1080/24725854.2024.2337068(1-17)Online publication date: 22-May-2024
https://doi.org/10.1080/24725854.2024.2337068
Longo GLupia FMerlo APagano FRusso E(2024)A Data Anonymization Methodology for Security Operations Centers: Balancing Data Protection and Security in Industrial SystemsInformation Sciences10.1016/j.ins.2024.121534(121534)Online publication date: Oct-2024
https://doi.org/10.1016/j.ins.2024.121534
Rahimian MYu FHurtado CAgmon NAn BRicci AYeoh W(2023)Differentially Private Network Data Collection for Influence MaximizationProceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems10.5555/3545946.3599081(2795-2797)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.5555/3545946.3599081
Tournier Ade Montjoye Y(2022)Expanding the attack surface: Robust profiling attacks threaten the privacy of sparse behavioral dataScience Advances10.1126/sciadv.abl64648:33Online publication date: 19-Aug-2022
https://doi.org/10.1126/sciadv.abl6464
Kwon HHahn C(2022)Asymptotically Optimal and Secure Multiwriter/Multireader Similarity SearchIEEE Access10.1109/ACCESS.2022.320896210(101957-101971)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3208962
Fröbe MLibera NHagen M(2022)City of Disguise: A Query Obfuscation Game on the ClueWebAdvances in Information Retrieval10.1007/978-3-030-99739-7_34(281-287)Online publication date: 5-Apr-2022
https://doi.org/10.1007/978-3-030-99739-7_34
Fröbe MSchmidt EHagen M(2021)Efficient Query Obfuscation with KeyqueriesIEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology10.1145/3486622.3493950(154-161)Online publication date: 14-Dec-2021
https://dl.acm.org/doi/10.1145/3486622.3493950
Li JLee PTan CQin CZhang X(2020)Information Leakage in Encrypted Deduplication via Frequency AnalysisACM Transactions on Storage10.1145/336584016:1(1-30)Online publication date: 29-Mar-2020
https://dl.acm.org/doi/10.1145/3365840
Krishnan UMoffat AZobel JBillerbeck B(2020)Generation of Synthetic Query Auto Completion LogsAdvances in Information Retrieval10.1007/978-3-030-45439-5_41(621-635)Online publication date: 8-Apr-2020
https://doi.org/10.1007/978-3-030-45439-5_41
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents