skip to main content
10.1145/3573128.3609342acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
short-paper

Privacy Now or Never: Large-Scale Extraction and Analysis of Dates in Privacy Policy Text

Published:22 August 2023Publication History

ABSTRACT

The General Data Protection Regulation (GDPR) and other recent privacy laws require organizations to post their privacy policies, and place specific expectations on organisations' privacy practices. Privacy policies take the form of documents written in natural language, and one of the expectations placed upon them is that they remain up to date. To investigate legal compliance with this recency requirement at a large scale, we create a novel pipeline that includes crawling, regex-based extraction, candidate date classification and date object creation to extract updated and effective dates from privacy policies written in English. We then analyze patterns in policy dates using four web crawls and find that only about 40% of privacy policies online contain a date, thereby making it difficult to assess their regulatory compliance. We also find that updates in privacy policies are temporally concentrated around passage of laws regulating digital privacy (such as the GDPR), and that more popular domains are more likely to have policy dates as well as more likely to update their policies regularly.

References

  1. Angel X Chang and Christopher D Manning. 2012. Sutime: A library for recognizing and normalizing time expressions.. In Lrec, Vol. 3735. 3740.Google ScholarGoogle Scholar
  2. Moses S Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. ACM, 380--388.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Lorrie Faith Cranor. 2012. Necessary but not sufficient: Standardized mechanisms for privacy notice and choice. J. on Telecomm. & High Tech. L. 10 (2012), 273.Google ScholarGoogle Scholar
  4. Martin Degeling, Christine Utz, Christopher Lentzsch, Henry Hosseini, Florian Schaub, and Thorsten Holz. 2018. We value your privacy... now take some cookies: Measuring the GDPR's impact on web privacy. arXiv preprint arXiv:1808.05096 (2018).Google ScholarGoogle Scholar
  5. Beata Fonferko-Shadrach, Arron S Lacey, Angus Roberts, Ashley Akbari, Simon Thompson, David V Ford, Ronan A Lyons, Mark I Rees, and William Owen Pickrell. 2019. Using natural language processing to extract structured epilepsy data from unstructured clinic letters: development and validation of the ExECT (extraction of epilepsy clinical text) system. BMJ open 9, 4 (2019), e023232.Google ScholarGoogle Scholar
  6. Julia T Fu, Evan Sholle, Spencer Krichevsky, Joseph Scandura, and Thomas R Campion. 2020. Extracting and classifying diagnosis dates from clinical notes: a case study. Journal of Biomedical Informatics 110 (2020), 103569.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Johanna Fulda, Matthew Brehmer, and Tamara Munzner. 2015. TimeLineCurator: Interactive authoring of visual timelines from unstructured text. IEEE transactions on visualization and computer graphics 22, 1 (2015), 300--309.Google ScholarGoogle Scholar
  8. Sonu Gupta, Ellen Poplavska, Nora O'Toole, Siddhant Arora, Thomas Norton, Norman Sadeh, and Shomir Wilson. 2022. Creation and Analysis of an International Corpus of Privacy Laws. arXiv preprint arXiv:2206.14169 (2022).Google ScholarGoogle Scholar
  9. Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. 2020. spaCy: Industrial-strength Natural Language Processing in Python. (2020).Google ScholarGoogle Scholar
  10. Thomas Linden, Rishabh Khandelwal, Hamza Harkous, and Kassem Fawaz. 2020. The Privacy Policy Landscape After the GDPR. Proceedings on Privacy Enhancing Technologies 1 (2020), 47--64.Google ScholarGoogle ScholarCross RefCross Ref
  11. Marco Lui and Timothy Baldwin. 2012. langid. py: An off-the-shelf language identification tool. In Proceedings of the ACL 2012 system demonstrations. Association for Computational Linguistics, 25--30.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. 2007. Detecting near-duplicates for web crawling. In Proceedings of the 16th international conference on World Wide Web. ACM, 141--150.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Anoop D Shah, Carlos Martinez, and Harry Hemingway. 2012. The freetext matching algorithm: a computer program to extract diagnoses and causes of death from unstructured text in electronic health records. BMC medical informatics and decision making 12, 1 (2012), 1--13.Google ScholarGoogle Scholar
  14. Robert H Sloan and Richard Warner. 2014. Beyond notice and choice: Privacy, norms, and consent. J. High Tech. L. 14 (2014), 370.Google ScholarGoogle Scholar
  15. David A Smith. 2002. Detecting events with date and place information in unstructured text. In Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries. 191--196.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Mukund Srinath, Soundarya Nurani Sundareswara, C Lee Giles, and Shomir Wilson. 2021. PrivaSeer: A Privacy Policy Search Engine. In International Conference on Web Engineering. Springer, 286--301.Google ScholarGoogle Scholar
  17. Mukund Srinath, Shomir Wilson, and C Lee Giles. 2021. Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6829--6839.Google ScholarGoogle ScholarCross RefCross Ref
  18. Soundarya Sundareswara, Shomir Wilson, Mukund Srinath, and Lee Giles. 2020. Privacy not found: a study of the availability of privacy policies on the web.Google ScholarGoogle Scholar
  19. Soundarya Nurani Sundareswara, Mukund Srinath, Shomir Wilson, and C. Lee Giles. 2021. A Large-Scale Exploration of Terms of Service Documents on the Web. In Proceedings of the 21st ACM Symposium on Document Engineering (Limerick, Ireland) (DocEng '21). Association for Computing Machinery, New York, NY, USA, Article 21, 4 pages. https://doi.org/10.1145/3469096.3474940Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Privacy Now or Never: Large-Scale Extraction and Analysis of Dates in Privacy Policy Text

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      DocEng '23: Proceedings of the ACM Symposium on Document Engineering 2023
      August 2023
      187 pages
      ISBN:9798400700279
      DOI:10.1145/3573128

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 August 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper
      • Research
      • Refereed limited

      Acceptance Rates

      DocEng '23 Paper Acceptance Rate9of27submissions,33%Overall Acceptance Rate178of537submissions,33%
    • Article Metrics

      • Downloads (Last 12 months)100
      • Downloads (Last 6 weeks)14

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader