Computer Science

A Tale of Two Regulatory Regimes: Creation and Analysis of a Bilingual Privacy Policy Corpus

Siddhant Arora
Henry Hosseini
Christine Utz
Vinayshekhar Bannihatti Kumar
Jasmine Mangat
Peter Story, Clark University
Rex Chen
Martin Degeling
Tom Norton
Thomas Hupperich

Document Type

Conference Paper

Abstract

Over the past decade, researchers have started to explore the use of NLP to develop tools aimed at helping the public, vendors, and regulators analyze disclosures made in privacy policies. With the introduction of new privacy regulations, the language of privacy policies is also evolving, and disclosures made by the same organization are not always the same in different languages, especially when used to communicate with users who fall under different jurisdictions. This work explores the use of language technologies to capture and analyze these differences at scale. We introduce an annotation scheme designed to capture the nuances of two new landmark privacy regulations, namely the EU's GDPR and California's CCPA/CPRA. We then introduce the first bilingual corpus of mobile app privacy policies consisting of 64 privacy policies in English (292K words) and 91 privacy policies in German (478K words), respectively with manual annotations for 8K and 19K fine-grained data practices. The annotations are used to develop computational methods that can automatically extract “disclosures” from privacy policies. Analysis of a subset of 59 “semi-parallel” policies reveals differences that can be attributed to different regulatory regimes, suggesting that systematic analysis of policies using automated language technologies is indeed a worthwhile endeavor. © European Language Resources Association (ELRA), licensed under CC-BY-NC-4.0.

Publication Title

13th International Conference on Language Resources and Evaluation Conference

Publication Date

6-2022

First Page

5460

Last Page

5472

ISBN

9791095546726

Keywords

bilingual, CCPA, CPRA, GDPR, privacy policy, privacy policy corpus, text corpus

Repository Citation

Arora, Siddhant; Hosseini, Henry; Utz, Christine; Kumar, Vinayshekhar Bannihatti; Mangat, Jasmine; Story, Peter; Chen, Rex; Degeling, Martin; Norton, Tom; and Hupperich, Thomas, "A Tale of Two Regulatory Regimes: Creation and Analysis of a Bilingual Privacy Policy Corpus" (2022). Computer Science. 1.
https://commons.clarku.edu/faculty_computer_sciences/1

APA Citation

Arora, S., Hosseini, H., Utz, C., Bannihatti, V. K., Dhellemmes, T., Ravichander, A., ... & Sadeh, N. (2022, January). A Tale of Two Regulatory Regimes: Creation and Analysis of a Bilingual Privacy Policy Corpus. In LREC proceedings.

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License

Copyright Conditions

Download

Find in your library

Included in

Computer Sciences Commons

COinS

Computer Science

A Tale of Two Regulatory Regimes: Creation and Analysis of a Bilingual Privacy Policy Corpus

Document Type

Abstract

Publication Title

Publication Date

First Page

Last Page

ISBN

Keywords

Repository Citation

APA Citation

Creative Commons License

Copyright Conditions

Included in

Search

Browse

Participate

Links

Computer Science

A Tale of Two Regulatory Regimes: Creation and Analysis of a Bilingual Privacy Policy Corpus

Authors

Document Type

Abstract

Publication Title

Publication Date

First Page

Last Page

ISBN

Keywords

Repository Citation

APA Citation

Creative Commons License

Copyright Conditions

Included in

Share

Search

Browse

Participate

Links