"Quantifying the effect of in-domain distributed word representations: " by Vinayshekhar Bannihatti Kumar, Abhilasha Ravichander et al.
 

Computer Science

Quantifying the effect of in-domain distributed word representations: A study of privacy policies

Vinayshekhar Bannihatti Kumar, School of Computer Science
Abhilasha Ravichander, School of Computer Science
Peter Story, School of Computer Science
Norman Sadeh, School of Computer Science

Abstract

Privacy policies are documents that describe what data is collected by a website or an app and how that data is handled. Privacy policies are often long and difficult to understand. Recently people have started to turn to Natural Language Processing (NLP) to automatically extract statements from the text of these policies. This article reports on a study to evaluate the benefits of using word embeddings in this endeavor. Specifically, we use 150,000 privacy policies to build word vectors in an unsupervised manner. This includes evaluating the benefits of privacy specific word embeddings. Evaluation is conducted on the OPP-115 corpus of privacy policy annotations. By building privacy-specific embeddings we hope to accelerate research at the intersection of privacy policies and language technologies.