Computer Science
Quantifying the effect of in-domain distributed word representations: A study of privacy policies
Abstract
Privacy policies are documents that describe what data is collected by a website or an app and how that data is handled. Privacy policies are often long and difficult to understand. Recently people have started to turn to Natural Language Processing (NLP) to automatically extract statements from the text of these policies. This article reports on a study to evaluate the benefits of using word embeddings in this endeavor. Specifically, we use 150,000 privacy policies to build word vectors in an unsupervised manner. This includes evaluating the benefits of privacy specific word embeddings. Evaluation is conducted on the OPP-115 corpus of privacy policy annotations. By building privacy-specific embeddings we hope to accelerate research at the intersection of privacy policies and language technologies.