I’ve Read Every Privacy Policy on the Internet; This is What I’ve Learned
We’ve been working on a project to analyze and classify every privacy policy on the Internet. Yeah, that’s from the 5,000,000,000 websites included in AWS’s Common Crawl Corpus. We are classifying what information a given website collects about you and what they choose to do with that information. Along the way, I’ve already learned some new technical skills ranging from Naive Bayes classification to running a streaming elastic MapReduce in Ruby. But, I’ve also learned a lot about online privacy and just how most websites treat your personal information.
“Non-personally identifying information” actually is
Privacy policies divide the information received into two buckets— personally identifying and non-personally identifying. Most sites then have a policy of protecting that first bucket, not selling or renting it out. As for the other bucket, they give themselves free rein to aggregate it, sell it or plain give it away.
What’s surprising is that personal information like IP-address and unique tracking cookies are not placed in that first bucket. Every time you or your wife or kids login to websites on the same machine, they all know who you are. What’s more, most people can be identified by their browser’s unique User-Agent features alone.
Putting a pretty face on handing out your personal information.
The scariest phrase to find in a privacy policy is “trusted partners.” Said “trusted partners” are being shared or sold your personal information. Other good legalese techniques include creative uses of the word “unless” or “except,” covering every conceivable case. Take for example the privacy policy of Coupons.com:
We do not share personally identifiable information with our Affiliates or other third-parties for their marketing or promotional uses except as part of a specific program or feature that you have chosen to participate in. For example, we offer coupons on our Sites that require you to fill out an advertiser survey in order to receive the coupon.
Read: “We can share whatever you provide us whenever you touch or interact with our site”
Network Effects
Most sites recognize that their users are opposed to their private information being shared with third parties without a good reason (to say: there are good reasons. We use Stripe for payments and MixPanel for analytics). While often websites limit or even rule out “third-party sharing,” they opt for allowing the sharing of your information with “affiliates”— companies under the same privacy policy and corporate umbrella. The scary thing about this is the wide reach of some corporate networks. For example, would you have guessed that gamespot.com and cnet.com can share your personal data with one another because they’re both part of CBS Interactive? It’s not infrequent for a website’s affiliate network to include thousands of external organizations in unrelated markets.
Everyone that’s anybody is using a (third-party) tracking service
Nearly every website in the world is using a third-party tracking service, of which Google Analytics is indisputably the most popular. A rarely discussed loophole in Internet Privacy is that everyone’s online behavior is being gathered across several websites and funneled into these large analytics networks. Does it ever creep you out that Google happily collects massive usage and consumer data from Your website (including paid conversions) and then presents you a slice of it?
Don’t want to read Privacy Policies, but do care about Privacy? We’re here to help
- Remove your personal information with SafeShepherd. We opt you out of the data brokers which end up with your information after these privacy-disrespecting sites leak it out
- Check out Ghostery, which identifies and blocks those third-party trackers
- If you want to go the extra mile, use Tor, anonymousspeech and the new project privly to hide your IP, identity, and content respectively, while still using all the same services that your privacy-ambivalent brethren benefit from
- More reading? Reference: knowprivacy.org — a project which categorized 50 policies and helped a lot with our current project and knowledge
-Ben