The false promises of data anonymisation

26 Feb. 2015

The false promise of data anonymisation. Find out what happens to our data when it is collected and 'anonymised' in the times of big data analytics, where what is “personal” is no longer so straightforward. Information algorithmically derived from anonymised data is not necessarily “personally identifiable”. Yet, it can tell intimate stories about you. It can prevent you from entering a university, getting a job, a job, housing, a phone contract, a social benefit; it can determine how much you pay for health insurance, a hotel room or food, or how long you wait for a delayed flight or medical treatment, it can put you under government's surveillance. It becomes your reputation, a fixed story of who you are, a profile through which corporations and governments see you. It is patterns that “emerge” among data during data mining. A special case concerns group profiles based on anonymised data. Neither the individuals whose data were used to create these group profiles nor individuals to whom the group profile is applied to, have a right to access “their” profiles.

Last Updated: 01 Oct 2015

Author: Morana Miljanovic

The not-so-new false promise: putting the puzzle together

Eight years ago, America Online (AOL) publicly released 'anonymised' web searches results of 650 000 users of its search engine. The results were anonymised by removing some “personally identifying information” such as AOL username and users' IP addresses and assigned unique user identifying numbers instead. AOL claimed user privacy would be preserved since queries could not be linked to individuals. It was a debacle. The New York Times reporters Michael Barbaro and Tom Zeller re-identified User 4417749 as Thelma Arnold, a sixty-two-year-old widow from Lilburn, Georgia. Her queries included “numb fingers,” “60 single men,” and “dog that urinates on everything.”

Similarly, Netflix, the “world’s largest online movie rental service,” publicized 100 million user movie ratings and promised a million dollars to whomever could improve Netflix's movie recommendation algorithm by 10%. Like AOL, Netflix “anonymised” user data. Paul Ohm, associate proffesor at Colorado Law school explains that “[t]hus, researchers could tell that user 1337 had rated Gattaca a 4 on March 3, 2003, and Minority Report a 5 on November 10, 2003.” Shortly after the release of data, researchers from the University of Texas, Arvind Narayanan and Vitaly Shmatikov, showed how easy it is for an algorithm to uniquely re-identify an individual Netflix user by name, with only a tiny bit of outside information (auxiliary, background) about her. This additional information does not need to be precise. In the Netwflix example, it can be learned from a conversation about movies, a personal blog, Google searches or IMDB movie ratings.

In the year 2015, background information is easy to find. Whenever you expose yourself to online traffic analysis, geo-location and sensor technologies, whenever you volunteer details of your movements, moods, purchases, opinions, associations, health condition, and wallet condition, whenever you fill in your demographic data for loyalty programmes and such, you leave traces online and offline. These traces are connected: through combining databases – from freely accessible public records to lists of vulnerable individuals sold by companies, through technologies that match online and offline data, and with help of tracking technologies.

Our location histories alone not only reveal our whereabouts and movements, but also can tell our gender, marital status, occupation, age, where we work, live, and whom we date. The latter two, for example, are indicated by where you cell phone rests during nights. Fed to a machine learning algorithm, location histories can be used to predict our future locations and itineraries. An analysis of our locations across time tells what behaviour is typical or untypical for a given person. Our movements alone have a high degree of regularity (93% predictability) and our uniqueness as an individual is amplified by technology. Predictions about us are “improved” when location data of friends and other people in our network are added. Individual mobility traces can reveal our visits to clinics, churches, lovers or gay bars. One person's location on four occasions can be enough to identify that person.

The more data gets collected, “anonymised” (e.g. last three digits of a zip code or IP address removed) and circulated in ad exchanges – marketplaces for profiles in the data broker industry – transparent masks– the easier it is for companies to link data from various sources, and identify a person within seconds. A few traces is all that it takes to identify you. Any four data points are enough to uniquely identify 95% of mobile phone users, three points if they are zip code, gender and birth date, two points if one of them is strong, like your home address. It becomes easier and more accurate when one “outside” piece is “plugged in” the anonymised data. Researchers at Harvard we able to identify 84% to 97% of individuals who had volunteered their medical and genomic data for research purposes under the assumption of anonymity, as their online profile appeared in a “de-identified” state. Re-identifying was done by linking names, contact information and demographics (such as gender, date of birth, postal code) to public records such as voter lists. Besides publicly available records, datasets bought from data brokers (e.g. a mailing list of people who attended a certain event) can be used.

There are cases where combining (cross-referencing) data from different datasets is not even necessary. The research by John Bohannon distinguished that each person's shopping pattern is unique and a copy of your credit card transactions alone can identify you. Even when a bank strips away your name, credit card number, shop addresses, and the exact times of the transactions, the remaining metadata (amounts you spent and the type of a shop you visited) is sufficient.

Furthermore, before data is “anonymised,” demographic data and lifestyle information can be glued to a profile, making it easier to find a matching piece of a profile. Not only can information on us that we believe is anonymous (e.g. movie ratings) be linked to our real identities. Any future information we share under new pseudonyms about, e.g., one of the movies we rated, can be linked to our real identities.

Is this legal? Well, it is difficult to proclaim that it is illegal under both current and recently proposed laws. It is precisely the promise of data anonymisation that makes it easy to elude the data protection legislation such as the EU Draft General Data Protection Regulation (GDPR). Data or information is only protected under GDPR if it concerns “an identified or identifiable natural person,” meaning it has either already been linked to a specific person or might be so linked in the future. What is not protected is data that is believed to be unlinkable to an identified or identifiable natural (physical) person¹ (GDPR calls this “anonymous data”).

 

The new false promise: getting consent

What is particularly new in the times of big data analytics is that what is “personal” is no longer so straightforward. Information algorithmically derived from anonymised data is not necessarily “personally identifiable”. Yet, it can tell intimate stories about you. It can prevent you from entering a university, getting a job, a job, housing, a phone contract, a social benefit; it can determine how much you pay for health insurance, a hotel room or food, or how long you wait for a delayed flight or medical treatment, it can put you under government's surveillance. It becomes your reputation, a fixed story of who you are, a profile through which corporations and governments see you. It is patterns that “emerge” among data during data mining. A special case concerns group profiles based on anonymised data. Neither the individuals whose data were used to create these group profiles nor individuals to whom the group profile is applied to, have a right to access “their” profiles.

Ensuring that companies that process our data obtain our consent for and before the processing of our data is a big promise of the new EU Regulation (GDDR). However, seeking consent is not only (1) not required when a person is not considered “identifiable” but also (2) impossible for an organization that produces profiles through data mining, since the patterns themselves cannot be humanly anticipated. They would not know whom to ask consent for what specifically.

Even when information seems as innocent as a movie review, it is not us who decide to reveal our political or sexual orientation that is guessed from watching certain movies or visiting certain locations. Even when data is not sensitive, knowledge of your past transactions, preferences and actions is used to infer who you are – which will determine who you will be in the future in the eyes of the data brokers and related companies. Whether for profit or law enforcement, data collection and analytics technologies enable buildling profiles that are more than sums of their parts, and we have no control over them.

 

Making it personal: targeted advertising

Before user data is “anonymised” by companies, demographic data and lifestyle information are added. Anonymisation enables targeted marketing since the “target” does not need to be named while data about her is being sold. Pieces of data that get removed can be easily retrieved later. Actions such as customizing your “privacy settings” on Facebook do nothing in that regard, it only concerns which other Facebook users see your particular activity. The companies, so-called third parties, can see you notwithstanding. Increasingly, the content you see online is personalized.

While personalization might not seem worrisome when we see an ad for a vacation in Greece, we are ever less exposed to serendipitous and novel information that leads us to discovering new areas of interest and learning. Worse, the news themselves seem on the way of being tailored for each of us differently, closing us down to filter bubbles, personalized cubicles and shrinking and fragmenting public spaces for common experience and dialogue, that are vital in a democratic society.

Anonymity – and hence, freedom - of individual users is incompatible with the current model of advertisement-supported online environments, where all information is sought to be monetized. We can seek to increase anonymity individually. We might also imagine collectively creating a better model, one that does not infer who we are from what we do online. Then, we could decide to give out some personal information in the name of research or better public services. Today, however, we do not have that choice, simply because all data that can be gathered or bought by marketers is gathered or bought by marketers and all data that is gathered can be what de-anonomises us.

This blog on anonimisation stems from research Morana Miljanovic did on the data industry when she was a programme researcher at Tactical Tech in 2014. See other posts from the series here

 

 

Footnote:

1. OECD Privacy Guidelines define personal data as “any information relating to an identified or identifiable individual.”

 

Further reading:

Ohm, P, Bromises of Privacy: Responding to the Surprising Failure of Anonymization, 57 UCLA Law Review 1701 (2010)

Sweeney, L, Achieving k-anonymity privacy protection using generalization and suppression, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002; 571-588

Sweeney, Abu and Winn, Identifying Participants in the Personal Genome Project by Name

Narayanan, A. and Shmatikov, V, Robust De-anonymization of Large Sparse Datasets

Gulyás, G.G, and Imre, S., Hiding Information In Social Networks From De-anonymization Attacks, Magdeburg, September 26, 2013, Conference on Communications and Multimedia Security 2013

Privacy International, Big data: A tool for development or threat to privacy?, 21 January 2014

http://en.wikipedia.org/wiki/Mask

Bellovin, S, Hutchins, R.M, Jebara, T, Zimmeck, S, 2014, When Enough is Enough: Location Tracking, Mosaic Theory, and Machine Learning, NYU Journal of Law & Liberty, Vol. 8 U of Maryland Legal Studies Research Paper No. 2013-51

Brdar, S, Ćulibrk, D, Crnojević, V, 2012, Demographic Attributes Prediction on the Real - World Mobile Data, Mobile Data Challenge Workshop

Song, C, Blumm, N, Barabasi, A.L., 2010, Limits of Predictability in Human Mobility, Science Vol.327

De Montjoye, Y-A, Hidalgo, C, Verleysen, M, Blondel, V, 2013, Unique in the Crowd: The privacy bounds of human mobility, Nature

Blumberg, A.J, Eckersley, P, 2009, On Locational Privacy, and How to Avoid Losing it Forever, EFF

Rubinstein, I., Big Data: The End of Privacy or a New Beginning?, 3 INTERNATIONAL DATA PRIVACY LAW 74, 74 (2013)

Andrejevic, M, 2014, The Big Data Divide, International Journal of Communication 8 (2014), 1673–1689