Anonymous data is one of the staples of the big data movement, but there’s a dark side.
In theory, data from mobile phones lets us do things like map traffic patterns, while web-behavior data can be a boon to researchers and others trying to make sense of how people conduct their online lives. The thing is, it’s damn hard to keep that data anonymous. Perhaps all we can hope for is to keep potentially sensitive data out of the wrong hands.
The latest proof of how hard it is to anonymize data came earlier this week, when a group of MIT researchers published a paper based on their analysis of 1.5 million cell phone traces over 15 months inside a “small European country.” A press release highlighting the paper’s publication nicely sums up the findings, which are somewhat startling:
“Researchers … found that just four points of reference, with fairly low spatial and temporal resolution, was enough to uniquely identify 95 percent of them.
“In other words, to extract the complete location information for a single person from an ‘anonymized’ data set of more than a million people, all you would need to do is place him or her within a couple of hundred yards of a cellphone transmitter, sometime over the course of an hour, four times in one year. A few Twitter posts would probably provide all the information you needed, if they contained specific information about the person’s whereabouts.”
And assuming you’re concerned about protecting privacy, it gets worse:
“[T]he probability of identifying someone goes down if the resolution of the measurements decreases, but less than you might think. Reporting the time of each measurement as imprecisely as sometime within a 15-hour span, or location as imprecisely as somewhere amid 15 adjacent cell towers, would still enable the unique identification of half the people in the sample data set.”
All it takes to get started is a few pieces of data against which to compare the anonymized mobile data. “For re-identification purposes,” the authors write in the paper, titled Unique in the Crowd: The Privacy Bounds of Human Mobility, “outside observations could come from any publicly available information, such as an individual’s home address, workplace address, or geo-localized tweets or pictures.”
We’ve been down this road before
This news might ring a bell to anyone who follows the world of web data. After releasing anonymous user data as part of its Netflix Prize competition in 2007, researchers were able to de-anonymize it using publicly available movie reviews from IMDB. In 2006, AOL released a bounty of supposedly anonymous search data for research purposes, but it was quickly mirrored onto public web sites and people began picking individual searchers out of the sea of anonymous identification numbers.
There are plenty of non-digital examples, too. The Unique in the Crowd authors point to one case where a medical database was analyzed against a voter list to discover a governor’s health records. In a 2007 post for Wired, security expert Bruce Schneier cited a couple of analyses of census data, including one using 1990 census data and proving that 87 percent of Americans could be identified using just their ZIP code, sex and date of birth.
And then there are those fitness-tracking devices. At out Structure: Data conference last week, Central Intelligence Agency CTO Ira “Gus” Hunt gave the audience — the whole world, really — a scare by noting that it’s possible to identify someone based solely on his gait. That kind of information might not get people lining up for web-connected pedometers and other fitness devices.
Any type of de-anonymization is only exacerbated in an era of social media. The University of Texas researchers who decoded the Netflix data were able to speculate on individuals’ political positions, sexual orientation and other characteristics, but we now give that information away for free on sites like Facebook, Twitter, Foursquare, you name it. If you’re inclined to stalk someone, steal identities or engage in any other malicious undertaking, access to names, photos, interests, location, checkins and other information makes for a hearty personal-data stew, and it just takes one piece to get the rest.
A choice between privacy and a better world?
However, if we can get past the inherent privacy concerns, these types of anonymous, aggregate data sets can be incredibly valuable. Companies such as Google, Apple and INRIX are using smartphones and in-vehicle devices to map traffic patterns and how people move throughout cities in efforts to improve both commute times and urban planning. Social scientists accessing data from companies such as Google and Facebook could learn a lot about the intricacies of online behavior. And predictive analytics platforms such as Kaggle present an opportunity optimize everything from business processes to health care.
The holy grail of anonymous data lies in genomics and the hope that lots and lots of quality data will help researchers discover cures for diseases like cancer. Because of the relative uniqueness of each individual cancer case, researchers hope a massive pool of data on sequenced genomes will help them spot patterns and commonalities that no amount of traditional lab work will uncover.
Further complicating things is the fact that the companies delivering our favorite web services rely on our personal data to make money. Whether we like it or not, targeted advertising pays the bills for free services, and doing targeted advertising well requires a lot of personal data. One could argue that a major focus of the data science movement that has taken the world by storm is stitching together various pieces of anonymous data from across the web in order to create holistic images of consumers.
In fact, web companies have gotten so good at de-anonymizing data that the Federal Trade Commission has all but abandoned the term “personally identifiable information.” In a 2010 report on online privacy, the agency wrote that any guidelines it proposes will likely apply
“to those commercial entities that collect data that can be reasonably linked to a specific consumer, computer, or other device. This concept is supported by a wide cross-section of roundtable participants who stated that the traditional distinction between PII and non-PII continues to lose significance due to changes in technology and the ability to re-identify consumers from supposedly anonymous data.”
“Going forward,” the Unique in the Crowd authors conclude, “the importance of location data will only increase and knowing the bounds of individual’s privacy will be crucial in the design of both future policies and information technologies.” This rings equally true for every other type of personal data, especially given the relative ease with which they can be analyzed against each other to create a sum that greater than the whole of its parts.
One has to wonder, though, what types of policies and technologies will come about to keep data anonymous and available to the people who need it while still maintaining its utility. Privacy is important, but is it worth the opportunity costs of not trying to solve the types of problems that large, anonymous data sets are ideal for solving? If true anonymization is really that difficult, perhaps the best bet is just to double down on security and try to ensure that valuable data — anonymous or not — doesn’t get into the wrong hands.