If I thought an NSA spy was sitting around in some foreboding government building, proactively trying to track my every move and figure out all my connections, I might not be sleeping so well. That’s because I know he’d be able to do it. It doesn’t take a wiretap to figure out what someone is up to when you have boatloads of metadata.
If I were involved in criminal activity, or concerned about individuals at ad platforms or web service providers snooping into my data, I’d really be losing my mind. Law enforcement agents tracking a specific individual can get all sorts of phone call, web account and transactional data without search warrants; in some cases, they can even put a GPS unit on my car for a little while.
Google, Facebook, direct-marketing firms and even grocery stores? Well, they know a whole lot about what we buy, where we go online and who we talk to. I’m confident most of this data is just used to train models and then lump me into a particular segment that computer systems can use to automatically present with ads or coupons, but a bad actor with access to my data could do some serious cyberstalking.
As you can see from just my few hours of tinkering with my personal data over the weekend, metadata can paint a pretty complete picture of a person’s habits and connections.
Calls and texts are low-hanging fruit
My first test was with my cellular phone calls and text messages, which I was able to download from my carrier. Phone numbers have been redacted to protect the innocent (or guilty …), but here’s how easy it is with even a simple tool like Datahero to visualize who I’m calling and at what times of the day. Of course, you could easily slice and dice by duration of calls, where the call was coming from/going to, or any other data points my carrier provides.
And just imagine what you could do with sophisticated indexing and graph analysis tools like what the NSA has — analyzing not just who I call, but how they’re all related and who else they know.
Here’s a chart showing the times I’m sending or receiving text messages from particular cities (and Twitter). It seems I get retweeted a lot between 7 a.m. and 11 a.m., and that I don’t text my wife too often (she’s the “Sndg Sndg, CA” label), but that’s because I use an app for that. And a call analysis would show I call her a lot, especially at certain times of day.
Of course, the data that carriers give users might not even come close to the metadata they give to agencies like the NSA. A phone’s location data, for example, would be incredibly useful for tracking a person’s movements across time (see, for example, this animated timeline pieced together based on a German politician’s data).
Add in all the calls I make using Google Voice from my desktop and the messages I send using third-party apps, and you’d have a complete picture. Even if it just showed that I text my wife a lot and dial into a lot of conference lines.
Email says a lot, too
Email metadata, as with phone and text records, can tell a lot about who someone knows and how frequently they communicate even without knowing what they were actually talking about. The MIT Media Lab’s new Immersion tool shows just how much you can see with just name, email address and year, even without getting into the weeds of metadata around the days and times that emails were sent.
Here’s my personal email graph. The big cluster, you could easily see if I had included names, are emails between GigaOM colleagues and myself when I was just kind of freelancing and didn’t yet have a GigaOM account (bigger nodes means more emails sent between us). The time series data included below this graph will show just how big a spike in activity this created, and how it dropped when my GigaOM email address was created (although I still receive a lot of junk mail, subscriptions and pitches to my personal email).
Here are the same visualizations of my GigaOM email activity. The big pink circle in the network graph is Stacey Higginbotham, for what it’s worth. My activity stats should show how I’m getting hit with a lot more email and actually responding to less of it as a result.
Facebook data: even creepy personal through a third party
With Facebook, it turns out, you don’t even need a FISA order in order to do some serious analysis if you have a target in mind. All you have to do is friend them and sign up for Wolfram Alpha.
This is my Facebook social graph, which was really easy to access using the Wolfram Alpha tool. On the surface, it shows a big cluster of high school friends/acquaintances, as well as clusters for family, university friends and law school acquaintances. (Historical data would show that I don’t use Facebook very much and, save for a mini-spurt during law school, pretty much stopped adding new friends after my 10-year high school reunion got everyone online a few years back.)
But wait, there’s so much more! Want to know my the most-common names among my friends, what time of day I use apps, where my friends are located geographically, their ages, the most-liked pictures of me and with whom I share the most connections? It’s all in the Wolfram report. Want to find out much of that same data about any of my friends? Just click on a name and Wolfram will generate it for you. If you pay for Wolfram Alpha Pro, you can download any of the data in a variety of formats.
Just imagine what a skilled analyst with mountains of metadata about you and all your connections — and a massive database to store and analyze it all — could find out.
Visualizing those 500+ LinkedIn connections
LinkedIn might not seem like too good a place to search for data about suspected terrorists, but it could be a good place to spy on suspected corporate criminals. Someone’s graph is a great place to see the types of networks they roll with and, perhaps, whether there’s a known crook in there who happens to be connected to a lot of people. I don’t know what types of metadata LinkedIn hands over to law enforcement other than someone’s connections, but a log of InMails and other private messages could prove very valuable even without knowing what was said.
Here’s my connection graph, which shows some distinct clusters of contacts: cloud computing (blue), big data (green), GigaOM (orange), a previous job (purple), law school (light orange, lower-left corner), undergrad, and high school and personal contacts (both gray and at the bottom). I’m the center node.
Tackling Twitter is valuable, but not always easy
Twitter is kind of a mystery when it comes to metadata because it’s increasingly useful for communicating but also represents a sort of needle-in-a-haystack situation. Knowing who’s connected to whom isn’t exactly easy when following/follower counts frequently tip into the thousands, and figuring out who’s tweeting among each other could be exceedingly difficult in a sea of billions of messages. I actually tried to download and visualize my Twitter network using NodeXL, but Twitter’s API rate limits made that nigh impossible in any reasonable timeframe.
Spies or law enforcement agents, however, might have an easier time getting the data they need in bulk form rather than in 15-request chunks every 15 minutes.
And if they’re interested in actual public tweets, it’s easy enough to scrape someone’s page or use a service like DataSift or Gnip to follow what’s happening. Here’s a quick summary of the last week’s worth of tweets (I think that’s the limit, at least for the free version) mentioning me, which was easy to generate using ScraperWiki.
Here’s a shot of what that looks like in table form. As you can see, every tweet comes with a lot of metadata, too, and if they’re geotagged or have images, that data should show up, too. (You should see what DataSift lets you do: I just tracked 34 minutes worth of tweets including the words “barack” or “obama” and got more than 1,200 results each loaded with metadata about users’ bios, URLs, locations, you name it.) If anything suspicious were to show up, it would be easy enough to put a digital track on that specific user and start digging into his or her connections, too.
OK, maybe I’m concerned after all …
The thing to remember about all this is that I got was able to analyze and visualize all this metadata (and some actual communication data, in the case of Twitter) using publicly available services, what limited data my service providers actually provide consumers and my own limited data-analysis skills. It’s hard to say what’s inside the guts of an intelligence database like those operated by the NSA, CIA and FBI, but I’ll assume it’s a lot more data about me (and anyone else) if the agencies were so inclined as to collect it. Combined with powerful graph algorithms and rich indexes, they could analyze connections and and track down individual people or communications with relative ease.
The million-dollar question then is what data agencies are collecting and how they’re choosing to use it. The NSA claims its efforts have foiled numerous terrorist plots, while the Boston bombers didn’t set off flags with the FBI even after a tip from the Kremlin. The CIA, well, as its CIO Ira “Gus” Hunt told us at Structure: Data in March, it wants to get at all your metadata — including your pedometer.