Can evil data scientists fool us all with the world’s best spam?

While most of the concern over web security has to do with criminal activity such as cyberterrorism, state secrets and hacktivism, there’s a far more annoying threat lurking beneath the surface. It’s a new generation of spam that does away with brute force email barrages in favor of fake online personas so real that people — and, more importantly, email and web-service spam filters — can’t tell they’re fake. Done right, these fake identities could influence everything from app downloads to e-commerce to elections.

It’s called influence manipulation. And, as data scientist Joseph Turian said during a presentation at the O’Reilly Strata conference on Wednesday, “It’s a pretty serious issue and it’s also pretty hard to catch.” (Turian will also be moderating a panel on next-generation databases at our Structure: Data conference in New York next month, but I’m sure he’ll gladly talk black-hat data science if you catch him in the hall.)

RoadMap 2012 Joseph Turian MetaOptimize

Joseph Turian at GigaOM RoadMap 2012 (c) 2012 Pinar Ozger [email protected]

It’s hard to catch because influence manipulation, which Turian also calls black-hat data science, is really just white-hat (or good) data science techniques inversed and pointed toward a nefarious purpose. So, whereas as white-hat data scientists try to uncover unnatural networks of links created to game Google’s PageRank algorithm, Turian explained, black hats will try to build artificial networks so good they look real. If someone wants to send lots and lots of undetectable spam, it’s just a matter of analyzing enough language to create messages that look less like a machine wrote them and more like a stupid human wrote them — because most spam filters try not to penalize users who just don’t write well.

During a one-on-one conversation later in the day, Turian told me he did a lot of work on language modeling as part of his Ph.D. work, and that the same techniques used for language evaluation — something like sentiment analysis, for example — can also be used for language generation. Marketing startups such as DataPop and BloomReach are already using some presumably similar techniques to create personalized online ads and web pages on the fly.

Does evil lurk among our data scientists?

Hilary Mason Source:

Not evil. Source:

But are there actually so-called black-hat data scientists among us, using their mastery of statistics to influence our opinions or make us buy Cialis? Turian quoted data scientist Hilary Mason, who he said asks of all her work, “What’s the most evil thing that can be done with this?” We can assume she’s just trying to avoid a mini-Sarah Winchester situation, but others might not be so ethical. (Turian already classifies as “gray hat” certain well-known companies that play fast and loose with user data.)

After all, Turian noted in his presentation, Greylock’s D.J. Patil has called being a data scientist the sexiest job of the 21st century, comparing it with Wall Street quants in the 1980s. And where there’s opportunity, there will always be people trying to cash in on it by any means necessary. Real-life Gordon Gekkos came to make quants almost universally reviled, and a few bad apples could certainly find their way into the data science bunch.

Turian assured me he isn’t one of them. “[I]f I did [this] I’d be riding around in a Rolls Royce,” he joked during our hallway conversation.

Define “good enough”

Maybe, maybe not. If all you’re trying to do is improve search rankings, mediocre bots might work in the same way that “legit” content-generation services like Chirpsy and Servio work, he noted. Marketers don’t necessarily care how good a tweet or article is as long as it’s positive and says their company’s name a lot.

But in order to be successful in the world of online influence manipulation, fake personas and their messages have to be really good. Lutz Finger, co-founder of Fisheye Analytics, laid out some interesting statistics during another conference talk that highlight how difficult it is to really influence someone. According to the studies he cited, 7 percent of people’s twitter followers are actually spambots; 30 percent of social media users are deceived by spambots and chatbots; and 20 percent of social media users accept friend requests from unknown people, 51 percent of which are not human.

Presently, though, the charlatans are not very good. Finger said that when it comes to “astroturfing” — the practice of creating fake grassroots movements to influence opinions — the hit ratio on email spams is about 12.5 million to 1. In order to create an astroturf movement on the scale of the anti-SOPA movement in 2011, every person on earth would have to receive the same spam message 8 times. The number might be even higher on an already-noisy platform like Twitter.

That, he noted, makes spambot @peace_karen25’s (a now defunct spambot) 10,000 pre-election tweets seem pretty inconsequential.

However, he explained, spammers are getting smarter and are working on some of the black-hat data science techniques that Turian warns about. Next-generation bots will be better at gaining trust (attractive females with familiar names are most likely to have their fake friend requests accepted), and they’ll act more real by mixing improved chatbot technologies and analytics to figure out how people speak and what to say in what circumstances. Once they have your trust, these bots can make introductions to more bots and people will be more likely to accept those requests, too.

Even if it’s difficult to change someone’s mind on issues like global warming or politics, Finger said well-timed messages could affect individual decisions. At the time someone is ready to buy something on, for example, he’s open to messages about that product, perhaps in the form of product reviews. Maybe someone waiting in line at the polling place and still sitting on the fence is open to suggestions, too.

And it’s possible the bar to convincing people — especially teens — to act really isn’t that high at all. In his talk, Turian highlighted teenage social media maven Acacia Brinley Clark and her single tweet that led to an app called Pheed becoming one of the most-downloaded apps in Apple’s App Store last week. After reading the rest of her Twitter feed, he said, (only half-jokingly, I think) it took quite a bit of research to convince him she’s a real person.


Her 120,000-plus followers don’t seem to share the skepticism, but they certainly seem willing to follow her lead.