You might have heard recently about a study finding that liking “curly fries” on Facebook correlates strongly with high intelligence. Publications such as Wired have written about it. Quid Founder and CEO Sean Gourley cited it during a presentation at Structure: Data last week. A faction of the European Union parliament even pointed to the study as yet another reason to prohibit data mining by web companies.
However, if you’re like me, hearing anybody repeat that curly fries data point as fact likely sends shiver down your spine. It’s not that it’s not true — it very well might be — but that it’s nearly useless information without more background.
That’s right, the old correlation versus causation argument is front and center once again. In all the big data world, it’s probably the biggest fallacy there is, no matter how you look at it. No, getting value from big data always doesn’t require giving greater credence to correlation than causation. And, no, relying on correlation isn’t inherently some sort of an ethically or scientifically questionable practice.
Really, the choice between relying on correlation or striving to find causation probably depends on what you’re trying to do.
When there’s nothing at stake, correlate away
Let’s be honest: If all I’m concerned with doing is boosting clickthroughs, selling more products or predicting the movies you want to see, correlations probably will work just fine. I don’t really care why, for example, Mac users book more-expensive rooms on Orbitz — I just care that they do.
You visit my site, my system sees you’re using a Mac (or that you like curly fries, or any other attribute it can associate with you) and it shows you content that it thinks you’ll want to see. It’s not a perfect approach, but it’s probably a far cry better than the old method of just showing everybody the exact same content.
And when you’re collecting potentially petabytes of user data and trying to serve ads in near real time, strong correlations might be about the best things you can hope to find. It’s a volume-and-velocity business, and heavy examinations of why any two (or more) things are related to one another might not always provide a high return on investment.
A more extreme example of when correlations might suffice would be something like machine-to-machine systems that need to make decisions in real-time in order to prevent disasters. The people charged with running these systems might not know why a certain series of events often precedes a particular outcome, but it’s better safe than sorry.
You can’t make a difference — or real decisions — with correlations
But if you’re trying to use big data to make a meaningful difference in the word or to make decisions that can have significant real-world consequences, mere correlations probably won’t cut it. This is what Evgeny Morozov warns about in relation to crime in a recent New York Times column. It’s what Gourley had in mind when talking about data science versus data intelligence. It’s why the current discussion around machine learning almost always includes a human aspect, as well.
Many of the reasons for not acting on correlations alone are based on privacy and a whole collection of civil, constitutional and human rights. You simply can’t profile and then arrest, for example, people based on what their Likes suggest they might be. You probably shouldn’t make decisions about people’s financial, health or general well being based on mere correlations, either.
Heck, I wouldn’t even serve ads that delve into personal information such as health, sexual orientation or intelligence without a very strong reason to believe I was accurate (and express consent to serve those ads). And the Facebook-curly-fries study is full of correlations that could be potential landmines, a small portion of which are visible in the chart below.
But these are all situations where the fear of incorrectly profiling someone occasionally — and being sued as a result — might overpower the desire to do good most of the time. The data Darwinism that my colleague Om Malik wrote about recently extends beyond just peer reviews and social-media ratings, and one shouldn’t take the role of playing God (or catalyst for evolutionary change, to continue the Darwin metaphor) lightly.
Sometimes, though, correlations aren’t enough because you really want to solve a problem or perhaps build a great product. As Gourley explained at Structure: Data, even using correlative data to predict insurgent attacks in a place like Iraq is relatively easy, but predicting the likelihood of events doesn’t stop them. Stopping them requires really understanding and addressing the root causes of the attacks.
The same goes for stopping disease outbreaks, figuring out why programmers make more mistakes during certain seasons, stopping gun violence, or just capitalizing on that knowledge about curly fries or hotel-room bookers in order to build products that touch upon the deeper rationales for liking those things. You can fight the symptoms, so to speak, or you can cure the disease.
So feel free to try selling the next guy you see eating curly fries on a documentary about Dostoevsky, but don’t expect him to care. It might be that there’s some strong connection between curly fries and intelligence; of course, it might also be that intelligent people — entirely coincidentally — tend to live within walking distances of an Arby’s. But no one has asked about that.
Feature image courtesy of Shutterstock user Tobias Arhelger.