How OkCupid Demystifies Dating With Big Data

The Boston Globe describes OkCupid as the “Google of online dating.” That’s a good description. With more than 3.5 million active members and more than 7 million unique logins per month, OkCupid, which was recently acquired by (s iaci), probably is what it claims to be — the fastest-growing free online dating site on the planet.

As much as I’m in favor of anything that improves the chances of finding true love (or any kind of love, for that matter), what fascinates me the most about OkCupid is OkTrends, the company’s official blog.

OkTrends Brings Big Data to the Masses
Written largely by Christian Rudder, the company’s co-founder and editorial director, OkTrends is a veritable treasure chest of sexy social insight generated through a highly creative mash-up of off-the-shelf and DIY analytic techniques.

A recent post, “Big Lies People Tell in Online Dating,” reveals some pretty amazing truths behind the tall tales that are told regularly in the quest for romance. With unflinching mathematical precision, accompanied by several convincing charts, the post shows how people routinely lie about their height, income and looks. This revelation isn’t exactly surprising, but it’s fascinating to see how Rudder and his merry crew of quants use the power of analytics to strip away the mythology and tell the real story.

One of the most interesting aspects of the blog is the size of the samples. As Rudder noted in the blog’s inaugural post back in June 2009:

… a word about statistical validity: the best questions on OkCupid have been answered over a million times. Therefore we have unique insights into the American mindset. A quick comparison:

OkCupid chart

Old media could only get 3,050 people to answer a poll about Obama. And it was enough to call the election with confidence. OkCupid, on the other hand, can ask the world’s most personal questions and get hundreds of thousands of answers.

Open Source Is the Key to Deeper Insights
The analytic techniques used to crunch the numbers and surface the patterns tend to vary. In the early days, when OkCupid had fewer members and the data sets were smaller, Excel (s msft) sufficed. But as the site’s membership rapidly grew into the millions, it was not uncommon for surveys to generate responses from 500,000 members. It became clear to Rudder that Excel by itself could not handle data from 500,000 responses; a more robust solution was required. Recently, OkCupid added statistical packages written in R to its mix of analytic tools.

R is an open-source language designed specifically for statistical analysis (see disclosure below). Unlike some of the more widely used, proprietary, analytic tools, programs written in R can handle the larger and more complex data sets generated by OkCupid’s growing base of users. That makes R a good choice for data scientists who are interested in pushing the envelope of traditional statistical analysis.

“R helps us get a quick overview of the data, which can save us a tremendous amount of time,” says Rudder. “If we had to do everything in Excel, it would take forever.”

Rudder’s crew uses R to visualize big data quickly, something they couldn’t do with Excel. “R lets us get a ‘zoomed-out’ view of what’s going on with the data, which helps us decide quickly if the tack we’re taking with the data is yielding something interesting. Once we figure out what we’re looking for, and we start narrowing our focus, then we can move into Excel,” says Rudder.

Rudder has also benefitted from the open-source community. “If I have some data and I have an idea for analyzing that data, the chances are good that someone in the R community has already written a program that does what I need,” says Max Shron, a data scientist at OkCupid. “That’s the nice thing about open source. People just go ahead and write programs. It’s a tremendous time saver.”

Shron and Rudder recently used R packages to analyze data for a study comparing gay and straight dating habits. “There was a tremendous amount of data,” says Rudder. “When we were looking at things like sex partners and messaging, R was very helpful. We could see the patterns very quickly and we knew we had something to write about.”

The low cost of open-source software is also a factor. “If we had to buy a package from a major vendor — even just one license for Max — we’d blow our budget for software. It’s much easier just doing some of this stuff in R,” says Rudder.
R is not a panacea for solving every data challenge – at least not yet, says Rudder. “I am a long-term Excel user; I’ve been using it since I was 12. So when it comes down to generating the final graphics, I usually like Excel. But I feel that with the right amount of work, R could probably do anything.”

OkTrends shows how big data and analytic science can hold a mirror to society, revealing its strengths and weaknesses. It’s gratifying to know that many of the advanced analytic processes running “under the hood” at OkTrends are hand-crafted from open-source components created by a worldwide community of people writing code for love, not money.

For more insights from the big data landscape, come to GigaOM’s Structure: Big Data conference on March 23 in New York City.

Mike Minelli is an executive VP at Revolution Analytics, a company founded in 2007 to foster R analytics by creating programs to make it easier for data scientists to analyze large amounts of data. Note: Neither Minelli nor Revolution Analytics has a business relationship with OkCupid.

Related content from GigaOM Pro (subscription req’d):