Meet Carter S. He used to be a lawyer, but now he writes predictive models for an insurance company. Admittedly green in certain new or advanced modeling methods, he prefers to use simple algorithms and throw as much computing power as possible problems. He calls the technique “overkill analytics,” and it just won him his first contest on Kaggle, defeating more than 80 other competitors in the GigaOM WordPress Challenge: Splunk Innovation Prospect (s splk) (see disclosure).
Not only was this Carter’s first win, it was also his first contest. You can read the detailed explanation of his victory on his blog, but the gist is that he didn’t get too involved with complex social graphing to determine relationships or natural language processing to determine topics readers liked. He figured out that most of what people liked came from blogs they’ve already read, and that the vast majority of posts people liked fell within a three-node radius on a simple social graph.
Statistically speaking, he did a generalized linear regression model, followed by a random forest model and averaged the results. “I’m not sure it’s a very unique technique,” he told me, “but it’s certainly a very powerful one.”
And therein lies the beauty of overkill analytics, a term that Carter might have coined, but that appears to be catching on — especially in the world of web companies and big data. Carter says he doesn’t want to spend a lot of time fine-tuning models, writing complex algorithms or pre-analyzing data to make it work for his purposes. Rather, he wants to utilize some simple models, reduce things to numbers and process the heck out of the data set on as much hardware as is possible.
It’s not about big data so much as it is about big computing power, he said. There’s still work to be done on smaller data sets like the majority of the world deals with, but Hadoop clusters and other architectural advances let you do more to that data in a faster time than was previously possible. Now, Carter said, as long as you account for the effects of overprocessing data, you can create a black-box-like system and run every combination of simple techniques on data until you get the most-accurate answer.
I wrote about the same general theory recently in explaining why Sparked.com’s Daniel Wiesenthal believes that big data (i.e., lots and lots of data combined with new storage and processing technologies) improves the practice of data science (i.e., the application of statistical techniques to data). The gist of his theory is that although complex models are great for small data sets, simple models can close the accuracy gap when applied to large data sets. Combine that with infrastructure that can process a lot of data relatively fast and support a wide variety of jobs, and you have a simpler, faster equally effective method.
Still, Carter said he didn’t get involved in Kaggle just to prove the effectiveness of overkill analytics. He does hope to get exposed to new data science techniques that haven’t yet caught on in the insurance industry, and he also wants to make a name for himself. When you work for a company with little turnover, he said, your professional network doesn’t grow too much, but doing Kaggle competitions is a great way to meet other data scientists — and winning is a great way to earn respect.
Ali Ahmad (username Xali) won the separate Splunk Innovation portion of the contest. According to a statement from Splunk, he “used Splunk’s built in statistical and visualization features to map out the relationship between blogs containing YouTube videos with those that are most likely to be viral, as measured by likes and shares. As a bonus, he fed the data into an app to view the YouTube videos most commonly liked and shared via WordPress blogs!”
Disclosure: Automattic, maker of WordPress.com, is backed by True Ventures, a venture capital firm that is an investor in the parent company of this blog, GigaOm. Om Malik, founder of GigaOm, is also a venture partner at True.
Feature image courtesy of Shutterstock user nasirkhan.