Kaggle now has 100K data scientists, but what’s a data scientist?

Data science competition platform Kaggle has reached the 100,000-member milestone just over three years after launching, the company announced on its blog Thursday morning. That could mean either a lot of people picked up the skills for the decade’s hottest job in a hurry, or a lot of people realized they might already have them. I’d say it’s a little of both.
You have to give credit where credit is due, though: Given the field we’re talking about, that’s a lot of people any way you slice it. Assuming there are a few million people across the world who can even loosely call themselves data scientists, getting 100,000 of them to congregate on a single platform is pretty impressive. Part of this has to do with Kaggle making it easier for people interested in data science to actually test their skills on real-world data.

Kaggle's trajectory. Source: Kaggle

Kaggle’s trajectory. Source: Kaggle

You can see the effects of Kaggle’s popularity in the evolution of company. It’s still best known for hosting public data mining and predictive analytics competitions on behalf of companies and other institutions, but the company has quietly grown into a full-fledged business. Top competitors are competing in invite-only private competitions that mean bigger prizes for everyone involved. I was surprised to bring up the Kaggle homepage and see its Connect service (an evolution of an earlier service called Prospect), for putting customers directly in touch with those top competitors, front and center with the competitions playing second fiddle.

Define “data scientist”

Of course, “data scientist” is term that emerged from the morass of terminological confusion that is “big data” — something that becomes pretty clear when you ask someone what a data scientist is or does. I’ve heard the core data science competencies explained as SQL, statistics, predictive modeling and programming, probably in Python. Those sound reasonable, but many would be quick to add to the list things like Hadoop/MapReduce, machine learning, visualization and perhaps a good, old-fashioned Ph.D. in mathematics, physics, computer science or something equally quantitative.
IBMer (s ibm) Swami Chandrasekaran built a great subway-style map of the optimal data scientist skillset, which you can see on his blog.
And those are just the technological prerequisites. A lot of people trumpet the importance of domain expertise, business acumen, creativity and storytelling, too. A data scientist can’t just be good with numbers (those people are called statisticians or analysts) but also needs to be able to understand the business; why certain data and results are or aren’t important to it; be able to find new datasets and build new products around them; and then be able to explain all this to the the C-suite in plain English.
That’s a tall order. I’m pretty sure there are a handful of these people in the world, and I’ve met most of them.

Eric Huls of Allstate Insurance, Jeremy Howard of Kaggle, and Ryan Kim of GigaOM at Structure:Data 2012

Kaggle’s Jeremy Howard (left) at Structure: Data 2012 (c) 2012 Pinar Ozger. [email protected]

The good news is that all those characteristics are probably overkill, necessary for the cream of the crop trying to do really advanced things, but not required to effect positive change in more pedestrian environments. They’re certainly not all necessary for success on a platform like Kaggle, where the business problem is often well understood and competitors are just trying to optimize it through predictive models. Many top competitors no doubt have some serious statistical and data analysis backgrounds — even if they’re not up on the latest techniques — but some competition winners have been college kids with a little coding experience and Coursera’s introductory Machine Learning course.
This isn’t because data science or predictive competitions are easy, but rather because we’re at the precipice of a big change. It’s easier than ever to learn what you need thanks to online courses and coding programs; easy enough to learn and access tools such as R and Hadoop — especially with the advent of cloud computing; and easy enough to hone your skills on platforms like Kaggle or Topcoder. This pedigree might not land anyone an engineering job at Google(s goog), but it’s probably enough to be dangerous (in a good way) in a lot of other places.