A programmer’s guide to big data: 12 tools to know

Over the past year, I’ve seen a lot of startups, projects and tools that aim to bring fairly advanced analytic capabilities to programmers. Sometimes they do this by enabling simple scripts that result in powerful dashboards or processes, while other times they just deliver the data in an easy-to-consume manner with little work at all on the developer’s part. I think this is a meaningful trend.

In a world of mobile apps and cloud resources, it’s easier than ever to start a business around a simple application. Even in large companies, developers fighting for resources might need to prove an application’s popularity or find a way to boost its monetization. Sometimes, that might even mean injecting some data-processing right into an application.

But whatever the case, if your job revolves around writing code rather than data flows, you might need a little help. Here are 12 tools (listed alphabetically) that aim to help. As usual with this type of list, it’s very possible I left out some good options, so please note any omissions in the comments.

1. BitDeli

BitDeli, a startup that launched in November, lets programmers measure pretty much whatever application metrics they want using Python scripts. Co-founder and CEO Ville Tuulos told me at the time that scripts can be as simple or complex as necessary — even going so far as to incorporate machine learning. Compared with the heavyweight Hadoop, BitDeli thinks of itself as the lightweight Ruby on Rails for analytics.


2. Continuuity

The brainchild of former Yahoo (s yhoo) Chief Cloud Architect Todd Papaioannou and Facebook (s fb) HBase engineer Jonathan Gray, Continuuity wants to help all companies operate like its founders’ former employers. The team created a big data fabric that abstracts the complexities of connecting to Hadoop and HBase clusters and includes a full suite of developer tools. The goal is to make it easy to write big data applications serving either internal or external audiences.


3. Flurry

Flurry is like a one-stop mobile-app shop, and it’s generating nearly $100 million a year in revenue because it’s good at what it does. Not only does the company help developers build mobile apps, but helps them analyze all the data those apps are generating in order to make them even better. The data also underpins the company’s ad network that helps developers monetize their apps by putting the right advertisers in front of the right users.


4. Google Prediction API

Of all the tools in Google’s (s goog) developer toolbox, the Google Prediction API might be the coolest. If you have good data to train a model, the Prediction API can bring machine learning to work on it in order to discern any number of pattern types and feed the answers into your application. Among the examples Google gives are spam detection, recommendation engines and sentiment analysis — and it gives step-by-step instructions for building those models.

Sample training data

“Mucho bueno” is probably Spanish.

5. Infochimps

Although Infochimps is trying hard to make itself an enterprise IT company (hey, that’s where the money is), the company’s eponymous platform also provides a real value for developers. Sitting atop its technologies for configuring and managing big data environments is Wukong, a framework for creating Hadoop jobs or streaming data flows using Ruby scripts. Infochimps also maintains a data marketplace full of API-accessible or downloadable datasets.


6. Keen IO

Keen IO won our Structure 2012 Launchpad competition with a message of delivering powerful analytics to mobile developers. With just a single line of code inserted that dictates what to track, the company claims developers can track pretty much whatever they want within their applications. At that point, it’s just a matter of creating a dashboard or query process in order to turn all that data into usable information.

keen screen

7. Kontagent

Kontagent‘s bread-and-butter business is its analytics platform for mobile, social and web applications, but it’s all built atop a Hadoop infrastructure designed to handle really big data. Earlier this year, the company turned that infrastructure loose with a new product that lets users mine their application data using the SQL-like Hive query language for Hadoop. Instead of tracking predetermined variables, they can dig in however they choose.


8. Mortar Data

Mortar Data is Hadoop for developers, plain and simple. The company has offered its cloud service — which replaces MapReduce with a combination Pig and Python — for almost a year. In November, it released the open source Mortar framework in order to build a community around sharing datasets and making it easier to write Hadoop pipelines. Mortar Data runs atop Amazon Web Services (s amzn) and currently supports Amazon S3 and MongoDB (hosted on Amazon EC2) as data sources.

[vimeo http://www.vimeo.com/51020237 w=400&h=300]

9. Placed Analytics

Placed does away with scripts, APIs and any other developer legwork and just delivers the results. In the case of Placed, those results are detailed information about where and when consumers are actually using mobile apps and web sites — right down to the name of the business. This type of info can be useful for attracting advertisers as well as informing app design (e.g., implementing voice controls if people are using an app while driving).


10. Precog

Precog might look like any other proprietary business intelligence service, but underneath its covers there’s a twist. The company offers a service called Labcoat, which is an interactive development environment for writing analytics jobs based on the open source Quirrel query language. The IDE includes a tutorial for learning the language, as well as some complex functions, and Precog COO Jeff Carr told me even non-technical people can learn it in hours.

[youtube http://www.youtube.com/watch?v=cLHU8JZztNs]

11. Spring for Apache Hadoop

Hadoop is written in Java, but that doesn’t mean it’s easy for Java developers to learn or use. That’s why, in early 2012, SpringSource (s vmw) announced the Spring for Apache Hadoop project, which brings the ease of building Java applications with the Spring framework to Hadoop jobs. That means integration with other Spring apps, scripting using JVM-based languages and a generally easier way to develop applications that utilize Hadoop or related technologies such as Hive or HBase.

[youtube http://www.youtube.com/watch?v=wlTnBzQ6KDU]

12. StatsMix

In the same vein as BitDeli and Keen IO, StatsMix wants to let developers start collecting and analyzing application data using the languages they already know. The service automatically tracks certain metrics, but developers can add their own using the StatsMix API and predefined code libraries. The results are delivered via a collection of dashboards that users can customize, share and use to mashup multiple data sources into a single view.