Meet the Web Database Company Google Just Bought (Hint: Not Metaweb)

Google (s GOOG) is in the process of acquiring ITA Software, an airfare information provider that brings the company into the realm of vertical travel search. It also creates potentially awkward competition with ITA customers like Kayak and Bing. But travel isn’t the only thing ITA does; a few years ago, a research division of the company started a build-your-own-database tool for web data. Needlebase, as it’s known, is a nifty way to give structure to disorganized and constantly changing information on any topic.

Needlebase, which started giving out free beta trials in January, uses machine learning to assemble scraped data from web sites and other sources into a hosted database intended to power vertical search engines. It’s similar to Metaweb’s Freebase — the other semantic web/structured data acquisition Google made this month — but instead of a giant public database like Freebase, each user’s Needlebase account is private unless designated otherwise. Needlebase is designed for use by anyone who wants to organize and play with data — such as an avid soccer fan looking to parse and visualize game stats — regardless of technical knowledge. But it’s also built to be powerful and reliable enough for commercial vertical search engines to use it as part of their backend.

Needlebase could be a key indicator of Google’s vertical search strategy going forward. It was out of character for the search engine to buy something as domain-specific as ITA, but Needlebase is not domain-specific at all. As I wrote in a previous post about Google’s forays into vertical search:

Google could potentially use ITA as a way to get into many more verticals without additional acquisitions or major new products. Perhaps Google was interested in vertical search, but it may be even more interested in an easy way to take massive amounts of unstructured data and give them structure. It would be the equivalent of spinning straw into gold.

The 14-member Needlebase staff is led by Justin Boyan, ITA’s VP of web data integration, a former NASA Ames and online anonymity researcher who’s been with the company nearly 10 years. Boyan said in a recent phone interview that he’s “optimistic” Needlebase will fit into Google’s plans for Boston-based ITA, and Needlebase will continue to serve existing and new beta users.

Boyan described the impetus for Needlebase, which is a more generic version of some of the technology behind ITA’s main airfare product, QPX.

It doesn’t require solving the AI problem to take out the columns of a table. And there’s no reason to have to labor over maintaining Perl scripts. It just seemed like a real nice match to the kinds of machine learning that we were already familiar with.

Cloud-based Needlebase starts with a wizard tool for scraping data from websites, including Javascript-heavy and form-driven ones, as well as CSV, XML and Excel files. The secret sauce is that the next time it refreshes data from that source, it will remember what it learned about the user’s edits, cleanups and duplicate deletion, and apply that learning to the new data automatically. And all the while, Needlebase normalizes, geocodes, fixes capitalization and makes other tweaks so that data can be merged and queried.

Needlebase has been used so far to manage information about movies, jobs, hotels, events, weather and oil spills, said Boyan. Check out two sample projects for 2010 World Cup stats and heavy metal bands. Boyan said Needlebase is intended to be a commercial-grade tool. “We’re looking for aggregators whose business is aggregation: people building vertical search engines, doing data gathering and analysis, and business analysts.” He said he hopes to soon announce Needle’s first two paying customers. Pricing will be cloud-style, pay-as-you-go based on the amount of data each customer acquires, hosts and publishes.

Needlebase has had virtually no publicity to date, but with Google’s name now behind it, the stakes have changed. Then again, the acquisition could also have the effect of scaring off potential customers who are worried about Google competing with them in vertical search.

Related content from GigaOM Pro (sub. req’d):

What Cloud Computing Can Learn from NoSQL