Diffbot aims to convert the web into one big database, one page at a time

Diffbot the itty-bitty startup with some big, bad backers is about to release a new API that will let users convert product web pages into reusable data for pricing analysis and tracking and whatever other uses they can come up with.
The idea behind Diffbot’s visual robot is to scan and recognize different types of  web pages — which are largely unstructured — for what they are — rich data sources for other applications. “We take pages and analyze them and make a structure out of them using sophisticated techniques,” Diffbot founder and CEO Mike Tung told me recently.
Most web pages fall into a handful of broad categories — news articles, front page, images, events and extracts. Diffbot recognizes them for what they are and turns them into what Tung calls a “true database representation.”  The company already released APIs for front pages and articles. What’s new here is the product API.
Customers include Instapaper which can take that structured data and repurpose it for use on mobile devices, he said.
Academics and big vendors including Google(s goog), Microsoft(s msft) and Yahoo(s yhoo) are all working to better understand web pages. Google Research and Microsoft Research are no doubt doing similar work, the difference being they are keeping it as a black box,Tung said. Diffbot is making its APIs and web-scanning SaaS service available to the masses.
Diffbot is backed by such tech stars as Andy Bechtolsheim, Sky Dayton, Joi Ito, Brad Garlinghouse, and Jonathan Heiliger