With Caffeine, Google Reveals the Challenges of Real Time

When Google announced the launch of its new search index, code-named Caffeine, some observers may have wondered why the world’s leading search engine would even need to do such a thing. Doesn’t Google already control a gigantic proportion of the search business, as well as its Siamese twin, the phenomenally lucrative search-related keyword advertising business? So why the pressure to beef up the search index? The biggest clue is in the name of the new system: Caffeine. Just as coffee drinkers hope they can work faster with a jolt of the chemical, so Google needed to respond faster to a world that’s become increasingly real-time. Not just because it wanted to — because it had to.

Google first announced it was working on an update to the index last August. Coincidentally or not, that was right around the time Microsoft and Yahoo announced their search partnership, whereby the software giant’s Bing search engine would power the results at all of Yahoo’s properties. But while Microsoft’s search results have been getting more competitive with Google’s over the past number of years, competing with the software company wasn’t the main impetus behind Google’s desire to re-engineer its index.

The biggest push came from the simple fact that the web is speeding up all around us — thanks largely to the skyrocketing popularity of social media sites like Twitter and Facebook, as well as other real-time publishing tools (such as PubSubHubbub) — and as the central library through which many people gather online information, Google had to speed up in order to catch up. I looked at the background behind these changes and the implications of them in an article for GigaOM Pro, which you can find here (subscription required).

Without getting too technical about the changes, Google’s previous indexing system accumulated large batches of updates for websites and pages, which were “crawled” (by the engine’s automated search bots) every few weeks to detect changes. But one result was that any pages in the update pool couldn’t be accessed by searchers until the entire batch was finished processing. That meant large quantities of results were older than they should have been — up to several weeks old — even though there were newer results in the update. So Google decided to make more frequent, but also smaller, updates to the index — meaning that in aggregate there would be more fresh results. In fact, the company says that the new Caffeine results are 50 percent fresher than the previous system (Stacey has a closer look at the new search indexing technology in this post).

The critical thing Google realized is that, while search results that are a few days old might have been fine even a year or two ago, the web has become far more real-time than it has ever been before — thanks to the volumes of status updates, photos and other information coming from social networks such as Facebook and Twitter. Facebook has more than 500 million users, many of whom are posting updates, links and photos multiple times a day, and Twitter recently said that the social network sees more than 65 million messages posted every day. That kind of deluge of information places increasing pressure on a search engine like Google to become more real-time in its results. Please see the full report here.

Post and thumbnail photos courtesy of Flickr user khrawlings