Anyone who’s used Twitter’s search engine — which is based on Summize, a company Twitter acquired in 2008 — knows it isn’t a great user experience. Not only is it cumbersome to use, but the results that come back from a search aren’t very comprehensive, since the index is limited to only a few days’ worth of tweets. That’s created a market for search engines like Topsy, which says it now has the most comprehensive index of Twitter content available, having recently indexed its 5 billionth tweet and its 2.5 billionth link from the social network.
The company — which not only competes with Twitter’s own search, but also with Google (s GOOG), Bing (s MSFT), and other real-time search engines like OneRiot (all of which index Twitter’s firehose of tweet data) — says it has developed a new architecture designed to index over 100 billion status updates and related objects. Topsy also notes that its platform can do this for data “from any social network,” which suggests that the company is looking to go beyond indexing Twitter and supply the same kind of search function for Facebook status updates as well as other social networks.
Branching out would be a wise move, since Twitter has made no secret of the fact that it plans to continue “filling the holes” in its feature set: an approach that’s seen the company acquire Twitter clients like Tweetie, as well as launch its own competing features such as a link shortener, both of which have caused some consternation among third-party developers. Google, meanwhile, launched a Twitter archive search several months ago based on the entire Twitter index, and it has some interesting features, including a timeline view.
Topsy says that its new architecture consists of a cluster of more than 500 servers with about a petabyte (a thousand gigabytes) of storage. Tweets received through the Twitter firehose — as well as other sources, such as the company’s regular API — are queued using a service that Topsy calls the Swarm, and the company says this occurs tens of thousands of times per second. Each tweet results in at least 10 pieces of data related to the message — such as the user’s handle and any related links — and the search engine adds about 500 million pieces of data to its index every day.
As the company notes, traditional search engines have operated using a “batch” process, in which robots crawl the web and index any information that has changed at pre-arranged intervals. This means that only part of an index is up-to-date at any one time, while other information is being processed. Google recently re-engineered its indexing system to improve its real-time search capabilities and launched a new version called Caffeine, something I discussed in a GigaOM Pro research report I wrote recently (subscription required).
Related content from GigaOM Pro (sub req’d): With Caffeine, Google Reveals the Challenges of Real-Time Search