How Blippex handles the data behind its time-driven search engine

Search for something on Blippex, a recently launched search engine, and the results you’ll find might be quite different from that of a more popular search engine, such as Google. (s goog) That’s because Blippex is designed to show users links to sites based on a specialized set of data — time spent on sites — instead of number and quality of links, among other indicators.
Under the hood, too, Berlin-based Blippex’s way of handling incoming data — from extensions that users install on their browsers — departs in some ways from conventional web search architecture. For one thing, it doesn’t need huge quantities of highly optimized gear the way Google does.
Right now, it’s a small index Blippex is working with — after all, the search engine did catch attention just two weeks ago. It fits nicely on around 40 EC2 servers on Amazon (s amzn) Web Services (AWS), Blippex’s chief technology officer, Gerald Bäck, told me in a recent interview.
On those EC2 servers, everything relies heavily on two applications — at least for now. For full-text search, the open-source tool Elasticsearch is used. And data on what pages users are checking out lives inside MongoDB. That popular NoSQL database does the job — MySQL was too slow, Bäck said — but Blippex wants to deliver to users the fastest experience possible, ideally less than a second after a search is executed. So the company is experimenting with Redis, which can be many times faster than MongoDB under certain circumstances. That could shave milliseconds off the average amount of time users have to wait for results.
But as an in-memory key-value store — and one that Pinterest and others rely heavily on — Redis uses lots of memory and therefore can get pretty expensive and perhaps difficult to scale. So one configuration under consideration is using Redis to store the sites themselves and MongoDB to store the integer values that are used to calculate which websites users spend the most time on.
Generally speaking, though, the database Blippex uses keeps track of how much time users spend on a given website. The system has a way of making sure pages that sit idle — think of the tab that’s been open on your browser for three days — don’t get incorrectly interpreted as being the most valuable: Pages left open for more than five minutes just get reported as being visited for five minutes. When a user moves off a page, the browser extension tells the Amazon servers, typically keeping the database up to date with less than two minutes of lag.
The thing is, web surfers might spend much more time poring over dense content, such as a paper in an academic journal, than on, say, a succinct news article about the same subject, even if the article is more successful at giving people just the information they’re looking for. In that case, time spent is not the best indicator of value. To counteract that anomaly, Blippex could add to the interface users see. “We think we can solve it by categorizing the content,” Bäck said.
As a European operation, Blippex could have opted for one of the Europe-based clouds, such as CloudSigma or ProfitBricks — we’ll be talking about those up-and-coming players and others at our Structure:Europe conference in London on Sept. 18-19 — but instead it went with that leader of public-cloud computing, AWS, like many of today’s web startups. And in using and evaluating open-source tools such as Redis, MongoDB and Elasticsearch — Blippex has something in common with Yahoo, which at one point was working on Nutch as a tool for crawling the internet in order to build the index for its search engine. Nutch went on to provide the roots for Hadoop.
Now that Blippex is making a splash — it just did a data dump from its database — and collecting some recognition for its approach, there should be an interesting show to see as the company keeps experimenting with the algorithm and takes on more input. And that’s all the more so because the company will be able to focus on it exclusively, now that they have decided to shutter the Archify service for personal search.