The web is becoming more dynamic, context-aware and personalized by the day, and the amount of information consumed by each person is increasing exponentially. But while hardware performance is improving, except when it comes to the simplest of parallel programming tasks, software infrastructure is not keeping pace. We need to develop new data processing architectures — ones that go beyond technologies like memcached, MapReduce, NoSQL, etc.
Think of this as a search problem. Traditionally, there was an index of every document in which every word occurred. When a query was received the search engine could just look up the precomputed answer to which documents had which word. For a personalized search, an exponentially larger index is needed that includes not only factual data (words in a document, brand of cameras, etc.) but also taste and preference data (people who like this camera tend to live in cities, be under 40, love “Napoleon Dynamite,” etc.).
Unfortunately, personalizing along 100 taste dimensions leads to nearly as many permutations of recommendation rankings as there are atoms in the universe! Obviously there isn’t enough space to precompute what recommendations to show every possible type of person that queries a site. Additionally, precomputing the answer to queries is too slow. People expect real-time results, not hours- or days-old precomputed answers. If I tell Amazon I don’t like a book, I want to immediately see that reflected in my recommendations.
We’re at a turning point in how we need to build web sites to handle these sorts of personalization problems. While first-generation distributed systems split the application into three tiers — web servers, application servers and databases — second-generation systems build large non-real-time back-end clusters to analyze huge amounts of sales data, index billions of web documents etc.
A third generation of systems is now emerging, with the computation shifting from those back-end clusters into front-end real-time clusters. After all, you just can’t build a back end that precomputes personalized results for millions of Internet users. You have to compute it in real time.
Adding complexity, many personalization problems are more difficult to parallelize than a lot of traditional back-end applications. Indexing the words in web pages is actually a lot easier to parallelize than are the long sequence of matrix calculations required to optimize a user’s recommendations.
Matrix calculations tend to involve complicated data access patterns that mean it’s hard to partition calculations and their data across a cluster of computers. Instead there tends to be a lot of sharing among many different computers, each of which holds a piece of the problem and updates the others as data changes. This back-and-forth data sharing is both incredibly hard to keep track of for the programmer, and can significantly degrade application performance.
The systems we’ve built at Hunch to solve this started off using distributed caching with memcached but very quickly veered into something more akin to distributed shared memory (DSM) systems, complete with multiple levels of caching, coherency protocols with application-specific consistency guarantees and data replication for performance. With an abundance of processing cores at our disposal, the real challenges tended to revolve around getting the right data to the right core.
I think that in a few years we’ll look back at this time as an era in which a slew of new large-scale programming challenges and their solutions were born. Hopefully we’ll also see more open-source solutions along the lines of memcached and Hadoop, so that building personalized and real-time web applications is easy for everyone.
Tom Pinckney is the co-founder & VP of engineering of Hunch.com.
Related GigaOM Pro content: