Twitter open sourced a recommendation algorithm for massive datasets

Late last month, Twitter open sourced an algorithm that’s designed to ease the computational burden on systems trying to recommend content — contacts, articles, products, whatever — across seemingly endless sets of possibilities. Called DIMSUM, short for Dimension Independent Matrix Square using MapReduce (rolls off the tongue, no?), the algorithm trims the list of potential combinations to a reasonable number, so other recommendation algorithms can run in a reasonable amount of time.

Reza Zadeh, the former [company]Twitter[/company] data scientist and current Stanford consulting professor who helped create the algorithm, describes it in terms of the famous handshake problem. Two people in a room? One handshake; no problem. Ten people in a room? Forty-five handshakes; still doable. However, he explained, “The number of handshakes goes up quadratically … That makes the problem very difficult when is a million.”

Twitter claims 271 million active users.

DIMSUM works primarily in two different areas: (1) matching promoted ads with the right users, and (2) suggesting similar people to follow after users follow someone. Running through all the possible combinations would take days even on a large cluster of machines, Zadeh said, but sampling the user base using DIMSUM takes significantly less time and significantly fewer machines.

A graph showing performance improvements for ad-matching since switching to DIMSUM. Source: Twitter

A graph showing performance improvements for ad-matching since switching to DIMSUM. Source: Twitter

Essentially, the algorithm, which runs daily at Twitter, pre-processes the pools of users and ad inventory based on factors such as who follows whom or which words are used. The closer two users or two ads are in a vector space, the more similar they are. This way, when the recommendation algorithms come in to actually match ads with users, or users with other users, they only have to focus on the ones that are above a prescribed similarity threshold.

DIMSUM is probably best suited for environments operating as large scale like Twitter, where pre-processing is needed in order to speed up the recommendation process. Zadeh said Google does something similar for its news page, calculating vectors of words and their frequencies in order to determine what to lump under the same topic.

In smaller environments, and with enough machines, Zadeh said, it’s usually possible to brute force all those comparisons without breaking anything or spending untold hours waiting for the process to complete.

Twitter has released versions of DIMSUM for Scalding, its Scala-based Cascading API, as well as for Apache Spark.