3 lessons in database design from the team behind Twitter’s Manhattan

In April, [company]Twitter[/company] announced that it had had enough of trying to bend existing [post_tag]database[/post_tag] technologies to its unique needs, inspiring the company to build its own database called [technology]Manhattan[/technology]. Last week, I spoke with the trio of Twitter engineers — Chris Goffinet, Peter Schuller and Boaz Avital — who built the database to get a higher-level view of Manhattan beyond its technological underpinnings, to get a better sense of how prevalent Manhattan will be inside Twitter and what its creation says about software development at web scale.

Here’s what they had to say.

Don’t throw out the baby with the bathwater

Manhattan was designed to be a generic [post_tag]key-value store[/post_tag] that can handle a large number of Twitter-specific use cases, from tagging photos to real-time search. It was designed to be modular, Avital explained, so that Twitter can keep adding new features and capabilities, hopefully without ever having to rebuild it from scratch again. “More and more, any time you go to the site you’ll be hitting it extensively for anything you’re doing,” he said.

In fact, Schuller noted, “As a general-purpose distributed store, it is basically the default choice.”

But that doesn’t mean Twitter is giving up entirely on its legacy database systems. MySQL is still very important to Twitter (it’s part of the WebScaleSQL project, along with Facebook, Google and LinkedIn), particularly for storing user data, tweets and social graphs, Goffinet said. Some of Twitter’s previous database creations, such as MySQL framework Gizzard, are still running for certain tasks, too.

Like many engineering decisions, Avital said, picking the right database is about picking the right tool for the job. It just so happens that tools built specifically for a company’s specific jobs happen to work better most of the time.

Source: The 451 Group

Source: The 451 Group

General-purpose databases are great … for general-purpose applications

“We don’t really need a really general-purpose storage system that tries to solve lots of problems” Goffinet said, referencing past experimentations with other key-value stores such as HBase and Cassandra.

Many systems might work well up to a certain point, but most large web companies have some specific workloads or traffic characteristics that are unique to their applications. That’s why so many open source [post_tag]NoSQL[/post_tag] databases (Cassandra, Voldemort and BigTable, to name a few) were created within companies such as Facebook, LinkedIn and Google. It’s also why those companies tend to move off of them pretty quickly as demands and applications change, even while the open source projects might continue to grow.

At our Structure conference next month, engineering executives from Facebook, Google, Amazon and other web companies will share the rationale behind some of their architectural decisions, and how that calculus changes as the web continues to evolve.

Schuller described the design process behind Manhattan as basically looking at what everything else does well, and then “picking the parts that are important for Twitter and making those better.”  Goffinet pointed to the storage aspect of databases — something they focused on with Manhattan — as a prime example of this approach.

“One thing I’ve always noticed is [other projects] focus a lot of their energy on the distributed systems part … but they don’t seem to focus too much on the the actual storage part,” he said.

The core components of Manhattan. Source: Twitter

The core components of Manhattan. Source: Twitter

And then there’s the issue of scalability, which can be a misleading term. Because the companies that create them are often the only ones running most database systems across thousands of nodes, Goffinet noted, they often haven’t been battle-tested against problems that arise in other [post_tag]webscale[/post_tag] environments. “They really do all break at scale in some sort of way,” he said.

“If and when” Twitter decides to open source Manhattan, Schuller added, the aim would be to make it as useful as possible to a broad range of users whose applications might look nothing like those running inside Twitter.

User experience as a major design principle

However, its creators weren’t just concerned with performance-related issues at scale. They also thought about productivity as Manhattan grows to encompass more applications and more users. How the operations staff prefers to interact with a system probably isn’t anything like how other employees might want to.

With that in mind, Goffinet, Schuller and Avital set out to build Manhattan to function as much as possible like any other cloud storage service someone might want to use — like Amazon Web Services for Twitter’s internal users. “Think of every engineer as their own startup and they just want to get something done,” Goffinet explained. “… There are literally no roadblocks to get started for them.”

Part of delivering anything as a service is designing it to serve multiple users simultaneously, and Schuller said a small number of Manhattan clusters are currently serving a growing number of applications within Twitter. “Having a single giant cluster is kind of an aspirational goal,” he said, but he’s not foolish enough to bet on that actually ever happening.

So far, their decisions appear to be working. Avital said other employees have told him Manhattan is “a joy” to use (as much any database can be a joy, presumably).

Goffinet suggested that companies serious increasing developer productivity follow in Twitter’s footsteps of developing internal systems that mostly mirror those they would deem acceptable to release as products. “If I was a paying customer,” he said, “I’d expect the thing to just work for me.”

Update: This post was updated at 11:52 a.m. on May 13 with an updated map of the database landscape.