The clarity and mystery behind what makes Twitter run

If there’s one thing about Twitter everyone can agree on, it’s that it takes some clever code to handle the billions of messages flowing into and out of the micro-blogging service every day. Twitter’s engineering team has been nothing if not stellar in building those tools, everything from new data-processing techniques to tools for making the site run faster. However, as you drill down into the actual servers and other gear on which Twitter runs, things start to look a little muddier, as evidenced by the fail whale that Twitter just can’t seem to harpoon.

New problems require new solutions

Like its web predecessors Google (s goog), Facebook (s fb) and Yahoo (s yhoo), Twitter has been forced to create its own solutions for its own problems in many circumstances. The names might be funny — Gizzard, Scalding, Zipkin, Cassovary, SpiderDuck — but they’re the products of seriously hard work. That Twitter can handle more than 12,000 tweets per second during events such as the Super Bowl without crashing or slowing down is evidence of that.

Just look, for example, at the history of Twitter’s search engine. In one year, it went from a legacy MySQL implementation into a Lucene implementation with a Java front end that, as of May 2011, was indexing an average of 2,200 tweets per second and 1.8 billion queries per day. But beyond just sheer volume and velocity of tweets, Twitter has to figure out a way to provide users with results relevant to them both personally and with regard to time.

Regardless of how functional Twitter search might be on a given day in terms of delivering the best results, it’s a complicated beast. In order to surface personally relevant tweets, it first analyzes queries against users’ social graphs — which requires talking to Twitter’s purpose-built Cassovary graph-processing library and FlockDB database — to figure out whose tweets might be most important. And although there are traditonal methods for ranking content in fairly static datasets, those go out the window when a database is hit with a deluge of of new data on the same topic — like last October, when Steve Jobs died and his name suddenly accounted for 15 percent of all searches.

An affinity for open source

To its credit, Twitter has been rather open about discussing the homegrown code that makes it run. In fact, Twitter has actually proven itself quite willing to (maybe more so than even Facebook) make a lot of that code open source. Since the beginning of 2011, the company has on its Twitter Engineering blog discussed its work on at least 27 projects or features, and has open sourced 16 technologies:

Clockwork Raven Project Kiji TwUI Storm
Finagle Scalding Cassovary MySQL fork
Ambrose Zipkin Elephant Twin Iago
Twemcache TwitterCLDR Trident HDFS-DU

That damned fail whale

Despite all its hard work, though, Twitter is still haunted too frequently by its now infamous fail whale. What was once-cute way of saying “we’re still trying to figure this out” now tends to invoke more of a “how haven’t they figured this out yet?!” reaction. And as impressive as its software engineering is to be able to handle all that traffic, we know a lot less about how Twitter’s engineers keep its service running at the system level.

Building and managing distributed systems is hard, and things you never expected can pop up out of the blue to wreak major havoc — just ask Amazon Web Services (s amzn). Like in late July, when Twitter went down for the second time in five weeks and blamed the outage on both its primary and backup systems failing simultaneously. However, whereas many companies might have issued a lengthy explanation detailing to users the exact cause, Twitter was markedly more vague. It blamed the earlier outage in June on a cascading bug without offering any more detail.

Opaqueness, though, is Twitter’s modus operandi when it comes to its core infrastructure. Although no one comes out and tells the world everything, we know where Facebook, Google, Apple (s aapl), Yahoo — pretty much everybody — are building data centers and where they continue to lease space. We know how Google is experimenting with software-defined WAN networks and how efficient Facebook is with its water usage. With a little digging, some folks have been able to make pretty good estimates of how many servers these companies are running.

But aside from a 2011 blog post detailing its migration from an old data center into a new, custom-designed space somewhere, Twitter hasn’t really talked a lot about its infrastructure. Scale, architecture, even locations — these things are all relatively mysterious. It’s like a highway on which there are a lot of nice cars engaged in a wonderfully orchestrated road race, but no one’s really sure if the highway can support their weight or when a bridge might collapse. Or like a state-of-the-art house built atop an Indian burial ground.

Twitter declined a request to talk about its infrastructure for this story.

Six years into its existence, Twitter has given an awful lot back to the world of webscale application design, but has largely been a bystander in the discussion of webscale infrastructure. Hopefully, for Twitter’s sake, that doesn’t come back to bite the company. While other web platforms take pride in showing off the gear and design principles that serve hundreds of millions of users, it might be hard to take a service as big as Twitter seriously when you can’t see the foundation it’s built on.

This is the second in a series of stories about Twitter’s transitional summer. On Monday, Mathew Ingram examined the evolution of Twitter corporate priorities as it grows into a mature business, and on Tuesday, Eliza Kern looked at the impact of its recent API policy shift on the partners who helped make Twitter a success.

Feature image courtesy of Shutterstock user Oleksly Mark; Twitter search architecure image courtesy of Twitter; Poltergeist image courtesy of Metro Goldwyn Mayer.