Gizzard Anyone? Twitter Open Sources Code to Access Distributed Data

Twitter last night offered up the code for Gizzard, an open-source framework for accessing distributed data quickly, which Twitter built to help the site deal with the millions of requests it gets from users needing access to their friends and their own tweets. It could become an important component of building out web-based businesses, much like Facebook’s Cassandra project has swept through the ranks of webscale startups and even big companies.

Gizzard is a middleware networking service that sits between the front-end web site client and the database and attempts to divide and replicate data in storage in intelligent ways that allows it to be accessed quickly by the site. Gizzard’s function it to take the requests coming in through the fire hose and allocate the stream of requests across multiple databases without slowing things down. It’s also fault-tolerant, which means if one section of data is compromised, the service will try to route to other sections. From the Twitter blog post:

Twitter has built several custom distributed data-stores. Many of these solutions have a lot in common, prompting us to extract the commonalities so that they would be more easily maintainable and reusable. Thus, we have extracted Gizzard, a Scala framework that makes it easy to create custom fault-tolerant, distributed databases.

Gizzard is a framework in that it offers a basic template for solving a certain class of problem. This template is not perfect for everyone’s needs but is useful for a wide variety of data storage problems. At a high level, Gizzard is a middleware networking service that manages partitioning data across arbitrary back-end datastores (e.g., SQL databases, Lucene, etc.).

The goal is to deliver relevant information to users faster across huge data sets that Twitter manages. Twitter said its FlockDB distributed social graph database can serve 10,000 queries per second, per commodity machine, using Gizzard. I heard Twitter’s Kevin Weil talk about the project a few weeks ago at SXSW, and at the time he said the company was building something to help manage distributed data sets using a Scala framework. This appears to be exactly that.

Whether or not Gizzard turns into another Cassandra or it fizzles is open for debate, but the act of figuring out how to work with giant data sets and then sharing that information with others is an essential step in creating webscale businesses. Thus, Twitters’s decision to solve its own problem and then share its solution is beneficial for the startup community.

I’ve chatted with developers who feel that Google’s (s goog) development of BigTable and the company’s decision to keep it to itself stalled the progress of building out webscale infrastructure for a few years until Facebook opened up Cassandra. This may be sour grapes — after all, a company does not have to open up code that gives it a strategic advantage — but it does highlight how difficult it is to build code that can handle and scale for millions of users. Sharing ways to do that lowers the barriers to entry for startups much like compute clouds such as Amazon’s EC2 or Rackspace’s CloudServers can.

So for anyone who wants some Gizzard, Twitter is happy to share.

Related GigaOM Pro content (sub req’d):

What Cloud Computing Can Learn From NoSQL

Image courtesy of Flickr user Sifu Renka