SXSW: When it Comes to Web Scale Go Cheap, Go Custom or Go Home

Facebook's future home for big data

Dealing with the terabytes of data generated by users online and serving up relationships tied to that data quickly are forcing web-scale sites like Twitter, Reddit and Facebook to investigate a variety of home-built, open sourced and hardware solutions, and reject as many closed-source software (such as Oracle (s orcl)) and specialized hardware solutions as possible.
It may seem like a forgone conclusion, but the ideological and practical bias against closed-source software products, as well as specialized physical hardware inherent in web-scale companies, has big implications, not just for vendors like Microsoft (s msft) and Oracle, if they get locked out of businesses built on the web, but for all businesses. That’s because broadband networks, cloud computing and the shift toward more rapid adoption and integration of web technology into our everyday lives changes the business models and opportunities for all businesses.
The new business models will take into account the need to attract users individually, on a personal level, while also connecting them out to other products they use. These services will be available and designed to be accessed on phones, monitors and any other screen. They are hyper-personal, even for services dictated by corporate IT departments. Because of the “use anywhere” nature of these services, and the myriad connections out to other applications, they will have to manage a lot of user data, a lot of requests from outside a network, and scale out to meet demand.
Given this framework, the panel on scaling open source frameworks past MySQL was one of the more interesting ones at the South by Southwest Interactive conference this weekend. Scalable databases are part of the future of IT for many businesses. You can’t build the types of services discussed above without scalable databases. And those databases, and generally all of the tools used to achieve cheap and agile scale, are open source.
Citing a desire to support open source code, as well as the need to peek under the hood and be able to solve problems quickly, a panel of four guys responsible for building various architectures at Twitter, Facebook, Reddit and Imgur said specifically that they avoid Oracle in favor of rolling their own databases. Most even derided proprietary hardware and specialized networking gear, with the exception of Facebook’s Serkan Piantino, who said the company does use proprietary F5 gear behind software load balancers. Piantino also said that Facebook was testing super fast solid-state hard drives from a company called Fusion IO as a means to speed up access to data.
But for the most part, building your own code and working with open source code ruled the day. Even if there wasn’t an open source solution that was readily available or mature, the consensus was that folks would wait until something was ready, or if the pain was too much, build it themselves. For example, an audience member questioned the panel about any good columnar database stores beyond Hadoop, and Kevin Weil from Twitter explained that there were some closed source options out there, but the open source world’s products were still, ” a little early.” So Twitter does without for now.
Other tidbits of interest on scaling databases from the panel were:

  • Nginx got a big shout out as an alternative to full Apache as a web server.
  • HAProxy is also a popular way to either load balance or merely break requests up to have a cache or a database serve those requests faster
  • Both Twitter and Facebook are using P2P technology (Twitter calls it Murder) to provision services because instead of taking five to 7 minutes to bring one online, it takes 37 seconds.
  • Facebook plans to open source Haystack, its visual storage system within a few months
  • While Hadoop isn’t used much or at all on the front end for both Twitter and Facebook, engineers use it on the back end to deliver granular analytics that otherwise wouldn’t have been possible about how people use the site
  • If you can’t speed up the process with better databases, caching or anything on the software and hardware side, try user interface tricks to make it seem faster, such as saying the video is done uploading even if it isn’t yet.
  • Facebook no longer thinks in terms of servers deployed, it now thinks in terms of deploying entire racks. The software the company is running is rack aware so it can take advantage of all of the bandwidth on a given switch in the rack. It looks like an intermediate step in running your data center as a computer.

For the GigaOM network’s complete SXSW coverage, check out this round-up.