STRUCTURE 08: Harnessing Explosive Growth

OK, so this is the sexy panel. We have tech people from some of the biggest web sites out there telling us about how they scale their sites.

  • Jeremiah Robison, Slide
  • Sandy Jen, Meebo
  • Jonathan Heiliger, Facebook
  • Akash Garg, hi5 Networks
  • Raj Patel, Yahoo!
  • James Barrese, eBay

Alistair: How much of scalability is architecture, and how much is throwing servers at the problem?

Heiliger, Facebook: The cheapest way to scale is adding servers, but over time the product is really what drives the infrastructure, so if the product is bad that’s going to cause you to have problems that are hard to engineer your way out of. When we added chat on Facebook we actually built a new backend for that. See a post on the Facebook engineering blog about this.

Alistair: Will Facebook and Meebo integrate chat?

Jen, Meebo: Open to integrating with chat on Facebook

Heiliger, Facebook: Great. We’re open to supporting messaging protocols.

Robison, Slide: We’re moving to a preferred model, with processing in the background on a different thread, helps us out a lot.

Alistair: What about failures?

Barrese, eBay: We had a few catastrophic issues in ’99 2000 time frame. We knew we were broken because the site was down. In that day people were very forgiving, but what we had was a major outage. Now that doesn’t happen. A lot of things automated now, people know how to do retries.

Alistair: What techniques?

Jen, Meebo: Since Meebo is just one page view, the limitation of JavaScript is big for us. The amount of computation we put in a browser is going to limit a user’s experience.

Patel, Yahoo: If you look hard enough there are signs of where applications or infrastructure are going to break. In geography we lost a significant amount of capacity because of architectural inadequacy. In a lot of our applications, we see issues with deploying more users, and you run out of space; you should fix it before the users find out.

Alistair: When you’re running a platform, isn’t it bad to have people writing code on your stuff?

Patel, Yahoo: I think you compartmentalize it, break it up into clusters that are not location contiguous.

Barrese, eBay: You’ve also got to firewall your stuff off. The worst thing you can do is open the floodgates without having a plan.

Heiliger, Facebook: If you let people use their own code, you open yourself to risk. If you provide developers a set of tools, you have more control.

Alistair: When an error on a Facebook app comes up, Facebook cites both outside developer and Facebook having a problem — doesn’t that look bad for you?

Heiliger, Facebook: We’re very concerned, and we’ve been thinking about it a lot.

Alistair: What’s the chain of command for outages?

Robison, Slide: We comb through Facebook stats, we have their ops folks on instant messenger, and early on in the platform there was even more dialog. Through time the platform got better, the apps got better, but still the trust is there.

Alistair: What’s the worst example of an application abusing the Facebook platform?

Heiliger, Facebook: We’ve had to turn off applications because their response time has caused other applications to suffer. We proactively provide information for our partners.

Barrese, eBay: We’ve built logging system. We can flag and identify problems very quickly. It’s not very sexy, but if you don’t have it, you’re shooting in the dark.

Alistair: Yahoo has nearly as many different products as Facebook has applications. Which Yahoo application broke Yahoo?

Patel, Yahoo: Advertising, video — but healthy dialog of best practices, like Facebook and Slide but internally.

Heiliger, Facebook: What Yahoo is saying is very politically correct, but reality is everything (everywhere) is broken. It falls into two buckets — stuff that’s already broken, and stuff we don’t know about yet.

Alistair: Off the shelf or build it yourself?

Jen, Meebo: For us a lot of the stuff that we launched with is open source. At the same time no one knows your system as well as you do, and no one can scale your system as well as you can. If you want something small, light-weight and fast, you’re going to have to strip away a lot.

Robison, Slide: We built our own object-aware caching system, because our core value was delivering photos faster than anyone else. Anything that’s not your core value you can take off the shelf.

Patel, Yahoo: Understand what you’re good at and what you’re not.

Alistair: What else needs to scale as you know? What did you have to scale that wasn’t necessarily technology, but business?

Barrese, eBay: Analogy of going from sailboat to battleship — basically every dimension of your business when you’re experiencing hypergrowth, you have to scale out.

Garg, hi5: Rise of spam and security issues on social networking sites. The black hat tax. It’s important for these social networking services that we bring these anti-spam features right into the core product.

Heiliger, Facebook: When expanding internationally we applied technology to a problem that otherwise would have been very expensive proposition. Facebook now translated into several languages by harnessing the power of our community.

Jen, Meebo: We leveraged our community as well. We asked for Spanish to start out with — in the next 48 hours we had 20 languages completely translated. For spam, it’s similar, you can hire thousands and thousands of people, or you can leverage the community. If they feel they’re part of the product, and they own Meebo as much as you do, you can have a really great experience for everyone.

Robison, Slide: When we introduced user moderation for photos, every photo gets marked. We’ve had a really hard time asking our community what is porn and not porn, so we’ve had to screen our user community.

Alistair: If you were in Twitter’s shoes, what would you have done in the last month?

Barrese, eBay: You’ve gotta be transparent, you’ve got to tell people what’s going on, and set people’s expectations as to when it will get fixed.

Jen, Meebo: The more honest you are, the more forgiving your user base is going to be.

Alistair: What have you built on clouds?

Jen, Meebo: We’ve chosen not to put anything critical on EC2, but we’ve put added features like file transfer on it. It’s scary to scale something you can’t completely control.

Robison, Slide: In the broad sense our CDNs springing up new services because of cloud computing. Akamai just launched video transcoding in the cloud. Panther is doing edge caching for bandwidth crisis.

Heiliger, Facebook: EC2, Joyent tremendous partners for us.

Alistair: Fix the protocol, or keep building?

Jen, Meebo: It doesn’t really matter if the user’s happy.

Robison, Slide: Sometimes you have to jump through some complex hoops to get what you want, so I think it’s changing. It’s the backend systems that are causing the scalability problems. When moving stuff to the edge it’s to keep architecture on server side. The real problem is if you have a single user you can scale that out, but when you have users sharing between each other.

Patel, Yahoo: The time will come, it requires the alignment of an entire ecosystem. Doesn’t mean you don’t experiment, but adoption requires people to line up.

Jen, Meebo: Users used to real-time, dynamic standard, so when you don’t give them that, they notice.

Heiliger, Facebook: Servers we use are PCs, have lots of things we don’t use. Challenge is dealing with how to get those technologies to be more optimal, even intaking changes from vendors and building them into our architectures. The truth is that the number of servers going into data centers is growing at the fastest rate ever, whereas enterprise servers and desktop PC systems are pretty flat.

Garg, hi5: Vendors needs to focus more on heat control and power efficiency.

Audience: Effect of Firefox 3?

Robison, Slide: It broke all our applications.

Alistair: Do you spend as much time on scaling as Vogels said (70 percent of engineering resources)?

Robison, Slide (I think): No.

Patel, Yahoo: I think that was right on.

Heiliger, Facebook: I wish I had 70 percent of our company’s resources to spend on those challenges.

Note from Liz: This might be a good panel to watch on video, because the panelists did a lot of talking back and forth to each other that I had trouble capturing. We’ll post the archive footage from Mogulus when we have it.