Why webscale innovators should look beyond their bubble

[protected-iframe id=”436c69e7fbc22331f146d39a93e452e9-14960843-25766478″ info=”http://new.livestream.com/accounts/74987/events/2117818/videos/22084741/player?autoPlay=false&height=360&mute=false&width=640″ width=”640″ height=”360″ frameborder=”0″ scrolling=”no”]
Transcription details:
Date:
21-Jun-2013
Input sound file:
1004.Day 2 Batch 4

Transcription results:
Session Name: The Curse Of The Burst

Announcer
Derrick Harris
Gary Grider

Announcer 00:00
Talk about electricity as the example for the public cloud, and actually what we’re seeing is the same way that some public cloud is going private, same thing with electricity, right? We’ve got stuff moving out of the on-demand pay-per-use utility, back into the data center. This next talk is really fascinating, it’s not per se, but it’s a really interesting problem at the extremes of computing. I don’t know about you, but I get probably a few megabits per second download and/or upload at home. Google Fibre is bringing a gigabit per second upstream and downstream to Kansas City, so picture that one gigabit, assuming that it serves 100, 000 customers across Kansas City to be 100 terabits per second. Picture all of Kansas City in a single trough, and that’s the 100 terabits per second that Gary Grider at Los Alamos has to deal with. He’s going to come out and talk about the challenges of dealing with that immense bandwidth of data in a single burst, and moderated the amazing Derrick Harris, as we start coming to a close. Welcome.
Derrick Harris 01:15
Like Joe said, Gary and me are talking about dealing with 100 terabits, or I think we said four petabytes in five minutes. I think one of the things we were talking about before we came out, and that really brings us back home maybe, is we talked about – some folks from Facebook were on stage yesterday talking about their bandwidth needs, how do you compare Facebook – what was it? Two billion IOPS, right?
Gary Grider 01:48
I think I remember them talking about needing two billion IOPS out of their network.
Derrick Harris 01:52
How does that compare with what you guys are dealing with?
Gary Grider 01:55
Our machines do that pretty regularly. Each node on our machines might do 500, 000 IOPS in the network today and a million a couple of years from now, so it doesn’t take a very large machine in our world to generate a billion IOPS, that’s pretty common.
Derrick Harris 02:16
The idea I like is that there might be some synergy, although it doesn’t seem like it, between what a company like Facebook is doing and what an organization like Los Alamos is doing, where maybe you don’t need to reinvent the wheel because there’s actually some areas we could work together in and do similar stuff.
Gary Grider 02:35
That’s true, it’s kind of too bad we’re not working together more, I think there’s many things that they’re probably doing that we could learn from. Our applications are tightly coupled and they do rendezvous with all the processes, with all the other processes, several times a second, so you have to have a network that can do essentially billions and billions – tens of billions, even hundreds of billions of IOPS – in very, very tight bursts, and that’s something that it sounds like they might need, so we’ve been doing that for a while. Our applications have a hard time with resilience because they’re so tightly coupled. You’re running an application on two million cores or something, and you want to run it for six months, you’ve got to figure out how to keep that application running for longer than an hour before something fails. Our resilience techniques are very gross compared to industry, so we could learn a lot from industry.
Derrick Harris 03:28
You don’t just send a guy in there every week to clean out the dead servers?
Gary Grider 03:33
There’s a guy in there all the time cleaning out the dead servers, there’s a lot of dead servers.
Derrick Harris 03:41
It seems like those are very different approaches to – you’re talking about they’re scaling out, running this built to fail ecosystem, and you’re very much a – we’re going to scale up, but built to be reliably. What happens in your world if something fails?
Gary Grider 04:00
Our approach to resilience is totally diametric opposite of what the rest of the world’s taken, because the applications communicate with one another so frequently essentially an application pushes the stop button on all two million cores and they dump memory out to Flash at these incredible rates of four petabytes in five minutes, and then we write it off to disk for an hour or something, which is still a terabyte a second, and that’s a very, very crude way to do it, but that’s about the only way we know how to do it when the applications are so tightly coupled together. We’re coming from one end of the spectrum, these incredibly tightly coupled applications where the communication is frequent and amazingly fast, and very, very small messages and billions of them at a time. And industry’s come from the other side, where they are working with far more embarrassingly parallel problems that enabled them to do different kinds of resilience techniques like [Due 3x?], the number of computations or make copies and things like that. We’re sort of coming at resilience from two different angles.
Derrick Harris 05:08
You mentioned some other areas where I thought it was interesting where there’s these similarities with, let’s say, the Google file system or the Hadoop distributed file system, compared with Lustre and other file systems. Can you talk about, bandwidth aside, these areas where these are things that have been going in super-computing for a long time, and all of a sudden web companies come along and say, We need to do this, but it’s maybe a step down.
Gary Grider 05:35
It is actually. It’s kind of interesting, if you take file systems like the ones that we funded, Lustre and Panasas and GPFS and others in the HPC community, and if you look at the different between those and the Google file system and HDFS and others, really if you just reduce the semantics of Lustre, or any of the POSIX file systems, and took out the POSIXisms, they would do the same thing roughly, so it’s interesting how they chose to rewrite everything versus just take something like Lustre or PVFS or something like that and just dumb it down. It’s fascinating how we don’t tend to work together as much as we should, because if we worked together I think more, we’d probably both be further down the road than we are.
Derrick Harris 06:18
What types of constraints are in place that actually prevent government agencies or research organizations from working with Google or Facebook? Is it a cultural thing, is it a regulatory thing?
Gary Grider 06:32
A little of both. At my side, of course, it’s guns and guards, and gates, and classified computing, because we’re simulating nuclear weapons, but there are open science HPC sites, like the DOE office of science sites and NFS sites that are far more open than we are, they’re doing so more things. I think there are ways to do it, industry of course has their own way of doing secrets, and I don’t understand any of that, but I’m sure that they have their reasons for not wanting to share with as well, but it does seem like there would be value if such a venue could be produced.
Derrick Harris 07:08
It seems like you’re both kind of working in parallel on certain things. I remember a few years ago we had Jonathan Heiliger from Facebook up here and he was saying Intel and AMD just need to make the chips better. This is what we need, like, call to action. Whereas you guys did the same thing, the national labs are pumping money into Intel and companies all the time, right?
Gary Grider 07:29
Absolutely, we have our own version of venture capital inside of DOE. We’re funding them to add the instruction sets that we need and build the things into the chips that we need. And I suspect a lot of it’s similar, it’s on chip photonics, and it’s stacked DRAM, and it’s microchannel cooling and all those things I suspect industry would want eventually, if they got as dense as we are. Our computing is a bit denser per rack than your typical web type company, our densities are approaching 200 kilowatts a rack, where typically I think web companies are more like 20 or 30 or something like that. Again, there’s ways where I think those industries could leverage some of the things we’re doing, and vice versa.
Derrick Harris 08:17
I’m just curious, what does that do? 200 kilowatts a rack is insane, what does that do to the power consumption when you’re running at that type of–?
Gary Grider 08:29
We have similar power consumption that any huge site does. But there’s a little bit of difference in what we do to the power company. The last speakers were out here talking about using natural gas, which I think is a cool idea, but they said that the power consumption is very steady. It’s on all the time, and in fact that’s good. The more watts you burn, the more ads you sell, or the more money you make, and the same thing is true for us, the more watts we use the more weapons science we get. So really for us it’s weapons science for the dollar, but we do have this one main difference, and that is one user might be using two million cores, where in their case it’s two million users might be using one core. That’s not exactly right, but that would be great for them if they can do that. So we have this problem that if an application running along on two million cores all of a sudden aborts we shed like 10 megawatts back onto the powergrid in less than one AC cycle, so the power company gives us a call and says, “What are you guys doing? Stop that,” and the machines are just going to get bigger. Our machines today are 10 megawatts and they’re going 30 megawatts some day and when we shed a load it’s going to 25 megawatts or something, so we’re going to have to come up with solutions for figuring out how to deal with these huge power transients because A it’s expensive and B it hurts the power companies.
Derrick Harris 09:51
That’s just incredible to think of a 30 megawatt machine running in parellel.
Gary Grider 10:01
For one user.
Derrick Harris 10:02
Yeah, for one user. When you talk about this lost opportunity for industry and research to work together, you made me think about, maybe it’s computer science education or something, or maybe it’s a case of we just need to teach – I don’t know if you teach these skills lower down, if kids come onto college with these skills.
Gary Grider 10:39
It certainly makes sense for the national university system to begin to teach at-scale computing. I think that’s something that actually was occurring back in the ’90s, when clusters first came out and it was a big deal, and then it kind of went away, and everybody started figuring out how to program in Java and all these other useless languages, from my point of view.
Derrick Harris 10:59
You guys can tweet that: Java, useless.
Gary Grider 11:03
On a super computer it’s a pretty useless language, and now you’re starting to see stuff come back into the curriculums at the universities now. In fact, IEEE and NSF and others are working on trying to get parallel and distributed computing back as a regular part of the undergraduate curriculum, because it’s disappeared. It went away for almost a decade, and that’s really a shame, and so I think that’s something that both industry and government should promote.
Derrick Harris 11:29
Is there a way to abstract that knowledge? Because I’ve seen some startups come round that are working on parallelizing your Ruby code, for example – and it seems like abstraction is the name of the game today. You’re saying you really have to get the basics, you can’t rely on–
Gary Grider 11:54
If it’s embarrassingly parellel, you don’t got to know much, but if it actually has race conditions and things like that, then you really do need to understand something. If you can’t program in threads or something like that at least, you’re going to be pretty useless on one of these machines that has tens of thousands of processes. I think you do need to learn some rudimentary things in an undergraduate program to come out as a useful parallel programmer. Not a lot, but some.
Derrick Harris 12:22
Just going back to this idea of these opportunities for industry and research to work together. Maybe Google or Facebook doesn’t have the burst capacity that something like Los Alamos has right now, but in terms of sheer data volumes, these are things that are similar, correct?
Gary Grider 12:47
Absolutely. In fact, our archives is a good example of that. We archive a large number of these large dumps, a few a day, off to tape, so we have to spend like 200 tape drives in parallel to get the bandwidth to get the stuff written out, and that’s quite an expensive proposition, so we’re thinking about leveraging cloud technology to do that, and use disk with similar mean time to day loss. I think there’s even ways for us to learn in the archive business from these companies that are doing it.
Derrick Harris 13:22
Tape?
Gary Grider 13:24
Tape, that’s right.
Derrick Harris 13:25
There was a session a couple of hours ago about–
Gary Grider 13:30
I heard it, yeah.
Derrick Harris 13:31
What’s your take, is tape–?
Gary Grider 13:33
Tape has always got the surface area advantage. The way they price tape, of course, is they price it just slightly less than disk for capacity cost, because price it any lower than that and you’re losing money. That’s how it’s priced, and so they’re always going to have the surface area damage, so if you’ve got a peer capacity play and no other driver, tape is a good solution. If you’ve got other drivers, where I do, like bandwidth, where I’m trying to move these huge multi-petabyte sized images around, the tape drives themselves are pretty expensive. If you’ve got a peer capacity play it’s going to be hard to compete with tape. It’s just that very few problems today are serviced by a peer capacity.
Derrick Harris 14:15
Would you ever consider moving– did you say you were considering moving stuff to the cloud?
Gary Grider 14:21
We’re considering utilizing cloud technology.
Derrick Harris 14:24
Cloud technologies, okay. I’m guessing that still means on-premises, but definitely like a scale out.
Gary Grider 14:30
Erase and object and stuff like that.
Derrick Harris 14:35
Just curious, when you look at what’s available – I don’t know if you’ve been looking at what’s available in the public cloud today – could anything even come close to handling your performance needs?
Gary Grider 14:47
For archive, yeah I think so. I’m not terribly different than a big cold data site like any of the ones that store your pictures, flickr, those kinds of things. The total amount of bytes that they have may not be terribly different than millions of hundreds of petabytes to an exabyte, or something, it’s not an order of magnitude off. The total amount of parallelism they have probably isn’t terribly different than mine, it’s just that I have these one petabyte sized objects, and they have thousands of one gigabyte sized movies or 50 megabyte sized pictures. The overall sizes and shapes – the sizes aren’t different, the shape is a bit different. I’m hoping to be able to leverage stuff like that in the future.
Derrick Harris 15:32
You said maybe there are opportunities to work together with industry, and maybe there are some things where everybody is kind of reinventing the same wheel. You see that even among web companies sometimes – everyone has their message bus or their own database. Maybe it’s in your industry too, but do you see that as just being an ego thing, like, “We built it, so it’s better.”
Gary Grider 16:06
I suppose. We even have our problems within the weapons complex. There’s three national labs that do weapons work, and there’s some competition between them in the various things. Competition is good and bad. It’s good that we have it and it works in some ways, and it’s bad in others that we don’t see these ways where competition is not necessary. In the hardware industry there’s a lot of these pre-competitive consortiums – like the magnetic head consortium that works through pre-competitive science and technology issues before people go off and make products out of it. I can’t help but wonder if there shouldn’t be something in the software realm like that, where people talk about stuff in a pre-competitive way, so that if there’s common pieces of infrastructure they can be produced in a common way. But I don’t know, that’s probably not the American way.
Derrick Harris 17:08
I always assumed the national labs worked closely with one another, is that not–?
Gary Grider 17:10
They do and they don’t. Livermore’s in business to compete against Los Alamos. From a weapons design point of view, that’s why it was created, because there was no peer review, no competition, so Lawrence Livermore and Los Alamos did weapons competitions for one another, so the end nuclear weapons in the complex were developed by those two labs through competitive processes. By definition we compete with one another at one level, and that’s the weapons simulation, weapons design, weapons test kind of thing, but then we try to share with one another on infrastructure and the like.
Derrick Harris 17:48
You mentioned one area where research could definitely learn from the web world and that’s resiliency and building applications that can fail, what about in terms of – one of the things we’re seeing now is building your own hardware and customizing stuff, as opposed to spending, God, the numbers that spend on a new super computing system are astronomical. Is there any desire within your world to say, How could we build something on our own, could we do that for less than it costs?
Gary Grider 18:23
We used to do that, a long, long, long time ago. [Craig?] came out of our funding back in the ’70s, so it sort of was true then. We were sort of guided in the ’90s to head towards this path of America Competes and making American industry more competitive, and so there’s more to it than just us building a super computer to do weapons work on. We’re also trying to make investments that make American industry healthier. In that instance it’s the government’s money and they can tell us to do that and we do. What’s a bigger problem is trying to form an industry around next generation technology, and not just go off and do a custom ASIC to do our job, because that helps the industry less and helps the US less I think. We have this dual role of making industry better and getting our job done.
Derrick Harris 19:14
All right, we are out of time, just on that note. Thanks a lot.