Another big obstacle to exascale computing: resilience

Los Alamos National Laboratory is trying to build to an exascale computer, which would be 1000 times faster than Cray’s Jaguar supercomputer and could process one billion billion calculations per second. The man in charge of executing that vision, however, sees a big obstacle toward building a computer with 1 millions nodes, running between 1 million to 1 billion cores. That problem is resilience.

Gary Grider of HPC Division, Los Alamos National Laboratory, Garth Gibson of Panasas, and Rich Brueckner inside-BigData at Structure:Data 2012

(c) 2012 Pinar Ozger. [email protected]

Speaking at GigaOM’s Structure:Data conference, Los Alamos HPC deputy division leader Gary Grider said that the exascale computer has so many parts, that some element will constantly be failing.
“It wouldn’t be worth building if it didn’t stay working for more than a minute,” Grider said. “Resilience is absolutely a must. The way you get answers to science is you run problems on these things for six months or more. If the machine is going to die every few minutes, that’s going to be tough sledding. We’ve got to figure out how to deal with resilience in a pretty fundamental way between now and then.”
Grider and Los Alamos’s technology partners have between 6 and 10 years to work on the problem, and the national lab won’t be alone. According to inside-Data president Rich Brueckner, who moderated the “Faster Memory, Faster Compute” panel Grider spoke on, countries from all over the world are in an exascale race. Brueckner said it’s just as likely as Russia, Japan, China, India or the European Union develops the exascale machine as the U.S.
Watch the livestream of Structure:Data here.