Hadoop’s new strategy: Pump data in to process, pull it out for privacy

Ari Zilka of Hortonworks, James Makarian of Informatica, Mark Cusack at RainStor, Justin Borgman at Hadapt, and Jo Maitland of GigaOM at Structure:Data 2012

(c) 2012 Pinar Ozger. [email protected]

Companies who are concerned with privacy and bandwidth issues, but who want to take advantage of the processing power of Hadoop, are actively pursuing a “pump to Hadoop and pull from Hadoop structure,” according to Hortonworks Chief Product Officer, Ari Zilka, speaking on a Future of Hadoop panel at Structure:Data on Thursday.
James Markarian of Informatica addressed the issue that most Hadoop applications tend to be data-intensive instead of resource-intensive. “The challenge is that the big elephant doesn’t move through the little pipes all that well,” he said. When the processing is colocated in the cloud, it’s no problem. But most companies store their data behind firewalls.
In describing the “pump in, pull out” approach, Zilka added that “Hadoop is forcing the unlocking of data.” Financial companies can’t put all their data in a public cloud, but they can remove credit IDs, passwords, etc. and spill it out to the cloud to do a massive processing job, then pull it back in to remap the data to the personal identifiers.
Hadoop is currently deployed over thousands of commodity boxes, but as the architecture evolves and the size of the data sets increase, the system will have to move toward a monolithic stack. Justin Borgman of Hadapt pointed out that one of the challenges is that the appeal of Hadoop is that you can run it on commodity hardware.
But as Mark Cusack of RainStor said, “It doesn’t make environmental or economic sense to throw more boxes” at the problem. Markarian wondered aloud if there are really exabyte problems that will need to be solved, or if there’s a limit to the size of data that we’ll be working with. Several members of the panel discussed the need for better compression. Right now, compression is a one-size-fits-all solution, Cusack said, but there’s a need for a “much more targeted, tailored compression.” He added, “Compression is a key driver.”
Watch the livestream of Structure:Data here.
Update: This post has been updated to fix a typo.

Watch live streaming video from gigaombigdata at livestream.com