Hadoop jobs should soon be able to run easily, and securely, inside Docker containers

Although it first caught on among web developers, the Docker system for deploying Linux containers could also be a boon for big data applications. Members of Altiscale, the Hadoop-as-a-service startup founded by former Yahoo CTO Raymie Stata, are working closely with the Docker community to integrate the technology with YARN, the resource management framework for Hadoop.

Stata said the work is important for his company, as well as for anybody else dealing with a multi-tenant Hadoop environment. Not only does Docker provide a fast, standard way for deploying those applications onto YARN, but it also provides isolation between them. Isolation is important in terms of security (a user might have permissions within a container that don’t extend to the host cluster) and also in terms of performance.

“The relationship between Docker and Linux containers is one of fantastical simplification,” Stata said.

For a company like Cloudera, he explained, Docker integration might be a “nice to have” because so many of its users are running on-premise, single-user Hadoop clusters. However, as more customers start wanting to run multiple types of jobs — maybe Spark here, or Matlab there — on top of YARN, or if Cloudera starts hosting more user jobs itself, being able to launch and manage isolated environments would become a bigger deal.

“It’s not an accident that Altiscale has picked up the engineering here, because we are a service,” he said. He added, “We’re driving it because it’s a must-have for us … but they’re all rooting us on.” Although, Stata noted, Hortonworks is also committing a lot of resources to the Docker integration.

Imagine Docker containers running these applications and all their components in isolation. Source: Hortonworks

Imagine Docker containers running these applications and all their components in isolation. Source: Hortonworks

However, there is one big improvement that has to happen before Docker becomes ready for most enterprise users to deploy it on YARN, explained Dinesh Subhraveti, an Altiscale engineer who has been working in the broader realm of Linux containers for years. That’s the support of user ID namespaces within Docker, which will ensure that an application with root-level permissions can’t compromise the host and therefore make it unsafe or hamstring performance for others containers.

Once that’s done — probably near the end of this year — Hadoop users should be able to start launching Docker containers on YARN and be fairly confident there won’t be any inherent security risks hanging around.

Whether or not Hadoop users will flock to Docker remains to be seen, but Stata seems to think the desire to maximize the utility of YARN will drive some in that direction. “[Docker] won’t replace anything, it will sit side-by-side with the old way of doing things,” he said.

However, he added, as people start trying distribute non-Java software (Hadoop and all its ecosystem are built on Java) as part of YARN applications, they might be begging for a standard approach to doing it. “These ad hoc ways,” Stata said, “they’re painful.”