Why eHarmony is rebuilding itself atop Hadoop and (probably) OpenStack

Dating site [company]eHarmony[/company] continues to grow, processing more matches more quickly for more users, and now the company’s technological foundation is finally growing along with it. OpenStack, Hadoop, Spark, Docker — eHarmony CTO Thod Nguyen says the company is looking at all of them as it tries to evolve into a company that’s able to innovate on the IT front as well as the dimensions-of-compatibility front.

The overhaul began in 2013 and should be complete by the end of 2015, Nguyen told me in a recent interview. A big part of it is turning eHarmony’s existing virtualization-centric data center into a private cloud environment, mostly likely running the open source [company]OpenStack[/company] cloud software. That will give the company more flexibility in terms of scaling and provisioning the infrastructure, including virtual servers and storage, that power its website and mobile app.

Installed on top of [company]Cisco[/company] UCS blade servers (servers have quietly become a multi-billion-dollar business for Cisco), the company expects to cut its web server count by about half from its current 1,000 machines, he said. The company also manages about 2,000 other devices.

The Cisco family of blade servers. Source: CIsco

The Cisco family of blade servers

eHarmony is also investigating the open source CloudStack technology previously championed by [company]Citrix Systems[/company], but Nguyen said OpenStack seems to scale better. That OpenStack has the backing so many large IT companies, as well as a growing number of large users, doesn’t hurt his assessment, either.

“It gives you a lot more flexibility in terms of sharing storage through the OpenStack Swift component, as part of the software-defined storage solution,” Nguyen said, adding, “The ultimate goal is to really by able to scale the storage exponentially with minimal operational cost, especially related to storage.”

eHarmony’s newfound focus on operational efficiency doesn’t start and stop with OpenStack, though. Nguyen said the company is also considering the popular [company]Docker[/company] container technology in order to simplify the deployment and management of distributed applications, and that it “may be able to explore the public cloud solution” in certain situations. eHarmony already uses Amazon Web Services for proofs of concept and disaster recovery, he added.

“With the Docker concept, we can easily have a DR solution running on the on-demand public cloud without investing in a DR data center, which is very, very expensive to us,” Nguyen said.

Thod Nguyen, in a YouTube video by IBM (https://www.youtube.com/watch?v=1AePeB7iCpI)

Thod Nguyen, in a YouTube video by IBM (https://www.youtube.com/watch?v=1AePeB7iCpI)

But eHarmony also collects and analyzes a lot of data — Nguyen predicts it could reach petabyte-scale in the coming years — and its previous Hadoop environment, which ran on 512-node SeaMicro appliances, was becoming a barrier to scale and to innovation. Each workload required its own cluster, Nguyen explained, which meant a whole other appliance and another time replicating the same data.

Moving that environment to a single cluster running the YARN resource-management framework should free the company up to do a lot more things. For starters, it can host multiple workloads and and processing frameworks on the same set of servers, all sharing the same file system. It can also scale horizontally by adding capacity as needed rather than 512 nodes at a time.

A shared Hadoop cluster has business implications as well, Nguyen explained. eHarmony can spin up new big data applications more easily and for less money, and YARN means eHarmony can start looking at technologies such as Spark for speeding up machine learning workloads or Storm for stream processing jobs.

Although the company, like most dating sites, is best known for its matching algorithms, Nguyen said better data infrastructure will also lead to better models on the business side, including for things such as price optimization and user experience. Still, the matching business is a bear data-wise, as each customer’s profile includes 250 attributes as well as those famous 29 dimensions of compatibility.

The Hortonworks view of YARN on Hadoop.

The Hortonworks view of YARN on Hadoop.

“The goal is enabling us to create a data product that actually can deliver the right functionality, the right feature set that’s very engaging to our customers, before our customers even know what they want,” he said. “Instead of waiting for our customer to tell us what they want, we want to deliver [it] to them.”

The timing of eHarmony’s tech makeover, especially on the data side, isn’t coincidental. It’s actually only in the past year or two that technologies such as Spark, Storm and Kafka began to reach critical mass, making it more feasible to do analyze data interactively or in real time and to regularly iterate on machine learning models.

“I think big data has been a moving target,” Nguyen said. “A lot of people think they’re doing big data, but they’re just storing the data. They don’t really do anything with the data.”

Feature image courtesy of Shutterstock user Kaveryn Kiryl. Other images courtesy of Cisco, Hortonworks and IBM.