Most LinkedIn (s lnkd) users know “People You May Know” as one of that site’s flagship features — an onmipresent reminder of other LinkedIn users with whom you probably want to connect. Keeping it up to date and accurate requires some heady data science and impressive engineering to keep data constantly flowing between the various LinkedIn applications. When Jay Kreps started there five years ago, this wasn’t exactly the case.
“I was here essentially before we had any infrastructure,” Kreps, now principal staff engineer, told me during a recent visit to LinkedIn’s Mountain View, Calif., campus. He actually came LinkedIn to do data science, thinking the company would have some of the best data around, but it turned out the company had an infrastructure problem that needed his attention instead.
How big? The version of People You May Know in place then was running on a single Oracle (s orcl) database instance — a few scripts and heuristics provided intelligence — and it took six weeks to update (longer if the update job crashed and had to restart). And that’s only if it worked. At one point, Kreps said, the system wasn’t working for six months.
When the scale of data began to overload the server, the answer wasn’t to add more nodes but to cut out some of the matching heuristics that required too much compute power.
So, instead of writing algorithms to make People You Know Know more accurate, he worked on getting LinkedIn’s Hadoop infrastructure in place and built a distributed database called Voldemort.
Since then, he’s built Azkaban, an open source scheduler for batch processes such as Hadoop jobs, and Kafka, another open source tool that Kreps called “the big data equivalent of a message broker.” At a high level, Kafka is responsible for managing the company’s real-time data and getting those hundreds of feeds to the apps that subscribe to them with minimal latency.
But Kreps’s work is just a fraction of the new data infrastructure that LinkedIn has built since he came on board. It’s all part of a mission to create a data environment at LinkedIn that’s as innovative as that of any other web company around, and that means the company’s applications developers and data scientists can keep building whatever products they dream up.
Bhaskar Ghosh, LinkedIn’s senior director of data infrastructure engineering — who’ll be part of our guru panel at Structure: Data on March 20-21 — can’t help but find his way to the whiteboard when he gets to discussing what his team has built. It’s a three-phase data architecture comprised of online, offline and nearline systems, each designed for specific workloads. The online systems handle users’ real-time interactions; offline systems, primarily Hadoop and a Teradata warehouse, handle batch processing and analytic workloads; and nearline systems handle features such as People You May Know, search and the LinkedIn social graph, which update constantly but require slightly less than online latency.
One of the most-important things the company has built is a new database system called Espresso. Unlike Voldemort, which is an eventually consistent key-value store modeled after Amazon’s Dynamo database and used to serve certain data at high speeds, Espresso is a transactionally consistent document store that’s going to replace legacy Oracle databases across the company’s web operations. It was originally designed to provide a usability boost for LinkedIn’s InMail messaging service, and the company plans to open source Espresso later this year.
According to Director of Engineering Bob Schulman, Espresso came to be “because we had a problem that had to do with scaling and agility” in the mailbox feature. It needs to store lots of data and keep consistent with users’ activity. It also needs a functional search engine so users — even those with lots of messages — can find what they need in a hurry.
With the previous data layer in tact, he explained, the solution for developers to solve scalability and reliability issues was doing so in the application.
However, Principal Software Architect Shirshanka Das noted, “trying to scale [your] way out of a problem” with code isn’t necessarily a long-term strategy. “Those things tend to burn out teams and people very quickly,” he said, “and you’re never sure when you’re going to meet your next cliff.”
Schulman and Das have also worked together on technologies such as Helix — an open-source cluster management framework for distributed systems — and Databus. The latter, which has been around since 2007 and the company just open sourced, is a tool that pushes changes in what Das calls “source of truth” data environments like Espresso to downstream environments such as Hadoop so that everyone can ensure they’re working with the freshest data.
In an agile environment, Schulman said, it’s important to be able to change something without breaking something else. The alternative is to bring stuff down to make changes, he added, and “it’s never a good time to stop the world.”
Next up, Hadoop
Thus far, LinkedIn’s biggest push has been in improving its nearline and online systems (“Basically, we’ve hit the ball out of the park here,” Ghosh said), so its next big push is offline — Hadoop, in particular. The company already uses Hadoop for the usual gamut of workloads — ETL, model-building, exploratory analytics and pre-computing data for nearline applications — and Ghosh wants to take it even further.
He laid out a multipart vision, most of which centers around tight integration between the company’s Hadoop clusters and relational database systems. Among the goals: better ETL frameworks, ad-hoc queries, alternative storage formats and an integrated metadata framework — which Ghosh calls the holy grail — that will make it easier for various analytic systems to use each other’s data. He said LinkedIn has something half-built that should be finished this year.
“[SQL on Hadoop] is going to take two years to work,” he explained. “What do we do in the meanwhile? We cannot throw this out.”
Actually, the whole of LinkedIn’s data engineering efforts right now put a focus on building services that can work together easily, Das said. The Espresso API, for example, allows developers to connect a columnar storage engine and do some limited online analytics right from within the transactional database.
Good infrastructure makes for happy data scientists
Yael Garten, a senior data scientist at LinkedIn, said better infrastructure makes her job a lot easier. Like Kreps, she was drawn to LinkedIn (from her previous career doing bioinformatics research at Stanford) because the company has so much interesting data to work with, only she was fortunate enough to miss the early days of spotty infrastructure that couldn’t handle 10 million users, much less today’s more than 200 million users. To date, she said, she hasn’t come across a problem she couldn’t solve because the infrastructure couldn’t handle the scale.
The data science team embeds itself with the product team and they work together to either prove out product managers’ hunches or build products around data scientists’ findings. In 2013, Garten said, developers should expect infrastructure that lets them prototype applications and test ideas in near real time. And even business managers need to see analytics as close to real time as possible so they can monitor how new applications are performing.
And infrastructure isn’t just about making things faster, she noted: “Something things wouldn’t be possible.” She wouldn’t go into detail about what this magic piece of infrastructure is, but I’ll assume it’s the company’s top-secret distributed graph system. Ghosh was happy to go into detail about a lot things, but not that one.
A virtuous hamster wheel
Neither Ghosh nor Kreps sees LinkedIn — or any leading web company, for that matter — quitting the innovation game any time soon. Partially, this is a business decision. Ghosh, for example, cites the positive impact on company culture and talent recruitment, while Kreps points out the difficult total-cost-of-ownership math when comparing paying for software licenses or hiring open source committers versus just building something internally.
Kreps acknowledged that the constant cycle of building new systems is “kind of a hamster wheel,” but there’s always an opportunity to do new stuff and build products with their own unique needs. Initially, for example, he envisioned two targets use cases for Hadoop but now the company has about 300 individual workloads; it went from two real-time data feeds to 650.
“But companies are doing this for a reason,” he said. “There is some problem this solves.”
Ghosh, well, he shot down the idea of relying too heavily on commercial technologies or existing open source projects almost as soon as he suggests it’s a possibility. “We think very carefully about where we should do rocket science,” he told me, before quickly adding, “[but] you don’t want to become a systems integration shop.”
In fact, he said, there will be a lot more development and a lot more open source activity from LinkedIn this year: “[I’m already] thinking about the next two or three big hammers.”