LinkedIn upgrades its search engine and ditches an array of open source extensions

LinkedIn has overhauled its search engine infrastructure in favor of a new system dubbed Galene, a homegrown engine designed to improve search results and problems with maintenance, the company plans to announce Thursday.

Using the improved search capabilities of the new architecture, a user can get better tailored results that are heavily personalized; what one user might see in his search results will be different than another user based on one’s own personal information. While this was somewhat possible in LinkedIn’s previous search engine, the new system is clearly faster, explained LinkedIn principal staff engineer Sriram Sankar who authored the blog post detailing Galene along with Asif Makhani, a LinkedIn director of engineering for search.

Search is the heart of LinkedIn, said Makhani, and people use LinkedIn as a professional search engine that helps them find jobs as well as aiding hiring managers who scout people based on specialized skills.

With the old system that was hard to maintain, Sankar said, it was a difficult task for the search engineering team to innovate and improve the quality of searching.

Its prior search engine was developed around the open source Lucene library and contained numerous plugins to tweak performance. The Lucene library allows for simple search functions in the form of storing information like keywords in indexes, searching those indexes when a user performs a search for a certain word and generating results based on relevance scores.

As the company made a push to create what its CEO Jeff Weiner termed an economic graph — the ability to map out the relationships between jobs, companies, talent and other professional descriptors — LinkedIn engineers added more plugins and extensions to its old search engine in order to do more complex tasks, said LinkedIn principal staff engineer Sankar.

Unfortunately, LinkedIn engineers decided that they could no longer keep their search engine up to their standards as the multitudes of extensions — including Bobo, Cleo and Norbert — bogged the team down with maintenance issues. Not to mention the fact that if a developer who set up one of the plugins were to leave, the knowledge and know-how of which plugin was responsible for which task would vanish.

“We had to go through unnatural steps to get the existing system to scale the extra mile,” said Sankar.

LinkedIn Chart 1

A diagram of how Galene is built

LinkedIn decided to scrap all of the extra extensions but continue using Lucene as its indexing layer that can handle queries and retrieve results. Essentially, the Galene architecture the company created does all the work of the previously used plugins without needing constant maintenance, in addition to doing the same tasks faster.

With the new system, a user can initiate a search query that gets passed from the web front-end interface to the back-end servers, where the Galene architecture does the heavy lifting and shoots the results back to the user.

According to the blog post, the search engine’s Federator and Broker services work by receiving the user’s query and associated metadata and shuttling it off to other services like query rewriters, which are used to generate more specific search queries than a user would have taken into account (plurals of words and different spelling variations, for example). The Searcher then takes in the modified user query that’s been altered by the Federator and Broker and does what its name implies and retrieves the matching result from the index based on its relevance score.

The index gets some help from Hadoop to store and update matching results that are again further refined.

From the blog post:

Indexing on Hadoop takes the form of multiple map-reduce operations that progressively refine the data into the data models and search index that ultimately serve live queries. HDFS contains raw data containing all the information we need to build the index. We first run map reduce jobs with relevance algorithms embedded that enrich the raw data – resulting in the derived data. Some examples of relevance algorithms that may be applied here are spell correction, standardization of concepts (for example, unifying “software engineer” and “computer programmer”), and graph analysis.

Galene also allows developers that are part of other LinkedIn groups, like the advertisement department, to create custom searches using APIs without having to consult the search engineering team, said Makhani.

Having a search engine that can map out relationships as opposed to performing more simple searches is important for LinkedIn, and the architecture needs to be constantly modified without causing bottlenecks. As the old system reached its limits of scalability, both Sankar and Makhani are confident that Galene can get the job done.

Post and thumbnail images courtesy of Shutterstock user Gil C.