LinkedIn open sources Cubert, a big data computation engine that saves CPU resources

Linkedin said on Tuesday that it open sourced a framework called Cubert that uses specialized algorithms to organize data in a way that makes it easier to run queries without overburdening the system and wasting CPU resources.

Cubert, whose name is derived from the Rubik’s Cube, is supposedly as easy for engineers to work with as a Java application and it contains a “script-like user interface” from which engineers can use algorithms like MeshJoin and Cube on top of the organized data to save system resources when running queries.

From the LinkedIn blog post:
[blockquote person=”LinkedIn” attribution=”LinkedIn”]Extant engines such as Apache Pig, Hive and Shark (orange blocks) provide a logical declarative language that is translated into a physical plan. This plan is executed on the distributed engine (Map-Reduce, Tez or Spark), where the physical operators are executed against the data partitions. Finally, the data partitions are managed via the file system abstractions provided by HDFS.[/blockquote]

Cubert framework

Cubert framework

With Cubert running on top of Hadoop, the new framework can abstract all of that storage into blocks of data that makes it easier to run its resource-saving algorithms as well as operators that help engineers better manage that data. For example, the COMBINE operator can combine multiple blocks of data together and the PIVOT operator can create subsets of data blocks.

LinkedIn also created a new language called Cubert Script, the purpose of which is to make it easy for developers to play with Cubert without having to do any sort of custom coding.

LinkedIn now uses Cubert as a key component to how it processes data. When the Kafka real-time messaging system (the LinkedIn team that built Kafka just bailed to create a company based on the tech, by the way) sucks in all the information from LinkedIn’s many applications and sends it over to Hadoop, Cubert then processes that data to make sure it’s not draining system resources and helps engineers solve a “variety of statistical, analytical and graph computation problems.”

After being processed, the data then flows out to LinkedIn’s Pinot real-time data analytics system, which the company uses to power its many data-tracking features, like discovering who recently looked at a user’s profile.

LinkedIn Data Pipeline

LinkedIn Data Pipeline

Now that Cubert is hooked into LinkedIn’s infrastructure, the company no longer has to worry about Hadoop scripts that end up “hogging too many resources on the cluster” or take hours to do what they’re supposed to do.

Diagrams courtesy of LinkedIn