LinkedIn shed more light Tuesday on a big-data framework dubbed Gobblin that helps the social network take in tons of data from a variety of sources so that it can be analyzed in its Hadoop-based data warehouses.
Although the task of funneling data into Hadoop may seem relatively simple since the basic idea is that data needs to be transferred from one location to a large, centralized system, the reality is a lot more complex.
[company]LinkedIn[/company] houses a variety of internal data (information pertaining to member profiles, user actions like comments and clicking, etc.) that sits in databases like Espresso and event-logging systems like Kafka. Additionally, the social network also takes in data from outside sources like [company]Salesforce[/company] and [company]Twitter[/company], although at a smaller scale than its internal data.
All of these different types of data present a dilemma for LinkedIn considering there are so many variables the company has to take into account before that information can be gathered in Hadoop. These variables include things like the type of data source being sucked in (event streams, log files, etc.), data-flow types (batch or streaming) and even data-transport protocols (REST, Kafka, company-specific APIs, etc.).
Simply put, you can’t just gobble up so many different types of data that each require their own specific protocols without a lot of heavy duty tweaking. According to the LinkedIn blog post, the social network found itself “running more than 15 types of data ingestion pipelines” and was “struggling to keep them all functioning at the same level of data quality, features and operability.”
To simplify the data-ingestion process, LinkedIn developed Gobblin, which acts as a gateway to Hadoop that preps all the data flowing in and makes sure the data goes to the right file directory.
Gobblin contains “out-of-the-box adaptors for all our commonly accessed data sources such as Salesforce, MySQL, Google, Kafka and Databus, etc.,” LinkedIn explained in a presentation. It also plays nicely with the YARN resource manager, which allows for “scheduled batch ingest or continuous ingestion.”
Gobblin also allows for developers working on other data pipelines to hook up their own adaptors to the framework, so that the system can be used for all LinkedIn engineers.
The following slides, taken from a LinkedIn presentation on Gobblin, describes in more detail how Gobblin functions:
LinkedIn said it plans on open sourcing the framework within the “coming weeks” and as of now, Gobblin is helping LinkedIn process tens of terabytes a day.