eBay’s new Pulsar framework will analyze your data in real time

eBay has a new open-source, real-time analytics and stream-processing framework called Pulsar that the company claims is in production and is available for others to download, according to an eBay blog post on Monday. The online auction site is now using Pulsar to gather and process all the data pertaining to user interactions and their behaviors and said that the framework “scales to a million events per second with high availability.”

While eBay uses Hadoop for its batch processing and analytics needs, the company said it now needs a way to process and analyze data in real time for better personalization, fraud and bot detection and dashboard creation, among others.

For a system to be able to achieve what eBay is calling for, it needs to be able to process millions of events per second, have low latency with “sub-second event processing and delivery” and needs to be spread out across multiple data centers with “no cluster downtime during software upgrade,” according to the blog post.

eBay decided the best way to go about this was to build its own complex event processing framework (CEP), which also includes a Java-based framework on top of which developers can build other applications.

eBay pulsar pipeline

eBay pulsar pipeline

Developers skilled with SQL should feel at home with Pulsar because the framework can be operated with a “SQL-like event processing language.”

The real-time analytics data pipeline built into Pulsar is essentially a combination of a variety of components that are linked together (but can function independently) and form the data-processing conveyor belt from which all that user data flows through. Some of these components include a data collector, an event distributor and a metrics calculator.

It’s within Pulsar that eBay can add additional information to enrich the data — like geo-location information — remove unnecessary data attributes and compile together a bunch of events and “add up metrics along a set of dimensions over a time window.”

The whole idea is to have all that real-time data available in Pulsar to be treated “like a database table” in which developers can run the necessary SQL queries for analytic purposes, the post stated.

From the eBay blog:
[blockquote person=”eBay” attribution=”eBay”]Pulsar CEP processing logic is deployed on many nodes (CEP cells) across data centers. Each CEP cell is configured with an inbound channel, outbound channel, and processing logic. Events are typically partitioned based on a key such as user id. All events with the same partitioned key are routed to the same CEP cell. In each stage, events can be partitioned based on a different key, enabling aggregation across multiple dimensions. To scale to more events, we just need to add more CEP cells into the pipeline. [/blockquote]

Here’s what the Pulsar deployment architecture looks like:

Pulsar deployment

Pulsar deployment

Plans are on the way for Pulsar to include its own dashboard and real-time reporting API and integrate with other similar services, like the Druid open-source database for real-time analysis. The Druid database, created by the analytics startup Metamarkets (see disclosure), just moved over to the Apache 2 software license to attract more users.

Pulsar is open sourced under the Apache 2.0 License and the GNU General Public License version 2.0.

Disclosure: Metamarkets is a portfolio company of True Ventures, which is also an investor in Gigaom.

Pinterest is experimenting with MemSQL for real-time data analytics

Pinterest shed more light on how the social scrapbook and visual discovery service analyzes data in real time, it said in a blog post on Wednesday, also revealing details about how it’s exploring a combination of MemSQL and Spark Streaming to improve the process.

Currently, Pinterest uses a custom-built log-collecting agent dubbed Singer that the company attaches to all of its application servers. Singer then collects all those application log files and with the help of the real-time messaging framework Apache Kafka it can transfer that data to Storm or Spark and other “custom built log readers” that “process these events in real-time.”

Pinterest also uses its own log-persistence service called Secor to read that log data moving through Kafka and then write it to Amazon S3, after which Pinterest’s “self-serve big data platform loads the data from S3 into many different Hadoop clusters for batch processing,” the blog post stated.

Although this current system seems to be working decently for Pinterest, the company is also exploring how it can use MemSQL to help when people need to query the data in real time. So far, the Pinterest team has developed a prototype of a real-time data pipeline that uses Spark Streaming to pass data into MemSQL.

Here’s what this prototype looks like:

Pinterest real-time analytics

Pinterest real-time analytics

In this prototype, Pinterest can use Spark Streaming to pass the data related to each pin (along with geolocation information and what type of category does the pin belong to) to MemSQL, in which the data is then available to be queried.

For analysts that understand SQL, the prototype could be useful as a way to analyze data in real time using a mainstream language.

This startup’s software can manage your storage, all inside the server

Although we’re still a ways away from seeing the data deluge that’s sure to come as the internet of things becomes more mainstream, that doesn’t mean storage startups aren’t busy creating technology designed to handle the flood of data.

Launching Wednesday, a new Sunnyvale-based storage startup called Springpath is flaunting its solution to the oncoming data deluge: a software-designed storage strategy that promises to allow customers to more efficiently store data in their server infrastructure without having to buy more hard drives.

Springpath, like other software-defined-storage startups like Primary Data and Qumulo, believes that modern-day data centers for the likes of a typical business (not the Googles or Facebooks of the world) are “highly fragmented” and littered with too many hardware appliances from multiple vendors, explained Springpath CEO and CTO Mallik Mahalingam. Compounding the problem is that each appliance has its own management plane that conflicts with the other devices, he said.

What Springpath wants to do is unify all of the storage in a company’s data centers through the use of software, but what makes it different is that the underlying Springpath software “runs on top of commodity servers” as opposed to the storage hard drives themselves, said Mahalingam.

“We are a full storage system,” said Mahalingam. “We don’t depend on any external storage [device].”

The Springpath team said that modern-day commodity servers contain enough storage built inside them that they can be used as both suppliers of compute and storage. The startup’s founders, who came from [company]VMware[/company] (Mahalingam said he is the inventor of the VxLan virtual networking technology), think that their proprietary software coupled together a server will be more efficient than storage arrays whose technology might be decades old.

At the heart of Springpath’s technology is its distributed file system called HALO, which the company spent the past three years developing from the ground up.

HALO

With the the HALO architecture, users should be able to link up all of their servers and access the types of data management services that you’d come to expect from other software-defined storage startups. These services include data caching, intelligently distributing the data across the devices for better performance and data de-duplication, which essentially removes duplicate copies of data that are hogging the system.

By putting a company’s data in one single device that does both compute and storage, Mahalingam explained that this could significantly save a company money in buying excess storage drives.

Of course, having both storage and compute bundled in one appliance could pose problems if the device shuts down for some reason, taking both the computing and data down with it. However, Springpath’s software apparently takes in account failover, so if one device goes down, another one should come back online.

If you want to use Springpath’s software in your data center to connect to the public cloud, as of now the startup only supports vCloud Air and “anything that’s coupled with the VMware environment,” but plans are to eventually support other public cloud environments, Mahalingam said.

Springpath also doesn’t support Hadoop, so big data aficionados who love the framework may be out of luck for the time being. Mahalingam said Hadoop integration may be on the horizon.

The Sunnyvale startup currently has $34 million in total funding from investors including Sequoia Capital, NEA and Redpoint.

Mesosphere’s new data center mother brain will blow your mind

Mesosphere has been making a name for itself in the the world of data centers and cloud computing since 2013 with its distributed-system smarts and various introductions of open-source technologies, each designed to tackle the challenges of running tons of workloads across multiple machines. On Monday, the startup plans to announce that its much-anticipated data center operating system — the culmination of its many technologies — has been released as a private beta and will be available to the public in early 2015.

As part of the new operating system’s launch, [company]Mesosphere[/company] also plans to announce that it has raised a $36 million Series B investment round, which brings its total funding to $50 million. Khosla Ventures, a new investor, drove the financing along with Andreessen Horowitz, Fuel Capital, SV Angel and other unnamed entities.

Mesosphere’s new data center operating system, dubbed DCOS, tackles the complexity behind trying to read all of the machines inside a data center as one giant computer. Similar to how an operating system on a personal computer can distribute the necessary resources to all the installed applications, DCOS can supposedly do the same thing across the data center.

The idea comes from the fact that today’s powerful data-crunching applications and services — like Kafka, Spark and Cassandra — span multiple servers, unlike more old-school applications like [company]Microsoft[/company] Excel. Asking developers and operations staff to configure and maintain each individual machine to accommodate the new distributed applications is quite a lot, as Apache Mesos co-creator and new Mesosphere hire Benjamin Hindman explained in an essay earlier this week.

Mesosphere CEO Florian Leibert

Mesosphere CEO Florian Leibert – Source: Mesosphere

Because of this complexity, the machines are nowhere near running full steam, said Mesosphere’s senior vice president of marketing and business development Matt Trifiro.

“85 percent of a data center’s capacity is typically wasted,” said Trifiro. Although developers and operations staff have come a long way to tether pieces of the underlying system together, there hasn’t yet been a nucleus of sorts that successfully links and controls everything.

“We’ve always been talking about it — this vision,” said Mesosphere CEO Florian Leibert. “Slowly but surely the pieces came together; now is the first time we are showing the total picture.”

Building an OS

The new DCOS is essentially a bundle of all of the components Mesosphere has been rolling out — including the Mesos resource management system, the Marathon framework and Chronos job scheduler — as well as third-party applications like the Hadoop file system and YARN.

The DCOS also includes common OS features one would would find in Linux or Windows, like a graphical user interface, command-line interface and a software-development kit.

These types of interfaces and extras are important for DCOS to be a true operating system, explained Leibert. While Mesos can automate the allocation of all the data center resources to many applications, the additional features provide coders and operations staff a centralized hub from which they can monitor their data center as a whole and even program.

“We took the core [Mesos] kernel and built the consumable systems around it,” said Trifiro. “[We] added Marathon, added Chronos and added the easy install of the entire package.”

To get DCOS up and running in a data center, Mesosphere installs a small agent on all Linux OS-based machines, which in turn allows them to be read as an “uber operating system,” explained Leibert. With all of the machines’ operating systems linked up, it’s supposedly easier for distributed applications, like Google’s Kubernetes, to function and receive what they needs.

The new graphical interface and command-line interface allows an organization to see a visual representation of all of their data center machines, all the installed distributed applications and how system resources like CPU and memory are being shared.

If a developer wants to install an application in the data center, he or she simply has to enter install commands in the command-line interface and the DCOS should automatically load it up. A visual representation of the app should then appear along with indicating which machine nodes are allocating the right resources.

DCOS interface

DCOS interface

The same process goes for installing a distributed database like Cassandra; you can now “have it running in a minute or so,” said Leibert.

Installing Cassandra on DCOS

Installing Cassandra on DCOS

A scheduler is built into DCOS that takes in account certain variables a developer might want to include in order to decide which machine should deliver resources to what application; this is helpful because it allows the developer to set up the configurations and the DCOS will automatically follow through with the orders.

“We basically turn the software developer into a data center programmer,” said Leibert.

And because DCOS makes it easier for a coder to program against, it’s possible that new distributed applications could be made faster than before because the developer can now write software to a fleet of machines rather than only one.

As of today, DCOS can run on on-premise environments like bare metal and OpenStack, major cloud providers — like [company]Amazon[/company], [company]Google[/company] and [company]Microsoft[/company] — and it supports Linux variants like CoreOS and Redhat.

Changing the notion of a data center

Leibert wouldn’t name which organizations are currently trying out DCOS in beta, but it’s hard not to think that companies like Twitter, Netflix or Airbnb — all users of Mesos — haven’t considered giving it a test drive. Leibert was a former engineer at Twitter and Airbnb, after all.

Beyond the top webscale companies, Mesosphere wants to court legacy enterprises like those in the financial-services industry who have existing data centers that aren’t nearly as efficient as those seen at Google.

Banks, for example, typically use “tens of thousands of machines” in their data centers to perform risk analysis, Leibert said. With DCOS, Leibert claims that banks can run the type of complex workloads they require in a more streamlined manner if they were to link up all those machines.

And for these companies that are under tight regulation, Leibert said that Mesosphere has taken security into account.

“We built a security product into this operating system that is above and beyond any open-source system, even as a commercial plugin,” said Leibert.

As for what lies ahead for DCOS, Leibert said that his team is working on new features like distributed checkpointing, which is basically the ability to take a snapshot of a running application so that you can pause your work; the next time you start it up, the data center remembers where it left off and can deliver the right resources as if there wasn’t a break. This method is apparently good for developers working on activities like genome sequencing, he said.

Support for containers is also something Mesosphere will continue to tout, as the startup has been a believer in the technology “even before the hype of [company]Docker[/company],” said Leibert. Containers, with their ability to isolate workloads even on the same machine, are fundamental to DCOS, he said.

Mesosphere believes there will be new container technology emerging, not just the recently announced CoreOS Rocket container technology, explained Trifiro, but as of now, Docker and native Linux cgroup containers are what customers are calling for. If Rocket gains momentum in the market place, Trifiro said, Mesosphere will “absolutely implement it.”

If DCOS ultimately lives up to what it promises it can deliver, managing data centers could be a way less difficult task. With a giant pool of resources at your disposal and an easier way to write new applications to a tethered-together cluster of computers, it’s possible that next-generation applications could be developed and managed far easier than they use to be.

Correction: This post was updated at 8:30 a.m. to correctly state Leibert’s previous employers. He worked at Airbnb, not Netflix.

LinkedIn explains its complex Gobblin big data framework

LinkedIn shed more light Tuesday on a big-data framework dubbed Gobblin that helps the social network take in tons of data from a variety of sources so that it can be analyzed in its Hadoop-based data warehouses.