Red Hat’s new operating system will power up your containers

Open-source software giant Red Hat said on Thursday that its new operating system custom made to power Linux containers is now available to download. Red Hat has been a big proponent of Docker and its container packing technology going back as far as last summer touting its support of the startup and making sure its Enterprise Linux 7 product was compatible with Docker’s technology.

Container technology has generated a lot of buzz over the past year by promising a type of virtualization that’s lighter weight than your typical virtual machine. In order for a container to actually run, it needs to be connected to a host Linux OS that can distribute the necessary system resources to that container.

While you could use a typical Linux-based OS to power up your containers, as CoreOS CEO Alex Polvi (whose own startup offers a competing container-focussed OS) told me last summer, these kinds of operating systems merely get the job done and don’t take full advantage of what containers have to offer.

Red Hat’s new OS supposedly comes packed with features designed to make running containerized applications less of a chore to manage. These features include an easier way to update the operating system (OS updates can often be a pain for IT admins) and an integration with Google’s Kubernetes container-orchestration service for spinning up and managing multiple containers.

The new OS is also promising better security for those Docker containers — which has been an issue that Docker’s team has been addressing in various updates — with a supposed stronger way of isolating containers from each other when they are dispersed in a distributed environment.

Of course, [company]Red Hat[/company] has some competition when it comes to becoming the preferred OS of container-based applications. CoreOS has its own container-centric OS and Ubuntu has its Snappy Ubuntu Core system for powering Docker containers. Additionally, a couple of the former veterans who recently departed Citrix in September have started their own startup called Rancher Labs that just released RancherOS, which the startup describes as a “minimalist Linux distribution that was perfect for running Docker containers.”

It will be worth keeping an eye on which OS gains traction in the container marketplace and whether we will see some of these new operating systems starting to offer support for CoreOS’s new Rocket-container technology as opposed to just the Docker platform.

A Red Hat spokesperson wrote to me in an email that “Red Hat Enterprise Linux-based containers are not supported on CoreOS and rocket is not supported with Atomic Host. We are, as always, continuing to evaluate new additions in the world of containers, including Rocket, with respect to our customer needs.”

Watch Hilary Mason discredit the cult of the algorithm

Want to see Hilary Mason, the CEO and founder at Fast Forward Labs, get fired up? Tell her about your new connected product and its machine learning algorithm that will help it anticipate your needs over time and behave accordingly. “That’s just a bunch of marketing bullshit,” said Mason when I asked her about these claims.

Mason actually builds algorithms and is well-versed in what they can and cannot do. She’s quick to dismantle the cult that has been built up around algorithms and machine learning as companies try to make sense of all the data they have coming in, and as they try to market products built on learning algorithms in the wake of Nest’s $3.2 billion sale to Google (I call those efforts faithware). She’ll do more of this during our opening session with Data collective co-managing partner Matt Ocko at Structure Data on March 18 in New York. You won’t want to miss it.

Lately, algorithms have been touted as the new saviors, capable of helping humans parse terabytes of data to find the hypothetical needle in the haystack. Or they are portrayed as mirrors of our biases coolly replicating our own racist or classist institutions in code.

Mason thinks of them differently. An algorithm is a method, or recipe, or set of instructions for a computer to follow, she said. “It’s just a recipe you type in to get a consistent result. In some ways chocolate chip cookie recipes are my favorite algorithms. You put a bunch of bad-for-you stuff in a bowl and get a delicious result.”

As for the phrase “machine learning,” which has begun replacing “algorithm” in many of the marketing and Kickstarter pitches I see for connected devices that learn your habits, Mason said that’s no more magical. “It’s a false distinction,” she said. Machine learning algorithms may tend to use statistical methods and techniques, but they are still just algorithms.

Essentially, you’re combining what you know about the properties of a given data set with the recipe you built. For an email spam filter, you might build an algorithm that detects spam by looking for words that commonly appear in spam and then combining that with a statistical distribution of the countries that spam often comes from. Voila, the magic has become mundane — or at least mathematical.

At the end of the day, it’s still just math. Really awesome math.

Updated: This story was updated, to clarify some of the points Mason was making.

Google’s new service will ease real-time communications for applications

Google has a new real-time messaging system available in beta for its cloud service called Google Cloud Pub/Sub, the company said on Wednesday in a blog post. The system in theory will enable applications and services to communicate with each other in real time, regardless if they are built atop the Google Cloud or run on-premises.

In today’s world of distributed systems, its important for messages to flow between applications and services as fast as possible in order for applications to present the freshest information to users as well as the IT admins responsible for managing the infrastructure. This is why Apache Kafka is so popular with companies like [company]Hortonworks[/company], which added support for the real-time messaging framework last summer.

The new messaging system targets developers looking to build complex, distributed applications on the [company]Google[/company] Cloud and it follows in line with the recently announced Google Container Engine back in November. Google Container Engine is basically the managed service version of the Kubernetes container-management system used for spinning up and managing tons of containers for complex, multi-component applications.

At this time, both Google Cloud Pub/Sub and the Google Container Engine are only available for the Google Cloud Platform, so you can see that the search giant is hoping to lure more enterprise clients to its cloud who don’t want to deal with the heavy lifting that’s often associated with using open-source technology.

Google said the new messaging system powers its recently launched Google Cloud Monitoring service as well as Snapchat’s new Discover feature, which as my colleague Carmel DeAmicis reported is basically Snapchat’s portal to media companies like Vice and CNN.

Google Cloud Pub/Sub is free to use while in beta, but once it hits general availability, you’ll have to pay based on usage, which starts “at 40¢ per million for the first 100 million API operations each month,” according to the blog post.

Coolan lets companies pool and analyze hardware data

A common dilemma facing many companies that have a ton of gear in their data centers is having to figure out which hardware appliance is causing bottlenecks that may cause downtime and customer outrage. Coolan, a startup formed by former Facebook and Google engineers, aims to solve this problem and is exiting stealth with a new product that gathers together infrastructure data from multiple companies, which it then analyzes to unearth how all their gear is performing.

That strength-by-numbers approach separates Coolan from other IT monitoring services out there that companies plug into their data centers to discover how efficient (or not) their infrastructure really is so they can spot problems before they turn to something bigger.

Although these IT monitoring services essentially study the infrastructure of one company, Coolan’s software platform allows other entities to share their infrastructure data with each other in the hopes that, with more data available, organizations can put an end to unnecessary server failures and the like.

“[Organizations] are all curious about solving this problem, but they have a limited data set,” said Coolan co-founder and CEO Amir Michael. “By bringing the industry together you get a larger data set.”

Screenshot of failure rate

Screenshot of failure rate

Michael was a a hardware engineer at [company]Google[/company] and then a Facebook engineer and manager of the company’s hardware design. At [company]Facebook[/company], Michael’s contributions led to the creation of the Open Compute Project, where he is is still an active participant as vice-chair of the project’s Incubation Committee, responsible for reviewing new specifications.

The open compute project did a good job of getting people to talk about servers and design concepts from a hardware level, Michael said. However, when it comes to operations and getting the most out of hardware, there’s not a lot of information available to the general public on the actual performance metrics of individual pieces of hardware.

“We all want more transparency around our hardware,” Michael said.

The idea is for companies big and small, with 100 servers or 1,000 servers, to all benefit from the insights gleaned from the same big data set. Companies will have to install software (three lines of code, apparently) onto their fleet of servers, which will allow for infrastructure data to flow over to Coolan’s own servers, stored in Amazon S3.

Screenshot of notification report

Screenshot of notification report

Michael seemed aware of the irony that his startup that specializes in hardware-performance metrics operates in the cloud, but he said that “we will eat our own dog food and be running our own servers” once it reaches a certain size.

Coolan will not be syphoning the type of software-related data that New Relic or AppDynamics need for their analytics purposes, but rather hardware data, like the name of a device manufacturer, the temperature of the hardware when running, the model number of an appliance, when the device started generating errors, and so on.

From all this data, Coolan’s team can run machine-learning algorithms to learn how the hardware stacks up and which devices have a higher chance of failure. If a bunch of companies that contribute to Coolan all find that a fan in a particular manufacturer’s device cracks out at the two-year mark, then users who recently purchased that device will now have some warning that their devices might not function properly down the road.

Coolan CEO Amir Michael

Coolan CEO Amir Michael

Coolan’s not ready to disclose who its pilot customers are, but Michael did say that a number of the company’s clients are organizations that started out in the cloud and are now moving off of it to build their own data centers.

The startup could one day have a tool that also monitors a company’s cloud infrastructure, but Michael said that’s “not the primary focus right now.” Coolan is also still figuring out its pricing model, but its main goal as of now is to simply get more companies on board to “get more data.”

“I think part of it is in my DNA,” said Michael in reference to how his days at Facebook could have made him more open to the idea of sharing and collaborative projects. Facebook recently launched a collaborative threat-detection framework that seems similar to Coolan except instead of hardware data, companies are dumping into a central hub security data.

The six-person team at Coolan is not disclosing how much funding it has raised so far, but it closed a seed round in February led by Social + Capital, North Bridge Venture Partners and Keshif Ventures.

eBay’s new Pulsar framework will analyze your data in real time

eBay has a new open-source, real-time analytics and stream-processing framework called Pulsar that the company claims is in production and is available for others to download, according to an eBay blog post on Monday. The online auction site is now using Pulsar to gather and process all the data pertaining to user interactions and their behaviors and said that the framework “scales to a million events per second with high availability.”

While eBay uses Hadoop for its batch processing and analytics needs, the company said it now needs a way to process and analyze data in real time for better personalization, fraud and bot detection and dashboard creation, among others.

For a system to be able to achieve what eBay is calling for, it needs to be able to process millions of events per second, have low latency with “sub-second event processing and delivery” and needs to be spread out across multiple data centers with “no cluster downtime during software upgrade,” according to the blog post.

eBay decided the best way to go about this was to build its own complex event processing framework (CEP), which also includes a Java-based framework on top of which developers can build other applications.

eBay pulsar pipeline

eBay pulsar pipeline

Developers skilled with SQL should feel at home with Pulsar because the framework can be operated with a “SQL-like event processing language.”

The real-time analytics data pipeline built into Pulsar is essentially a combination of a variety of components that are linked together (but can function independently) and form the data-processing conveyor belt from which all that user data flows through. Some of these components include a data collector, an event distributor and a metrics calculator.

It’s within Pulsar that eBay can add additional information to enrich the data — like geo-location information — remove unnecessary data attributes and compile together a bunch of events and “add up metrics along a set of dimensions over a time window.”

The whole idea is to have all that real-time data available in Pulsar to be treated “like a database table” in which developers can run the necessary SQL queries for analytic purposes, the post stated.

From the eBay blog:
[blockquote person=”eBay” attribution=”eBay”]Pulsar CEP processing logic is deployed on many nodes (CEP cells) across data centers. Each CEP cell is configured with an inbound channel, outbound channel, and processing logic. Events are typically partitioned based on a key such as user id. All events with the same partitioned key are routed to the same CEP cell. In each stage, events can be partitioned based on a different key, enabling aggregation across multiple dimensions. To scale to more events, we just need to add more CEP cells into the pipeline. [/blockquote]

Here’s what the Pulsar deployment architecture looks like:

Pulsar deployment

Pulsar deployment

Plans are on the way for Pulsar to include its own dashboard and real-time reporting API and integrate with other similar services, like the Druid open-source database for real-time analysis. The Druid database, created by the analytics startup Metamarkets (see disclosure), just moved over to the Apache 2 software license to attract more users.

Pulsar is open sourced under the Apache 2.0 License and the GNU General Public License version 2.0.

Disclosure: Metamarkets is a portfolio company of True Ventures, which is also an investor in Gigaom.

MIT researchers claim they have a way to make faster chips

A team of MIT researchers have discovered a possible way to make multicore chips a whole lot faster than they currently are, according to a recently published research paper.

The researchers’ work involves the creation of a scheduling technique called CDCS, which refers to computation and data co-scheduling. This technique can distribute both data and computations throughout a chip in such a way that the researchers claim that in a 64-core chip, computational speeds saw a 46 percent increase while power consumption decreased by 36 percent. This boost in speed is important because multicore chips are becoming more prevalent in data centers and supercomputers as a way to increase performance.

The basic premise behind the new scheduling technique is that data has to be near the computation that uses it, and the best way to do so is with a combination of hardware and software that distributes both the data and computations throughout the chip more easily than before.

Although current techniques like nonuniform cache access (NUCA) — which basically involves storing cached data near the computations — have worked so far, these techniques don’t take in account the placement of the computations themselves.

The new research touts the use of an algorithm that optimally places the data and the compute together as opposed to only the data itself. This algorithm allows the researchers to anticipate where the data needs to be located.

“Now that the way to improve performance is to add more cores and move to larger-scale parallel systems, we’ve really seen that the key bottleneck is communication and memory accesses,” said MIT professor and author of the paper Daniel Sanchez in a statement. “A large part of what we did in the previous project was to place data close to computation. But what we’ve seen is that how you place that computation has a significant effect on how well you can place data nearby.”

While the CDCS-related hardware loaded on the chip accounts for 1 percent of the chip’s available space, the researchers believe that it’s worth it when it comes to the performance increase.

Pinterest is experimenting with MemSQL for real-time data analytics

Pinterest shed more light on how the social scrapbook and visual discovery service analyzes data in real time, it said in a blog post on Wednesday, also revealing details about how it’s exploring a combination of MemSQL and Spark Streaming to improve the process.

Currently, Pinterest uses a custom-built log-collecting agent dubbed Singer that the company attaches to all of its application servers. Singer then collects all those application log files and with the help of the real-time messaging framework Apache Kafka it can transfer that data to Storm or Spark and other “custom built log readers” that “process these events in real-time.”

Pinterest also uses its own log-persistence service called Secor to read that log data moving through Kafka and then write it to Amazon S3, after which Pinterest’s “self-serve big data platform loads the data from S3 into many different Hadoop clusters for batch processing,” the blog post stated.

Although this current system seems to be working decently for Pinterest, the company is also exploring how it can use MemSQL to help when people need to query the data in real time. So far, the Pinterest team has developed a prototype of a real-time data pipeline that uses Spark Streaming to pass data into MemSQL.

Here’s what this prototype looks like:

Pinterest real-time analytics

Pinterest real-time analytics

In this prototype, Pinterest can use Spark Streaming to pass the data related to each pin (along with geolocation information and what type of category does the pin belong to) to MemSQL, in which the data is then available to be queried.

For analysts that understand SQL, the prototype could be useful as a way to analyze data in real time using a mainstream language.

How Twitter processes tons of mobile application data each day

It’s only been seven months since Twitter released its Answers tool, which was designed to provide users with mobile application analytics. But since that time, Twitter now sees roughly five billion daily sessions in which “hundreds of millions of devices send millions of events every second to the Answers endpoint,” the company explained in a blog post on Tuesday. Clearly, that’s a lot of data that needs to get processed and in the blog post, Twitter detailed how it configured its architecture to handle the task.

The backbone of Answers was created to handle how the mobile application data is received, archived, processed in real time and processed in chunks (otherwise known as batch processing).

Each time an organization uses the Answer tool to learn more how his or her mobile app is functioning, Twitter logs and compresses all that data (which gets set in batches) in order to conserve the device’s battery power while also not putting too much unnecessary strain on the network that routes the data from the device to Twitter’s servers.

The information flows into a Kafka queue, which Twitter said can be used as a temporary place to store data. The data then gets passed into Amazon Simple Storage Service (Amazon S3) where Twitter retains the data in a more permanent location as opposed to Kafka. Twitter uses Storm to process the data that flows into Kafka and also uses it to write the information stored in Kafka to [company]Amazon[/company] S3.

Data pipeline

Data pipeline

With the data stored in Amazon S3, Twitter than uses Amazon Elastic MapReduce for batch processing.

From the blog post:
[blockquote person=”Twitter” attribution=”Twitter”]We write our MapReduce in Cascading and run them via Amazon EMR. Amazon EMR reads the data that we’ve archived in Amazon S3 as input and writes the results back out to Amazon S3 once processing is complete. We detect the jobs’ completion via a scheduler topology running in Storm and pump the output from Amazon S3 into a Cassandra cluster in order to make it available for sub-second API querying.[/blockquote]

At the same time as this batch processing is going on, Twitter is also processing data in real time because “some computations run hourly, while others require a full day’s of data as input,” it said. In order to address the computations that need to be performed more quickly and require less data that the bigger batch processing jobs, Twitter uses a instance of Storm that processes the data that’s sitting in Kafka, the results of which get funneled into an independent Cassandra cluster for real-time querying.

From the blog post:
[blockquote person=”Twitter” attribution=”Twitter”]To compensate for the fact that we have less time, and potentially fewer resources, in the speed layer than the batch, we use probabilistic algorithms like Bloom Filters and HyperLogLog (as well as a few home grown ones). These algorithms enable us to make order-of-magnitude gains in space and time complexity over their brute force alternatives, at the price of a negligible loss of accuracy.[/blockquote]

The complete data-processing system looks like this, and it’s tethered together with Twitter’s APIs:

Twitter Answers architecture

Twitter Answers architecture

Because of the way the system is architected and the fact that the data that needs to be analyzed in real time is separated from the historical data, Twitter said that no data will be lost if something goes wrong during the real-time processing. All that data is stored where Twitter does its batch processing.

If there are problems affecting batch processing, Twitter said its APIs “will seamlessly query for more data from the speed layer” and can essentially configure the the system to take in “two or three days of data” instead of just one day; this should give Twitter engineers enough time to take a look at what went wrong while still providing users with the type of analytics derived from batch processing.

Security incubator with ties to Israeli military forms with $18M

A new Israeli-based cyber-security incubator called Team8 plans to announce its launch on Tuesday and is banking that its ties to the Israeli military will give its startups a competitive edge in the crowded security startup market. As part of the launch, the incubator also landed an $18 million dollar investment round from Bessemer Venture Partners (BVP), Alcatel-Lucent, Cisco Investments, Marker LLC, and Innovation Endeavors.

Team8’s founders — Nadav Zafrir, Israel Grimberg and Liran Grinberg — are all veterans of the Israel Defense Forces Unit 8200, which Zafrir described as being the National Security Agency of Israel. This particular unit, which Zafrir said he commanded during the last half of his military service, is responsible for intelligence gathering and national security, with former members of the unit having gone on to build some of Israel’s largest tech companies, like the Tel Aviv-based Check Point Software Technologies. Unit 8200 has also generated some innovative security companies over the years like Hexadite, which formally launched last July.

Zafrir described Team8 as a “startup of startups” that operates like a think tank in that its core team and staff spend a considerable amount of time doing research, albeit not for policy reports to influence governments. After researching specific areas in cyber-security that the team wants to tackle, Team8 then tries to find the right security experts who are best suited for potentially creating a startup that can solve the issue; these experts typically come from Unit 8200, but they don’t necessarily need to be, Zafrir said.

After an entrepreneur or security expert signs on, Team8 in return gives them the typical incubator perks including helping with the logistics of starting a new business.

The entrepreneurs that Team8 decides to work with will be provided with funding, technical guidance, go-to-business planning and anything else it takes for a successful startup to get off the ground.

Team8 Team

Team8 Team

Team8 will be different than a typical incubator in that it will be “taking in people and developing the concepts and technology in-house,” said BVP partner David Cowan, and the plan is for these companies to remain independent and not bound to the larger companies that are financially backing the project.

The first area of cyber security that Team8 wants to tackle relates to the idea of preventing the kind of massive data breaches like those seen at [company]Target[/company] and [company]Sony[/company] through a thorough understanding of the hackers behind the attacks, whether they be criminal syndicates or nation-states. Zafrir wouldn’t elaborate on how exactly the first company in its portfolio will be addressing this, citing that the company (led by former Check Point Software veteran Ofer Israeli) is still in stealth (and still figuring out a name).

“Our thesis for this specific domain is that at the end of the day, it is not about the malware,” said Zafrir. “You have to think about the people behind the malware.”

Team8 is currently backing two companies with one in beta and the other starting its alpha program the next quarter, and the ultimate goal of the incubator is to build four to six companies in the next few years.

For BVP’s Cowan, a successful 2015 means that Team8 will have spun out at least one company that has workable technology, solid leadership and a couple of customers. Cowan said he has “a good sense of what project it is” that he bets will be the incubator’s first successful company, but he wouldn’t elaborate on more details only to say “We will have our first company up and running by the end of the year.”

SmartThings hires ex-Googler to manage dev platform

SmartThings, Samsung’s hope for a unified smart home platform, has hired Dora Hsu, a former Google executive, as chief platform officer to lead its developer platform. Hsu, who was formerly the senior director of Google Cloud Solutions for the Google Cloud Platform business, will be responsible for getting developers to buy into SmartThings‘ and Samsung’s idea of an open ecosystem for the smart home by convincing them to use the SmartThings’ developer environment and to integrate devices into the SmartThings ecosystem.

Dora Hsu headshotSmartThings has long positioned itself as an open platform for the smart home, and after it was purchased last summer by Samsung, that hasn’t changed. In fact, Samsung Electronics president BK Yoon gave a somewhat overwrought keynote at International CES pleading for openness in the internet of things. But that openness may be hard to come by.

While Samsung’s purchase of SmartThings helped many companies finally feel comfortable with SmartThings as a mature platform, I’ve also heard from others — notably large appliance and TV vendors — that they will not integrate with a potential rival. So Hsu may have her work cut out for her in enticing developers to build for what is likely to be a large but never fully complete platform.

When it comes to the tools to court developers, Hsu may have better luck. SmartThings last month updated its Integrated Developer Environment so it runs smoother and faster, and I expect more updates to come. Hsu’s experience at Google heading up the technical teams that managed some of the cloud business will certainly help here.

SmartThings CEO Alex Hawkinson said that developers can expect more investment from SmartThings in the platform, beyond just making it faster. That will include adding analytics, certification and more. “We will also be making investments in certification, marketing, and monetization support to help developers and device makers to reach significant numbers of new customers through SmartThings,” he told me in an email. “The goal is to not just help developers to rapidly innovate, but to also help them to improve the lives of many consumers while building a great business in the process.”

Hsu will essentially build the business and infrastructure to create SmartThings’ App store model. So far SmartThings supports over 150 devices, with more in the works. It doesn’t disclose the number of developers working on the platform.

Hsu reports directly to Hawkinson, and is based in the SmartThings HQ in Palo Alto, California, with her new role effective immediately.