Intel, Alcatel-Lucent unveil their cloud mobile network

A year after forming their wireless partnership, Intel and Franco-American network builder Alcatel-Lucent say they’re ready to start moving the mobile network from the cell tower into the data center. At Mobile World Congress on Monday, the two took the wraps off a new networking architecture called vRAN, which looks unlike any mobile system deployed to date.

vRAN moves the baseband processing that drives the mobile network to the cloud and at its center are servers running on [company]Intel[/company] Xeon processors. [company]Alcatel-Lucent[/company] then runs many of the functions of the network as software on that server. The concept is known as Cloud-RAN, and if adopted by the mobile industry, it could fundamentally change how networks are built.

The mobile industry certainly wouldn’t be the first to embrace virtualization, but the move is a particularly fraught one for carriers because of the highly distributed way mobile networks are designed. All of the processing might – and the lion’s share of the expense – of mobile networks is at its fringes, right under the radios that transmit signals to our phones. Today carriers have to maximize the capacity of those base stations so they can handle the enormous demand for mobile data and voice at peak times.

cloud-cell-tower

Cloud-RAN (the RAN standing for radio access network) would move all that baseband processing into a centralized data center and carriers could allot capacity to cell towers as it’s needed. It’s a more efficient way to build a network, and it could result in more reliable and faster mobile service for you and I. Instead of cell sites maxing out their capacity and dropping our LTE connections, Cloud-RAN could amp up capacity at congested cell sites – you can think of it as a kind processing SWAT team wherever its needed in the network at any given time.

There are some limitations to just how “cloud” Cloud-RAN can go. You’re not going to a mobile network built on Amazon Web Services, or a central massive data center for all of the U.S. Latency is a super important consideration in the mobile network so data centers will have to be reasonably close to the towers they serve, but Alcatel-Lucent wireless CTO Michael Peeters told me that ALU and Intel have managed to push that distance put to more than 100 km (62 miles), which is enough to build a virtualized network of thousands of cells.

“You could take a city of a 1 million-population city and host the entire everything in a single central location,” Peeters said.

Even before their collaboration began, Intel and Alcatel-Lucent had been plugging away at Cloud-RAN independently for half a decade or more, as have other mobile networking companies. The difference now, said Sandra Rivera, GM of Intel’s Network Platforms Group, is the two companies now have a commercially viable product in vRAN. “The products have been developed and we’ll be doing trials this year,” she said.

Two of those trial partners, China Mobile and Telefónica, will doing live demos of vRAN at their booths. If all goes as planned, Intel and Alcatel-Lucent hope to start installing their first data centers in commercial networks in 2016.

But Intel and Alcatel-Lucent face plenty of competition. [company]Nokia[/company] announced its competing network virtualization technology called Radio Cloud, and [company]ARM[/company] is working with network semiconductor maker [company]Cavium[/company] to put its processors at the heart of a cloud mobile system. And not every vendor believes that the Intel’s vision of a network running on off-the-shelf chips is feasible for something as complex as mobile network.

In an interview at MWC, [company]Ericsson[/company] CTO Ulf Ewaldsson told me that while moving the mobile network into a data center is most definitely possible, replacing its specialized digital signal processing workhorses with generic processors isn’t. He likened the baseband to the graphics accelerator, which is still separate from the CPU of any computer or high-end mobile device today. Just like GPUs can much more efficiently render pixels than a general-purpose processor, baseband processors can much more efficiently crunch signal data than any off-the-shelf chip, Ewaldsson said.

MWC-2015-ticker

Facebook’s latest homemade hardware is a 128-port modular switch

Facebook has been building its own servers and storage gear for years, and last June announced its first-ever networking gear in the form of a top-of-rack switch called “Wedge.” On Wednesday, the company furthered its networking story with a new switch platform called “6-pack,” which is essentially a bunch of Wedge switches crammed together inside a single box.

The purpose of 6-pack was to build a modular platform that can handle the increase in network traffic that Facebook’s recently deployed “Fabric” data center architecture enables. The Facebook blog post announcing 6-pack goes into many more details of the design, but here is the gist:

“It is a full mesh non-blocking two-stage switch that includes 12 independent switching elements. Each independent element can switch 1.28Tbps. We have two configurations: One configuration exposes 16x40GE ports to the front and 640G (16x40GE) to the back, and the other is used for aggregation and exposes all 1.28T to the back. Each element runs its own operating system on the local server and is completely independent, from the switching aspects to the low-level board control and cooling system. This means we can modify any part of the system with no system-level impact, software or hardware. We created a unique dual backplane solution that enabled us to create a non-blocking topology.”

In an interview about 6-pack, lead engineer Yuval Bachar described its place in the network fabric as the level above the top-of-rack Wedge switches. Facebook might have hundreds of 6-pack appliances within a given data center managing traffic coming from its untold thousands of server racks.

A 6-pack line card.

A 6-pack line card.

“We just add those Lego blocks, as many as we need, to build this,” he said.

Matt Corddry, Facebook’s director of engineering and hardware team lead, said all the focus on building networking gear is because Facebook user growth keeps growing as more of the world comes online, and the stuff they’re sharing is becoming so much richer, in the form of videos, high-resolution photos and the like.

That might be the broader goal, but Facebook also has a business-level goal that’s behind its decision to build its own gear in the first place, and to launch the open source Open Compute Project. Essentially, Facebook wants to push hardware vendors to deliver the types of technology it needs. If it can’t get them to build custom gear, it and dozens of other large-scale Open Compute partners can with immense buying power can at least push the Dells and HPs and Ciscos of the world in the right direction.

Corddry said there’s nothing to report yet about Wedge or 6-pack being used anywhere outside Facebook but, he noted, “Our plan is to release the full package of Wedge in the near future to Open Compute.”

6-packdiff2

If you’re interested in hearing more about Facebook’s data center fabric, check out our recent Structure Show podcast interview with Facebook’s director of network engineering, Najam Ahmad.

[soundcloud url=”https://api.soundcloud.com/tracks/178628647″ params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

Download This Episode

Subscribe in iTunes

The Structure Show RSS Feed

Google had its biggest quarter ever for data center spending. Again

Google just finished off another record-setting quarter and year for infrastructure spending, according to the company’s earnings report released last week. The web giant spent more than $3.5 billion on “real estate purchases, production equipment, and data center construction” during the fourth quarter of 2014 and nearly $11 billion for the year.

[dataset id=”858606″]

As we have explained many times before, spending on data centers and the gear to fill them is a big part of building a successful web company. When you’re operating on the scale of companies such as [company]Google[/company], [company]Microsoft[/company], [company]Amazon[/company] and even [company]Facebook[/company], better infrastructure (in terms of hardware and software) means a better user experience. When you’re getting into the cloud computing business as Google is — joining Amazon and Microsoft before it — more servers also mean more capacity to handle users’ workloads.

Google Vice President of Infrastructure — and father of the CAP theorem — Eric Brewer will be speaking at our Structure Data conference in March and will share some of the secrets to building the software systems that run across all these servers.

But even among its peers, Google’s capital expenditures are off the chart. Amazon spent just more $1.1 billion in the fourth quarter and just under $4.9 billion for the year. Microsoft, spent nearly $1.5 billion on infrastructure in its second fiscal quarter that ended Dec. 31,  and just under $5.3 billion over its past four quarters. Facebook only spent just over $1.8 billion in 2014 (although it was a 34 percent jump from 2013’s total).

[dataset id=”912454″]

Spotify reportedly scraps Russian launch plans

The music streamer Spotify was all set to plow into the Russian market, having poached a former Google exec, Alexander Kubaneishvili, to lead the offensive. However, that plan has gone out the window for now.

According to Russian broadcaster RBC, Kubaneishvili announced the pause on Monday, citing Russia’s political and economic situation, as well as pending Russian legislation about regulating the internet. Spotify will not launch in the country “for the foreseeable future,” he said, adding that he does not work for the company anymore.

According to TASS, the firm is also shutting down its Russian office in its infancy. RBC reported that Spotify’s Russian launch had already been delayed because it had failed to agree partnerships with local mobile operators, though TASS indicated some progress had been made with Vimpelcom. I asked Spotify for comment on all this, but the company refused to provide any.

Russia’s ruble is having a very rough time, largely due to the falling oil price and sanctions related to the country’s invasion of neighboring Ukraine and annexation of the Crimean peninsula.

Meanwhile, the country has also been pumping out various new laws designed to clamp down on internet freedom. The most relevant is probably Russia’s local data storage mandate, through which it intends to force web service providers servicing Russians to store their personal data in local data centers. This rule is set to come into force in 2016.

Untangling the data center from complexity and human oversight

Our investment thesis at Khosla Ventures is that simplicity through abstraction and automation through autonomic behavior will rule in the enterprise’s “New Stack,” a concept that embraces several industry changes:

  • The move to distributed, open source centric, web-era stacks and architectures for new applications.  New Stack examples include Apache web stacks, noSQL engines, Hadoop/Spark, etc. deployed on open source infrastructure such as Docker, Linux/KVM/Openstack and the like.
  • The emergence of DevOps (a role that that didn’t even exist 10 years ago) and general “developer velocity” as a priority: e.g. giving developers better control of infrastructure and the ability to rapidly build/deploy/manage services.
  • Cloud-style hardware infrastructure that provides cost and flexibility advantage of commodity compute pools in both private datacenters and public cloud services, giving enterprises the same benefits that Google and Facebook have gained through in-house efforts.

 

The most profound New Stack efficiency will come from radically streamlining developer and operator interactions with the entire application/infrastructure stack, and embracing new abstractions and automation concepts to hide complexity. The point isn’t to remove the humans from IT — it’s to remove humans from overseeing areas that are beyond human reasoning, and to simplify human interactions with complex systems.

The operation of today’s enterprise data centers is inefficient and unnecessarily complex because we have standardized on manual oversight. For example, in spite of vendors’ promises of automation, most applications and services today are manually placed on specific machines, as human operators reason across the entire infrastructure and address dynamic constraints like failure events, upgrades, traffic surges, resource contention and service levels.

The best practice in data center optimization for the last 10 years has been to take physical machines and carve them into virtual machines. This made sense when servers were big and applications were small and static. Virtual machines let us squeeze a lot of applications onto larger machines. But today’s applications have outgrown servers and now run across multitudes of nodes, on-premise or in the cloud. That’s more machines and more partitions for humans to reason with as they manage their growing pool of services. And the automation that enterprises try to script over this environment amounts to linear acceleration of existing manual processes and adds fragility on top of abstractions that are misfits for these new applications and the underlying cloud hardware model. Similarly, typical “cloud orchestration” vendor products increase complexity by layering on more management components that themselves need to be managed, instead of simplifying management.

Embracing the New Stack developers

Server-side developers are no longer writing apps that run on single machines. They are often building apps that span dozens to thousands of machines and run across the entire data center. More and more mobile or internet applications built today are decomposed into a suite of “micro-services” connected by APIs. As these applications grow to handle more load and changing functionality, it becomes necessary to constantly re-deploy and scale back-end service instances. Developers are stalled by having these changes go through human operators, who themselves are hampered by a static partitioning model where each service is run on an isolated group of machines.

Even in mature DevOps organizations, developers face unnecessary complexity by being forced to think about individual servers and partitions, and by creating bespoke operational support tooling (such as service discovery and coordination) for each app they develop. The upshot is pain of lost developer time on tooling, provisioning labor and hard costs underutilization that results from brute force “service per machine” resource allocation.

We believe the simpler path for the New Stack is to give power to developers to write modern data center–scale applications against an aggregation of all of the resources in the data center, to build operational support into apps as they are created, and to avoid management of individual machines and other low-level infrastructure.

Delivering such an abstraction lays the foundation for an increasingly autonomic model where logical applications (composed of all of their physical instances and dependent services) are the first-class citizens deployed and automatically optimized against the underlying cloud-style infrastructure. Contrast this with the typical enterprise focus on the operation of servers as first-class citizens — a backward-looking approach that represents pre-Cloud, pre-DevOps thinking.

Distributed computing isn’t just for Google and Twitter

Turing Award winner Barbara Liskov famously quipped that all advances in programming have relied on new abstractions. That truth is even more pronounced today.

Most enterprises adopting New Stack today will have a mixed fleet of applications and services with different characteristics: long running interactive applications, API/integration services, real-time data pipelines, scheduled batch jobs, etc. Each distributed application is dependent on other services and made up of many individual instances (sometimes thousands) that run across large numbers of servers. This mixed topology of distributed applications running across many servers is geometrically more complex than an application running on a single server. Each service that comprises the application needs to simultaneously operate independently and coordinate with all of the interlocking parts to act as a whole.

In the above model, it’s inefficient to use human reasoning to think about individual tasks on individual servers. You need to create abstractions and automations that aggregate all of the individual servers into what behaves like one pool of resources, where applications can call upon the computation they need to run (the CPU/memory/I/O/storage/networking) without having to think about servers. To achieve optimal cost structure and utilization, the resource pool is aggregated from low-cost, relatively homogenous equipment/capacity from multiple vendors and cloud providers, and deployed under a uniform resource allocation and scheduling framework. This strategy avoids costly, specialized equipment and proprietary feature silos that lead to lock-in and ultimately less flexibility and manageability at scale.

Google was the first to overcome the limits of human oversight of data center resources with this resource aggregation approach by building its own resource management framework (which was initially called Borg, then evolved into Omega). Twitter rebuilt its entire infrastructure on top of the Apache Mesos distributed systems kernel to kill the “fail whale.” You could argue that Google and Twitter — in the absence of innovation from the big systems players — created their own operating systems for managing applications and resources across their data centers. That simple idea of a data center operating system — although complex to create and execute in the first place — is what drove our most recent investment in Mesosphere.

We believe the adoption of this type of “data center OS” outside of the largest web-scale businesses is an inevitability. Even small mobile applications have outgrown single machines and evolve much more rapidly. Managing change as a process instead of a discrete event has become table stakes for CIOs, and daily changes in business models make data center resource requirements highly unpredictable. Elastic infrastructures are no longer a “nice-to-have.” And human beings and manual oversight have reached their limits.

Vinod Khosla 

Putting outages into the proper perspective

Which is worse: experiencing a cloud outage or waiting to experience a cloud outage?

Last month, Azure storage services went down and caused customers in the U.S., Europe, and parts of Asia to suffer. As Gigaom’s Barb Darrow reported, several Azure users were not happy. Of course, a status page let users determine what exactly was going on, but Azure’s it reported that everything was hunky-dory. Clearly that was not the case.

Notice that the other public-cloud providers, namely AWS and Google, did not toss stones at Microsoft. Why? They could be next. Indeed, a power outage at an AWS data center in California brought down services for customers on Memorial Day. And don’t forget about Christmas Eve two years ago, when Netflix customers tried to watch their favorite holiday movie only to have their hopes crushed when the streaming service was brought down by an AWS employee error that affected the company’s U.S.-East region. I was a victim of that outage, unable to get my “Santa’s Buddies” fix on Christmas Eve.

More recently, according to Gigaom’s Barb Darrow:

Amazon Web Services’ Content Delivery Network (CDN) experienced some glitches on Thanksgiving Eve, according to various reports all citing the AWS status page. According to that page, users experienced “elevated error rates when making DNS queries for CloudFront distributions: between 4:12 p.m. and 6:02 p.m. PST” on Wednesday, November 26.

So, what does this all mean? Pretty much nothing at this point.

In looking at the one-year status page at Cloud Harmony/Cloud Square, things seem pretty good. While there are some bad outage records, most report more than 99.90 percent uptime (for computing clouds). The larger cloud providers, including AWS, Google, and Microsoft, have experienced some outages in some regions but stay right up near 100 percent uptime, for the most part. There are just a few exceptions here: the outages experienced last month, a few bad apples listed near the end of the report, and those that do not get much traction as public-cloud providers.

There will always be outages. Cloud providers can improve their best practices but they can’t change the laws of physics. Networks will fail, backup power units won’t kick on, and hundreds of other situations can go wrong in a cloud data center. However, that does not seem to translate into much of an impact in service to the enterprise in the larger picture.

When it comes to reliability of cloud services, the real metrics to consider will compare internal enterprise systems to cloud services. If internal enterprise systems normally experience a certain number of outages, most cloud providers will find they can do better, though the core reason that many enterprises push back on public cloud-based systems is a fear of outages and downtime.

Statistics often lead us to the reality of the “cloud versus traditional systems” comparison. Indeed, there are an average of three “business-interruption events” for most enterprises per year. The cost is around $110,000 per hour while the outage is occurring, with an average of five hours per interruption. That’s about $1.65 million per year in enterprise outage costs, and those are in the IT shops that are well managed.

Now, consider what you would have experienced in the cloud over the past year. According to Cloud Harmony and my own experiences, the likelihood is that you won’t experience a public cloud outage at all. If you do, it’s likely to be very short in duration, and hopefully they won’t do what Microsoft did, and instead make sure you get a heads up.

Internal enterprise systems are another story. The uptime records for many enterprises are much worse than the average public-cloud provider. They end up costing the enterprise $1 million-plus. Of course, the trend is not to flog your enterprise IT guys for losing a network for a few hours, but it’s certainly okay to complain if an outside service, such as a public-cloud provider, takes an occasional dirt nap. That’s just human nature. If you’re looking at the impact to the business you need to consider the true behavior of both traditional systems, and those that exist in public clouds.

Based upon the data, it would be a reasonable to assume that most public-cloud providers will have at least one outage, and that outage will last for an average of two hours. Using our metrics in terms of impact to the business, including $110,000 per hour cost, and we’ll suffer $220,000 per year in outage costs (See Figure 1). Again, that’s assuming there is even an outage at all. Thus, public cloud becomes a much better deal, beyond the operational cost savings and the ability to avoid capital expenses.

Figure 1: Even with many outages, public clouds are still a better bet when it comes to uptime, and avoiding the cost of outages.

Screen Shot 2014-12-04 at 9.58.36 AM

Public-cloud services seem to be getting better at managing their cloud services, which includes avoiding outages. At the same time, enterprises continue to struggle with aging equipment, smaller budgets, and more challenging requests from the business. It’s no wonder there are outages. Indeed, I would expect there to be more.

The cloud is not a magic bullet for system uptime. The last thing you want to do is leverage public clouds without a good reason. The value of public clouds for your enterprise should be understood in the context of the business, including capital-cost savings, operational costs, and the cost of outages. You can’t make these kinds of decisions around only one variable, such as outages. The models as to cloud or not to cloud for specific applications, databases, or even enterprises are always complex, with dozens and dozens of variables to consider. Moreover, they vary a lot from enterprise to enterprise, and there are some common patterns that are not general commonality.

So do outages matter? Of course they do, and you should consider outage data when looking at public-cloud providers. However, for the most part public clouds provide you with much better uptime than internal systems, and that’s a fact that most will admit to these days.

As we move into 2015 and 2016 I suspect we’ll see about the same number of outages as public cloud providers, including AWS, Microsoft, and Google, continue to expand. However, considering the growing capacity, the metrics for the service will actually improve, if they keep the outages at about the same number of occurrences or better.

There will be those providers that struggle to maintain an uptime record within a failing business. I suspect those will be the majority of outages next year, and in 2016, as the public-cloud market continues to normalize. The best path is to avoid them, which actually makes their situation worse. You don’t want to go down with those ships, trust me.

Outages make interesting new articles. I’ve written them myself. However, for the most part, they don’t mean much, in terms of any significant downside of leveraging public clouds.

How I learned to stop worrying and love commoditization

Technology entrepreneurs and investors alike have long regarded commoditization as a dark and dangerous force, a destroyer of high-margin businesses, to be avoided at all costs. An exciting new class of opportunity aims to eviscerate that dogma: the “commoditization accelerant.”