Why the cloud is ready for enterprise — but the paths connecting it are not

[protected-iframe id=”8e8641c839423565a9b7484514778c36-14960843-61002135″ info=”http://new.livestream.com/accounts/74987/events/2117818/videos/22037847/player?autoPlay=false&height=360&mute=false&width=640″ width=”640″ height=”360″ frameborder=”0″ scrolling=”no”]
Transcription details:
Input sound file:
1003.Day 2 Batch 1

Transcription results:
Session Name: Cloud Benchmarking: Toward Fair And Useful Cloud Econometrics

Sebastian Stadil
Ariel Tighlin
Jeremy Koeber

Announcer 00:01
So, without further ado here’s a topic near and dear to my heart, which is all the things around Cloud economics. We’ve got a pretty cool panel coming up right now, that’s going to talk about how we effectively define and leverage and understand all this stuff. Without further ado, Sebastian Stadil, CEO of Scaler and a slight change from the agenda – somebody got lost on a plane, showing the importance of transport technology in today’s Cloud – so let’s come out and clear up the haze and fog surrounding the cloud.
Sebastian Stadil 00:44
Thank you very much Joe. Hi and welcome everyone. My name is Sebastian, I’m the founder of an open source project that allows you to federate and manage all your code and protect it from one place. [inaudible]. I’m here because a couple of months ago I wrote a post that benchmarks performance on Amazon versus Google computer engine. That had a lot of traction, and raised the question of comparing different Clouds, comparing performance, comparing functionality and things like that. So on this panel here, we’re going to talk about what it takes for an organization to decide what cloud they want to use in terms of performance or functionality. And what it takes to eventually move from one to another and some of the tooling that they need. I’d like to introduce Ariel Tighlin, a director of cloud solutions at Netflix, and Jeremy Koeber, who is director of technical operations at Branch Out. Why don’t you guys introduce yourself and some of the work you’ve been doing?
Ariel Tighlin 01:54
Sure, so I manage a team at Netflix called the cloud solutions team, and we build, manage and operate the cloud that we use at Netflix, which is all primarily using Amazon’s web services. So we build a lot of the tooling and the operations as well as key parts of the platform for abstracting a way the infrastructure from the rest of our developers at Netflix so they can focus on building applications and building all the product features that makes Netflix unique. Tools like Asgard, Chaos Monkey and some of those other types of tools that you might have heard about.
Jeremy Koeber 02:30
I’m the director of Tech-Ops at Branch Out, it’s a professional networking company. We’re originally focused on jobs and recruiting but have recently expanded our service more to be focused on communications within the enterprise. I head up managing the design, deployment, maintenance, support of our cloud which is 100% in AWS.
Sebastian Stadil 02:59
Ariel, you mentioned that you built tooling for Netflix. Proper tooling is pretty much the most important thing for productivity in their cloud. When you build that tooling, where do you stand on building things for performance? Do you measure performance, do you try to control costs, what does that tooling do?
Ariel Tighlin 03:20
Sure. So for Netflix, the infrastructure cost as compared to the overall expenses that we have as a business is fairly low. So most of our expenses go towards content, acquiring content and producing our own originals. The next biggest chunk goes towards marketing and the acquisition of users and customers. Then a tiny little slither at the bottom is what we pay for the actual infrastructure So cost isn’t a huge motivator for us. Obviously it’s something that as a business we care about, but we prefer to optimize for innovation and for being able to have very high engineering velocity, and enabling the product organization to maximize the rate at which we can bring innovation to market. Cost is important, but it rarely drives the decisions that we make. A lot of our tooling is geared not so much around cost, but rather around how we make all of the pains and the complications of running an infrastructure and running on top of Amazon abstracted away from the rest of our developers. Things like Asgard for example, simplify the management and deployment of applications in the cloud. A tool that we just realized earlier this week called Ice gives the part of the organization that actually is interested in cost and looking at how costs are trending, gives them visibility into how much we’re spending in the cloud and what usage looks like and some interesting analytics around that. So most of our tooling is really around efficiency and giving us the capability to abstract the way the platform from the rest of our developers.
Sebastian Stadil 05:06
Which allows you then to launch new markets, like the Netherlands just last week.
Ariel Tighlin 05:09
That’s right.
Sebastian Stadil 05:10
Jeremy, can you tell us a bit about your tooling and the sort of things that you use for cost control and performance management?
Jeremy Koeber 05:17
Absolutely. Unlike Netflix, we don’t have our own in house tools that we’ve built. I’m really interested in this Ice tool that we’ve just learned about, because we use a hodgepodge of third party tools like Cloud-in, Cloud ability. I’ve worked most with Cloud-in and being an advisor on beta features and things like that. We’ve found them very useful. For us, cost is a big factor; we’re a startup working towards profitability, our Amazon bill is our largest cost, so we’re always trying to work to minimize that impact on the business. Any tool that can help us find more things are being wasted, like reserve [inaudible] management within Amazon, I think a lot of people world agree, is pretty tricky. Prior to tools coming out like Cloud-in and their unused reserved instances detector and helping you find where those are being wasted and reallocated, a lot of man hours were spent on just that. Those are primarily the tools around cost and performance management, finding out where things are underutilized and where we can cut costs.
Ariel Tighlin 06:46
So it would be fair to assume then that because in your organization the technical operations is a larger share of the overall expenditures, that costs is more important in your case because it’s mostly spent on licensing and it’s a lower amount.
Jeremy Koeber 07:01
Benchmarks in performance, those are one side of the equation. Functionality is the other. Have we gotten to the point where pretty much all clouds have the functionality that you need and you care about performance and that side? Or do you still expect more from the different clouds, Jeremy?
Jeremy Koeber 07:23
Again, I think you’re going to find pretty different answers between us [laughter]. For us, about 95% of the functionality that we need within the cloud, within EC-2, we don’t really tie into to many of the AWS proprietary services, like RDS, dynamo DB, we pretty much just use pure EC-2 and install open source software within them so we can primarily consume compute storage and network research. We need those to be as fast and as cheap as possible.
Sebastian Stadil 08:06
Ariel, do you agree with that commodification view? Like Jeremy said, we’re going to have a little bit of a differing view on this. From my perspective and I think from Netflix’s perspective, we’re far from being in a commoditized club market where it really isn’t a utility like we one day see it’s going to become. There’s a couple of different perspectives that lead to that conclusion. On the one hand for us, for Netflix, we use almost every Amazon service that gives us value, so we don’t use for example just raw EC-2, but only ASG, so everything that runs is inside of an autoscaling group, because that gives it a nice cluster management that we don’t have to worry about, and gives us great availability. We use Amazon’s availability zones again for the availability purposes. We use their load balancers and SQL for queuing, SMS for messaging. Almost every service that you can think about, we have found a way to leverage. But that wasn’t enough – if you look at how much we’ve opened sourced, that gives you a little bit of a glimpse on how much additional infrastructure and platform we’ve had to build. There was a huge amount of extra glue and extra services and extra tooling that we’ve had to invest to build up, and that give you some indication of something that could be offered by a cloud sometime in the future. We’re trying to stimulate that eco system and that community by open sourcing everything as part of our Netflix services platform. We’re hoping the things we’ve invested and the things we’ve built in order to run and manage and deploy a large scale internet application is going to be useful for others as well, and can be used as a starting point for building a larger cloud platform. But it’s nowhere near where it needs to be. It’s in a great place, and it’s much better than running our own data centers and building all of that ourselves, but at the same time there is a lot more left in the evolution. Even for Amazon to evolve, there’s still a long ways to go. But then if you look at comparing Amazons feature set with the other cloud providers, it’s night and day in terms of the type of functionality that they offer. Not a common thing about performance, but just specifically around feature breadth and feature set. There’s no other cloud provider that gives you as much functionality that Amazon does. It’s far from being commoditized.
Sebastian Stadil 10:36
Jeremy, you mentioned that in choosing you rely a lot on external tooling, how come you don’t rely on things like RDS or dynamo DB for some of your workloads?
Jeremy Koeber 10:38
Well, I think that the simplest answer to that would be the paranoia or feature – justified paranoia – about vendor lock in. We want to be able to potentially transport our infrastructure out of EC-2 to another cloud, potentially GCE, if that makes more sense. We found that services like RDS can be prohibitive in that manner to get the data in and to get the data out. If you’re building your infrastructure from scratch, and starting with AWS it might make a lot of sense to utilize RDS. But we were already running with a bunch of MySQL on EC-2 and we did a pretty heavy analysis on moving into RDS and we just found that logistically it would be tough to get in and equally tough to get out. We just wanted to be able to be as nimble as possible.
Sebastian Stadil 11:44
So you’re architecting for reducing lock-ins so you can then play on the commoditization of different clouds?
Jeremy Koeber 11:50
Exactly, and we do like the traditional, very granular access to the box, and that some of the services that you offload to them which there’s absolutely a benefit to if you have a small ops team. But we have a very good database administrator who really wants that root access to those boxes, so you do surrender some of that. I think they do a good job on it; RDS is a good product, but it’s mainly to keep us nimble.
Sebastian Stadil 12:23
And Ariel, I believe it’s the opposite for you. You guys use Cassandra not Dynamo-DB. Can you explain a bit on the decisions to use a certain Amazon functionality versus using something that is open source?
Ariel Tighlin 12:37
Sure. Our preference is always use something is available if it meets our needs. Whenever we look at services that Amazon or anybody else offers – whether it’s Amazon or any other third party commercial software – we always prefer to use something that’s already there that we don’t have to invest into and build ourselves. We have a lot of really smart engineers who are capable of doing great things, and we’d much rather they’d build features that are unique to Netflix rather than just infrastructure and tooling. When we looked at Cassandra and when we looked at what we wanted to do for Cloud persistence, none of the services that were out there could do what we needed them to do, or could give us the capabilities of what we wanted out of our persistence engine. Dynamo wasn’t around at the time and we decided to go to Cassandra because we were re-evaluating Symphony-b and sequel stores. Cassandra was the one that worked best for us. One of the reasons is that it’s written in Java, and we have a lot of Java developers and end up committing a lot of code back to Cassandra. The other really important requirement for us was the ability to have multi-regionally synchronized stores, which is something that Cassandra offers. Going back to the point that Jeremy was making around vendor lock-in and wanting to avoid this, and thinking what happens if you want to move to another cloud provider, the alternative if you don’t use evaluated services that a cloud provider or any other vendor gives you is that either you end up having to build to the lowest common denominator or you end up having to invest into building that type of functionality yourself. So neither of those are things that are particularly appealing, because you either end up with an inferior product or an investment that you can’t defer or avoid. So if for whatever reason, you decide to use some kind of functionality that locks you into a vendor, you’re just delaying the work that you’ll need to do in order to implement it later if you decide to move to another vendor. We see it as taking advantage of the best of breed, and leveraging that functionality. And if for whatever reason we need to pivot down the road, then whatever investment we had to do today, we just postpone until later.
Jeremy Koeber 14:55
I have a little bit to add. With my example with RDS, they’ll manage back and they’ll manage replication and everything like that, so we don’t use AWS for that, but we do use Scaler for that, which is a tool that can lay on top of any cloud, and so we feel like we are still getting some of that and we didn’t have to build it ourselves fortunately. We didn’t have to do that, we were doing that when we were previously hosted with Riot Scale. Another tool like Scaler can come along and work with any cloud to provide you with automotive backup replication and many of those features that RDS does for you, and you get that value.
Sebastian Stadil 15:45
Ariel, what would it take for a cloud provider to seduce Netflix into using their platform instead of Amazon?
Ariel Tighlin 15:52
That’s a tough one. The first thing we look at is just the functionality that we get. When we used Amazon we had to pay a pretty large pioneer tax, because a lot of the features weren’t built up yet, we were a pretty large scale service that ran on top of their infrastructure, so there was a lot of learning on both sides in order to leverage the cloud and in order to make sure the services are mature and scalable and can support that load and that use case. We don’t have a particularly strong desire to pay that pioneer tax again, with another cloud provider. So what we want to see is other cloud providers maturing and being able to reach the scale that Amazon has, as well as the feature breadth. Then it would be an interesting evaluation for us to look at if there were additional cloud providers. At this point we’re pretty happy with Amazon because they give us the features that we need, they’re a great company to work with, we get a lot of value from them, and I think they get a lot of value from us, usually because we’re a really large scale use case that showcases how you can use the cloud effectively.
Sebastian Stadil 17:04
Jeremy Koeber 17:06
As far as evaluating other clouds and the possibility of moving to another one, as I described, we’re not too tied in to Amazon, I don’t think it would be too hard for us to get out. I don’t know that it would necessarily behove us. We haven’t done a really in depth analysis, I think we’re waiting on products like GCE to mature a little bit, like describing and maybe proving themselves in the market a little bit more. Again, it mainly being a matter of needing compute storage and network resources, we’re pretty well satisfied where we are. It would be no small undertaking by any means to move, and there’s really nothing that’s driving us away just yet.
Sebastian Stadil 18:00
What are some of the functionality that Amazon has that other cloud providers don’t have that you really, really need for you to even consider moving?
Ariel Tighlin 18:17
The autoscaling functionality and the cluster concept is something that we just wouldn’t be able to run without. We don’t run anything on bare instances, because we have Chaos Monkey that goes around and kills instances. We need to have the functionality of the cluster and to be able to abstract away the actual physical instance or the virtual instance into this notion of a service, and the service is run on a cluster that automatically scales based on load. That’s a really critical component for us, the load balancing piece is also really important – ELB is something that we use very heavily, and something we don’t have to worry about as well. Some of the other services are much more specific point solutions, so SQS and SMS and SES; almost each one of these three letter acronyms – SWF too. Our thing is that we play with and we use in various smaller contexts, but as far as the broader system goes, it’s really ASG, ELB and Ralph 53 because we’re leveraging that for being able to be multi regional active fail-over, which is something that we are investing in pretty heavy this year.
Sebastian Stadil 19:31
What about yourself, Jeremy? When comparing to say a provider like HP Cloud or like Rack Space or Google Compute engine, what are some of the sets of functionality that are currently not there that you wish they did have?
Jeremy Koeber 19:47
Well, not to sound like a broken record, I’ve looked at them, and like I said we haven’t done any heavy analysis of any of the other clouds, but there just really hasn’t been any business pain to drive us. We’re a small team and we don’t really have the resources to do in-depth testing and spinning up or pilot tests on other clouds to really see if it provides us for anything else. Functionality isn’t specifically what we’re looking for. We have a cloud management layer over the top that really provides a lot of that, so within there’s just nothing that is really grabbing our attention outside of EC-2.
Transcription details:
Input sound file:
1006.Day 2 Batch 1

Transcription results:
Session Name: Cloud Benchmarking: Toward Fair And Useful Cloud Econometrics.

Sebastian Stadil
Ariel Tseitlin
Jeremy Koerber

Sebastian Stadil 00:00
What are some of the next big pieces of functionality that you really want Amazon to develop for you, that you really want the cloud to provide? What’s on your wish list for the next three years?
Ariel Tseitlin 00:14
Interesting question. Like I was saying earlier, one of the things that we’re investing in pretty heavily this year is the ability to run our service in multiple regions in an active-active mode. We’ve architected our service to be deployed in multiple availability zones with the notion that individual availability zone can fail and the region will still be fully functional. We deploy across three different availability zones in every single region that we’re in – across the US and EU – but we still see every now and then – not often, but still too often for our liking – region-wide events that affect our availability, where availability zone redundancy isn’t enough. So we’re investing into building the infrastructure ourselves now, and some of the components, in order to be able to run an active-active across – for example in the US, across US West and US East. And there’s specific features out of Amazon that would make that much easier and much less of an investment for us, for example the ability to, at the DNS level, be able to understand when EOBs fail within an individual region and to fast fail over to another region that’s a backup – that’s a really important one. Anycast is something that internally we’ve been talking about a lot, and we’ve been talking to Amazon, because that really gives you the ability to take DNS itself, as an extra choke-point, out of your flow, and also takes out the DNS propagation delays that dealing fail over introduces. Those are two really big things for us as we invest more into this active-active model, to give us resilience against regional failure, and against our own deployments in any individual region that fails. Those are some specific things that would make it easier for us.
Sebastian Stadil 02:11
Jeremy have you tried to do some region-to-region, active-active sort of thing?
Jeremy Koerber 02:17
No, we haven’t done anything between regions. We obviously and definitely utilize availability zones and keep things spread out for availability zones, primarily in US East right now. That will be our next step, is to move some stuff into another region. We’ve obviously been hit with some big outages and we’ve made changes on our side, and hopefully Amazon is making changes on their side. I think as far as the wish list question goes, one of the biggest things would be transparency into what’s going on. I think AWS took some criticism around that – really keeping customers informed as to what’s going on during an event, and potentially what we can be doing to get ourselves back up quicker. That’s one piece and then just less network variability; we struggle with that a little bit even within a single availability zone, let alone inter-region transport. Really fortifying the network to give you a very stable experience would be on our wish list.
Sebastian Stadil 03:36
Circling back to the topic of benchmarks that we started up on, what are some of the – you mentioned the variability of performance over time – what are some of the things that you would care much about if there were public benchmarks that evaluated the performance from all the different providers, Ariel?
Ariel Tseitlin 03:55
One of the things that we really liked when Amazon announced, was their SSD-backed instances, because this was something that really made Cassandra in the cloud much better and more functional and less operationally painful. Having the ability to measure IOPS of instances in the cloud, and be able to compare them and see – How would a Cassandra workload work on that type of environment? – I think would be really valuable. Other than that, most of our scaling and performance really relies on horizontal scaling, and so we don’t worry too much about vertically scaling any individual instance or service, because the architecture that we choose is one that you can just add instances and services, and our discovery and the load balancing algorithm will just bring traffic in across those. It becomes less important about individual performance, because then that translates mostly into cost, which isn’t as big of a factor for us. I think IOPS is probably the main thing.
Sebastian Stadil 05:07
Jeremy Koerber 05:08
Yeah, I’d have to agree. The SSD-backed instances were huge for us. We were really relying on EBS – and this was before Provisioned IOPS was even released – and we did struggle a lot with the variability in disk I/O. When SSD-backed instances came about we were able to do a huge consolidation, it was able to drive our costs down, and give us just exponential performance gains that we were really happy with, so it was a very successful project for us to utilize those SSD-backed instances, and they’re coming out with more there, and they’re continuing to up the bar constantly. They didn’t just release IOPS, we’re getting announcements all the time about higher and higher IOPS levels. Disk I/O and network I/O in terms of their speed and also reducing that variability, giving you a very stable and predictable experience over would be critical for us to evaluate in looking at another cloud.
Sebastian Stadil 06:12
Your work cloud, are they mostly sequential I/O or random I/O, can you talk a little bit about what type of I/O you want from the cloud?
Ariel Tseitlin 06:25
For Cassandra I think it’s a combination of both. For example, in writing SS tables it will tend to read entire SS tables and write SS tables out sequentially, but there’s also a very strong component of the random I/Os as well for the way that the SS table cache is working, so it’s really a combination of both.
Sebastian Stadil 06:46
Jeremy Koerber 06:49
Yeah, it’s essentially the same. We’re a pretty heavy random I/O.
Sebastian Stadil 06:56
Excellent, so I guess if there’s any people working in cloud providers, I would hear that I/O is pretty much the most important things, performance-wise. All right, thank you very much.