Want a better/greener/more agile data center? Use the data.

[youtube http://www.youtube.com/watch?v=pkWuOY2Y8Ws&w=560&h=315]
Session Name: Seeing Everything To Get Ready For Anything: Capacity Planning Scale.
Speakers: Announcer Dave Ohara Tamara Budec Heather Marquez Amaya Souarez Audience Member 1 Audience Member 2
Thank you David. Next we’re going to be talking about Seeing Everything to Get Ready for Anything: Capacity Planning at Scale. It’s going to be a conversation moderated by Dave Ohara, he’s a founder of GreenM3 and Analyst with GigaOM Research and he’s going to be talking with Tamara Budec, VP Critical Systems and Engineering at Goldman Sachs, Heather Marquez, Manager Asset Strategy and Optimization at Facebook and Amaya Souarez, Director Datacenter Strategy and Automation at Microsoft. Please welcome our next panel to the stage. [applause]
I’m Dave Ohara and I focus a lot on green datacenters and one of the things I’ve figured out to really “green” the datacenter means you really have to know what’s going on and that got me introduced to my panel here of people who really understand what’s going on in a datacenter. What I’m really lucky is these people actually don’t have to know what their companies are. Let’s take a little bit of time to have first introductions, Tamara on my right do a little introduction of what you do at Goldman Sachs.
Sure. I’m Tamara Budec with Critical Systems and Engineering, Goldman Sachs. My division supports the IT operations of the firm which does include the business side of our firm as well as the rest of the support services. So it’s basically we run critical systems environments.
I’m Heather Marquez. I run the Asset Strategy and Optimization team at Facebook in the infrastructure organization. So really we’re responsible for the data accuracy for all of the hardware that reside in the Facebook data centers worldwide as well as end of life management and optimization of the hardware fleet.
Hello I’m Amaya Souarez, I’m a leader on the Datacenter Services team at Microsoft. Microsoft has global data centers – we have more than 10, less than 100, hundreds of thousands of servers – it’s basically a massive footprint and my focus is capacity optimization, everything from assets to datacenter capacity on the floor, power, et cetera.
One of the first things we want to start with – and it’s appropriate we’re here in New York – is actually the impact of the Sandy Hurricane and what happened. Let’s first start with Goldman Sachs and tell the story of what some of that was going on from a data situation of getting ready for the Sandy situation.
Just to give a little bit of a context our firm has a very significant footprint here in New York City and New Jersey area. So when we saw the storm coming – which is nothing new, we all know we have bad weather East and West Coast, regardless – but there was something unique about this specific event. There were a couple of things that were known and a lot of things that were unknown. Considering that we have a huge concentration in this area we took everything very seriously. I’m just going to point out a couple of things that may be of interest for this audience as it relates to data and information. What was unique about this event or what represented for us one of the interesting challenges is how do you manage through real-time data that you receive from various telemetry devices in your facilities – and they range from temperature sensors, controllers that run various types of equipment – they come back with time-series data, log type data. Then you have information coming from the news channels, weather channels, information coming from your own internal teams hunting the information on the web.
At some point there is a tremendous amount of data with different quality and characteristics. You have structured information that is as I said almost like a time-series, a base, and you have those quality type of information – hey, this tide may be somewhere between 11 and 15 foot high. How do you prepare yourself for something like that? When they give you a range, do you look at your worst case? Is this indeed the worst case – can it get any worse than that? Again, the key here was to be able to run your operations by absorbing all the information and then when you’re in it and you see how much of that information is actually coming at you, you start looking at how your database is structured, what is your resiliency, where you can fail or where if you have to fail or your operations. We also faced in some circumstances curfews – police would dictate mandatory evacuations from certain areas – so you have to be prepared to run your operations assuming it’s, say, a datacenter or a similar type of facility remotely. How do you vacate the building, go somewhere in hiding where you are safe behind the evacuation zone and you run your facility remotely?
So these are the challenges that are very-very interesting to us. Again we do a lot of dry runs in terms of disaster recovery, strategies, so we put them in place as needed. Every of these unique events have something that you learn from. One of the things for us was look really carefully at where our databases are, how resilient they are and how the information is structured, in terms of being able to retrieve information as quickly as possible. Being a 140 year old corporation we have a lot of legacy systems, and when you’re facing a real-time event how do you drill down through those platforms that are not horizontally scalable, that are not so easy retrievable information and get that information so you know what is at risk? So that was interesting for us.
Adding on top of that and luckily you had no outages, you had nothing go down.
We were on generators, every single building was on generators.
Heather, during that same time-frame, what was the experience from a Facebook perspective of what was happening?
Facebook didn’t have a huge concentration of facilities in this area, so luckily we had that, but we do have East Coast facilities that were on alert, monitoring every hour – there was water that needed to be pumped out. We’re all about connecting users, so one thing that Facebook really wanted to make sure is that we are connecting users, we’re also working with the emergency services preparedness teams and connecting folks in that way, to make sure people are getting the help in the way that they need. We have a data scientist team that is obviously looking at that data and taking what they can and learn from it, such that we can be better prepared in advance for other natural disasters that may take place.
It’s interesting to see the outside events and how they can affect the company both from a datacenter perspective and from a user perspective – how do you prepare for that, how do you make sure that all the information you have is correct and that you have the right plan in place without putting too much or too little in one area? Our capacity load for our datacenters, we need to be able to move load real-time, such that if we ever did have a natural disaster hit one of our locations that we can move all the facilities or all of the applications to a different facility if needed. So that is something that’s very key for us, something we prepare for all the time. We did notice also anytime we have a spike in user growth we’ve got to make sure we have that capacity on hand to be able to handle that extra volume.
Amaya you can add on top of that…
The way I think about it is – when you’re building a cloud-scale infrastructure, you really need to start with acceptance. That basically means acceptance that hardware’s going to fail, servers will fail, circuits will go down, your generators may not come up when they’re supposed to, natural disasters are going to happen, there may be bugs in software, there’s human error. All these things are going to happen and you need to accept it and then build in a cloud service that is resilient within the software itself. The nice thing about that is when you get to that point then you can really focus on optimizing your datacenter infrastructure and reducing the amount of back-up generators you have. For example at our new Boynton site – the latest expansion – we did not put a diesel generator back-up, because we have tons of data that we’ve been collecting since about 2006, real-time power-monitoring data, temperature, humidity and so on. All these things have allowed us to optimize the infrastructure – but it has to start with acceptance.
The things that are really interesting me about something like the Sandy storm – a huge, large event like that – is how could we in the cloud infrastructure business partner better with the utilities? Is it possible to do some scenarios like feeding power back into the grid? When we have the smart grid in the future. Maybe our failover happens within the grid, or we provide power to the grid. These are the things that I’m really intrigued by at this point.
So this kind of a disaster situation gave you a good summary of some of the issues of capacity planning. But let’s go more to the present, day-to-day activities. Michael Abbott – who used to be at Twitter and did operations there – explains the issues of how critical it is to understand the capacity of the infrastructure. So if somebody wants to jump in with talking about the day-to-day – I think you have a challenging job of people expecting you guys to know everything, because it’s like: I want to add more capacity.
With a scale-out type of distributor compute model which is becoming more and more how we operate – we are heading in that direction, abstracting the application layer away from the physical infrastructure – we are faced with even more complex capacity planning challenges and asset tracking. When you virtualize your applications and you’re going through your systems you end up with an increase in the number of servers – your physical assets may be decreasing but the number of servers is increasing. So you have to have your databases that track assets refresh very frequently and as Amaya and Heather mentioned quality and accuracy of the data.
Now on top of this as I said you have this capacity on the main type of the model where you are constantly trying to manage it as tightly as you can so you don’t end up over provisioning and having too much headroom and at the same time assuring yourself to have enough capacity. In other words you have a core capacity – this is something you need 24/7 – you can’t give up that piece. Then you have peak capacity, which is usually based on some kind of historical trend or what certain applications run at. Then you have the so-called BCP or disaster recovery capacity, which is your failover capacity – you should be able to failover from one location to another location in the same physical infrastructure and allow for that and continue to run the applications that are running in that facility to begin with. Then you have so-called burst capacity – this is the Lehman Brothers symptoms – this is the storm situations, like just a year ago the Dow Jones went down for like a thousand points and back up within three minutes just because there was a large block trade for S&P minis.
Nevertheless, there is this data information available in terms of, let’s say, what’s being traded real-time by various trading platforms where just one large trade can trigger a massive sell off and buy back which is all done through analytics of information and triggering algorithms. All of that you have to add in your capacity planning. You’ve got increased number of assets, you’ve got complex capacity on demand and if you’re doing infrastructure as a service it’s a capacity as you go, pay as you go, so you have to look at your charge backs and see what is efficient, what is not efficient. All of this is coming along with this distributor compute model which is becoming the way we are heading into operating the platforms.
One interesting point you just brought up is the issue of charge-back models, because that’s also a part of the cloud and [inaudible] made between the two of you, maybe to add to the issue of getting the business units aware of the costs. A lot of times this is the way that you can change the behavior – it’s when people just don’t realize what you made this request, oh I’m going to charge you back for it and people are like wow I didn’t know it was going to cost that much. So maybe that’s a type of thing, do you want to take off with that one?
Luckily Facebook is not doing charge-backs within the company, at the end of the day it’s all one budget, so–
Lucky you.
Yeah, lucky.
We were a culture built on data, so really data is the truth in our company and it’s great because it allows us to really produce value out of the data, rather than dealing with the silos of trying to prove why this team is right or this data set is right or wrong. So that’s one of the benefits, is big data is all about insight and making impact and if you’re not making impact with that data then all you have is just a bunch of data that you’re sitting on.
Unfortunately we do charge-back to all the different business groups at Microsoft and part of that is because of the way that Microsoft is structured, where there are separate P&Ls within the company. Yet we centrally manage all of the datacenter capacity and all of the datacenters within one organization. Back in about 2006 we had a realization as we started collecting data that we really needed to start allocating these charges in terms of kilowatts of power. This actually launched me on about a year and a half adventure to change the mindset – that we shouldn’t be thinking about rack units and racks, we should be thinking about kilowatts and power.
It’s been hugely beneficial because it’s made the entire company conscious of our actual energy resources. At this point in time basically all of the datacenter costs, as well as the operations center and some other parts of the organization are charged back in terms of kilowatts. We track all of our performance on a per-watt, per-dollar basis and this has allowed us to look at all this data and come up with more efficient models. We also now look at carbon emissions and we report that back as well and charge back to all the businesses carbon credits. It’s based upon the footprint, so let’s say it’s Hotmail has a different footprint from Windows Azure, we track all that on a daily basis.
I was just going to add to that because I think that’s a really good point in the sense that you have an analytics team that broke that information down to handle that type of environment. Most companies could say OK I’m just going to charge you for the equipment you’re using, but having the data, the insight and the analytics to be able to come up with a business model that really works for that organization is definitely key.
It’s very true and then there’s also a lot of legacy applications. We do have massive cloud services – we have Bing, we have Azure, we have Office 365 – but there are still some older systems and then also when we purchase a company we want to transition them to the most cost-effective overall platform. So having that data really helps drive those discussions.
So maybe Heather, because you’re the different versus a financial charge-back, I think a lot of what you’re trying to do is change the behavior of people. So you can say well I’m going to change the behavior of people, but you’re going to pay for it. But now you have the challenge of well how do I change people’s behavior by making them through data. Maybe you could give either an example of doing that or how it is, because again it’s like this different way which is actually more what this audience wants – it’s about big data, how to use that. So maybe you can use an example of…
Really it comes down to various projects in the organization. If the senior leadership says that we want to scale to x billion users, it really is something that – granted it’s a top-down initiative – but it’s also a bottoms-up enablement, all based on data in order to do that. We have a huge data scientist team, we have analytics teams – in almost every part of the organization – and they’re responsible for helping to achieve that goal. The transparency of that goal – and we’re a smaller company, we’re definitely not as big and haven’t been around as long as someone like Goldman and Sachs – but that’s really where you get the data behind the initiative and when you’re all marching along that line it’s again a data driven culture.
I haven’t seen too many instances of where we have those silos, but we do see a lot instances where we have different teams with different data sets. Then it’s coming from one angle versus another to say okay, well tell me why your data set is so much better than this, especially if they’re telling two different stories. That’s when if you have an analytics team that’s really understanding that data from the point of entry to the point of exit and you understand the steps that it takes all along the way, you know where to look for gaps, you know where your data accuracy is so key because if you’re making business decisions on this data – you must have accurate data, and what’s acceptable. A lot of companies say well if I’m 85%, 95%, is that enough? We get to the root of the problem as soon as we see a data gap in an issue.
I want to actually reference data that was recently released by Gartner Research Group. They polled about a hundred-something top CIOs, Fortune 500 and they published the top ten priorities for 2013 in business and technology. What ended up on the top of the list is the analytics and business intelligence. These days you can’t run a company any longer on the past quarter data – it can’t help. You have to know what’s happening today, or as close as possible. So the historical type of information – look-back information – is not as high quality as something that is happening on a day-to-day basis. When you look at it from that perspective – and it’s coming to understanding the CIOs and CEOs – they’re starting to understand that you have to pay sometimes to get that information. So charge-backs are not always bad, as long as you understand the data sets and you can take actions on that information.
I think we’ve all learned that [inaudible] some of the biggest profit charge-back systems is when it’s not clear that oh you’re just allocating the money and you’re spreading it like peanut butter so I have to allocate – versus no I’m actually going to charge you back the true cost associated with the service you requested. When you have that accuracy connection that’s when you have the closer link to the behavior. But the past ways of doing things – for example where I’m going to charge you x amount per network port, no matter what switch I connect to, what I have – and that’s the old way of doing things that just was a financial approach to: I just have to allocate this money out at the end of the day.
It’s just one of the metrics to run the business, the cost per unit. Then there are productivity metrics that can to some extent justify some of this expenditure.
I have a comment which is that the data will help you in your discussions, but it’s not everything. It really does take a lot of personal interaction and commitment to that relationship. So let’s not take out of it the fact that even though you may have the data, that doesn’t mean that you’re always going to win some kind of argument because you’re all data driven – no. There’s many different ways to look at the same data also. It does take a lot of personal influence as well.
It’s about collaborating, sharing and learning from the data so that, I think, is key. Because it isn’t just about the data, you’ve got to develop the relationships, you’ve got to be able to see eye-to-eye and be able to look at things from a different point of view. That’s where understanding a different team’s – how did they come up with that metric, what is that based on? That is collaboration that allows you to see more and more into not just the production of the data but the analytics that go behind it, the algorithms that go behind it and pique the interest of those that may not be that savvy to the data scientist side, to be a little more curious about that.
What’s great is we’ve talked about the past, where we are in the present doing things. But from your perspectives if each one of you can share the future of where you think some of the issues are, either people aren’t thinking about now that they should. That would be really great from each one of you to take a moment and – you’re kind of nodding your head–
Yeah I have a couple of things. One is that I actually have a team of data scientists but recently I also hired a data quality – data steward is what we call that person. I just want to point out that I think this a really important profession going forward as we have more and more data. It’s not something I could personally do, I’m not a detail person in that way, but having the right type of people who are going to maintain the data quality and really dig into it is going to be critical. The second thing is when it comes to cloud infrastructure and optimizing the infrastructure itself. I think I said it earlier is that I’m really interested in the smart grid and integrating that more with the facilities infrastructure. Those are the two things that are top of mine for me today.
Maybe Heather you want to go?
One thing I will say is we have all these amazing tools. Just hearing the talks yesterday, some of the new technologies and forward thinking that’s coming out from great folks such as these is really wonderful to see, it just shows the advancements of where we’re going with big data. One thing I want to emphasize is that – like Amaya said – it’s really about the data quality, how accurate is that? If your source data is not correct it doesn’t matter what you put on top of it – granted it may highlight some of those issues – but at Facebook when we find a gap in something we’re not just correcting the data, we’re actually tracing it back to the root. Obviously it broke down somewhere where that was point of entry, someone didn’t update something, dealing with those process issues – or system issues in some cases – it’s important that we take the time to do that. I think with the data stewards of the world and having people that really understand not only the data itself but the processes and systems that create that data is absolutely key to enabling such great tools out there.
There are two things on my horizon. One is – in general for us also – is integration, that’s the biggest challenge. We find it difficult to – and again it’s complex – to integrate the systems that ran fine for years, there’s nothing wrong with them, with this new technology with the different types of databases and then we end up creating types of hybrid solutions, run things in parallel. So the biggest challenge is really how you integrate your legacy systems with new data sets and the amount of information that is available. Most of you know high-frequency trading is definitely the way we trade these days – there are close to or more than 70% of all US equity trades that are high-frequency trading through machines. So there aren’t that many interactions with the market on a human scale, it’s all becoming driven by inputs of various types of algorithms and machines make decisions.
So post-Sandy as we were heading into the opening for business, everyone was scratching their heads saying is the stock exchange going to be open? If it isn’t down on Wall Street, are they going to run it from the back office operation? If they do that, it’s all electronic and I don’t think every firm is as ready as they want to be to trade with the stock exchange being all electronic. That’s a back-up system and how do you go into an opening of markets at 9.30 in the morning and – everything is basically electronic – how are your algorithms going to perform because there’s no human component?
Now the second thing that is of great interest to us is this aspect of operations and automating in our datacenter operations. How do you take advantage of datacenter infrastructure types of platforms, software, defined datacenter approach where you basically let an application run – application which basically means operating system – to run your facility. It’s coming to be more of a trend. It’s hard to drop what you’re doing now and move into this mode of operation, engineers are not ready to let go of manually running the equipment. So a big challenge – there are definitely great advantages of going along that route. Software defined datacenter brings in itself tremendous amount of efficiencies and cost-benefit. But again, how do you jump into that model or how do you transition and integrate with your current operations? That’s the challenge.
It seems like one of the issues that maybe somebody wants to comment on is the data that you now have to try to do this has now become mission critical. Whereas before I think in the early days it was like: oh here is all this data, I need it. But now as you’ve got in more, businesses are now critical and now it actually goes: wow, I need this data – more important than some of the others – to actually make the decisions. Now that has changed in dynamic, where it’s now versus an afterthought – which I think you’ve all seen this – migrate. That capacity planning was thought of after but it’s turned into a forward thinking thing. This is: I need this information to tell me forecasting, planning, where to go. Heather, you can comment how this has transitioned because you’ve worked at other places besides Facebook and you’ve seen this transition where again capacity planning was thought of like: you do this after you’ve bought it, is it going to work, but now it’s turned into…
Yeah, I’ve seen both sides of the coin where it’s reactive, we’re just in time, which is great also from a cost standpoint, having just-in-time capacity or just figuring out from your data that oh, wow, we need to get this up there now, like how do you be prepared for anything. And I’m now at the opposite side where at Facebook we’re planning 18 months in advance based on our data, and we can forecast based on our product schedule. Granted, we do have to be surprised – and we get surprised – by last-minute projects, last-minute product launches. We have to be prepared for that. So, how do you leverage your data to be prepared for everything without having that constant reactive approach? It’s a difficult transition to initially get ahead of that, but it has so much value. People are concerned with waste in doing that but it actually is better off in the long run.
I’d like to follow up on that actually, because we spent a lot of time over the last year taking all the data that we have – which, I’ve got to say, there’s never enough – to build predictive models in terms of what inventory levels that we need and what build sizes for new datacenters that we need. The demand that comes in from different businesses is forecasting, it’s going to be inaccurate and again you just need to accept that and then use the data to build the models that tell you when you need the next bit of capacity. I think that’s what you’re talking about because datacenter build times are typically 18 months – if you want to include site selection then we’re looking at longer term scenarios. In the way that we can move to more of a supply chain approach and have signals just for the land versus signals for built COLO versus pre-building out the cables and so on. So it’s a big investment of time and energy into modeling the data to come up with that.
We’ve got a few minutes left – five minutes left – so we want to go ahead and open up the mics, so if anyone has any questions to go ahead and come on down. We have a couple of people coming up, so go ahead here on the left.
Tamara you mentioned software defined datacenters and I was wondering if you had a ROI analysis, or how do you see that affecting the business and how do you justify it?
I knew some question like that may come back! It is very-very difficult to justify ROI for something that’s going to increase your resiliency and your availability and optimize your operations, but it’s so hard to quantify. We can talk about up-time of five ninths which is 99.999 which translates into one outage of five minutes long, per year. Does this mean five times one minute outage, or is it every two years 10 minutes you are out? So when you have that type of a challenge it’s very-very hard to run the numbers that will justify the expense. You really have to have a true belief that you will learn as you go, in terms of how much is this impacting your bottom line. But definitely efficiencies in terms of less amount of overtime, scaling down engineers, manpower, and running buildings more efficiently from the energy point of view definitely is going to run some return on investment. How much, whether you are looking for one year, three year or 10 year payback it’s hard to tell.
I have a question.
My question is for Tamara, but I think it applies also equally to the other panelists. Earlier you mentioned software defined networking but then also in the same discussion about how you currently have engineers who perform work manually – that there’s no automation there. So in a sense when it comes to sensors there’s rich data sets, when it comes to the actual work that’s being performed in operations in conjunction with applications provisioning resources, building out systems, migrating systems I get the sense that that’s where there’s less data. That you don’t get an ongoing, continuous stream of data, that you get data after the fact and it’s under-reported – and maybe those are some assertions – but are those the kind of things that you were speaking to?
Yeah, actually it’s a great point. Building automation is not new. There is BMS building automation systems that operate equipment automatically – that’s not an issue. I think you are hitting at a very important point and that is: we are having less transparency at this point in time based on the software platforms currently available to understand what is running where. Which application has either moved or it’s running in that specific point in time and what is at risk, in terms of the impact to business. So the transparency from the VM to the utility switch gear hasn’t gotten there yet. We all know we need that information but we don’t have it yet, at least not that I’m aware of any products or platforms currently. It’s definitely a grey, blind spot out there that needs to be addressed and added more options.
Anything you want to add? No? Any other questions? Mic’s open. One of the things I think that comes up, we can go back to this topic, is with the DR strategy. It’s interesting to say before the disaster happens how do you shift the loads? I think that’s again where the old world was it was just thought of: oh, I completely cut off this datacenter over here, whereas the reality of the world much more is like: no I’m constantly trying to figure out, I need to be able to shift loads, for example to reduce the risk.
I think the moving of the loads is less of a challenge – what is a challenge is connectivity. You are down if you lose connectivity, nothing is going to rescue you. So the key here is to really understand how you build resiliency around that aspect and I think Amaya you mentioned the other day that there was one cable run from Europe to US was–
Yeah one of the trans-Atlantic fibers was cut, or maybe two of them were cut I can never remember–
During the outage, right?
I don’t remember exactly, during the hurricane.
So if that’s the fiber that’s coming into your facility – either primary or a back-up – it becomes a problem.
Well I think that actually there were many customers that are riding across that fiber that maybe just purchased specific pairs. Again, it gets back to you need to build a resiliency all the way from the software through to the network, datacenter and hopefully will get to the point, as you mentioned, integrated into the utility grid at some point in the future – I’m sure it’s going to happen, it’s just a matter of when and how do we make it happen.
Well I want to thank each one of you for joining in this panel, this has been a great discussion and thank you our audience for listening to the challenges of capacity planning…