Even the CIA is struggling to deal with the volume of real-time social data

[youtube http://www.youtube.com/watch?v=isH8j0MPu-Y&w=560&h=315]
Session Name: The CIA’s Grand Challenges With Big Data.
Speakers: S1: Announcer S2: Ira Gus” Hunt
If you don’t give a big round of applause for our next speaker, he’s going to find out and it’s going to go on your permanent record. He is Mr. Gus Hunt, he is the CTO of the Central Intelligence Agency and he’s going to be talking about the CIA’s grand challenges with Big Data. Please welcome to the stage Mr. Gus Hunt.
As the only person standing between you and lunch, I’m not sure this is the right place I actually want to be, but we’ll see if we can keep this interesting for you. I’m Gus Hunt, I’m the CTO at the CIA, I’m going to talk to you about stuff you’ve probably already heard as you’ve listened through the day, a lot of the same subjects. But I’m going to try and give it to you from our perspective of what’s actually going on in the world, why it matters to us and then what we think needs to change in order to actually enable us and I think the private sector itself to take advantage of Big Data.
If you think about the world that we’ve been into, Cloud is pass. It was so three years ago. Today we’re in the point where Big Data is so last year – all those breathless articles and all the front page covers – I was expecting Big Data to be Time’s Man of the Year. This year, what we’re really talking about is how do we get value out of the stuff, and I think that’s a lot of the conversations I’ve been hearing around, [inaudible].
In case you didn’t know what we did for a living, the CIA has three business lines. We collect information about the plans and intentions of our adversaries. We do this thing called All-Source Analysis where we bring the information we collect together with any information we can get our hands on so that we can tell the President and the Secretary of Defense and policymakers, everybody else, what it all means. And the third thing we do – and we’re the only agency authorized to do this by law, at the discretion of the President of the United States – is this thing called covert action. So these are the three things that we do.
So about four years ago when I took over as CTO of the organization, we sat down and said What is it that we have to be able to make sure we do well into the future.” And so we set what I call our four big bets. Big bet number one from four years ago was it’s all about Big Data. It’s all about our ability to take advantage of these massive information streams that have emerged in the planet so we can figure out what’s going on in them and protect national security. That’s what we do.
Number two – and this preceded all this talk about sequestration and things like that – the fact is we have a fiduciary responsibility to you, the taxpayers, to make sure that we execute every dollar we spend as well as possible. But when we think about [inaudible], this is not a lowest cost proposition. This is a best value proposition, and for us ‘value’ is defined as outcomes divided by cost and time. More outcomes in less time is a much better value thing to get done.
Three – and something that we very intently focus on – is that we have to act better as a community, despite what you read about the fact that there is massive dysfunction, that we don’t share information, all these things like that. It’s actually not true. We actually do a really good job. It’s just that like every organization within anything, just like in the private sector, we come at problems with different aspects and different angles, and that creates sometimes a little discussion about what’s the right way to solve some of the problems we face.
Number four, it’s all about people. If we don’t have the right talent we can’t execute the things that we do.
Then we said, to accomplish these things we’re going to have to have an enduring framework in which we’re going to invest. So we put up these things – these are things we call our six key technology enablers – these are the things that we intend to invest in for the long haul in order to make sure we are a viable, competitive organization heading into the future. They’re really simple things, and these are things that you know very well, but secure mobility for us is a huge deal. Mobile is not secure. Repeat after me: mobile is not secure. It really isn’t. So how are we going to make this secure in our environment so we can take advantage of it? This is a big thing.
The second thing up there is what we call advanced analytics. This is actually analytics as a service. It’s everything we want to be able to do with Big Data to do the jobs we have to do to be able to support the national security of our nation.
The third one up there is what we call widgets and services. We got into this thing through a thing we call the Ozone Framework. The Ozone Framework is a framework that the intelligence community developed based on the Google framework. Fundamentally, it’s all the same reasons you like your Smartphone and your iPad and things like that, that you can personalize it and put onto it the things that you need for your personal life or for your business life that matter to you. We need to build an environment where our analysts and our operators and other people like that can basically put on the necessary functionality that matters to them and personalize our world. We call this our WebTop, or device-top, or any other thing you want to call it.
Four – which by the way is three on the chart, and I’m not going to explain the odd binary numbering system, it’s a long story – security is a service. We don’t want you to have to build security from top to bottom every time you deliver or build us a system. What we want to be able to do is to have a set of security services, and the best practices out of what was the old services-only architecture world – anybody remembers that world? I’m dating myself I’m sure. These are security services into which, what happens is, the widgets and the analytics above it have to talk to the security services in the middle in order to get to data and computational infrastructures and things like that below it. So the security services all have to be in common, and these are things we want to make sure are very consistently enforced across anybody touching any piece of data through any analytic. It has to be enforced through one of these security services.
Five, it’s all about data. I’m going to talk more about Its the data, stupid” in a second. We have this concept of data as a service and then a thing we call the ‘data harbor’. The data harbor is not a place but it really is all about us bringing to bear these massive computational engines that so many of you out in the exhibition hall have and are showing off. What we have discovered, or believe is true, is the fact that all the analytics up above often want to consume common sets of this large high-performance computational infrastructure underneath the covers.
What we want is an environment in which all our data, and these common massive computational infrastructures, are already in place so that it’s very easy for us to plug in a new idea or new capability on top because I can leverage what’s already in place underneath. In order to do all these things, it’s all about massive computational capacity and this funny little thing called the Cloud.
Do you ever ask yourself how big ‘big’ was? Because we do this all the time. I’m going to give you a quick run-through of how big ‘big’ actually is. You guys know Google. Google is a very big provider of things. Google stopped reporting how big it was, at least as far as we can find, about four years ago in their 2009 or 2010 SEC filing. At that time, they said they were more than 100 petabytes in size, more than a trillion .dex URLs. Pretty big stuff.
Facebook. Facebook as you know exceeded a billion users in August of last year, so they’re well over a billion at this stage of the game. What’s more interesting I’ve found, is that the latest numbers that are coming out about Facebook is that roughly 35% of all the world’s digital photography gets put onto Facebook.
YouTube. We believe that YouTube is the only Exabyte scale or bigger depository that we’ve been able to come across on the planet, at least in the public sector across the board. What happened was that YouTube in the last filing that we saw was about 768 petabytes. If you do the math on how much data is added to YouTube, what you find out is that from about three to four years ago, YouTube is clearly bigger than an Exabyte.
The world population back in about April ticked past the seven billion mark. Everybody talks about Twitter and how big Twitter is. Twitter is about 124 billion tweets a year, 4500 a second. Twitter is a piker relative to global text messaging, which is about 193, 000 texts a second, of which 190, 000 of them are generated by my daughter alone [laughter]. I have the bills to prove it.
But even that’s small relative to US cell calls. The US alone is roughly 2. 2 trillion minutes a year – 19 minutes per person a day – which I find awfully small, again using my daughter as my measure of average. That’s about two orders of magnitude too small. But if you think about it, uncompressed that’s only a YouTube a year.
So what’s making all this happen? I think you all know this pretty well. There’s three fundamental driving forces which have been around now for the past several years and this is one little thing called ‘Social Mobile Cloud.’ It was this which drove so much of Big Data. In fact, Big Data was made real because of the combination of these three things. In the social world, things go viral in a hurry, and so they need to have a computational space that will scale elastically within [inaudible] brought the Cloud into existence. Everybody wants to be social and exchange information. All this together is conspired to deliver what we just talked about is some of the Big Data stuff that’s there.
This has been a dramatic increase in the velocity of innovation. Any of you who are start-ups today, do you actually ever go to your investment companies, except in very special cases, and tell them that you’re going to buy a bunch of hardware and hire a bunch of admins and you’re going to get started to do work – does anybody do that? It’s pretty rare. What do you do? You go, you swipe your credit card at Amazon or Rackspace or something like that, you get your capacity, you get started, you do some work. It allows you to go very fast, very cheaply, and you focus on what you want to build or deliver and not focus on trying to run this underlying infrastructure.
For our world, what’s happened is the Social Mobile Cloud has dramatically accelerated social change in ways that were totally unanticipated and I don’t think could have existed prior to these technologies coming into play. A classic example of this is the Arab Spring. It was the ability of the groups in the Arab Spring to continue to communicate, despite the fact that their totalitarian governments were trying to shut them off, that enabled the Arab Spring process and protest to come to whatever the fruition is that we’re going to find out here in a while. We’re still trying to figure out what all this means.
Fundamentally, in our world what’s really important is that this Social Mobile Cloud thing has completely altered the flow of information on the entire planet. When I started as analyst years ago inside the CIA, the world was pretty simple. It was the world of the few-to-the-many in terms of information flows. You had NBC and CNN, and you had the Tass and the Times, and you had the Washington Post. What you had was the classic model of the few generators of information all telling the rest of us how and what to think, and that was how things were distributed. The Social Mobile Cloud world has completely inverted that model, and has gone to this complex many-to-many model, and I’ve got to tell you, we really liked the few-to-the-many model [laughter]. It was really easy to take advantage of this model. When everybody is talking and everybody is sharing information, what’s really interesting is that while there’s a whole lot of noise out there, there is signal we have to be able to find. That is, I think, one of the predominant problems of Big Data in the world: how do you find a signal in ever-increasing seas of noise.
If you think that’s complex and you think you know this – in fact the health guy from Aetna and others were just talking about this a little bit before – there’s these three emerging forces of Nano, bio and sensors. You’re already a walking sensor platform. You guys know this I hope. Your mobile device – your Smartphone, your iPad, whatever it’s going to be – has got any number of these things. In fact, I think this is a limited list of what’s inside these devices and what’s going to be emerging inside these spaces. As you walk around – and remember, I told you mobile is not secure – you are aware of the fact that somebody can know where you are at all times because you carry a mobile device. Even if that mobile device is turned off. You know this I hope. Yes? No? Well you should [laughter]. Cause it’s really important.
What’s happened is that if you’re a Star Trek fan, like I was when I was a kid, what’s current now is that this mobile platform, your Smartphones, have turned into your Communicator, they’re becoming your Tricorder, and actually they’re becoming your Transporter. How do you get on an airplane these days? Do you walk up with a piece of paper like I do, because we don’t do mobile very well in my place? You walk up with your little boarding pass symbol and you wave it in front of the magic thing and you get transported to wherever it is you want to be able to go.
It’s also becoming your mobile health platform. Right now you can buy plug-ins for a pacemaker, they do blood-sugar monitoring, insulin control, all these health-monitoring things. The health industry itself is looking very hard at how they can begin to do remote health monitoring to you, so they can continue to pay attention to what’s happening to you and your body, and then to be able to do things like a remote tune-up. Gus talks very fast, and so I’m just very worried about the fact that somebody’s going to hack my remote tune-up and crank up my little pacemaker thing, and then I’ll talk a whole lot faster to you guys. But this is something we have to worry about because if you think about cyber-threats as they emerge, it’s not just against your business. Ultimately it’s going to be against you and your health. These are things that will be at risk if you’re not careful.
In fact, if you think about your mobile sensor platform, there’s a really cool little app – Activity Tracker. It’s a little Android app – have you guys seen this anywhere? What they’ve discovered is fundamentally they take your 3-axis accelerometer on your phone – I actually carry a Fitbit. You guys know the Fitbit, right? It’s just a simple 3-axis accelerometer. We like these things because they don’t have any – well, I won’t go into that [laughter]. What happens is, they discovered that just simply by looking at the data what they can find out is with pretty good accuracy what your gender is, whether you’re tall or you’re short, whether you’re heavy or light, but what’s really most intriguing is that you can be 100% guaranteed to be identified by simply your gait – how you walk.
Now this could be a really good thing. Think about this as a security app. If you’re walking along and you want to access your bank code, maybe it could become simplified because they can with absolute assurance know it’s you by your gait trying to do something with your bank. On the other hand, if you don’t want to be found or you want to protect yourself, maybe you don’t want to have somebody know what your gait looks like so they can figure out where you are at all times.
What’s curious is as you start to put these things together, the inanimate becomes sentient. We’re already seeing this happen. IBM is already talking about their Smarter Planet. Google has their self-driving car. You’ve got machines that know your needs – at the last CES show, did you read the article about the refrigerator that reads your items as you put them in and take them out, and sends you an email on your Smartphone to tell you ” Get your milk”? My dystopian view of the future is the following: on Friday evening, I’m really tired, I’ve had to work late, I get into my self-driving car, I say ” take me home,” and it’s going to take me where? Safeway, to get the damn milk [laughter]. Why? Because it knows better, because you would totally go get the milk [laughter]. So some good things here, but also some maybe not so good things.
But when you put them together, this really works well because, if you think about this, the potential for this is enormously good. And you know this. Radical efficiencies in driving – the ability to dynamically rout you in bad traffic so you can optimize your time and minimize your fuel consumption or something else like that – is a really great thing. We’ve already talked about social engagement, helping us get green, we’ve talked about how these are all really wonderful, great things.
To stop and prevent crime. Anybody see the recent article where they did a study – London’s the most camera’d city on the planet – and the argument for London putting all the cameras was it would help them stop crime. Do you know how many crimes they stopped that they can definitively tie to the camera? Anybody know the answer to this? One. So it’s calling into question some of these things.
The issue we face is – remember I talked about this big world of data from Social Mobile Cloud where you put in the sensor world and of course this becomes a real interesting problem space, particularly for us, because sensors are unbounded. They’re just little pieces of silicon that we want to be able to put on any place, they can go anywhere, they’re really simple to do. Sensors are promiscuous: they never met a signal they didn’t like. And they’re indiscriminate: they process any signal that they get. Then we’ve got this internet of things that was talked about earlier, everything becomes connected, because everything is [inaudible], so everything talks to each other, and the volume of this just explodes. So what humans are able to do pales in comparison to what’s going to emerge in the sensor-connected world. And that’s the really big challenge to our future.
You ask yourself, Why do we care about these things?” We care because of the fact that there are signals in all this information that matter to us to be able to protect national security. We care because we have to understand what’s going to be going on in the world so that we can inform our policymakers ahead of these trends, problems and issues as they emerge. We care because we do want to stop the next underwear bomber before he gets on the airplane and tries to light his pants on fire. We care because – I have to be careful how I say this here – it can be a good thing for you and your friends to know where you are all the time. In my business, that could be not such a good thing. So we care about how this world evolves.
And we care because information is vastly different in this world than in the previous world of human-curated intelligence. This is a great chart down the bottom there. The purplish blob and the greenish blob. The green one is the world according to the universal decimal classification system, which when I was in school was called the Dewey decimal system. I’m dating myself. The other one is the world of information according to Wikipedia. Which one do I believe? Which one do you believe? I know which one I believe. I believe in the Wikipedia one.
What’s the impact of Big Data for us? The impact is, it really helps us understand what’s going on in the world to know what we know, so we know where our gaps are, so that we can do our job much more effectively. It takes us a long time with some very expensive assets in order to figure out how to fill in the gaps, and we don’t want to be collecting information that is unnecessary, that we can already find out through other mechanisms, such as what’s happening out in social media and things like that. This has some pretty profound implications, so what I’m going to talk to you about today in my last six minutes is what I call the four rules of Big Data.
Number one, it’s the data, stupid. Remember James Carville: Its the economy, stupid.” Two, it’s going to be power to the people. Three, we’re going to talk about latency breeds contempt. And four, everything is in context and everything is in your context in this future world.
Number one, the data, stupid. A little history lesson in our world – sounds really mundane to you guys, but this is a hard-fought and hard-learned lesson in our place: sophisticated tools, no matter how slick your tool is, if it doesn’t work on my data it’s fundamentally useless. Our users are going to opt every time to use a mediocre tool where the data exists, than to take the most sophisticated thing you can deliver me and tell me how beautiful and shiny this object is across the board. This is because our job is to figure what’s going on in the world of information. We have to put it together. We have to figure what’s the plans of our adversary. We have to connect the dots.
The problem of Big Data is the following: the database of useless information is 500 million gigabytes, the database of useful information is 5K. Our problem is, which 5K? Because we have learned through our long history that information has time value, much like money has time value, and the value of any information is only known when you can connect it with something else which arrives at a future point in time. If you throw away, in our world, information because you didn’t think it had any value, or you chose not to bring in or collect any information because it didn’t match what you thought your needs were at that moment in time, you won’t have information to connect together as new information and new events emerge in the world. So our problem is, since you can’t connect dots you don’t have, it drives us into a mode of fundamentally trying to collect everything and hang on to it forever,” forever being in quotes of course.
Some interesting characteristics of Big Data which have emerged are really simple, like ‘more is always better.’ The signal to noise only gets worse in this world, but the reason why more is better is that it allows you to do a numeration to know what’s going on in your data and not modeling. Anybody know George P. Box’s famous saying about modeling? All models are wrong but some are useful.” The problem is modeling forces you to make assumptions up front which are all biased by your current view of what’s going on. We want out of bias and into actual understanding of what’s happening in the world.
The other thing is, users are not data scientists or data engineers. They don’t get the stuff. So what we want to make sure can happen in the data world is that we need to be able to imbue our information, the data sets themselves, with sufficient intelligence that the user doesn’t need to do anything more than ask a question in order to get value out of the data sets themselves. If they have to go into thousands of data sets and figure out which ones have information that might be relevant to the question that I’m asking, this is a losing proposition across the board.
Next, power to the people. I will tell you today that analytics and tools are hard to use, and that specialists are needed to drive value and we call these specialists data scientists, and we are actually establishing a new high priesthood of data sciences, because the information and the skills and the knowledge needed to do these things are very dense and can take a long time to acquire. The problem is, it takes a lot of hand creation and a lot of these things that are happening are not built for our business space.
This world of this new priesthood is driving these fields we talk about a lot – data scientists, information engineers, things like that. The data scientist, according to Wikipedia, has to have fundamentally all these skills. How many people in the planet have these skills? Not many. Granted, every university on the planet has started up a new sciences program, which is good news, but it sort of [inaudible].
Our belief: Big Data democracy wins. The goal we have is I have to be able to get the power of Big Data and the analytics into the hands of the average user. The only way that the real value is going to be realized by us, or even in the commercial sector and by individual companies, is when everybody has access to a tool and the data in order to get their jobs done and they don’t have to worry about it. Tomorrow what we want are really elegant, easy-to-use tools, the machines to do the heavy-lifting, and we want to get out of simple things like ‘search’: search is so broken in this peta-scale world that we’re talking about.
We have this thing we call the seven universal constructs for what we want analytics to do. We care about people, we care about places, we care about organizations, we care about time, events, things and concepts. What we want is for the analytics to be as simple to use as Excel functions. You go to Excel, you put in your little equations – equals, standard deviation, open parentheses, select a list of numbers, close parentheses – you get an answer back. You know that’s correct. We want a tool, say for people, where I want to understand the relationships between a bunch of folks – I want to say, equals, relationship between, open parentheses, list of names, close parentheses. And what do I get back? I get a nice network graph that explains to me how all these people are related in any number of different ways, based upon what I want to be able to do.
I believe it’s got to be that simple for folks to be able to use. And we want people to be able to put these things together in ways that you can’t possibly anticipate, and we want them to be able to change so that they themselves can build much more complex outcomes based on fundamentally simple building blocks. This is a case where I want to be able to tell all the people involved in the Arab Spring, I want to understand how the sentiment analysis changed over time and put it on a map as a heat-flow. That’s all I want my user to have to do: simply draw that as some sort of visio-mechanism, and be able to see what comes out on the other end. We’ve got to keep it simple just for them.
Latency breeds contempt. It’s all about speed. Speed is the only thing that matters in our world and I think it’s going to be the only thing that matters out in the commercial side because, simply, we don’t want to wait. What drives my user nuts more than anything is when they think it takes too long for something to occur. So I think we’re moving into a world where this is already happening. We’ve got these equivalent real-time MapReduce jobs, getting out of MapReduce which is flexible, powerful and slow, and into a MapReduce which is flexible, powerful and very fast. We actually want a push into what we call peta-scale memory architectures to do distributed analytics and things like that. This is what’s driving all these technology shifts that you read about all the time. What we think is doing is this is going to drive new competing architectures that will radically shift how things happen in the world.
Finally, it’s everything in context. In your context – and this matters, because this is the world that we believe we have to be able to build. It’s got to be in your frame of reference because anything else is somebody else’s frame of reference. So the purpose of the widgets is to enable you to build your WebTop, or whatever you want to call it, using the tools and capabilities you need to get the job done. What’s the purpose of all of the stuff that’s emerged in the Big Data world about schema on read? It’s data out in the context in which you need to take advantage of it. I want, as I said before, user-assembled analytics in the context of the problem and question you want to ask, and then all this takes computing in context to meet the demand of the job that you want to be able to run. That’s what elastic computing in our world is about.
I think we’re at high noon in the information age. I say this because of the following. It is really very nearly in our grasp to be able to compute on all human-generated information. You know what’s nice about humans compared to sensors? You can only do so much stuff in 24 hours. The fact that you’re sitting here, taking notes, taking pictures, or just listening, you’re only doing that. You can’t do so many other things. You can only generate so much data. We’re at this point – and if you don’t believe me, let’s go back to my Facebook example where already one seventh of the world population and 35% of the world’s digital photography is already in one place – if you want to think about that and the things that they can do.
The inanimate is becoming sentient. When it becomes sentient, I told you my dystopian thing. We’ve got this third wave of computing which has emerged which are cognitive machines. Watson is the critical example of this that we can think about. The interesting thing about Watson is that Watson is to the cognitive machine as the original IBM PC 8088 is to what we can do today on existing machines. This is a world which is going to explode upon us, and cognitive machines are going to do everything from medicine, to financial trading, to helping us with our intelligence analysis across the board.
What’s happened is, technology in this world is moving faster than government or law can keep up. It’s moving faster I would argue than you can keep up. You should be asking the question of what are your rights and who owns your data. This is a question that I argue you’ve got to put on the table. As I mentioned before, it’s driving the pace of social change in ways we can’t anticipate and it’s creating an interesting world. I’m not going to talk about the cyber threat thing because I think we’re out of time. Thank you very much.
Thank you so much. That was awesome. I think we’re getting ready for lunch, but people can find you I’m sure, floating around. Thank you once again to Gus Hunt, CTO CIA. I don’t know about you but I’m going to go throw my phone in the river during lunch break. Lunch is right now, it is sponsored by Chartio, thank you Chartio. Visit the exhibit area out there, there are workshops. You saw in the first morning break that some of them had standing room only, so please be sure you get there soon. There’s an IBM-sponsored workshop in the Oceanic room, Oracle-sponsored workshop in Aquatania West, Mu Sigma-sponsored workshop in Aquatania East. Stop by the Giga Om research table, pick up your sector road map report. And pick up some lunch, it’s delicious. We’ll see everybody back in 1:50.