From Amazon’s top data geek: data has got to be big — and reproducible

Session Name: The Missing Manual For Data Science: Remix. Reuse. Reproduce.
S1: Announcer S2: Matt Wood
…Mr. Matt Wood. He’s a principal data scientist with Amazon Web Services. Please welcome Mr. Matt Wood to the stage.
Good afternoon everybody. Can you hear me at the back? Give me a wave. Thanks guys. So hello everybody. My name is Matt Woods. I’m the principal data scientist at Amazon Web Services. It basically means that I get to talk to smart people such as yourselves about how to best take advantage of the data that you have available. So you and your own internal users and customers, much like us at Amazon Web Services, spend quite a bit of time thinking about data. This is kind of the data timeline that I have in mind when I think about the full spectrum of using data: the generation of the data, the collection and the storage of that data, the analytic and computation that you want to do with that data, and ultimately the collaboration and sharing of those results and of those analytic tools.
Up until about five or ten years ago data science was limited by the generation challenge; the cost of generating data to analyze was still very, very high. You can look at how this has changed by looking at how people are working with some of the new generation of data generators. This is a great paper you can read in PLOS about using anonymized cell tower data to look at the movement of the cholera outbreak in Haiti. You can see ground zero there in the black and the movement of people away, and this is useful if you want to direct humanitarian aid. Likewise you can look at social networks, another fantastic data generator – an amazing data generator – and you can correlate what people are tweeting or posting about quite closely. You can see the R-squared value .958 against the CDC data about a influenza outbreak. So you can start to use the data that’s freely available through these social networks to start to add a lot of value and track closely other metrics – again, for humanitarian aid or whatever it happens to be.
We’re looking at web apps as being amazing data generators – using webb app logs. This is a work that a customer of ours did – Razor Fish – with a traditional big box vendor. They asked the question, who buys video games? And it turns out if you look at those logs, the terabytes and terabytes of the logs that are being generated every day by these applications, it’s people who buy really, really big televisions. Not much insight there, but you can go and segment further that and say it’s actually people that buy big televisions and watch sport that then go on to buy sport related video games. If you take that approach you can do more targeted advertising and these guys saw a 500% return on their advertising expense.
All the way through to the mother of all data generators: this is an Illumina GA1 DNA sequence analyzer. You put your DNA sequence in through the little grey door there – that’s a username and password written on the Post-it note on the right there. Don’t do that! [laughter] If you lift that little grey thing up you can see a glass slide. You put your DNA on there. I did this a couple of years ago and if you take a look at my own data, you will see I have fast twitch muscles – which makes me a sprinter, as you can probably tell. I have a typical risk of male pattern baldness. I have a natural genetic immunity to the norovirus, which is very handy. I exhibit the photic sneeze reflex, so when I come out of a very dark room into a light room I will sneeze. It’s very common; if you see people coming out of the cinema sneezing they probably have this same polymorphism in their genome. I have a decreased risk of type 2 diabetes – again, very useful – and fascinatingly, my genetics make-up basically means that I will drink a quarter of a cup of coffee a day more than the average caffeine consumption. So very useful information, all driven by data.
When I talk about the decreasing cost of data generation I thought this was a really good way of visualizing it. This is the cost of sequencing a million bases of DNA over the past 10 or 12 years. You can see the dramatic decrease in the cost associated with generating this data. What happens is with this and social networks, and all the other web applications that are generating data, is that the economics are now favorable such that we’re going to generate more of that data. We’ve moved away from data generation being a bottleneck to doing data science, towards pushing that bottleneck further downstream into the collection and storage, and the analytics and computation – primarily, the infrastructure required to store and compute against these increasing data sets. These become the constraint of working with the data; they become the framing of how we start to think and ask questions of our data.
At Amazon Web Service we provide this utility computing platform that can start to remove some of the constraints of working with that data. And we see our customers starting to think not just in terms of, ‘Given the resources I have, what questions can I ask?’ but, ‘What question will advance the state of my business.’ And we see customers doing that on a very, very large scale. This is a partner of ours, Cycle Computing. They’re running a 50, 000-core cluster up on AWS and this is the provisioning graft. It took about three hours to provision. The grey line shows the provision cores and the blue box shows the utilization of those cores, so you can see it steadily tracks up. They ran this for eight hours and then shut it down. We charged them about $10, 000 to run it. And here you can see additional metrics that they’re tracking against this. This just shows the scale that, if you really need the level of that utility to start answering your questions, then it’s there at the end of an API. So, we’re starting to remove the constraints associated with the infrastructure. We’re starting to move those data center walls so we can ask more interesting questions of our data, and answer the questions that our businesses actually want to answer.
There’s still a lot of work to do here, but we’ve pushed that bottleneck further downstream. We’ve moved away from the analytics or the resources becoming the constraint associated with it. So what’s at the bottom here? What’s at the bottom is the collaboration and sharing. This adds a tremendous amount of value to the work that data scientists or computer scientists do – whatever you want to call them. It turns out that, unfortunately, a lot of statistical modelling, a lot of predictive analytics, is beautiful but unique. I think of this as being impossible to recreate. It’s kind of like a snowflake data science; it’s very beautiful when you look at it under a microscope, but if you wanted to recreate those results it can be quite challenging. So reproducibility becomes a key arrow in the quiver of today’s modern data scientist; being able to take the tools that they’re already building but architect them in a way that allows them to scale out, not just in terms of infrastructure scale, but to deliver it to a broader set of collaborators and a broader set of users. We want people to be able to reproduce, reuse and re-mix the data science, the algorhythms and the data that they’re using within those algorhythms.
This can provide a tremendous amount of value. In fact, if you’re not taking this approach you’re actually leaving quite a lot of value back on the table. It’s relatively easy to do this in utility environments, but it’s not the current state of affairs. In fact, as I was preparing for this talk a friend of mine posted some new analytic tools onto GitHub here – and here you can see the readme. I’ve anonymized this to protect the guilty party – primarily because he actually does some fantastic work – but that’s a backtick in the readme. That’s the entire readme: one line, one backtick. So the question is: how do we get from publishing and producing analytics tools to moving into an area where we can reproduce the results that are generated by those tools – where we can start to remix both the analytic tools and the data that are provided by different people?
In the next 10 or 12 minutes I’m going to walk through five principles of reproducibility that we’ve seen from some of our customers at Amazon. The first principle is that data has gravity. We can take advantage of that. We’re all dealing with increasingly large, increasingly complex data collections that are very challenging to obtain and they’re very challenging to manage, so they’ve become very expensive. It becomes very, very expensive to experiment with that data in a traditionally provisioned infrastructure environment. Because of that complexity and that cost and that high barrier of entry, you get a large barrier to reproducing the results from even internal collaborators, let alone wanting to share that in a more open way across the world. The core here is that we used to be able to move data to our users. With the original human genome we actually put our data onto an mp3 player and just FedEx’d it around the world. That was our data coordination mechanism. You can’t do that in today’s world where we have these amazing data generators producing this data at such a high throughput. You can’t do that anymore. You have to start moving the tools to the data, and you have to place your data where it can be consumed by those tools. Simple storage is not enough; you have to have storage alongside compute capabilities and you need to be able to have programmatic access and high levels of availability to those resources. Conversely, if you’re building tools, you want to be able to place them where they can access data which other people may be providing. Again, that may be in an open context with things like the CommonCrawl – which is an open web archive that anyone can run through, available on S3 – or in more tightly controlled areas. You want to be able to place the tools where they can access the data.
We’re moving from a world where we used to basically take this approach: where the circles here are the data – relatively small, easy to manage, you can throw it on an FTP site, put it on an mp3 player and send it around the world – to the triangles which is actually people consuming it. More towards this sort of model where customers have a canonical source of their data in an environment where they can elastically provision the resources that they need. So these little triangles can just pull down the data that they need. We see this quite commonly with data in S3, where each of these triangles can actually represent individual clusters which are doing individual tasks – they call them task clusters. So all working all the time against the same data in S3, and we can do that primarily because the hadoop that we provide with Elastic MapReduce will treat S3 as if it was HDFS.
We end up in this area where we have more data, we have more users, we have more uses of that data and we have more locations where that data wants to be used. All of these act as a force multiplier for people who want to work with the work that your data scientists are doing – not the least is that they want to be able to do that as low cost as possible. All of these end up being a force multiplier for complexity and a force multiplier for cost in a traditionally provisioned infrastructure. You should be in no doubt that cost and complexity will kill the reproducibility that we should be aiming for; it will the availability of somebody being able to take your work internally and reproduce it either in one year or five years down the line, or be able to mix in their own set of skills and their own set of data and their own insight and their own experience into your work. So that’s the first thing. It follows on nicely that the second principle is really that the ease of use of your tools becomes a prerequisite for re-mixing and reproducing and adding value to the work you’re already doing.
Cathy Sierra, who is a fantastic writer, used to run the Creating Passionate Users” blog – which if you haven’t read, I highly recommend – published this graph: it shows the user of a system’s ability changing over time. You can see here that if you’re good then you’re getting your users over what she called the suck threshold.” This is where they weren’t experienced enough to get the full value of your service – the full value of your system. You want to get them over that suck threshold” where they start to feel confident and they start to deliver value for their own ends very quickly. The quicker you can do that the better – all the way up to the point where you turn them over the passion threshold,” where they become passionate evangelists, passionate advocates for your service. Then you’re going to get more users across more locations, and you’re going to add even more value to the work that you’ve already done. You have to help your users overcome this suck threshold”; you have to make your service easy to embrace and extend, and choose the right abstraction level for the users that you’re addressing.
At Amazon one abstraction is ec2-run-instances. This is the API call to start provisions compute resources. It’s a perfectly reasonable place to start. Other customers want a slightly higher level. There’s an open source tool from MIT called StarCluster. Here you just type starcluster start and it will go off and provision your fully MPI enabled cluster on ec2. It will install hadoop and any other tools which you want, along with a bunch of analytics tools like NumPy and all the rest of it – that’s very useful if customers are used to just dealing with a queue; it will present them with that queue and they can get up and running – all the way up to a platform level like Globus Online where they just point and click to shuttle their data around. This gives you the opportunity to start to package and automate the tools that you’re building, and basically present yourself as an expert at a service, building all of your expertise into programmatically available resource.
We’ve got some really good examples of how customers are doing this today. In this canonical model you can imagine building out and packaging a tool which customers can place in front of their data. This may be as simple as a script or a shim to load that data in, again reducing the [SAP] if you have complex data that you want to work with. A good example of that is actually the Thousand Genomes Project. This is probably the world’s largest collection of genomic variation data. It’s about 200 terabytes is size, it’s 1, 700 genomes, and it’s very, very challenging for customers to work with data at that level when they’re used to working with single, individual genomes that are just a couple of gigabytes. There’s CloudBioLinux distribution available, and all you have to do is spin that up on ec2 and it will mount the Thousand Genomes data as if it were discs. Then you can take you same scripts and you can start working with that data as if it was on a disc. It’s actually hosted back on S3 in our public datasets program, where we’ll host large collections of data which are useful to the community for free. Or you can go a step beyond that. We have a customer called Illumina who built that sequencer that you saw earlier. They allow you to not just take that Thousand Genome data, they put a platform in front of that to expose the tools at the right level of abstraction – again, to get their customers productive with that data as soon as possible, and they can mix in their own genomic data alongside that.
Netflix have a really nice way of approaching this as well. This is the very high level approach of Netflix’s architecture where they have their Cassandra cluster recording all of the data that’s being generated by the production platform going through a transformation process that they call Aegisthus. It sits all of that in S3 and then they have hadoop, Hive and Pig scripts that sit alongside that. They expose this in various different ways, depending on the customers internally that they want to work with. They expose that through traditional business intelligence tools MicroStrategy, they make everything available through [AR], and they also have an internal tool called Sting, which is like an in-memory cache. That allows their customers to point and click when necessary and refine and play with the queries in real time – this is a screenshot of Sting. Providing that right level of abstraction is really important. Just as important as that is the building an architect in so that customers can not just play with your tools, but reuse and re-mix them as they go forwards.
Professor Carole Goble, who is Professor of Computer Science at the University of Manchester, gave a legendary talk many years ago that basically changed my life. She outlined the seven sins of bioinformatics and it turns out these are as true today as they were seven or eight years ago when she first gave the talk. If you really start to classify the various different sins of generating [inaudible] new unique identifiers every time you create a new database and all these sorts of things. The top six boil down to just one thing: that is that data scientists, computer scientists, bio-mathematicians and analysts are basically hackers. They like to have the right level of access and they like to be able to hack on the data and hack on the tools that you make available to them. So they have their own way of working for the most part; they’ll have their own tool set and they’ll have their own workflow, and building an understanding of that you can use to your advantage.
You should try to be aware of the big red button. This is something that with reproducibility is very tempting to set up. It’s basically the ‘fire and forget’ approach where you set up everything in a very simple way and you just have the user click a big red button and it will go off and provision and reproduce the results for them. That’s a very good first step, but it limits the longer term value of your data sites. Monolithic one-stop shops become limiting down the line; they world well for a single, intended purpose, but they’re challenging to install and they’re typically dependency heavy – they’ll have lots of moving parts internally. That makes them very difficult to grok; it makes them difficult to hack against; it makes them difficult to understand in their entirety and very difficult to embrace and extend. But if you work backwards from the fact that your customers who are going to be consuming these tools are hackers, and they’re going to want to understand the tool in its entirety, you can start to embrace that to add additional value. So you can move more towards the Unix philosophy of having small things that are loosely coupled. This makes it much easier for you to publish those tools and make those tools available and it’s much easier for your customers to grok, understand and hack against those. So it’s much easier to reuse and integrate loosely couples tools than it is big monolithic tools. That creates a much lower barrier of entry for people who want to work with your tools.
The fourth principle is building for collaboration. If you work backwards from the customer you start to see that the workflows that you’re producing – the analytics tools mixed with the data, mixed with the infrastructure – is basically a meme in that they benefit from social interaction and they benefit from being able to share. The more they are shared the more value is added to those workflows. So reproduction is just the first step to allowing that. Where things start to get really interesting is where you start to drill down beneath the reproducibility of those results into the different bill of materials – the code, the data, where that data sits, the various configuration options for your infrastructure, your analytics piece – and creating a bill of materials that a customer can take on board and programmatically recreate. You want to provide a full definition for reproduction, and utility computing provides a playground for data science because you can expose that full bill of materials. You can expose the data, the code and all of the configuration options as a collection of hackable elements – a collection of building blocks that they can piece together in any way that they want. The code, the machine images, custom data sets – all of these things start to be integrated in various different ways and your customers can roll in their own data or roll in their own code and remix those in any way that makes sense for them.
The fifth and final thing that I will touch on quickly is the importance of provenance in all of this. Understanding and maintaining provenance – that is, the background to your tools and the data and understanding the metadata around that and how it’s being used – becomes a critical element of enabling these sorts of approaches. Versioning becomes really important, especially in an active community where you have a lot of people doing a lot of things, and doubly so when you have loosely coupled tools, because you’re going to have a lot of versions of those integrating in various different configurations over various different versions. Provenance metadata becomes a first-class entity. You also start to build up a distributed provenance tree, building up not just the code relationships but the data relationships between each of your different areas.
Those are my five things. I hope they were useful. We talked about data having gravity; the ease of use being a prerequisite; we talked about reuse being as important as reproduction; we talked about building for collaboration and enabling that, and understanding the importance of provenance as a first-class object. Hopefully that will help you to enable your own data scientists and your own tools to not be beautiful and unique but to be reproducible and add a lot of value to the work you’re already doing. Thank you.