For Amazon Web Services (s amzn) Chief Data Scientist Matt Wood, the day isn’t filled performing data alchemy on behalf of his employer; he’s entertaining its customers. Wood helps AWS users build big data architectures that use the company’s cloud computing resources, and then take what he learns about those users’ needs and turn them into products — such as the Data Pipeline Service and Redshift data warehouse AWS announced this week.
He and I sat down this week at AWS’s inaugural Re: Invent conference and talked about many things, including what he’s seen in the field and where cloud-based big data efforts are headed. Here are the highlights.
The end of contstraint-based thinking
Not so long ago, computer scientists understood many of the concepts that we now call data science, but limited resources meant they were hamstrung in the types of analysis they could attempt to do. “That can be very limiting, very constraining when you’re working with data,” Wood said.
Now, however, data storage and processing resources are relatively inexpensive and abundant — so much so that they’ve actually made the concept of big data possible. Cloud computing has only made those resources cheaper and more abundant. The result, Wood said, is that people working with data are undergoing a shift from that mindset of limiting their data analysis to the resources they have available to one where they think about business needs first.
If they’re able to get past traditional notions of sampling and days-long processing times, he added, individuals can focus their attention on what they can do because they have so many resources available. He noted how Yelp (s yelp) gave developers relatively free rein early on the use of Elastic MapReduce, saving them from having to formally request resources just “to see if the crazy idea [someone] had over coffee is going to play out.” Yelp was able to spot a shift in mobile traffic volume years ago and get a headstart on its mobile efforts because of that, Wood added.
Data problems aren’t just about scale
Generally speaking, Wood said, solving customers’ data problems isn’t just about figuring out how to store ever greater volumes for every cheaper prices. “You don’t have to be at a petabyte scale in order to get some insight on who’s using your social game,” he said.
In fact, access to limitless storage and processing is a solution to one problem that actually creates another. Companies want to keep all the data they generate, and that creates complexity, Wood explained. As that data piles up in various repositories — perhaps in Amazon’s S3 and DynamoDB services, as well as on some physical machines with a company’s data center — moving it from place to place in order to reuse it becomes a difficult process.
Wood said AWS built its new Data Pipeline Service in order to address this problem. Pipelines can be “arbitrarily complex,” he explained — from running a simple piece of business logic against data to running whole batches through Elastic MapReduce — but the idea is to automate the movement and processing so users don’t have to build these flows themselves and then manually run them.
The cloud isn’t just for storing tweets
People sometimes question the relevance of cloud computing for big data workloads, if only because any data generated on in-house systems has to make its way to the cloud over inherently slow connections. The bigger the dataset, the longer the upload time.
Wood said AWS is trying hard to alleviate these problems. For example, partners such as Aspera and even some open source projects enable customers to move large files at fast speeds over the internet (Wood said he’s seen consistent speeds of 700 megabits per second). This is also why AWS has eliminated data-transfer fees for inbound data, has turned on parallel uploads for large files and created its Direct Connect program with data center operators that provide dedicated connections to AWS facilities.
And if datasets are too large for all those methods, customers can just send AWS their physical disks. “We definitely receive hard drives,” Wood said.
Collaboration is the future
Once data makes its way to the cloud, it opens up entirely new methods of collaboration where researchers or even entire industries can access and work together on shared datasets too big to move around. “This sort of data space is something that’s becoming common in fields where there are very large datasets,” Wood said, citing as an example the 1000 Genomes project dataset that AWS houses.
As we’ve covered recently, the genetics space is drooling over the promise of cloud computing. The 1000 Genomes database is only 200TB, Wood explained, but very few project leads could get the budget to store that much data and make it accessible to their peers, much less the computation power required to process it. And even in fields such as pharmaceuticals, Amazon CTO Werner Vogels told me during an earlier interview, companies are using the cloud to collaborate on certain datasets so companies don’t have to spend time and money reinventing the wheel.
No more supercomputers?
Wood seemed very impressed with the work that AWS’s high-performance computing customers have been doing on the platform — work that previously would have been done on supercomputers or other physical systems. Thanks to AWS partner Cycle Computing, he noted, the Morgridge Institute at the University of Wisconsin was able to perform 116 years worth of computing in just one week. In the past, access to that kind of power would have required waiting in line until resources opened up on a supercomputer somewhere.
The collaborative efforts Wood discussed certainly facilitate this type of extreme computation, as does AWS’s continuous efforts to beef up its instances with more and more power. Whatever users might need, from the new 250GB RAM on-demand instances to GPU-powered Cluster Compute Instances, Wood said AWS will try to provide it. Because cost sometimes matters, AWS has opened Cluster Compute Instances and Elastic MapReduce to its spot market for buying capacity on the cheap.
But whatever data-intensive workloads organizations want to run, many will always look to the cloud now. Because cloud computing and big data — Hadoop, especially — have come of age roughly in parallel with each other, Wood hypothesized, they often go hand-in-hand in people’s minds.
Feature image courtesy of Shutterstock user winui.