Obama’s big data plans: Lots of cash and lots of open data

The White House on Thursday morning released the details of its new big data strategy, and they involve access to funding and data for researchers. It’s a big financial commitment in a time of tight budgets — well over $200 million a year — but the administration is banking on big data techniques having revolutionary effects on par with the Internet, which federal dollars financed decades ago.
Here’s where its efforts are headed, in a nutshell:
Grants: About $73 million has been specifically laid out for research grants, with the National Science Foundation chipping in about $13 million across three projects, and the Department of Defense ponying up $60 million. The U.S. Geological Survey will also be announcing a list of grantees working on big data projects, although no specific monetary amounts are listed.
Spending: If there’s one thing the DoD knows how to do, it’s spend, and it will be doing a lot of it on big data — $250 million a year. DARPA alone will be investing $25 million annually for four years to develop XDATA, a program that aims “to develop computational techniques and software tools for analyzing large volumes of data, both semi-structured (e.g., tabular, relational, categorical, meta-data) and unstructured (e.g., text documents, message traffic).” The Department of Energy is getting in on the big data frenzy, too, investing $25 million to develop Scalable Data Management, Analysis and Visualization Institute, which aims to develop techniques for visualizing the incredible amounts of data generated by the department’s team of supercomputers.
Open data: The White House has also teamed with Amazon Web Services to make the 1,000 Genomes Project data freely available to genetic researchers. The data set weighs in at a whopping 200TB, and is a valuable source of data for researching gene-level causes and cures of certain diseases. Hosting it in the cloud is critical because without access to a super-high-speed network, you wouldn’t want to move 200TB of data across today’s broadband networks. While the data itself is free, though, researchers will have to pay for computing resources needed to analyze it.
Here’s how the White House summed up the rationale behind its efforts in the press release announcing the new programs:

“In the same way that past Federal investments in information-technology R&D led to dramatic advances in supercomputing and the creation of the Internet, the initiative we are launching today promises to transform our ability to use Big Data for scientific discovery, environmental and biomedical research, education, and national security,” said Dr. John P. Holdren, Assistant to the President and Director of the White House Office of Science and Technology Policy.

It’s worth noting, however, that the White House’s approaches to capitalizing on the big data opportunity aren’t entirely novel. Open source projects such as Hadoop — the linchpin of many big data efforts — and the projects surrounding have already revolutionized the way we think about storing and analyzing unstructured data. And researchers have already had cloud-based access to genetic databases and tools — DNAnexus is hosting the 400TB Short/Sequence Read Archive as part of its cloud-based genome-analysis service, and Microsoft (s msft) is hosting the NCBI BLAST DNA-analysis tool on its Windows Azure cloud.