How federal money will spur a new breed of big data

If you think Hadoop and the current ecosystem of big data tools are great, “you ain’t seen nothing yet,” to quote Bachman Turner Overdrive. By pumping hundreds of millions of dollars a year into big data research and development, the Obama administration thinks it can push the current state of the art well beyond what’s possible today, and into entirely new research areas.
It’s a noble goal, but also a necessary one. Big data does have the potential to change our lives, but to get there it’s going to take more than startups created to feed us better advertisements.

Consumer data is easy to get, and profitable

It’s not fair to call the current state of big data problematic, but it is largely focused on profit-centric technologies and techniques. That’s because as companies — especially those in the web world — realized the value they could derive from advanced data analytics, they began investing huge amounts of money in developing cutting-edge techniques for doing so. For the first time in a long time, industry is now leading the academic and scientific research communities when it comes to technological advances.
As Brenda Dietrich, IBM Fellow and vice president for business analytics for IBM Software (and former VP of IBM’s mathematical sciences division), explained to me, universities are still doing good research, but students are leaving to work at companies like Google (s goog) and Facebook as soon as their graduate or Ph.D. studies are complete, often times beforehand. Research begun in universities is continued in commercial settings, generally with commercial interests guiding its direction.
And this commercial focus isn’t ideal for everyone. For example, Sultan Meghji, vice president of product strategy at Appistry, told me that many of his company’s government- and intelligence-sector customers aren’t getting what they expected out of Hadoop, and they’re looking for alternative platforms. Hadoop might well be the platform of choice for large web and commercial applications — indeed, it’s where most of those companies’ big data investments are going — but it has its limitations.

Enter federal dollars for big data

However, as John Holdren, assistant to the president and director of White House Office of Science and Technology Policy, noted during a White House press conference on Thursday afternoon, the Obama administration realized several months ago that it was seriously under-investing in big data as a strategic differentiator for the United States. He was followed by leaders from six government agencies explaining how they intend to invest their considerable resources to remedy this under-investment. That means everything from the Department of Defense, DARPA and the Department of Energy developing new techniques for storage and management, to the U.S. Geological Survey and the National Science Foundation using big data to change the way we research everything from climate science to educational techniques.
How’s it going to do all this, apart from agencies simply ramping up their own efforts? Doling out money to researchers. As Zach Lemnios, Assistant Secretary of Defense for Research & Engineering for the Department of Defense, put it, “We need your ideas.”
IBM’s Deitrich thinks increased availability of government grants can play a major role in keeping researchers in academic and scientific settings rather than bolting for big companies and big paychecks. Grants can help steer research away from targeted advertising and toward areas that will “be good … for mankind at large,” she said.

The 1,000 Genomes Project data is now freely available to researchers on Amazon's cloud.

Additionally, she said, academic researchers have been somewhat limited in what they can do because they haven’t always had easy access to meaningful data sets. With the government now pushing to open its own data sets, and as well as for collaborative research among different scientific disciplines, she thinks there’s a real opportunity for researchers to do conduct better experiments.
During the press conference, Department of Energy Office of Science Director William Brinkman expressed his agency’s need for better personnel to program its fleet of supercomputers. “Our challenge is not high-performance computing,” he said, “it’s high-performance people.” As my colleague Stacey Higginbotham has noted in the past, the ranks of Silicon Valley companies are deep with people who might be able to bring their parallel-programming prowess to supercomputing centers if the right incentives were in place.

Self-learning systems, a storage revolution and a cure for cancer?

As anyone who follows the history of technology knows, government agencies have been responsible for a large percentage of innovation over the past half century, taking credit for no less than the Internet itself. “You can track every interesting technology in the last 25 years to government spending over the past 50 years,” Appistry’s Meghji said.
Now, the government wants to turn its brainpower and money to big data. As part of its new, roughly $100-million XDATA program, DARPA Deputy Director Kaigham “Ken” Gabriel said his agency “seek[s] the equivalent of radar and overhead imagery for big data” so it can locate a single byte among an ocean of data. The DOE’s Brinkman talked about the importance of being able to store and visualize the staggering amounts of data generated daily by supercomputers, or by the second from CERN’s Large Hadron Collider.
IBM’s Dietrich also has an idea for how DARPA and the DOE might spend their big data allocations. “When one is doing certain types of analytics,” she explained, “you’re not looking at single threads of data, you tend to be pulling in multiple threads.” This makes previous storage technologies designed to make the most-accessed data the easiest to access somewhat obsolete. Instead, she said, researchers should be looking into how to store data in a manner that takes into account the other data sets typically accessed and analyzed along with any given set. “To my knowledge,” she said, “no one is looking seriously at that.”
Not surprisingly given his company’s large focus on genetic analysis, Appistry’s Meghji is particularly excited about the government promising more money and resources in that field. For one, he said, the Chinese government’s Beijing Genomics Institute probably accounts for anywhere between 25 and 50 percent of the genetics innovation right now,  and “to see the U.S. compete directly with the Chinese government is very gratifying.”
But he’s also excited about the possibility of seeing big data turned to areas in genetics other than cancer research — which is presently a very popular pastime — and generally toward advances in real-time data processing. He said the DoD and intelligence agencies are typically two to four years ahead of the rest of the world in terms of big data, and increased spending across government and science will help everyone else catch up. “It’s all about not just reacting to things you see,” he said, “but being proactive.”
Indeed, the DoD has some seriously ambitious plans in place. Assistant Secretary Lemnios explained during the press conference how previous defense research has led to technologies such as IBM’s (s ibm) Watson system and Apple’s (s aapl) Siri that are becoming part of our everyday lives. Its latest quest: utilize big data techniques to create autonomous systems that can adapt to and act on new data inputs in real time, but that know enough to know when they need to invite human input on decision-making. Scary, but cool.