Something’s gotta give when big data meets broadband

Somewhere in the mountains of Chile scientists want to build a telescope capable of taking roughly 1,400 photos of the night sky consisting of 6 gigabytes of information each. The Large Synoptic Survey Telescope, which is up for funding this month, would result in several hundreds of petabytes of processed data each year. This month the National Science Board will decide if it should fund the next phase of LSST to build that data-generating telescope.

Scientists aren’t worried about storing or processing all data, according to an article written by Mari Silbey for A Smarter PlanetSmartPlanet. Instead they’re worried about shipping that data from Chile to everyplace else it will be wanted. Basically it’s not a big data issue, it’s a broadband issue.
The proposed telescope would take two photos of the night sky every 60 seconds and then process all of those 6 gigabyte images during the 12 hours of daylight it isn’t snapping pics. The project plans to release annual data analysis reports which include the highest resolution photos for the broader scientific community, but that data will run into multiple petabytes of information each year. From the post:

The new synoptic telescope is being built on Cerro Pachon, a mountain in Chile, but the images it collects will be transported nightly to North America and up to the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign. That will require a huge amount of bandwidth, and bandwidth is increasingly a scarce resource in our data-hungry world.
Moore’s law observes that computing power doubles every one and a half to two years. However, bandwidth isn’t on course to increase at nearly the same pace. And that, says Borne, is a problem: “The number one big data challenge in my mind is the bandwidth problem; it’s just having a pipe big enough to move data fast enough to whatever end-user application there is.

The solution is both fatter and faster pipes over which to transmit such data. The post lists solutions like Globus Online and the U.S. government’s Project Ignite that will make building new networks easier, but there are also some technological solutions such as the special file transfer protocol used by Aspera that could help. But other than fatter or faster pipes, is a solution that is perhaps more promising both for science and for business use — better algorithms to reduce the data sent. From the post:

The way algorithms help minimize the bandwidth issue of distributed data is by reducing the amount of information that needs to be transported for study. For instance, instead of transferring an entire set of raw data, scientists can employ relatively simple algorithms to reduce data to a more manageable size. Algorithms can separate signal from noise, eliminate duplicate data, index information, and catalog where change occurs. Any of these data subsets are inherently smaller and therefore easier to transport than the raw data from which they emerge.

So maybe it’s a method of deduplicating data before it travels over the congested networks or maybe it’s a form of pre-processing on site that can reduce the unnecessary data or at least group data into different cohorts so scientists can pick only the data sets that are relevant to their research.
This astronomy project isn’t the only one dealing with an influx of data. I’ve covered the proposed Square Kilometre Array radio antenna project that aims to look all the way back to the creation of the universe as another example of where we’re going to have to rethink how we handle data — although the SKA folks are worried about the processing and storage of what they estimate will be about 300 to 1,500 Petabytes of data each year.
But as researchers encounter these roadblocks — or bandwidth bottlenecks — the solutions they and the computer science industry come up with will also help enterprises filter through their own big data. For example, algorithms that are designed to detect outliers as a way to clean up irrelevant data might be tweaked to help detect fraud at financial institutions, according to the post. Really, anything that helps people parse and analyze their huge stores of data will be welcome, especially given how much hope everyone from government watchdogs to marketers have in the promise of big data.
Images courtesy of the LSST project.