Often times, the best way to to get a sense of your data is to look at it. A bunch of of numbers or words might not mean anything sitting within a table, but they start to make a lot more sense when they’re turned into a chart. In fields like mass cytometry, though, where doctors might want to analyze dozens of biological markers for each of tends of thousands of cells in a tissue sample, creating an easy-to-understand chart is easier said than done.
That’s why a group of researchers from Columbia University and Stanford University developed an algorithm that can do just that, turning those cells into something that resembles your social graph. This lets researchers see how the various cells are related to each other so they know , for example, where to focus cancer treatment and what to track as that treatment progresses.
The idea of representing large or complex data as a graph is nothing new, but it has taken on more prominence thanks to the rise of social media and those ubiquitous social graphs that map out who’s connected to whom. As we highlighted recently, however, graph analysis is becoming more popular outside the realm of social networks, and is being applied to problems that are more complex than just figuring out simple relationships within a network. In cases such as medical research, especially, graphs can provide a very effective way of seeing how potentially hundreds of thousands of data points spanning perhaps hundreds of variables are similar to each other.
That’s exactly what the team at Columbia and Stanford has done with a new algorithm that they’ve demonstrated within the realm of mass cytometry. According to a press release announcing the research (which is available via paid download at Nature Biotechnology):
“The method, called viSNE (visual interactive Stochastic Neighbor Embedding), is based on a sophisticated algorithm that translates high-dimensional data (e.g., a dataset that includes many different simultaneous measurements from single cells) into visual representations similar to two-dimensional ‘scatter plots’ ….
“The viSNE software can analyze measurements of dozens of molecular markers. In the two-dimensional maps that result, the distance between points represents the degree of similarity between single cells. The maps can reveal clearly defined groups of cells with distinct behaviors (e.g., drug resistance) even if they are only a tiny fraction of the total population. This should enable the design of ways to physically isolate and study these cell subpopulations in the laboratory.”
I assume they say similar to scatter plots because the algorithm is analyzing data across more than two dimensions, although the resulting chart is essentially the same (i.e., data points with similar characteristics will form clusters).
Whether or not they’re technically similar, this research seems similar to what Ayasdi is doing with its new data-analysis software based on a technique called topological data analysis. In both cases, though, the algorithms aren’t necessarily concerned with how data points interact with one another (like in network graphs), but rather what similar characteristics the points share. Ayasdi’s software has been used in cancer research, too, including on datasets spanning hundreds of patients and tens of thousands of variables.
In theory — although not likely in practice considering the complexity of the datasets medical researchers are dealing with — these approaches are similar to clustering approaches that are also popular among data scientists working with web companies. In areas such as e-commerce or email management, for example, where there isn’t a strong social element, companies can broadly break customers into distinct groups based on their behavior or interests.
Of course, curing cancer is a slightly more compelling — and difficult — goal than targeted advertising. The algorithms have to be precise so as not to miss similarities hidden within the mass of data. In the case of viSNE, the researchers say they’ve been able to spot small groups of cells (like 20 out of tens of thousands) that might be able to survive chemotherapy and increase the likelihood of a recurring tumor.
But we probably shouldn’t bee too quick to discount the work that web companies do as somehow less valuable than that of cancers researchers, for example. The big data era arguably started with the web, and web companies have generated some of the most important data-analysis techniques and technologies around today (see, for example, Google’s Jeff Dean, with whom I’ll be speaking at our Structure conference next month). As medical researchers start generating more and more data via cytometry, genome sequencing and even electronic medical records, it will be critical for individuals in all fields to keep track of what data scientists in other fields are doing and figure out how that might apply to their own work.