Using the crowd to find new needles in the information haystack

A Stanford professor, Russ Altman, working with Microsoft, has created a new and faster way to discover drug side effects by analyzing search queries made in web browsers. This is a great example of using the crowd to find new needles in the information haystack.

John Markoff, Unreported Side Effects of Drugs Are Found Using Internet Search Data, Study Finds

Using data drawn from queries entered into Google, Microsoft and Yahoo search engines, scientists at Microsoft, Stanford and Columbia University have for the first time been able to detect evidence of unreported prescription drug side effects before they were found by the Food and Drug Administration’s warning system.

Using automated software tools to examine queries by six million Internet users taken from Web search logs in 2010, the researchers looked for searches relating to an antidepressant, paroxetine, and a cholesterol lowering drug, pravastatin. They were able to find evidence that the combination of the two drugs caused high blood sugar.

The study, which was reported in the Journal of the American Medical Informatics Association on Wednesday, is based on data-mining techniques similar to those employed by services like Google Flu Trends, which has been used to give early warning of the prevalence of the sickness to the public.


The new approach is a refinement of work done by the laboratory of Russ B. Altman, the chairman of the Stanford bioengineering department. The group had explored whether it was possible to automate the process of discovering “drug-drug” interactions by using software to hunt through the data found in F.D.A. reports.

The group reported in May 2011 that it was able to detect the interaction between paroxetine and pravastatin in this way. Its research determined that the patient’s risk of developing hyperglycemia was increased compared with taking either drug individually.

Altman wondered if other techniques might be employed to discover drug side effects, and he approached Microsoft, and scientists there analyzed anonymized data from users that had opted in to allowing their search histories to be used. They pored through 82 million searches, and found coincident use of the drugs in question and ‘hyperglycemia’ or any of  the common symptoms of hyperglycemia, like ‘blurry vision’ or ‘high blood sugar’.

This is obviously now going to be a key medical research tool, but the general approach has enormous potential, which is why we hear so much about big data these days. Similar techniques are possible for delving into Twitter and Facebook feeds, and are likely to be exploited in commercial and non-commercial ways.

Consider an automobile manufacturer interested in the future urban transportation market, and search through Twitter logs for people discussing its cars and competitors, and alternatives, like Zipcar, bicycles, and mass transit. That sort of analysis will yield better and more immediate results than old school surveys.

Perhaps more importantly in this urban transport case, unlike that of the drug side effects discovery, the car company can identify social psychographics based on what they are discussing: What groups are considering giving up their car? Are women riding bicycles to work? Are electric urban vehicles, like Lit Motors’ C-1 and Toyota’s i-ROAD concept, being talk about as serious alternatives? And they can find the influencers who are spreading these ideas, and follow them. This is going to be one of the biggest opportunities: filtering through sparse big data sets to find socially-scaled influence networks: bringing the dark matter of influence to light.

SumAll raises $6M to bring data analytics to small businesses

New York-based data analytics startup SumAll announced that it has raised $6 million in a Series A round led by Battery Ventures and including Wellington Partners. Launched last year, SumAll helps smaller businesses analyze and visualize a comprehensive set of their data.

Does your chocolate choice show your politics? CrowdTwist says yes

Timed to Halloween and the presidential election, New York-based social loyalty startup CrowdTwist analyzed user data related to brand preferences for chocolate and political affiliation. Its data suggest that Democrats tend to prefer non-premium chocolate, while Republicans opt for more up-market sweets.

Social ad targeting firm 33Across raises $13.1M

New York-based 33Across, an ad tech startup that uses vast amounts of social data to target audiences, has raised an additional $13.1 million from Pelion Ventures, Flybridge Capital, Greycroft Partners, First Round Capital and others.

Adventures in big data: How AddThis’ Hydra works

With 10 terabytes of data generated everyday, AddThis (formerly called Clearspring) has access to a social graph that rivals Facebook. But unlike the social networking giant, AddThis wants to share its data. So I asked them how they can handle real-time processing on that many terabytes.