Twitter open sources Storm-Hadoop hybrid called Summingbird

Twitter has open sourced a “streaming MapReduce” system called Summingbird that makes Hadoop and Storm play nicer together so applications that require both batch and stream processing can do their jobs with as little complexity as possible.

Real-time curation in the corporation?

We are experiencing a rapid adoption of stream-based communication tools in the enterprise, and these are introducing a new era of information explosion, one that shares similarities with what has gone on the business setting before, but on a completely unprecedented scale.
The social revolution in the workplace has many ramifications — changes in management approaches, reliance on new technologies, and so on — but from a social network analysis viewpoint we can point at one factor that acts as a proxy for a great deal of the other impacts of social tools: social density. As people adopt and become more habituated to social tools, the number of social relationships increase. And this means more conversations, more connection, and more information streaming through these networks.
All very interesting, but how are we to make sense of dramatically increased information flows?
Bruce Sterling, the science fiction writer, suggested that we’d be relying on algorithmic machinery:

Ultimately no human brain, no planet full of human brains, can possibly catalog the dark, expanding ocean of data we spew. In a future of information auto-organized by folksonomy, we may not even have words for the kinds of sorting that will be going on; like mathematical proofs with 30,000 steps, they may be beyond comprehension. But they’ll enable searches that are vast and eerily powerful. We won’t be surfing with search engines any more. We’ll be trawling with engines of meaning.

And that turns out to be how Google crunches the web. But for real-time sense making  Twitter has found it needs to get people in the loop. After all, when something new begins to become a hot trend, algorithms don’t necessarily have enough context to figure out what is going on, and search won’t do the right thing:

 Edwin Chen and Alpa Jain, Improving Twitter search with real-time human computation via Twitter Engineering Blog
From a search and advertising perspective, however, these sudden events pose several challenges:

  1. The queries people perform have probably never before been seen, so it’s impossible to know without very specific context what they mean. How would you know that #bindersfullofwomen refers to politics, and not office accessories, or that people searching for “horses and bayonets” are interested in the Presidential debates?
  2. Since these spikes in search queries are so short-lived, there’s only a small window of opportunity to learn what they mean.

So an event happens, people instantly come to Twitter to search for the event, and we need to teach our systems what these queries mean as quickly as we can — because in just a few hours, the search spike will be gone.
How do we do this? We’ve built a real-time human computation engine to help us identify search queries as soon as they’re trending, send these queries to real humans to be judged, and then incorporate the human annotations into our back-end models.
Before we delve into the details, here’s an overview of how the system works.

  1. First, we monitor for which search queries are currently popular.
    Behind the scenes: we run a Storm topology that tracks statistics on search queries. [Storm is a distributed system for real-time computation.]
    For example, the query “Big Bird” may suddenly see a spike in searches from the US.
  2. As soon as we discover a new popular search query, we send it to our human evaluators, who are asked a variety of questions about the query.
    Behind the scenes: when the Storm topology detects that a query has reached sufficient popularity, it connects to a Thrift API that dispatches the query to Amazon’s Mechanical Turk service, and then polls Mechanical Turk for a response.
    For example: as soon as we notice “Big Bird” spiking, we may ask judges on Mechanical Turk to categorize the query, or provide other information (e.g., whether there are likely to be interesting pictures of the query, or whether the query is about a person or an event) that helps us serve relevant Tweets and ads.
  3. Finally, after a response from an evaluator is received, we push the information to our backend systems, so that the next time a user searches for a query, our machine learning models will make use of the additional information. For example, suppose our evaluators tell us that [Big Bird] is related to politics; the next time someone performs this search, we know to surface ads by @barackobama or @mittromney, not ads about Dora the Explorer.

So, they quickly move the sense-making for interest in Big Bird or Clint Eastwood to humans via Mechanical Turk, to overcompensate for recent shifts in the reason for interest in a trending topic.
I think this sort of human curation will be increasingly important in the business context. For example, if a competitor’s product XJ11 is being mentioned all over the social network within your company, understanding why may be critical. Perhaps they have just announced a big sale, or are selling the product line off.  But the new spike in activity is likely to be related to very recent events, and not the long tail of older piece of information about that product. And the most likely candidates to help make sense of the new trend? The people originating the cascading comments in the social communications tools.
Work media and social marketing tools vendors will have to support social curation and getting people into the loop. Twitter approach is one way to go, although instead of using Mechanical Turk a system that asks the originators of these trends for clarification, it might just message the earliest trend setters to ask them what’s up. Alternatively, an approach like Tumblr’s editorial teams might be employed, where individuals are selected to act as editors or curators, and to actively pull out information that is novel and important, and post it onto topic pages.
Whichever approach is taken, I have no doubt that we will see social curation emerging as a key component of our business social networking tools in the near future. And we will still be relying on human beings as the key element of sense making, not machinery.

Here’s how it looks when big data goes mobile-first

Zoomdata has a plan for business intelligence that involves tacking the difficult problem of streaming data, and doing so with a mobile-device-first mindset. The result is pretty and compelling in theory, but it’s technologically challenging and will face tough competition from new and old vendors alike.

Lessons learned: How to get your big data startup to a Series B

Mobile-application development specialist Appcelerator bought big data startup Nodeable on Wednesday, although the deal wasn’t exactly what Nodeable was planning for when it launched in 2011. Founder and CEO Dave Rosenberg shares some of the lessons he learned trying to break into the big data space.

Nodeable gives Hadoop a real-time boost with StreamReduce

Nodeable is now offering a cloud service for processing and analyzing streams of data in real time. Its new flagship service, called StreamReduce, is built atop Twitter’s open source Storm framework and acts as Hadoop’s faster, nimbler front-end partner that delivers users insights as they happen.

Twitter to open source Hadoop-like tool

Attention webscale aficionados, Twitter plans to open source its Hadoop-like real-time data processing tool known as Storm. The social service nabbed the code through its acquisition last month of BackType, and says it’s a better tool for processing streams of data.

Why RIM is cutting 2,000 jobs

Just over 10 percent of RIM’s workforce will be laid off as the company continues losing market share in a segment it once led. How could this happen? RIM has been slow to transition, a process that’s still under way, with no end in sight.