FTC’s Brill sees consumer consent as key for health, finance apps

When normal people use a new app, they don’t wade through hidden service terms. Many just click “OK” and hope for the best. This might be fine for a game of Candy Crush, but it can be risky in the case of apps that monitor things like your bank account or heartbeat.

On March 18, you can find out why from FTC Commissioner Julie Brill, a leading authority on privacy in the age of apps, and one of our guests at Structure Data in New York City.

Brill told me in Washington last week that her agency is concerned about gaps in existing privacy law, especially in how data is stored and sold.

“When it comes to hospitals, insurers and doctors, we have a law that’s well known and well used [i.e. HIPAA],” she said. “Outside of that, when it comes to health tech and wearables, there’s a lot of deeply sensitive information that can be analyzed.”

Brill pointed to an FTC study of 12 health and fitness apps released last spring. It showed how the apps can lead to personal health data, which is normally kept in closed loops of the medical community, trickling out to analytics and advertising companies. Here’s an FTC slide that illustrates the point:

ftc screenshot

A similar information sprawl can occur with financial apps, which many consumers use to track spending or obtain rewards.

The result, Brill said, is that data gathered for one purpose, such as counting steps or tracking spending, can get used for another without the consumer’s knowledge. In a worst-case scenario, the data could become a means for insurance companies or employers to discriminate against those who have experienced health or financial trouble.

One way to prevent this, she said, involves improving the consent and transparency process for apps that deal in sensitive data, such as those that collect health or financial information, or precise geo-locations. In these cases, Brill sees a potential solution in encouraging app makers to obtain affirmative consent if they want to use a consumer’s data out of context.

“So if the consumer downloads an app to monitor some of her vital statistics, and the health information is being used to provide that information to the consumer herself – to monitor her weight or blood pressure  – that is part of the context that the consumer understands when she downloads the app, and affirmative consent is not needed,” Brill stated in a follow-up email. “However, if the company is going to share this health information with third parties, like advertising networks, data brokers or data analysts, then that is a collection and use that is outside the context of the relationship that the consumer understood, and affirmative consent should be required.”

The challenge, of course, is for the Federal Trade Commission to find a way to improve privacy protection without subjecting vibrant parts of the economy to pointless or burdensome regulations. Brill said she’s aware of this and, in any event, formal rules or laws (including a Privacy Act like that proposed last week by President Obama) may be a long time coming.

“I believe industry can do lots before any legislation happens,” she said. “Legislation will take a long time, and this industry is taking off — so if industry can do best practices, it will allow appropriate business practices to flourish.”

To hear more about how (and if) the FTC can find a practical way to protect consumers, come join us on March 18-19 at Structure Data, where you’ll meet other leaders of the data economy, including executives from Google, Twitter, BuzzFeed and Amazon.

An earlier version of this story misspelled HIPAA as HIPPA. It has since been corrected.

Watch Hilary Mason discredit the cult of the algorithm

Want to see Hilary Mason, the CEO and founder at Fast Forward Labs, get fired up? Tell her about your new connected product and its machine learning algorithm that will help it anticipate your needs over time and behave accordingly. “That’s just a bunch of marketing bullshit,” said Mason when I asked her about these claims.

Mason actually builds algorithms and is well-versed in what they can and cannot do. She’s quick to dismantle the cult that has been built up around algorithms and machine learning as companies try to make sense of all the data they have coming in, and as they try to market products built on learning algorithms in the wake of Nest’s $3.2 billion sale to Google (I call those efforts faithware). She’ll do more of this during our opening session with Data collective co-managing partner Matt Ocko at Structure Data on March 18 in New York. You won’t want to miss it.

Lately, algorithms have been touted as the new saviors, capable of helping humans parse terabytes of data to find the hypothetical needle in the haystack. Or they are portrayed as mirrors of our biases coolly replicating our own racist or classist institutions in code.

Mason thinks of them differently. An algorithm is a method, or recipe, or set of instructions for a computer to follow, she said. “It’s just a recipe you type in to get a consistent result. In some ways chocolate chip cookie recipes are my favorite algorithms. You put a bunch of bad-for-you stuff in a bowl and get a delicious result.”

As for the phrase “machine learning,” which has begun replacing “algorithm” in many of the marketing and Kickstarter pitches I see for connected devices that learn your habits, Mason said that’s no more magical. “It’s a false distinction,” she said. Machine learning algorithms may tend to use statistical methods and techniques, but they are still just algorithms.

Essentially, you’re combining what you know about the properties of a given data set with the recipe you built. For an email spam filter, you might build an algorithm that detects spam by looking for words that commonly appear in spam and then combining that with a statistical distribution of the countries that spam often comes from. Voila, the magic has become mundane — or at least mathematical.

At the end of the day, it’s still just math. Really awesome math.

Updated: This story was updated, to clarify some of the points Mason was making.

Cloudera CEO declares victory over big data competition

Cloudera CEO Tom Reilly doesn’t often mince words when it comes to describing his competition in the Hadoop space, or Cloudera’s position among those other companies. In October 2013, Reilly told me he didn’t consider Hortonworks or MapR to be Cloudera’s real competition, but rather larger data-management companies such as IBM and EMC-VMware spinoff Pivotal. And now, Reilly says, “We declare victory over at least one of our competitors.”

He was referring to Pivotal, and the Open Data Platform, or ODP, alliance it helped launched a couple weeks ago along with [company]Hortonworks[/company], [company]IBM[/company], [company]Teradata[/company] and several other big data vendors. In an interview last week, Reilly called that alliance “a ruse and, frankly, a graceful exit for Pivotal,” which laid off a number of employees working on its Hadoop distribution and is now outsourcing most of its core Hadoop development and support to Hortonworks.

You can read more from Reilly below, including his takes on Hortonworks, Hadoop revenues and Spark, as well as some expanded thoughts on the ODP. For more information about the Open Data Platform from the perspectives of the members, you can read our coverage of its launch in mid-February as well as my subsequent interview with Hortonworks CEO Rob Bearden, who explains in some detail how that alliance will work.

If you want to hear about the fast-changing, highly competitive and multi-billion-dollar business of big data straight from horses’ mouths, make sure to attend our Structure Data conference March 18 and 19 in New York. Speakers include Cloudera’s Reilly and Hortonworks’ Bearden, as well as MapR CEO John Schroeder, Databricks CEO (and Spark co-creator) Ion Stoica, and other big data executives and users, including those from large firms such as [company]Lockheed Martin[/company] and [company]Goldman Sachs[/company].


You down with ODP? No, not me

While Hortonworks explains the Open Data Platform essentially as a way for member companies to build on top of Hadoop without, I guess, formally paying Hortonworks for support or embracing its entire Hadoop distribution, Reilly describes it as little more than a marketing ploy. Aside from calling it a graceful exit for Pivotal (and, arguably, IBM), he takes issue with even calling it “open.” If the ODP were truly open, he said, companies wouldn’t have to pay for membership, Cloudera would have been invited and, when it asked about the alliance, it wouldn’t have been required to sign a non-disclosure agreement.

What’s more, Reilly isn’t certain why the ODP is really necessary technologically. It’s presently composed of four of the most mature Hadoop components, he explained, and a lot of companies are actually trying to move off of MapReduce (to Spark or other processing engines) and, in some cases, even the Hadoop Distributed File System. Hortonworks, which supplied the ODP core and presumably will handle much of the future engineering work, will be stuck doing the other members’ bidding as they decide which of several viable SQL engines and other components to include, he added.

“I don’t think we could have scripted [the Open Data Platform news] any better,” Reilly said. He added, “[T]he formation of the ODP … is a big shift in the landscape. We think it’s a shift to our advantage.”

(If you want a possibly more nuanced take on the ODP, check out this blog post by Altiscale CEO Raymie Stata. Altiscale is an ODP member, but Stata has been involved with the Apache Software Foundation and Hadoop since his days as Yahoo CTO and is a generally trustworthy source on the space.)

Hortonworks CEO Rob Bearden at Structure Data 2014.

Hortonworks CEO Rob Bearden at Structure Data 2014.

Really, Hortonworks isn’t a competitor?

Asked about the competitive landscape among Hadoop vendors, Reilly doubled down on his assessment from last October, calling Cloudera’s business model “a much more aggressive play [and] a much bolder vision” than what Hortonworks and MapR are doing. They’re often “submissive” to partners and treat Hadoop like an “add-on” rather than a focal point. If anything, Hortonworks has burdened itself by going public and by signing on to help prop up the legacy technologies that IBM and Pivotal are trying to sell, Reilly said.

Still, he added, Cloudera’s “enterprise data hub” strategy is more akin to the IBM and Pivotal business models of trying to become the centerpiece of customers’ data architectures by selling databases, analytics software and other components beside just Hadoop.

If you don’t buy that logic, Reilly has another argument that boils down to money. Cloudera earned more than $100 million last year (that’s GAAP revenue, he confirmed), while Hortonworks earned $46 million and, he suggested, MapR likely earned a similar number. Combine that with Cloudera’s huge investment from Intel in 2014 — it’s now “the largest privately funded enterprise software company in history,” Reilly said — and Cloudera owns the Hadoop space.

“We intend to take advantage” of this war chest to acquire companies and invest in new products, Reilly said. And although he wouldn’t get into specifics, he noted, “There’s no shortage of areas to look in.”

Diane Bryant, senior vice president and general manager of Intel's Data Center Group, at Structure 2014.

Diane Bryant, senior vice president and general manager of Intel’s Data Center Group, at Structure 2014.

The future is in applications

Reilly said that more than 60 percent of Cloudera sales are now “enterprise data hub” deployments, which is his way of saying its customers are becoming more cognizant of Hadoop as an application platform rather than just a tool. Yes, it can still store lots of data and transform it into something SQL databases can read, but customers are now building new applications for things like customer churn and network optimization with Hadoop as the core. Between 15 and 20 financial services companies are using Cloudera to power detect money laundering, he said, and Cloudera has trained its salesforce on a handful of the most popular use cases.

One of the technologies helping make Hadoop look a lot better for new application types is Spark, which simplifies the programming of data-processing jobs and runs them a lot faster than MapReduce does. Thanks to the YARN cluster-management framework, users can store data in Hadoop and process it using Spark, MapReduce and other processing engines. Reilly reiterated Cloudera’s big investment and big bet on Spark, saying that he expects a lot of workloads will eventually run on it.

Databricks CEO (and AMPLab co-director) Ion Stoica.

Databricks CEO (and Spark co-creator) Ion Stoica.

A year into the Intel deal and …

“It is a tremendous partnership,” Reilly said.

[company]Intel[/company] has been integral in helping Cloudera form partnerships with companies such as Microsoft and EMC, as well as with customers such as MasterCard, he said. The latter deal is particularly interesting because Cloudera and Intel’s joint engineering on hardware-based encryption helped Cloudera deploy a PCI-compliant Hadoop cluster and MasterCard is now out pushing that system to its own clients via its MasterCard Advisors professional services arm.

Reilly added that Cloudera and Intel are also working together on new chips designed specifically for analytic workloads, which will take advantage of non-RAM memory types.

Asked whether Cloudera’s push to deploy more workloads in cloud environments is at odds with Intel’s goal to sell more chips, Reilly pointed to Intel’s recent strategy of designing chips especially for cloud computing environments. The company is operating under the assumption that data has gravity and that certain data that originates in the cloud, such as internet-of-things or sensor data, will stay there, while large enterprises will continue to store a large portion of their data locally.

Wherever they run, Reilly said, “[Intel] just wants more workloads.”

A look at Zeroth, Qualcomm’s effort to put AI in your smartphone

What if your smartphone camera were smart enough to identify that the plate of clams and black beans appearing in its lens was actually food? What if it then automatically could make the necessary adjustments to take a decent picture of said dish in the low light conditions of a restaurant? And what if it then without prompting, uploaded that photo to Foodspotting along with your location because, your camera phone knows from past experience you like to keep an endless record of your culinary conquests for the world to see?

These are just a few of the questions that [company]Qualcomm[/company] is asking of its new cognitive computing technology Zeroth, which aims to bring artificial intelligence out of the cloud and move it – or at least a limited version of it – into your phone. At Mobile World Congress in Barcelona, I sat down with Qualcomm SVP of product management Raj Talluri, who explained what Zeroth was all about.

Zeroth phones aren’t going to beat chess Grand Masters or create its own unique culinary recipes, but it will perform basic intuitive tasks and anticipate your actions, thus eliminating many of the rudimentary steps required to operate the increasingly complex smartphone, Talluri explained.

“We wanted to see if we could build deep-learning neural networks on devices you carry with you instead of in the cloud,” Talluri said. Using that approach, Qualcomm could solve certain problems surrounding the everyday use of a device.

One such problem, Talluri called the camera problem. The typical smartphone can pick up a lot of images throughout the day, from selfies to landscape shots to receipts for your expense reports. You could load every image you have into the cloud and sort them there or figure out what to do with each photo as you snap them, but cognitive computing capabilities in your phone could do much of that work and it could it could it without you telling it what to do, Talluri said.

Zeroth can train the camera not just to recognize a landscape shot from a close up. It could determine between whole classes of objects, from fruit to mountains to buildings. It can distinguish children from adults and cats from dogs, Talluri said. What the camera does with that information depends on the user’s preferences and the application.

The most basic use case would be taking better photos as it can optimize the shot for the types of objects in them. It could also populate photos with tons of useful metadata. Then you could build on that foundation with other applications. Your smartphone might recognize, for instance, that you’re taking a bunch of landscape and architecture shots in foreign locale and automatically upload them to a vacation album on Flickr. A selfie might automatically produce a Facebook post prompt.

Zeroth devices would be pre-trained to recognize certain classes of objects – right now Qualcomm has used machine learning to create about 30 categories – but the devices could continue to learn after they’re shipped, Talluri said.

With permission, it could access your contact list and scan your social media accounts, and start recognizing the faces of your friends and family in your contact list, Talluri said. Then if you were taking a picture with a bunch of people in the frame, Zeroth would recognize your friends and focus in on their faces. Zeroth already has the ability to recognize handwriting, but you could train it to recognize the particular characteristics of your script, learning for instance that in my chicken scratch, lower case “A”s often look like “O”s.

Other examples of Zeroth applications include devices that could automatically adjust their power performance to the habits of its owner or scan its surroundings sensors to determine what a user’s most likely next smartphone action might be.

Zeroth itself isn’t a separate chip or component. It’s a software architecture designed to run across the different elements of Qualcomm’s Snapdragon processors, so as future Snapdragon products get more powerful, Zeroth becomes more intelligent, Talluri said. We’ll discuss the Zeroth capabilities and designing software that’s smarter and based on cognitive computing with a Qualcomm executive at our Structure Data event in New York later this month.

Qualcomm plans to debut the technology in next year’s premium smartphones and tablets that uses the forthcoming the Snapdragon 820, which uses a new 64-bit CPU architecture called Kyro and was announced at MWC. But Qualcomm was already showing off basic computer vision features like handwriting and object recognition on devices using the Snapdragon 810. Many of those devices were launched at MWC and should appear in markets in the coming months.


Why you can’t program intelligent robots, but you can train them

If it feels like we’re in the midst of robot renaissance right now, perhaps it’s because we are. There is a new crop of robots under development that we’ll soon be able to buy and install in our factories or interact with in our homes. And while they might look like robots past on the outside, their brains are actually much different.

Today’s robots aren’t rigid automatons built by a manufacturer solely to perform a single task faster than, cheaper than and, ideally, without much input from humans. Rather, today’s robots can be remarkably adaptable machines that not only learn from their experiences, but can even be designed to work hand in hand with human colleagues. Commercially available (or soon to be) technologies such as Jibo, Baxter and Amazon Echo are three well-known examples of what’s now possible, but they’re also just the beginning.

Different technological advances have spurred the development of smarter robots depending on where you look, although they all boil down to training. “It’s not that difficult to builtd the body of the robot,” said Eugene Izhikevich, founder and CEO of robotics startup Brain Corporation, “but the reason we don’t have that many robots in our homes taking care of us is it’s very difficult to program the robots.”

Essentially, we want robots that can perform more than one function, or perform one function very well. And it’s difficult to program a robot to do multiple things, or at least the things that users might want, and it’s especially difficult to program to do these things in different settings. My house is different than your house, my factory is different than your factory.

A collection of RoboBrain concepts.

A collection of RoboBrain concepts.

“The ability to handle variations is what enables these robots to go out into the world and actually be useful,” said Ashutosh Saxena, a Stanford University visiting professor and head of the RoboBrain project. (Saxena will be presenting on this topic at Gigaom’s Structure Data conference March 18 and 19 in New York, along with Julie Shah of MIT’s Interactive Robotics Group. Our Structure Intelligence conference, which focuses on the cutting edge in artificial intelligence, takes place in September in San Francisco.)

That’s where training comes into play. In some cases, particularly projects residing within universities and research centers, the internet has arguably been a driving force behind advances in creating robots that learn. That’s the case with RoboBrain, a collaboration among Stanford, Cornell and a few other universities that crawls the web with the goal of building a web-accessible knowledge graph for robots. RoboBrain’s researchers aren’t building robots, but rather a database of sorts (technically, more of a representation of concepts — what an egg looks like, how to make coffee or how to speak to humans, for example) that contains information robots might need in order to function within a home, factory or elsewhere.

RoboBrain encompasses a handful of different projects addressing different contexts and different types of knowledge, and the web provides an endless store of pictures, YouTube videos and other content that can teach RoboBrain what’s what and what’s possible. The “brain” is trained with examples of things it should recognize and tasks it should understand, as well as with reinforcement in the form of thumbs up and down when it posits a fact it has learned.

For example, one of its flagship projects, which Saxena started at Cornell, is called Tell Me Dave. In that project, researchers and crowdsourced helpers across the web train a robot to perform certain tasks by walking it through the necessary steps for tasks such as cooking ramen noodles.  In order for it to complete a task, the robot needs to know quite a bit: what each object it sees in the kitchen is, what functions it performs, how it operates and at which step it’s used in any given process. In the real world, it would need to be able to surface this knowledge upon, presumably, a user request spoken in natural language — “Make me ramen noodles.”

The Tell Me Dave workflow.

The Tell Me Dave workflow.

Multiply that by any number of tasks someone might actually want a robot to perform, and it’s easy to see why RoboBrain exists. Tell Me Dave can only learn so much, but because it’s accessing that collective knowledge base or “brain,” it should theoretically know things it hasn’t specifically trained on. Maybe how to paint a wall, for example, or that it should give human beings in the same room at least 18 inches of clearance.

There are now plenty of other examples of robots learning by example, often in lab environments or, in the case of some recent DARPA research using the aforementioned Baxter robot, watching YouTube videos about cooking (pictured above).

Advances in deep learning — the artificial intelligence technique du jour for machine-perception tasks such as computer vision, speech recognition and language understanding — also stand to expedite the training of robots. Deep learning algorithms trained on publicly available images, video and other media content can help robots recognize the objects they’re seeing or the words they’re hearing; Saxena said RoboBrain uses deep learning to train robots on proper techniques for moving and grasping objects.

The Brain Corporation platform.

The Brain Corporation platform.

However, there’s a different school of thought that says robots needn’t necessarily be as smart as RoboBrain wants to make them, so long as they can at least be trained to know right from wrong. That’s what Izhikevich and his aforementioned startup, Brain Corporation, are out to prove. It has built a specialized hardware and software platform, based on the idea of spiking neurons, that Izhikevich says can go inside any robot and “you can train your robot on different behaviors like you can train an animal.”

That is to say, for example, that a vacuum robot powered by the company’s operating system (called BrainOS) won’t be able to recognize that a cat is a cat, but it will be able to learn from its training that that object — whatever it is — is something to avoid while vacuuming. Conceivably, as long as they’re trained well enough on what’s normal in a given situation or what’s off limits, BrainOS-powered robots could be trained to follow certain objects or detect new objects or do lots of other things.

If there’s one big challenge to the notion of training robots versus just programming them, it’s that consumers or companies that use the robots will probably have to do a little work themselves. Izhikevich noted that the easiest model might be for BrainOS robots to be trained in the lab, and then have that knowledge turned into code that’s preinstalled in commercial versions. But if users want to personalize robots for their specific environments or uses, they’re probably going to have to train it.

Part of the training process with Canary. The next step is telling the camera what its seeing.

Part of the training process with Canary. The next step is telling the camera what it’s seeing.

As the internet of things and smart devices, in general, catch on, consumers are already getting used the idea — sometimes begrudgingly. Even when it’s something as simple as pressing a few buttons in an app, like training a Nest thermostat or a Canary security camera, training our devices can get tiresome. Even those of us who understand how the algorithms work can get get annoyed.

“For most applications, I don’t think consumers want to do anything,” Izhikevich said. “You want to press the ‘on’ button and the robot does everything autonomously.”

But maybe three years from now, by which time Izhikevich predicts robots powered by Brain Corporation’s platform will be commercially available, consumers will have accepted one inherent tradeoff in this new era of artificial intelligence — that smart machines are, to use Izhikevich’s comparison, kind of like animals. Specifically, dogs: They can all bark and lick, but turning them into seeing eye dogs or K-9 cops, much less Lassie, is going to take a little work.

Hortonworks did $12.7M in Q4, on its path to a billion, CEO says

Hadoop vendor Hortonworks announced its first quarterly earnings as a publicly held company Tuesday, claiming $12.7 million in fourth-quarter revenue and $46 million in revenue during fiscal year 2014. The numbers represent 55 percent quarter-over-quarter and 91 percent year-over-year increases, respectively. The company had a net loss of $90.6 million in the fourth quarter and $177.3 million for the year.

However, [company]Hortonworks[/company] contends that revenue is not the most important number in assessing its business. Rather, as CEO Rob Bearden explained around the time the company filed its S-1 pre-IPO statement in November, Hortonworks’ thinks its total billings are a more accurate representation of its health. That’s because the company relies fairly heavily on professional services, meaning the company often doesn’t get paid until a job is done.

The company’s billings in the fourth quarter totaled $31.9 million, a 148 percent year-over-year increase. Its fiscal year billings were $87.1 million, a 134 percent increase over 2013.

If you buy Bearden’s take on the importance of billings over revenue, then Hortonworks looks a lot more comparable in size to its largest rival, Cloudera. Last week, Cloudera announced more than $100 million in revenue in 2014, as well as an 85 percent increase in subscription software customers up to 525 in total.

Hortonworks, for its part, added 99 customers paying for enterprise support of its Hadoop platform in the fourth quarter alone, bringing its total to 332. Among those customers are Expedia, Macy’s, Blackberry and Spotify, all four of which moved directly to Hortonworks from Cloudera, a Hortonworks spokesperson said.

There are, however, some key differences between the Hortonworks and Cloudera business models, as well as that of fellow vendor MapR, that affect how comparable any of these metrics really are. While Hortonworks is focused on free open source software and relies on support contracts for revenue, Cloudera and MapR offer both free Hadoop distributions as well as more feature-rich paid versions. In late-2013, Cloudera CEO Tom Reilly told me his company was interested in securing big deployments rather than chasing cheap support contracts.

Rob Bearden

Rob Bearden at Structure Data 2014

I had a broad discussion with Bearden last week about the Hadoop market and some of Hortonworks’ recent moves in that space, including the somewhat-controversial Open Data Platform alliance it helped to create along with Pivotal, [company]IBM[/company], [company]GE[/company] and others. Here are the highlights from that interview. (If you want to hear more from Bearden and perhaps ask him some of your own questions, make sure to attend our Structure Data conference March 18 and 19 in New York. Other notable Hadoop-market speakers include Cloudera CEO Tom Reilly, MapR CEO John Schroeder and Databricks CEO (and Spark co-creator) Ion Stoica.)

Explain the rationale behind the Open Data Platform

Bearden wouldn’t comment specifically on criticisms — made most loudly by Cloudera’s Mike Olson and Doug Cutting, as well as some industry analysts — that the Open Data Platform, or ODP, is somehow antithetical to open source or the Apache Software Foundation. “What I would say,” he noted, “is the people who are committed to true open source and an open platform for the community are pretty excited about the thing.”

He also chalked up a lot of the criticism of the ODP to misunderstanding about how it really will work in practice. “One of the things I don’t think is very clear on the Open Data Platform alliance is that we’re actually going to provide what we’ll refer to as the core for that alliance, that is based on core Hadoop — so HDFS and YARN and Ambari,” Bearden explained. “We’re providing that, which is obviously directly from Apache, and it’s the exact same bit line that [the Hortonworks Data Platform] is based on.”

Pivotal CEO Paul Maritz at Structure Data 2014.

Paul Maritz, CEO of Hortonworks partner, and ODP member Pivotal, at Structure Data 2014.

So, the core Hadoop distribution that platform members will use is based on Apache code, and anything that ODP members want to add on top of it will also have to go through Apache. These could be existing Apache projects, or they could be new projects he members decide to start on their own, Bearden said.

“We’re actually strengthening the position of the Apache Software Foundation,” he said. He added later in the interview, on the same point, that people shouldn’t view the ODP as much different than they view Hortonworks (or, in many respects, Cloudera or MapR). “[The Apache Software Foundation] is the engineering arm,” he said, “and this entity will become he productization and packaging arm for [Apache].”

So, it’s Cloudera vs. MapR vs. Hortonworks et al?

I asked Bearden whether the formation of the ODP officially makes the Hadoop market a case of Cloudera and MapR versus the Hortonworks ecosystem. That seems like the case to me, considering that the ODP is essentially providing the core for a handful of potentially big players in the Hadoop space. And even if they’re not ODP members, companies such as [company]Microsoft[/company] and [company]Rackspace[/company] have built their Hadoop products largely on top of the Hortonworks platform and with its help.

Bearden wouldn’t bite. At least not yet.

“I wouldn’t say it’s the other guys versus all of us,” he said. “I would say what’s happened is the community has realized this is what they want and it fits in our model that we’re driving very cleanly. . . . And we’re not doing anything up the stack to try and disintermediate them, and we de-risk it because we’re all open.”

The this he’s referring to is the ability of its partners to stop spending resources keeping up with the core Hadoop technology and instead focus on how they can monetize their own intellectual property. “To do that, the more data they put under management, the faster and the more-stable and enterprise-viable [the platform on which they have that data], the faster they monetize and the bigger they monetize the rest of their platform,” Bearden said.

Microsoft CEO Satya Nadella speaks at a Microsoft cloud event. Photo by Jonathan Vanian/Gigaom

Microsoft CEO Satya Nadella speaks at a Microsoft cloud event about that company’s newfound embrace of open source.

Are you standing by your prediction of a billion-dollar company?

“I am not backing off that at all,” Bearden said, in reference to his prediction at Structure Data last year that Hadoop will soon become a multi-billion-dollar market and Hortonworks will be a billion-dollar company in terms of revenue. He said it’s fair to look at revenue alone is assessing the businesses in this space, but it’s not the be all, end all.

“It’s less about [pure money] and more about what is the ecosystem doing to really start adopting this,” he said. “Are they trying fight it and reject it, or are they really starting to embrace it and pull it through? Same with the big customers. . . .
“When those things are happening, the money shows up. It just does.”

Hadoop is actually just a part — albeit a big one — of a major evolution in the data-infrastructure space, he explained. And as companies start replacing the pieces of their data environments, they’ll do so with the open source options that now dominate new technologies. These include Hadoop, NoSQL databases, Storm, Kafka, Spark and the like.

In fact, Bearden said, “Open source companies can be very successful in terms of revenue growth and in terms of profitability faster than the old proprietary platforms got there.”

Time will tell.

Update: This post was updated at 8:39 p.m. PT to correct the amount of Hortonworks’ fourth quarter revenue and losses. Revenue was $12.7 million, not $12.5 million as originally reported, and losses were $90.6 million for the quarter and $177.3 million for the year. The originally reported numbers were for gross loss.

Remember when machine learning was hard? That’s about to change

A few years ago, there was a shift in the world of machine learning.

Companies, such as Skytree and Context Relevant, began popping up, promising to make it easier for companies outside of big banks and web giants to run machine learning algorithms and to do it at a scale congruent with the big data promise they were being pitched. Soon, there were many startups promising bigger, faster, easier machine learning. Machine learning became the new black as it became baked into untold software packages and services — machine learning for marketing, machine learning for security, machine learning for operations, and on and on and on.

Eventually, deep learning emerged from the shadows and became a newer, shinier version of machine learning. It, too, was very difficult and required serious expertise to do. Until it didn’t. Now, deep learning is the focus of numerous startups, all promising to make it easy for companies and developers of all stripes to deploy.

Joseph Sirosh

Joseph Sirosh

But it’s not just startups leading the charge in this democratization of data science — large IT companies are also getting in on the act. In fact, Microsoft now has a corporate vice president of machine learning. His name is Joseph Sirosh, and we spoke with him on this week’s Structure Show podcast. Here are some highlights from that interview, but it’s worth listening to the whole thing for his take on Microsoft’s latest news (including support for R and Python in its Azure ML cloud service) and competition in the cloud computing space.

You can also catch Sirosh — and lots of other machine learning and big data experts and executives — at our Structure Data conference next month in New York. We’ll be highlighting the newest techniques in taking advantage of data, and talking to the people building businesses around them and applying them to solve real-world problems.

[soundcloud url=”https://api.soundcloud.com/tracks/191875439″ params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

Download This Episode

Subscribe in iTunes

The Structure Show RSS Feed

Why the rise of machine learning and why now

“I think the cloud has transformed [machine learning], the big data revolution has transformed it,” Sirosh said. “But at the end of the day, I think the opportunity that is available now because of the vast amount of data that is being collected from everywhere . . . is what is making machine learning even more attractive. . . . As most of behavior, in many ways, comes online on the internet, the opportunity to use the data generated on interactions on websites and software to tailor customer experiences, to provide better experiences for customers, to also generate new revenue opportunities and save money — all of those become viable and attractive.”

Asked why whether all of this is possible without the cloud, Sirosh thinks it is, but — like most things —  it would be a lot more difficult.

“The cloud makes it easy to integrate data, it makes it easy to, in place, do machine learning on top of it, and then you can publish applications on the same cloud,” he said. “And all of this process happens in one place and much faster, and that changes the game quite a bit.”

Deep learning made easy and easier

Sirosh said he began his career in neural networks and actually earned his Ph.D. studying them, so he’s happy to see deep learning emerge as a legitimately useful technology for mainstream users.

“My take on deep learning is actually this,” he explained. “It is a continuing evolution in that field, we just have now gotten to the level where we have identified great algorithmic tricks that allow you to take performance and accuracy to the next level.”

Deep learning is also an area where Microsoft sees a big opportunity to bring its expertise in building easily consumable applications to bear. Azure ML already makes it relatively easy to train deep neural networks using the same types of methods as its researchers do, Sirosh noted, but users can expect even more in the months to come.

“We will also provide fully trained neural networks,” he said. “We have a tremendous amount of data in images and text data and so on inside of Bing. We will use our massive compute power to learn predictive models from this data and offer some of those pre-trained, canned neural networks in the future in the product so that people will find it very easy to use.”

A set of images that the Microsoft system classified correctly.

The results of a Microsoft computer vision algorithm it says can outperform humans at some tasks.

How easy can all of this really be?

As long as there are applications that can hide its complexity, Sirosh has a vision for machine learning that’s much broader than even the world of enterprise IT sales.

“Well, we are actually going after a broad audience with something like machine learning,” he said. “We want to make it as simple as possible, even for students in a high school or in college. In my way of thinking about it, if you’re doing statistics in high school, you should be able to use [a] machine learning tool, run R code and statistical analysis on it. And you can teach machine learning and statistical analysis using this tool if you so choose to.”

Is Microsoft evolving from an operating system company to a data company?

Not entirely, but Sirosh did suggest that Microsoft sees a shift happening in the IT world and is moving fast to ride the wave.

“I think you should even first ask, ‘How big is the world of data to computing itself?'” he said. “I would say that in the future, a huge part of the value being generated in the field of computing . . . is going to come from data, as opposed to storage and operating systems and basic infrastructure. It’s the data that is most valuable. And if that is where in the computing industry most of the value is going to be generated, well that is one place where Microsoft will generate a lot of its value, as well.”

For now, Spark looks like the future of big data

Titles can be misleading. For example, the O’Reilly Strata + Hadoop World conference took place in San Jose, California, this week but Hadoop wasn’t the star of the show. Based on the news I saw coming out of the event, it’s another Apache project — Spark — that has people excited.

There was, of course, some big Hadoop news this week. Pivotal announced it’s open sourcing its big data technology and essentially building its Hadoop business on top of the [company]Hortonworks[/company] platform. Cloudera announced it earned $100 million in 2014. Lost in the grandstanding was MapR, which announced something potentially compelling in the form of cross-data-center replication for its MapR-DB technology.

But pretty much everywhere else you looked, it was technology companies lining up to support Spark: Databricks (naturally), Intel, Altiscale, MemSQL, Qubole and ZoomData among them.

Spark isn’t inherently competitive with Hadoop — in fact, it was designed to work with Hadoop’s file system and is a major focus of every Hadoop vendor at this point — but it kind of is. Spark is known primarily as an in-memory data-processing framework that’s faster and easier than MapReduce, but it’s actually a lot more. Among the other projects included under the Spark banner are file system, machine learning, stream processing, NoSQL and interactive SQL technologies.

The Spark platform, minus the Tachyon file system and some younger related projects.

The Spark platform, minus the Tachyon file system and some younger related projects.

In the near term, it probably will be that Hadoop pulls Spark into the mainstream because Hadoop is still at least a cheap, trusted big data storage platform. And with Spark still being relatively immature, it’s hard to see too many companies ditching Hadoop MapReduce, Hive or Impala for their big data workloads quite yet. Wait a few years, though, and we might start seeing some more tension between the two platforms, or at least an evolution in how they relate to each other.

This will be especially true if there’s a big breakthrough in RAM technology or prices drop to a level that’s more comparable to disk. Or if Databricks can convince companies they want to run their workloads in its nascent all-Spark cloud environment.

Attendees at our Structure Data conference next month in New York can ask Spark co-creator and Databricks CEO Ion Stoica all about it — what Spark is, why Spark is and where it’s headed. Coincidentally, Spark Summit East is taking place the exact same days in New York, where folks can dive into the nitty gritty of working with the platform.

There were also a few other interesting announcements this week that had nothing to do with Spark, but are worth noting here:

  • [company]Microsoft[/company] added Linux support for its HDInsight Hadoop cloud service, and Python and R programming language support for its Azure ML cloud service. The latter also now lets users deploy deep neural networks with a few clicks. For more on that, check out the podcast interview with Microsoft Corporate Vice President of Machine Learning (and Structure Data speaker) Joseph Sirosh embedded below.
  • [company]HP[/company] likes R, too. It announced a product called HP Haven Predictive Analytics that’s powered by a distributed version of R developed by HP Labs. I’ve rarely heard HP and data science in the same sentence before, but at least it’s trying.
  • [company]Oracle[/company] announced a new analytic tool for Hadoop called Big Data Discovery. It looks like a cross between Platfora and Tableau, and I imagine will be used primarily by companies that already purchase Hadoop in appliance form from Oracle. The rest will probably keep using Platfora and Tableau.
  • [company]Salesforce.com[/company] furthered its newfound business intelligence platform with a handful of features designed to make the product easier to use on mobile devices. I’m generally skeptical of Salesforce’s prospects in terms of stealing any non-Salesforce-related analytics from Tableau, Microsoft, Qlik or anyone else, but the mobile angle is compelling. The company claims more than half of user engagement with the platform is via mobile device, which its Director of Product Marketing Anna Rosenman explained to me as “a really positive testament that we have been able to replicate a consumer interaction model.”

If I missed anything else that happened this week, or if I’m way off base in my take on Hadoop and Spark, please share in the comments.

[soundcloud url=”https://api.soundcloud.com/tracks/191875439″ params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

A massive database now translates news in 65 languages in real time

I have written quite a bit about GDELT (the Global Database of Events, Languages and Tone) over the past year, because I think it’s a great example of the type of ambitious project only made possible by the advent of cloud computing and big data systems. In a nutshell, it’s database of more than 250 million socioeconomic and geopolitical events and their metadata dating back to 1979, all stored (now) in Google’s cloud and available to analyze for free via Google BigQuery or custom-built applications.

On Thursday, version 2.0 of GDELT was unveiled, complete with a slew of new features — faster updates, sentiment analysis, images, a more-expansive knowledge graph and, most importantly, real-time translation across 65 different languages. That’s 98.4 percent of the non-English content GDELT monitors. Because you can’t really have a global database, or expect to get a full picture of what’s happening around the world, if you’re limited to English language sources or exceedingly long turnaround times for translated content.

For a quick recap of GDELT, you can read the story linked to above, as well as our coverage of project creator Kalev Leetaru’s analyses of the Arab Spring and Ukrainian crisis and the Ebola outbreak. For a deeper understanding of the project and its creator –who also helped measure the “Twitter heartbeat” and uploaded millions of images from the Internet Archive’s digital book collection to Flickr — check our Structure Show podcast interview with Leetaru from August (embedded below). He’ll also be presenting on GDELT and his future plans at our Structure Data conference next month.


An time-series analysis of the Arab Spring compared with similar periods since 1979.

Leetaru explains GDELT 2.0’s translation system in some detail in a blog post, but even at a high level the methods it uses to achieve near real-time speed are interesting. It works sort of like buffering does on Netflix:

“GDELT’s translation system must be able to provide at least basic translation of 100% of monitored material every 15 minutes, coping with sudden massive surges in volume without ever requiring more time than the 15 minute window. This ‘streaming’ translation is very similar to streaming compression, in which the system must dynamically modulate the quality of its output to meet time constraints: during periods with relatively little content, maximal translation accuracy can be achieved, with accuracy linearly degraded as needed to cope with increases in volume in order to ensure that translation always finishes within the 15 minute window. In this way GDELT operates more similarly to an interpreter than a translator. This has not been a focal point of current machine translation research and required a highly iterative processing pipeline that breaks the translation process into quality stages and prioritizes the highest quality material, accepting that lower-quality material may have a lower-quality translation to stay within within the available time window.”

In addition, Leetaru wrote:

“Machine translation systems . . . do not ordinarily have knowledge of the user or use case their translation is intended for and thus can only produce a single ‘best’ translation that is a reasonable approximation of the source material for general use. . . . Using the equivalent of a dynamic language model, GDELT essentially iterates over all possible translations of a given sentence, weighting them both by traditional linguistic fidelity scores and by a secondary set of scores that evaluate how well each possible translation aligns with the specific language needed by GDELT’s Event and GKG systems.”

It will be interesting to see how and if usage of GDELT picks up with the broader, and richer, scope of content it now covers. With an increasingly complex international situation that runs the gamut from the climate change to terrorism, it seems like world leaders, policy experts and even business leaders could use all the information they can get about what’s connected to what, who’s connected to whom and how this all might play out.

[soundcloud url=”https://api.soundcloud.com/tracks/165051736?secret_token=s-YTgYs” params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

The 4 things (at least) you’ll learn about at Structure Data

Gigaom’s Structure Data conference is less than a month away, kicking off March 18 in New York. There are a lot of reasons to attend — great location, great networking, free drinks — but, of course, the biggest reason is great content.

With that in mind, here are four big themes of the event and the speakers who’ll be talking about them. Some are household names in the world of big data and information technology, some are researchers on the forefront of hot new fields, and other up-and-coming entrepreneurs with big ideas about how data can change business and the world. Structure Data is your chance to hear in person what they have to say and ask them those questions you’ve been dying to ask.

The business of big data

Everyone has heard about Hadoop, but the business of big data infrastructure is about so much more: Spark, Kafka, the internet of things, the industrial internet, visualization, social media analysis, webscale systems, machine learning. The tools are finally in place to do some really cool things, if you know where to look for them.

Structure Data speakers leading the charge in the world data software and services include: Ted Bailey, Dataminr; Rob Bearden, Hortonworks; Eric Brewer, Google; Ann Johnson, Interana; Jock Mackinlay, Tableau Software; Hilary Mason, Fast Forward Labs; Seth McGuire, Twitter; Neha Narkhede, Confluent; Matt Ocko, Data Collective; Andy Palmer, Tamr; Tom Reilly, Cloudera; William Ruh, GE; John Schroeder, MapR; Joseph Sirosh, Microsoft; Ion Stoica, Databricks; Matt Wood, Amazon Web Services.

Eric Brewer, vice president of infrastructure, Google

Eric Brewer, vice president of infrastructure, Google

A new era of artificial intelligence

Unless you live under a rock inside a cave with spotty internet, you’ve probably heard that folks including Stephen Hawking and Elon Musk think we should be leery of artificial intelligence. Perhaps they’re right, perhaps they’re wrong. But AI is hot right now because techniques such as deep learning offer effective ways of training systems that can make sense of mountains of text, audio and visual data, and because we’re closer than ever to robots that can navigate the world around them.

Structure Data speakers on the forefront of AI and machine learning research include: Ron Brachman, Yahoo; Eugenio Culurciello, TeraDeep; Rob Fergus, Facebook; Ahna Girshick, Enlitic; Jeff Hawkins, Numenta; Anthony Lewis, Qualcomm; Gary Marcus, Geometric Intelligence; Naveen Rao, Nervana Systems; Ashutosh Saxena, Stanford University; Julie Shah, MIT; Sven Strohband, MetaMind; Davide Venturelli, NASA; Brian Whitman, Spotify.

Julie Shah, Interactive Robotics Group, MIT

Julie Shah, Interactive Robotics Group, MIT

Users — big users — everywhere

One of the most amazing things to watch over the past few years is how big data tools and data science techniques dispersed from the ivory towers of places like Google and Facebook out across every type of industry. From farming to medicine, and from media to food production, data is driving some incredible investments and innovations.

Structure Data speakers discussing how data has transformed their businesses include: Krish Dasgupta, ESPN; Don Duet, Goldman Sachs; Ky Harlin, BuzzFeed; Nancy Hersh, Opower; Steven Horng, Beth Israel Deaconess Medical Center; Ravi Hubbly, Lockheed Martin; Lee Redden, Blue River Technology; Bill Squadron, STATS; Dan Zigmond, Hampton Creek.

Ky Harlin, director of data science, BuzzFeed

Ky Harlin, director of data science, BuzzFeed

Data by the people, for the people and about the people

While big data has been a boon for IT vendors and some large corporations, the benefits haven’t always been so obvious when it comes to society. Better marketing and more-addictive apps only help the businesses behind them, while the privacy risks for consumers have never been higher as companies collect more and more data from the sites we visit and devices we use. Things are starting to come around, however, as smart people are using data tackle everything from crime to geopolitics, and officials are increasingly cognizant of regulating new industries like the internet of things in ways that maximize the consumer experience while still keeping them safe.

Structure Data speakers addressing societal impacts of data analysis include: Julie Brill, Federal Trade Commission; Paul Duan, Bayes Impact; Kalev Leetaru, GDELT; Jens Ludwig, University of Chicago Crime Lab.

Julie Brill, commissioner, Federal Trade Commission

Julie Brill, commissioner, Federal Trade Commission