Microsoft’s machine learning guru on why data matters sooooo much

[soundcloud url=”https://api.soundcloud.com/tracks/191875439″ params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

Not surprisingly, Joseph Sirosh, has big ambitions for his product portfolio at Microsoft which includes Azure ML, HDInsight and other tools. Chief among them is making it easy for mere mortals to consume these data services from the applications they’re familiar with. Take Excel for example.

If a financial analyst can, with a few clicks, send data to a forecast service in the cloud, then get the numbers back, visualized on the same spreadsheet, that’s a pretty powerful story, said Sirosh who is corporate VP of machine learning for Microsoft.

But as valuable as those applications and services are, more and more of the value to be derived from computation over time will be the data itself, not all those tech underpinnings.  “In the future a huge part of the value generated from computing will come from the data as opposed to storage and operating systems and basic infrastructure,” he noted on this week’s podcast. WHich is why one topic under discussion at next month’s Structure Data show will be who owns all the data flowing betwixt and betweeen various systems, the internet of things etc.

When it comes to getting corporations running these new systems [company]Microsoft[/company] may have an ace in the hole because so many of them already use key Microsoft tools — Active Directory, SQL Server, Excel. That gives them a pretty good on-ramp to Microsoft Azure and its resident services. Sirosh makes a compelling case and we’ll talk to him more on stage at Structure Data next month in New York City.

In the first half of the show, Derrick Harris and I talk about the Hadoop world has returned to its feisty and oh so interesting roots. When Pivotal announced its plan to offload support of Hadoop to [company]Hortonworks[/company] and work with that company along with [company]IBM[/company], [company]GE[/company] on  the Open Data Platform the response from Cloudera CEO Mike Olsen in a blog post with his take. 

Also on the docket, @WalmartLabs massive OpenStack production private cloud implementation.

Joesph Sirosh

Joseph Sirosh

 

SHOW NOTES

Hosts: Barb Darrow and Derrick Harris.

Download This Episode

Subscribe in iTunes

The Structure Show RSS Feed

PREVIOUS EPISODES:

No, you don’t need a ton of data to do deep learning 

VMware wants all those cloud workloads “marooned” in AWS

Don’t like your cloud vendor? Wait a second.

Hilary Mason on taking big data from theory to reality

On the importance of building privacy into apps and Reddit AMAs

Cloudera claims more than $100M in revenue in 2014

Hadoop vendor Cloudera announced on Tuesday that the company’s “[p]reliminary unaudited total revenue surpassed $100 million” in 2014. That the company, which is still privately held, would choose to disclose even that much information about its finances speaks to the fast maturation, growing competition and big egos in the Hadoop space.

While $100 million is a nice, round benchmark number, the number by itself doesn’t mean much of anything. We still don’t know how much profit Cloudera made last year or, more likely, how big of a loss it sustained. What we do know, however, is that it earned more than bitter rivals [company]Hortonworks[/company] (it claimed $33.4 million through the first nine months of 2014, and will release its first official earnings report next week) and probably MapR (I’ve reached out to MapR about this and will update this if I’m wrong). However, Cloudera claims 525 customers are paying for its software (an 85 percent improvement since 2013), while MapR in December claimed more than 700 paying customers.

Cloudera also did about as much business as EMC-VMware spinoff Pivotal claims its big data business did in 2014. On Tuesday, Pivotal open sourced much of its Hadoop and database technology, and teamed up with Hortonworks and a bunch of software vendors large and small to form a new Hadoop alliance called the Open Data Platform. Cloudera’s Mike Olson, the company’s chief strategy officer and founding CEO, called the move, essentially, disingenuous and more an attempt to save Pivotal’s business than a real attempt to advance open source Hadoop software.

Hortonworks CEO Rob Bearden at Structure Data 2014.

Hortonworks CEO Rob Bearden at Structure Data 2014.

All of this grandstanding and positioning is part of a quest to secure business in a Hadoop market that analysts predict will be worth billions in the years to come, and also an attempt by each company to prove to potential investors that its business model is the best. Hortonworks surprised a lot of people by going pubic in December, and the stock has remained stable since then (although its share price dropped more than two percent on Tuesday despite the news with Pivotal). Many people suspect Cloudera and MapR will go public this year, and Pivotal at some point as well.

This much action should make for an entertaining and informative Structure Data conference, which is now less than a month away. We’ll have the CEOs of Cloudera, Hortonworks and MapR all on stage talking about the business of big data, as well as the CEO of Apache Spark startup Databricks, which might prove to be a great partner for Hadoop vendors as well as a thorn in their sides. Big users, including Goldman Sachs, ESPN and Lockheed Martin, will also be talking about the technologies and objectives driving their big data efforts.

Pivotal open sources its Hadoop and Greenplum tech, and then some

Pivotal, the cloud computing and big data company that spun out from EMC and VMware in 2013, is open sourcing its entire portfolio of big data technologies and is teaming up with Hortonworks, IBM, GE, and several other companies on a Hadoop effort called the Open Data Platform.

Rumors about the fate of the company’s data business have been circulating since a round of layoffs began in November, but, according to Pivotal, the situation isn’t as dire as some initial reports suggested.

There is a lot of information coming out of the company about this, but here are the key parts:

  • Pivotal is still selling licenses and support for its Greenplum, HAWQ and GemFire database products, but it is also releasing the core code bases for those technologies as open source.
  • Pivotal is still offering its own Hadoop distribution, Pivotal HD, but has slowed development on core components of MapReduce, YARN, Ambari and the Hadoop Distributed File System. Those four pieces are the starting point for a new association called the Open Data Platform, which includes Pivotal, [company]GE[/company], [company]Hortonworks[/company], [company]IBM[/company], Infosys, Pivotal, SAS, Altiscale, [company]EMC[/company], [company]Verizon[/company] Enterprise Solutions, [company]VMware[/company], [company]Teradata[/company] and “a large international telecommunications firm,” and which promises to build its Hadoop technologies using a standard core of code.
  • Pivotal is working with Hortonworks to make Pivotal’s big data technologies run on the Hortonworks Data Platform, and eventually on the Open Data Platform core. Pivotal will continue offering enterprise support for Pivotal HD, although it will outsource to Hortonworks support requests involving the guts of Hadoop (e.g., MapReduce and HDFS).

Sunny Madra, vice president of the data and mobile product group at Pivotal, said the company has a relatively successful big data business already — $100 million overall, $40 million of which came from the Big Data Suite license bundle it announced last year — but suggested that it sees the writing on the wall. Open source software is a huge industry trend, and he thinks pushing against it is as fruitless as pushing against cloud computing several years ago.

“We’re starting to see open source pop up as an RFP within enterprises,” he said. “. . . If you’re picking software [today] . . . you’d look to open source.”

pivotalbds

The Pivotal Big Data Suite.

Madra pointed to Pivotal’s revenue numbers as proof the company didn’t open source its software because no one wanted to pay for it. “We wouldn’t have a $100 million business . . . if we couldn’t sell this,” he said. Maybe, but maybe not: Hortonworks isn’t doing $100 million a year, but word was that Cloudera was doing it years ago (on Tuesday, Cloudera did claim more than $100 million in revenue in 2014). Depending how one defines “big data,” companies like Microsoft and Oracle are probably making much more money.

However, there were some layoffs late last year, which Madra attributed to consolidation of people, offices and efforts rather than a failing business. Pivotal wanted to close some global offices and bring the data team and Cloud Foundry teams under the same leadership, and to focus its development resources on its own intellectual property around Hadoop. “Do we really need a team going and testing our own distribution?” he asked, troubleshooting it, certifying it against technologies and all that goes along with that?

EMC first launched the Pivotal HD Hadoop distribution, as well as the HAWQ SQL-on-Hadoop engine, with much ado just over two years ago.

The deal with Hortonworks helps alleviate that engineering burden in the short term, and the Open Data Platform is supposed to help solve it over a longer period. Madra explained the goal of the organization as Linux-like, meaning that customers should be able to switch from one Hadoop distribution to the next and know the kernel will be the same, just like they do with the various flavors of the Linux operating system.

Mike Olson, Cloudera’s chief strategy officer and founding CEO, offered a harsh rebuttal to the Open Data Platform in a blog post on Tuesday, questioning the utility and politics of vendor-led consortia like this. He simultaneously praised Hortonworks for its commitment to open source Hadoop and bashed Pivotal on the same issue, but wrote, among other things, of the Open Data Platform: “The Pivotal and Hortonworks alliance, notwithstanding the marketing, is antithetical to the open source model and the Apache way.”

The Pivotal HD and Hawq architecture

Much of this has been open sourced or replaced.

As part of Pivotal’s Tuesday news, the company also announced additions to its Big Data Suite package, including the Redis key-value store, RabbitMQ messaging queue and Spring XD data pipeline framework, as well as the ability to run the various components on the company’s Cloud Foundry platform. Madra actually attributes a lot of Pivotal’s decision to open source its data technologies, as well as its execution, to the relative success the company has had with Cloud Foundry, which has always involved an open source foundation as well as a commercial offering.

“Had we not had the learnings that we had in Cloud Foundry, then I think it would have been a lot more challenging,” he said.

Whether or not one believes Pivotal’s spin on the situation, though, the company is right in realizing that it’s open source or bust in the big data space right now. They have different philosophies and strategies around it, but major Hadoop vendors Cloudera, Hortonworks and MapR are all largely focused on open-source technology. The most popular Hadoop-ecosystem technologies, including Spark, Storm and Kafka, are open source, as well. (CEOs, founders and creators from many of these companies and projects will be speaking at our Structure Data conference next month in New York.)

Pivotal might eventually sell billions of dollars worth of software licenses for its suite of big data products — there’s certainly a good story there if it can align the big data and Cloud Foundry businesses into a cohesive platform — but it probably has reached its plateau without having an open source story to tell.

Update: This post was updated at 12:22 p.m. PT to add information about Cloudera’s revenue.

No, you don’t need a ton of data to do deep learning

[soundcloud url=”https://api.soundcloud.com/tracks/190680894″ params=”secret_token=s-lutIw&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

There are a couple of seemingly contradictory memes rolling around the deep learning field. One is that you need a truly epic amount of data to do interesting work. The other is that in many subject areas there is a ton of data but it’s not just laying around for data scientists to snarf up.

On this week’s Structure Show podcast, Enlitic Founder and CEO Jeremy Howard and Senior Data Scientist Ahna Girshick address those topics and more.

Girshick, who is our first guest who’s worked with Philip Glass and Björk to create music visualizations, said there are scads of MRIs, CAT scans, x-rays created but once they’re used for their primary purpose — to diagnose your bum knee — they are then squirreled away in some PACS system never to see the light of day again.

All of that data is useful for machine learning algorithms, or would be, if it were accessible, she said.

Ahna Girshick, Enlitic's senior data scientist.

Ahna Girshick, Enlitic’s senior data scientist.

Girshick and  Howard agreed that while deep learning — the process of a computer teaching itself how to solve a problem — gets better as the data set grows, there’s no reason to hold off working with it to wait for that data to become available.

“While more data can be better I think this is stopping people from trying to use big data,” Howard said. He cited a recent Kaggle competition on facial key point recognition that  uses 7,000 images and “the top algorithms are nearly perfectly accurate.”

The reason companies like Baidu and Google say you need mountains of data is because they have mountains of data available, he said.  “I don’t think people should be put off trying to use deep learning just because they don’t have a lot of data.”

Enlitic is using deep learning to provide medical diagnoses faster and provide better medical and outcomes for millions of underserved people.

It’s a fascinating discussion so please check it out — Girshick will speak more on what Enlitic is doing at Structure Data next month.

And, if you want to hear what’s going on with Pivotal’s big data portfolio, Derrick Harris has the latest. Oh and Microsoft makes a bold play for startups by ponying up $500K in Azure cloud credits starting with the Y Combinator Winter 2015 class. That ups the ante pretty significantly compared to what [company]Amazon[/company] Web Services, [company]Google[/company] and [company]IBM[/company] offer. Your move boys.

 

SHOW NOTES

Hosts: Barb Darrow and Derrick Harris.

Download This Episode

Subscribe in iTunes

The Structure Show RSS Feed

PREVIOUS EPISODES:

VMware wants all those cloud workloads “marooned” in AWS

Don’t like your cloud vendor? Wait a second.

Hilary Mason on taking big data from theory to reality

On the importance of building privacy into apps and Reddit AMAs

Cheap cloud + open source = a great time for startups 

 

Cloud Foundry Foundation names Ramji CEO

The Cloud Foundry Foundation, put in place last year to promote the open-source platform as a service framework, now has new leadership. Sam Ramji, former VP of strategy of Apigee, is now CEO.

In the statement announcing the news, Ramji was painted as a neutral outsider to the Cloud Foundry effort:

Ramji’s absence of ties to any of the Foundation’s member companies underscores the community’s embrace of coopetition between major vendors to drive Cloud Foundry’s success.

When it comes to open source efforts like Linux, Eclipse and OpenStack, it’s important to demonstrate that no one vendor can big-foot the process.

The foundation was formed about a year ago by Cloud Foundry backer Pivotal with its federation partners [company]EMC[/company] and [company]VMware[/company], as well as [company]Rackspace[/company], [company]IBM[/company], [company]HP[/company], [company]CenturyLink[/company], ActiveState and [company]SAP[/company]. Notably absent is [company]Red Hat[/company], the Linux and OpenStack player that is pushing its own distinctly non-Cloud Foundryish OpenShift PaaS.

The offloading of governance to a foundation was meant to provide an “open governance” model and ensure that no one backer controlled the process. Then, in December, the group tapped the Linux Foundation to provide bread-and-butter PR, logistics and planning services. The release for this news, for example, was sent out by the Linux Foundation staff.

The foundation also named nine new members:

  • EMC – John Roese
  • HP – Bill Hilf
  • VMware – Ajay Patel
  • IBM – Christopher Ferris
  • SAP – Sanjay Patil
  • Intel – Nicholas Weaver
  • Pivotal – Rob Mee
  • Swisscom – Marco Hochstrasser
  • ActiveState – Bart Copeland

Exclusive: Pivotal CEO says open source Hadoop tech is coming

Pivotal, the cloud computing spinoff from EMC and VMware that launched in 2013, is preparing to blow up its big data business by open sourcing a whole lot of it.

Rumors of changes began circulating in November, after CRN reported that Pivotal was in the process of laying off about 60 people, many of which worked on the big data products. The flames were stoked again on Friday by a report in VentureBeat claiming the company might cease development of its Hadoop distribution and/or open source various pieces of its database technology such as Greenplum and HAWQ.

Gigaom has confirmed at least part of this is true, via an emailed statement credited to Pivotal CEO Paul Maritz:

“We are anticipating an interesting set of announcements on Feb 17th. However rumors to the effect that Pivotal is getting out of the Hadoop space are categorically false. The announcements, which involve multiple parties, will greatly add to the momentum in the Hadoop and Open Source space, and will have several dimensions that we believe customers will find very compelling.”

Those announcements will take place via webcast.

Paul Maritz at Structure Data 2014. (© Photo by Jakub Mosur).

Paul Maritz at Structure Data 2014.

Multiple external sources have told Gigaom that Pivotal does indeed plan to open source its Hadoop technology, and that it will work with former rival (but, more recently, partner) Hortonworks to maintain and develop it. IBM was also mentioned as a partner.

Members of the Hadoop team were let go around November when active development stopped, the sources said, and some senior big data personnel — including Senior Vice President of R&D Hugh Williams and Chief Scientist Milind Bhandarkar — departed the company in December, according to their LinkedIn profiles. Both of them claim to be working on new startup projects.

When EMC first introduced its Hadoop distribution, called Pivotal HD in February 2013 (it was one of the technologies that Pivotal the company inherited), executive Scott Yara touted the size of EMC’s Hadoop engineering team and the quality of its technology over that of its smaller rivals Cloudera, MapR and Hortonworks. However, Pivotal has been getting noticeably more in touch with its open source side recently, including with the Hortonworks partnership referenced above (around the Apache Ambari project) and a big commitment to the open source Tachyon in-memory file system project.

The current Pivotal HD Enterprise architecture.

The current Pivotal HD Enterprise architecture.

Pivotal has been a big proponent of the “data lake” strategy whereby companies store all their data in a big Hadoop cluster and use various higher-level programs to access and analyze. Last April, the company took a somewhat brave step toward ensuring its customers could do that by relaxing its product licensing and making Pivotal HD storage free.

Whatever happens with Pivotal’s technology, it’s not shocking that the company would decide to take the open source path. Its flagship technology is the open source Cloud Foundry cloud computing platform, and Cloudera, Hortonworks and MapR have cornered the market on Hadoop sales, by all accounts. If Pivotal has some good code in its base, it’s probably best to get it into the open source world and ride the momentum rather try to fight against it.

For more on the fast-moving Hadoop space, be sure to attend our Structure Data conference March 18-19 in New York. We’ll have the CEOs of Cloudera, Hortonworks and MapR on stage to talk about the business, as well as Databricks CEO Ion Stoica discussing the Apache Spark project that is presently kicking parts of Hadoop into hyperdrive.

Cloud Foundry Foundation brings in reinforcements

The Cloud Foundry Foundation, established in February,  will get a new marquee name Tuesday with [company]Intel[/company] coming aboard as a Platinum member.

Other companies, including [company]Akamai[/company], [company]Fujitsu[/company], [company]Hortonworks[/company], [company] Mendix,[/company] [company]SAS[/company] and others also joined the effort to push Cloud Foundry as an open framework for cross-cloud Platform as a Service (PaaS).

If it achieves its goals a business customer could “write an app once and run it in many places with lots of services,” said Leo Spiegel, SVP of strategy and corporate development at Pivotal, the company behind the commercial PivotalCF version of the PaaS. Those “many places” would include PivotalCF, or [company]IBM[/company] BlueMix or [company]HP[/company] Helion or whatever iteration of Cloud Foundry it deems most appropriate. That would alleviate corporate concerns of platform or vendor lock-in but is also a tall order.

Another (unspoken) part of the foundation’s aim is to show that Cloud Foundry is not under the control of one vendor — Pivotal or its parent company [company]VMware[/company] — so bringing in these big names and their code contributions is a plus. But then again, all those vendors want to wring competitive advantage out of Cloud Foundry, which is at odds with providing customers an easy offramp to someone else’s version of the PaaS.

To expedite logistics, the Cloud Foundry leadership signed the Linux Foundation to provide staffing, legal assistance and help with events.

“Why recreate things that exist?” Spiegel asked. “We raised plus or minus $6 million a year — we want to use that cost-effectively and not hire separate legal and event teams that are already in place.”