What happens when too few databases become too many databases?

So here’s some irony for you: For years, Andy Palmer and his oft-time startup partner Michael Stonebraker have pointed out that database software is not a one-size-fits-all proposition. Companies, they said, would be better off with a specialized database for certain tasks rather than using a general-purpose database for every job under the sun.

And what happened? Lots of specialized databases popped up, such as Vertica (which Stonebraker and Palmer built for data warehouse query applications and is now part of HP). There are read-oriented databases and write-oriented databases and relational databases and non-relational databases … blah, blah, blah.

The unintended consequence of that was the proliferation of new data silos, in addition to those already created by older databases and enterprise applications. And the existence of those new silos pose a next-generation data integration problem for people who want to create massive pools of data they can cull for those big data insights we keep hearing about.

Meet new-style ETL

In an interview, Palmer acknowledged that we’ve gone from one extreme — not enough database engines —  to too many options, with customers getting increasingly confused. And in that complexity, there is opportunity. And Palmer’s latest startup with Stonebraker — called Tamr — as well as other young companies like ClearStory, Paxata, Trifacta are attacking the task of cleaning up data in a process traditionally called Extract Transform Load or ETL.

Tamr combines machine learning smarts with human subject matter experts to create what Palmer calls a sort of self-teaching system. The startup is one of seven winners of our Structure Data Awards will be on hand at the Structure Data event next month to discuss the new era of ETL along with other trends in data.

The data sharing economy

As more companies share select information with supply chain and other trusted partners, ensuring that key data is clean will become more important. According to a new Accenture survey of 2,000 IT professionals, 35 percent of those surveyed said they’re already using partner APIs to integrate data and work with those partners while another 38 percent said they plan do that.

Per the survey:

One example is Home Depot, which is working with manufacturers to ensure that all of the connected home products it sells are compatible with the Wink connected home system – thereby creating its own connected home ecosystem and developing potential new services and unique experiences for Wink customers.

And, 74 percent of those respondents said they are using or experimenting with new technologies that integrate data with digital business partners.  Also from the Accenture report:

 “Rapid advances in cloud and mobility are not only eliminating the cost and technology barriers associated with such platforms, but opening up this new playing field to enterprises across industries and geographies.”

As the velocity and types of data flowing to and from applications increases “old style careful ETL curation doesn’t work anymore but [the data] still needs to be cleansed and prepped,” said Gigaom Research Director Andrew Brust.

In other words big data is big, no doubt. But in some cases, the old adage “Garbage In, Garbage Out” holds true even in the era of big data. If you really want the best insights out of the information you have, getting that data cleaned and spiffed up, can be a very big deal.

 

Microsoft throws down the gauntlet in business intelligence

[company]Microsoft[/company] is not content to let Excel define the company’s reputation among the world’s data analysts. That’s the message the company sent on Tuesday when it announced that its PowerBI product is now free. According to a company executive, the move could expand Microsoft’s reach in the business intelligence space by 10 times.

If you’re familiar with PowerBI, you might understand why Microsoft is pitching this as such a big deal. It’s a self-service data analysis tool that’s based on natural language queries and advanced visualization options. It already offers live connections to a handful of popular cloud services, such as [company]Salesforce.com[/company], [company]Marketo[/company] and GitHub. It’s delivered as a cloud service, although there’s a downloadable tool that lets users work with data on their laptops and publish the reports to a cloud dashboard.

James Phillips, Microsoft’s general manager for business intelligence, said the company has already had tens of thousands of organizations sign up for PowerBI since it became available in February 2014, and that CEO Satya Nadella opens up a PowerBI dashboard every morning to track certain metrics.

A screenshot from a sample PowerBI dashboard.

A screenshot from a sample PowerBI dashboard.

And Microsoft is giving it away — well, most of it. The preview version of the cloud service now available is free and those features will remain free when it hits general availability status. At that point, however, there will also be a “pro” tier that costs $9.99 per user per month and features more storage, as well as more support for streaming data and collaboration.

But on the whole, Phillips said, “We are eliminating any piece of friction that we can possibly find [between PowerBI and potential users].”

This isn’t free software for the sake of free software, though. Nadella might be making a lot of celebrated, if not surprising, choices around open source software, but he’s not in the business of altruism. No, the rationale behind making PowerBI free almost certainly has something to do with stealing business away from Microsoft’s neighbor on the other side of Lake Washington, Seattle-based [company]Tableau Software[/company].

Phillips said the business intelligence market is presently in its third wave. The first wave was technical and database-centric. The second wave was about self service, defined first by Excel and, over the past few years, by Tableau’s eponymous software. The third wave, he said, takes self service a step further in terms of ease of use and all but eliminates the need for individual employees to track down IT before they can get something done.

The natural language interface, using funding data from Crunchbase.

The natural language interface, using funding data from Crunchbase.

IBM’s Watson Analytics service, Phillips said, is about the only other “third wave” product available. I recently spent some time experimenting with the Watson Analytics preview, and was fairly impressed. Based on a quick test run of a preview version of PowerBI, I would say both products have their advantages over the other.

But IBM — a relative non-entity in the world of self-service software — is not Microsoft’s target. Nor, presumably, is analytics newcomer Salesforce.com. All of these companies, as well as a handful of other vendors that exist to sell business intelligence software, want a piece of the self-service analytics market that Tableau currently owns. Tableau’s revenues have been skyrocketing for the past couple years, and it’s on pace to hit a billion-dollar run rate in just over a year.

“I have never ever met a Tableau user who was not also a Microsoft Excel user,” Phillips said.

That might be true, but it also means Microsoft has been leaving money on the table by not offering anything akin to Tableau’s graphic interface and focus on visualizations. Presumably, it’s those Tableau users, and lots of other folks for whom Tableau (even its free Tableau Public version) is too complex, that Microsoft hopes it can reach with PowerBI. Tableau is trying to reach them, too.

“We think this really does 10x or more the size of the addressable business intelligence market,” Phillips said.

A former Microsoft executive told me that the company initially viewed Tableau as a partner and was careful not to cannibalize its business. Microsoft stuck to selling SharePoint and enterprise-wide SQL Server deals, while Tableau dealt in individual and departmental visualization deals. However, he noted, the new positioning of PowerBI does seem like a change in that strategy.

Analyzing data with more controls.

Analyzing data with more controls.

Ultimately, Microsoft’s vision is to use PowerBI as a gateway to other products within Microsoft’s data business, which Phillips characterized the the company’s fastest-growing segment. PowerBI can already connect to data sources such as Hadoop and SQL Server (and, in the case of the latter, can analyze data without transporting it), and eventually Microsoft wants to incorporate capabilities from its newly launched Azure Machine Learning service and the R statistical computing expertise it’s about to acquire, he said.

“I came to Microsoft largely because Satya convinced me that the company was all in behind data,” Phillips said. For every byte that customers store in a Microsoft product, he added, “we’ll help you wring … every drop of value out of that data.”

Joseph Sirosh, Microsoft’s corporate vice president for machine learning, will be speaking about this broader vision and the promise of easier-to-use machine learning at our Structure Data conference in March.

Microsoft CEO Satya Nadella.

Microsoft CEO Satya Nadella.

Given all of its assets, it’s not too difficult to see how the new, Nadella-led Microsoft could become a leader in an emerging data market that spans such a wide ranges of infrastructure and application software. Reports surfaced earlier this week, in fact, that Microsoft is readying its internal big data system, Cosmos, to be offered as a cloud service. And selling more data products could help Microsoft compete with another Seattle-based rival — [company]Amazon[/company] Web Services — in a cloud computing business where the company has much more at stake than it does selling business intelligence software.

If it were just selling virtual servers and storage on its Azure platform, Microsoft would likely never sniff market leader AWS in terms of users or revenue. But having good data products in place will boost subscription revenues, which count toward the cloud bottom line, and could give users an excuse to rent infrastructure from Microsoft, too.

Update: This post was updated at 10:15 a.m. to include additional information from a former Microsoft employee.

Basho, creator of NoSQL Riak database, raises $25M

Basho, the company behind the Riak key-value database and Riak CS cloud-storage system, has raised a $25 million series G round of venture capital led by Georgetown Partners. The company has now raised nearly $60 million in a combination of equity and debt financing since it was founded in 2008.

Basho is among a handful of companies, including MongoDB, DataStax and Couchbase, that seems to have garnered some real traction in the NoSQL space over the past few years. Riak, its flagship open source, database competes most directly against Cassandra, around which DataStax was built. Basho released its Riak CS storage system in 2012 to help users build distributed object stores a la Amazon Web Services’ S3 or OpenStack Swift.

Although it has raised much less capital than its NoSQL peers (MongoDB, for example, just announced an $80 million round on top of the $150 million in closed in October 2013) and had a major executive shakeup in 2014 — the company replaced both its CEO and CTO — Basho claims it’s doing just fine. In an interview on Monday, new CEO Adam Wray cited an 89 percent annual increase in bookings, tens of millions in annual revenue and accounts at some of the world’s largest companies.

Big data, the internet of things and hybrid cloud computing environments are driving many of Basho’s deployments, he added.

Assuming the market for non-relational databases keeps growing like many expect (“One day, we’ll be a $50 billion market space,” Wray said), there’s no reason it can’t support a handful of successful companies. Riak might never have the the user base of MongoDB or the webscale reputation of Cassandra, but if the company can get its act together operationally and the technology remains solid, there should be plenty of business to go around.

And if a large software vendor starts going shopping for NoSQL software, Basho will likely have a much more-palatable price tag than the other big-name options.

MongoDB CEO: Company was ‘opportunistic’ in raising $80M

According to MongoDB CEO Dev Ittycheria, it was unsolicited demand from investors that drove much of the company’s recent $80 million investment round, news of which broke on Friday afternoon.

In an interview with Gigaom, Ittycheria said that the company initially planned to raise a smaller sum of money to finance its acquisition of WiredTiger in December, but demand from adoring investors that caught wind of the raise was too much to resist. The company ended up raising about three times what it had planned to, and on very favorable terms, he said.

“We were very opportunistic,” Ittycheria said. MongoDB’s previous fundraising round had it valued at $1.2 billion, and it’s now worth even more.

The company will likely go public at some point but doesn’t want to — or have to — rush into it, he said. It still has plenty of mindshare and capital as a private company to choose its own timing. And there’s still some work to do proving its business model can succeed and scaling its global presence (MongoDB only has a handful of reps in both Europe and Asia, for example), Ittycheria noted.

That being said, MongoDB can probably, and safely, be optimistic about its prospects whenever it decides to go public.

Although the Hadoop and NoSQL markets are quite different, Ittycheria said he watched the recent Hortonworks IPO very closely because Hortonworks was the first of the next-generation, open-source data infrastructure vendors to test the public markets. While the company’s financials initially made him a little nervous, he said the fact that it was well-received speaks to demand from public investors to invest in “loosely, ‘big data’ companies.”

MemSQL open sources tool that helps move data into your database

Database startup MemSQL said today that it open sourced a new data transfer tool called MemSQL Loader that helps users haul over vast quantities of data from sources like Amazon S3 and the Hadoop Distributed File System (HDFS) into either an MemSQL or MySQL database.

While moving data from one source to another may seem relatively straightforward, there’s a lot of nuts and bolts in the process; if one thing goes awry, the whole endeavor can fail. For example, if you’re trying to move over thousands of files and one fails to transfer for some reason, you may have to start the process over again and hope all goes well, according to the MemSQL announcement.

MemSQL Loader is essentially an automation tool that lets users set up multiple transfers and queues that can restart “at a specific file in case of any import issues,” the release stated.

From the MemSQL blog post explaining the tool:
[blockquote person=”MemSQL” attribution=”MemSQL”]MemSQL Loader lets you load files from Amazon S3, the Hadoop Distributed File System (HDFS), and the local filesystem. You can specify all of the files you want to load with one command, and MemSQL Loader will take care of deduplicating files, parallelizing the workload, retrying files if they fail to load, and more.[/blockquote]

MemSQL in action

MemSQL in action

The new tool is available in open source through the MIT License and can be downloaded at GitHub.

MemSQL has been on a roll launching new tools and features since its 2012 inception. In September, Gigaom’s Derrick Harris reported that MemSQL now supports cross-data-center replication, which is good for disaster recovery in case a database takes a hit; cross-data-center replication also helps distribute the load across two data centers, which could cut down on latency and boost performance.

Amazon expands its NoSQL story with JSON support in DynamoDB

Amazon Web Services’ popular DynamoDB service now supports JSON documents, a capability that makes it more competitive against alternatives from Microsoft, Google and MongoDB. AWS also increased storage and throughput limits on the DynamoDB free tier, making the service that much more appealing.

Couchbase replaces its storage engine with homegrown ForestDB

Couchbase has built its own data store called ForestDB in order to boost the performance and efficiency of its family of NoSQL database offerings. ForestDB is open source and was designed with mobile devices and solid-state drives in mind.