From Storage to Data Virtualization

Do you remember Primary Data? Well, I loved the idea and the team but it didn’t go very well for them. It’s likely there are several reasons why it didn’t. In my opinion, it boiled down to the fact that very few people like storage virtualization. In fact, I expressed my fondness for Primary Data’s technology several times in the past, but when it comes to changing the way to operate complex, siloed, storage environments you come across huge resistance, at every level!
The good news is that Primary Data’s core team is back, with what looks like a smarter version of the original idea that can easily overcome the skepticism surrounding storage virtualization. In fact, they’ve moved beyond it and presented what looks like a multi-cloud controller with data virtualization features. Ok, they call it “Data as a Service,” but I prefer Data Virtualization…and being back with the product is a bold move.
Data Virtualization (What and Why)
I’ve begun this story by mentioning Primary Data first, because David Flynn (CEO of HammerSpace and former CTO of Primary Data) did not start this new Hammerspace venture from scratch. He bought the code which belonged to Primary Data and used it to build the foundation of his new product. That allowed him and his team to get on the market quickly with the first version of HammerSpace in a matter of months instead of years.
HammerSpace is brilliant just for one reason. It somehow solves or, better, hides the problem of data gravity and allows their Data-as-a-Service platform to virtualize data sets by presenting virtualized views of them available in a multi-cloud environment through standard protocols like NFS or S3.
Yes, at first glance it sounds like hot air and a bunch of buzzwords mixed together, but this is far from being the case here… watch the demo in the following video if you don’t trust me.
The solution is highly scalable and aimed at Big Data analytics and other performance workloads for which you need data close to the compute resource quickly, without thinking too much about how to move, sync, and keep it updated with changing business needs.
HammerSpace solutions have several benefits but the top two on my list are:

  • The minimization of egress costs: This is a common problem for those working in multi-cloud environments today. With HammerSpace, only necessary data is moved where it is really needed.
  • Reduced latency: It’s crazy to have an application running on a cloud that is far from where you have your data. Just to make an example, the other day I was writing about Oracle cloud, and how  good they are at creating high-speed bare-metal instances at a reasonable cost. This benefit can be easily lost if your data is created and stored in another cloud.

The Magic of Data Virtualization
I won’t go through architectural and technical details, since there are videos and documentation on HammerSpace’s website that address them (here and here).  Instead, I want to mention one of the features that I like the most: the ability to query the metadata of your data volumes. These volumes can be anywhere, including your premises, and you can get a result in the form of a new volume that is then kept in sync with the original data. Everything you do on data and metadata is quickly reflected on child volumes. Isn’t it magic?
What I liked the least, even though I understand the technical difficulties in implementing it, is that this process is one-way when a local NAS is involved… meaning that it is only a source of data and can’t be synced back from the cloud. There is a workaround, however, and it might be solved in future releases of the product.
Closing the Circle
HammerSpace exited stealth mode only a few days ago. I’m sure that by digging deeper into the product, flaws and limitations will be found.t is also true that the more advanced features are still only sketched on paper. But I can easily get excited by innovative technologies like this one and I’m confident that these issues will be fixed over time. I’ve been keeping an eye on multi-cloud storage solutions for a while, and now I’ve added Hammerspace to my list.
Multi-cloud data controllers and data virtualization are the focus of an upcoming report I’m writing for GigaOm Research. If you are interested in finding out more about how data storage is evolving in the cloud era, subscribe to GigaOm Research for Future-Forward Advice on Data-Driven Technologies, Operations, and Business Strategies.

Pinterest is experimenting with MemSQL for real-time data analytics

Pinterest shed more light on how the social scrapbook and visual discovery service analyzes data in real time, it said in a blog post on Wednesday, also revealing details about how it’s exploring a combination of MemSQL and Spark Streaming to improve the process.

Currently, Pinterest uses a custom-built log-collecting agent dubbed Singer that the company attaches to all of its application servers. Singer then collects all those application log files and with the help of the real-time messaging framework Apache Kafka it can transfer that data to Storm or Spark and other “custom built log readers” that “process these events in real-time.”

Pinterest also uses its own log-persistence service called Secor to read that log data moving through Kafka and then write it to Amazon S3, after which Pinterest’s “self-serve big data platform loads the data from S3 into many different Hadoop clusters for batch processing,” the blog post stated.

Although this current system seems to be working decently for Pinterest, the company is also exploring how it can use MemSQL to help when people need to query the data in real time. So far, the Pinterest team has developed a prototype of a real-time data pipeline that uses Spark Streaming to pass data into MemSQL.

Here’s what this prototype looks like:

Pinterest real-time analytics

Pinterest real-time analytics

In this prototype, Pinterest can use Spark Streaming to pass the data related to each pin (along with geolocation information and what type of category does the pin belong to) to MemSQL, in which the data is then available to be queried.

For analysts that understand SQL, the prototype could be useful as a way to analyze data in real time using a mainstream language.

The rise of self-service analytics, in 3 charts

I’m trying really hard to write less about business intelligence and analytics software. We get it: Data is important to businesses, and the easier you can make it for people to analyze it, the more they’ll use your software to do it. What more is there to say?

But every time I see Tableau Software’s earnings reports, I’m struck by the reality of how big a shift the business intelligence market is undergoing right now. In the fourth quarter, Tableau grew its revenue 75 percent year over year. People and departments are lining up to buy what’s often called self-service analytics software — that is, applications so easy even those lay business users can work with them without much training — and they’re doing it at the expense of incumbent software vendors.

Some analysts and market insiders will say the new breed of BI vendors are more about easy “data discovery” and that their products lack the governance and administrative control of incumbent products. That’s like saying Taylor Swift is very cute and very good at making music people like, but she’s not as serious as Alanis Morrisette or as artistic as Björk. Those things can come in time; meanwhile, I’d rather be T-Swift raking in millions and looking to do it for some time to come.

[dataset id=”914729″]

Above a quick comparison of annual revenue for three companies, the only three “leaders” in Gartner’s 2014 Magic Quadrant for Business Intelligence and Analytics Platforms (available in the above hyperlink) that are both publicly traded and focused solely on BI. Guess which two fall into the next-generation, self-service camp and are also Gartner’s two highest-ranked. Guess which one is often credited with reimagining the data-analysis experience and making a product people legitimately like using.

[dataset id=”914747″]

Narrowing it just to last year, Tableau’s revenue grew 92 percent between the first and fourth quarters, while Qlik’s grew 65 percent. Microstrategy stayed relatively flat and is trending downward. It’s fourth quarter was actually down year over year.

[dataset id=”914758″]

And what does Wall Street think about what’s happening? [company]Tableau[/company] has the least revenue for now, but probably not much longer, and has a market cap more than [company]Qlik[/company] and [company]Microstrategy[/company] combined.

Here are a few more data points that show how impressive’s Tableau’s ongoing coup really is. Tibco Software, another Gartner leader and formerly public company, recently sold to private equity firm Vista for $4.2 billion after disappointing shareholders with weak sales. Hitachi Data Systems is buying Pentaho, a BI vendor hanging just outside the border of Gartner’s “leader” category, for just more than $500 million, I’m told.

A screenshot from a sample PowerBI dashboard.

A screenshot from a sample PowerBI dashboard.

Although it’s worth noting that Tableau isn’t guaranteed anything. As we speak, startups such as Platfora, ClearStory and SiSense trying to match or outdo Tableau on simplicity while adding their own new features elsewhere. The multi-billion-dollar players are also stepping up their games in this space. [company]Microsoft[/company] and [company]IBM[/company] recently launched the natural-language-based PowerBI and Watson Analytics services that Microsoft says represent the third wave of BI software (Tableau is in the second wave, by its assessment), and [company]Salesforce.com[/company] invested a lot of resources to make its BI foray.

Whatever you want to call it — data discovery, self-service analytics, business intelligence — we’ll be talking more about it at our Structure Data conference next month. Speakers include Tableau Vice President of Analytics (and R&D leader) Jock Mackinlay, as well as Microsoft Corporate Vice President of Machine Learning Joseph Sirosh, who’ll be discussing self-service machine learning.

Can the legal system be Moneyballed?

A Florida-based AI firm has released some very interesting revelations about lawyers and their rates of prevailing in lawsuits. The firm, Premonition, has determined that lawyers — even those on ‘top lawyer lists’, or considered by peers to be among the best — generally have average results. The company also discovered that lawyers and law firms don’t actually keep track of win/loss data: they mostly track fees and billing proportions. One indicative statistic kind of jumps out as a condemnation of the industry:

In a study of the United Kingdom Court of Appeals, it found a slight negative correlation of -0.1 between win rates and re-hiring rates, i.e. a barrister 20% better than their peers was actually 2% less likely to be re-hired!

The industry is so self blind that it actually penalizes lawyers that perform better.

Toby Unwin, the co-founder of Premonition says that the only thing that correlates with win rate in court cases is a history of winning. ‘The only item that affects the likely outcome of a case is the attorney’s prior win rate, preferably for that case type before that judge,’ he stated in the company’s press release.

The Premonition system is a web crawler that applies AI to scrape court results so that win/loss data for lawyers can be analyzed. This is a function of many factors, such as the relationship between the lawyer and the judge. This is like the example in Moneyball — the story about Billy Beane’s Oakland A’s,  where the team analyzed how batters fared against specific other pitchers, or how frequently they got on base in general — except Premonition is analyzing how lawyers do when in front of specific judges and specific kinds of litigation.

Most interesting is that Premonition states that they have found no correlation between win rate and billing rate. The best results appear to come from small firms and soloists that they call ‘strip mall superstars’.

I make no claim about this technology, but what I find interesting is that the lack of transparency that makes it basically impossible for the average person or business to make informed judgments about which litigator to hire for a court case. It seems obvious that AI/analytic tools like Premonition should be of immense value, and therefore the marketplace for such tools should lead to a Moneyball like disruption in the field of law. This is also a condemnation for the larger law firms, who aren’t apparently tracking their lawyers in a sensible way relative to what matters to their clients.

 

Microsoft throws down the gauntlet in business intelligence

[company]Microsoft[/company] is not content to let Excel define the company’s reputation among the world’s data analysts. That’s the message the company sent on Tuesday when it announced that its PowerBI product is now free. According to a company executive, the move could expand Microsoft’s reach in the business intelligence space by 10 times.

If you’re familiar with PowerBI, you might understand why Microsoft is pitching this as such a big deal. It’s a self-service data analysis tool that’s based on natural language queries and advanced visualization options. It already offers live connections to a handful of popular cloud services, such as [company]Salesforce.com[/company], [company]Marketo[/company] and GitHub. It’s delivered as a cloud service, although there’s a downloadable tool that lets users work with data on their laptops and publish the reports to a cloud dashboard.

James Phillips, Microsoft’s general manager for business intelligence, said the company has already had tens of thousands of organizations sign up for PowerBI since it became available in February 2014, and that CEO Satya Nadella opens up a PowerBI dashboard every morning to track certain metrics.

A screenshot from a sample PowerBI dashboard.

A screenshot from a sample PowerBI dashboard.

And Microsoft is giving it away — well, most of it. The preview version of the cloud service now available is free and those features will remain free when it hits general availability status. At that point, however, there will also be a “pro” tier that costs $9.99 per user per month and features more storage, as well as more support for streaming data and collaboration.

But on the whole, Phillips said, “We are eliminating any piece of friction that we can possibly find [between PowerBI and potential users].”

This isn’t free software for the sake of free software, though. Nadella might be making a lot of celebrated, if not surprising, choices around open source software, but he’s not in the business of altruism. No, the rationale behind making PowerBI free almost certainly has something to do with stealing business away from Microsoft’s neighbor on the other side of Lake Washington, Seattle-based [company]Tableau Software[/company].

Phillips said the business intelligence market is presently in its third wave. The first wave was technical and database-centric. The second wave was about self service, defined first by Excel and, over the past few years, by Tableau’s eponymous software. The third wave, he said, takes self service a step further in terms of ease of use and all but eliminates the need for individual employees to track down IT before they can get something done.

The natural language interface, using funding data from Crunchbase.

The natural language interface, using funding data from Crunchbase.

IBM’s Watson Analytics service, Phillips said, is about the only other “third wave” product available. I recently spent some time experimenting with the Watson Analytics preview, and was fairly impressed. Based on a quick test run of a preview version of PowerBI, I would say both products have their advantages over the other.

But IBM — a relative non-entity in the world of self-service software — is not Microsoft’s target. Nor, presumably, is analytics newcomer Salesforce.com. All of these companies, as well as a handful of other vendors that exist to sell business intelligence software, want a piece of the self-service analytics market that Tableau currently owns. Tableau’s revenues have been skyrocketing for the past couple years, and it’s on pace to hit a billion-dollar run rate in just over a year.

“I have never ever met a Tableau user who was not also a Microsoft Excel user,” Phillips said.

That might be true, but it also means Microsoft has been leaving money on the table by not offering anything akin to Tableau’s graphic interface and focus on visualizations. Presumably, it’s those Tableau users, and lots of other folks for whom Tableau (even its free Tableau Public version) is too complex, that Microsoft hopes it can reach with PowerBI. Tableau is trying to reach them, too.

“We think this really does 10x or more the size of the addressable business intelligence market,” Phillips said.

A former Microsoft executive told me that the company initially viewed Tableau as a partner and was careful not to cannibalize its business. Microsoft stuck to selling SharePoint and enterprise-wide SQL Server deals, while Tableau dealt in individual and departmental visualization deals. However, he noted, the new positioning of PowerBI does seem like a change in that strategy.

Analyzing data with more controls.

Analyzing data with more controls.

Ultimately, Microsoft’s vision is to use PowerBI as a gateway to other products within Microsoft’s data business, which Phillips characterized the the company’s fastest-growing segment. PowerBI can already connect to data sources such as Hadoop and SQL Server (and, in the case of the latter, can analyze data without transporting it), and eventually Microsoft wants to incorporate capabilities from its newly launched Azure Machine Learning service and the R statistical computing expertise it’s about to acquire, he said.

“I came to Microsoft largely because Satya convinced me that the company was all in behind data,” Phillips said. For every byte that customers store in a Microsoft product, he added, “we’ll help you wring … every drop of value out of that data.”

Joseph Sirosh, Microsoft’s corporate vice president for machine learning, will be speaking about this broader vision and the promise of easier-to-use machine learning at our Structure Data conference in March.

Microsoft CEO Satya Nadella.

Microsoft CEO Satya Nadella.

Given all of its assets, it’s not too difficult to see how the new, Nadella-led Microsoft could become a leader in an emerging data market that spans such a wide ranges of infrastructure and application software. Reports surfaced earlier this week, in fact, that Microsoft is readying its internal big data system, Cosmos, to be offered as a cloud service. And selling more data products could help Microsoft compete with another Seattle-based rival — [company]Amazon[/company] Web Services — in a cloud computing business where the company has much more at stake than it does selling business intelligence software.

If it were just selling virtual servers and storage on its Azure platform, Microsoft would likely never sniff market leader AWS in terms of users or revenue. But having good data products in place will boost subscription revenues, which count toward the cloud bottom line, and could give users an excuse to rent infrastructure from Microsoft, too.

Update: This post was updated at 10:15 a.m. to include additional information from a former Microsoft employee.

Hands on with Watson Analytics: Pretty useful when it’s working

Last month, [company]IBM[/company] made available the beta version of its Watson Analytics data analysis service, an offering first announced in September. It’s one of IBM’s only recent forays into anything resembling consumer software, and it’s supposed to make it easy for anyone to analyze data, relying on natural language processing (thus the Watson branding) to drive the query experience.

When the servers running Watson Analytics are working, it actually delivers on that goal.

Analytic power to the people

Because I was impressed that IBM decided to a cloud service using the freemium business model — and carrying the Watson branding, no less — I wanted to see firsthand how well Watson Analytics works. So I uploaded a CSV file including data from Crunchbase on all companies categorized as “big data,” and I got to work.

Seems like a good starting point.

watson14Choose one and get results. The little icon in the bottom left corner makes it easy to change chart type. Notice the various insights included in the bar at the top. Some are more useful than others.

watson15But which companies have raised the most money? Cloudera by a long shot.

watson18

I know Cloudera had a huge investment round in 2014. I wonder how that skews the results for 2014, so I filter it out.

watsonlast

And, voila! For what it’s worth, Cloudera also skews funding totals however you sort them — by year founded, city, month of funding, you name it.

watsonlast2

Watson analytics also includes tools for building dashboards and for predictive analysis. The latter could be particularly useful, although that might depend on the dataset. I analyzed Crunchbase data to try and determine what factors are most predictive of a company’s operating status (whether it has shut down, has been acquired or is still running), and the results were pretty obvious (if you can’t read the image, it lists “last funding” as a big predictor).

watsonpredict3

If I have one big complaint about Watson Analytics, it’s that it’s still a bit buggy — the tool to download charts as images doesn’t seem to work, for example, and I had to reload multiple pages because of server errors. I’d be pretty upset if I were using the paid version, which allows for more storage and larger files, and experienced the same issues. Adding variables to a view without starting over could be easier, too.

Regarding the cloud connection, I rather like what [company]Tableau[/company] did with its public version by pairing a locally hosted application with cloud-based storage. If you’re not going to ensure a consistent backend, it seems better to guarantee some level of performance by relying on the user’s machine.

All in all, though, Watson Analytics seems like a good start to a mass-market analytics service. The natural language aspect makes it at least as intuitive as other services I’ve used (a list that includes DataHero, Tableau Public and Google Fusion tables, among others) and it’s easy enough to run and visualize simple analyses. But Watson Analytics plays in a crowded space that includes the aforementioned products, as well as Microsoft Excel and PowerBI, and Salesforce Wave.

If IBM can work out some of the kinks and add some more business-friendly features — such as the upcoming abilities to refine datasets and connect to data sources — it could be onto something. Depending on how demand for mass-market analytics tools shapes up, there could be plenty of business to go around for everyone, or a couple companies that master the user experience could own the space.

Can’t find your market research data? This startup can help

That Diet Coke in your hand didn’t invent itself. It’s the result of years of focus groups, surveys and data crunching—more commonly known as market research. Companies that produce consumer packaged goods (CPGs) spend millions each year on market research and it’s both a blessing and a curse. It’s a boon to planning new products and tweaking existing ones but when it comes to finding individual pieces of information, that’s where the cursing comes in. A startup out of Chicago called KnowledgeHound thinks it has the answer.

The majority of product failures can be attributed to a brand not knowing or understanding its audience. When you’re in the business of selling to consumers, getting inside their heads — and then using that information effectively — can make or break you. At most companies, that information lies undiscovered in spreadsheets, survey responses, and charts that were never parsed. As KnowledgeHound CEO Kristi Zuhlke put it, companies spend all that cash on studies, then “stash the results on a hard drive, and promptly experience ‘corporate amnesia’.”

Zuhlke experienced this firsthand in her time at Procter & Gamble working in consumer insights. “As soon as we’d get the info in, a month later we’d forget we had it. There wasn’t a central, easily navigated place to access.” Building on her experiences in the consumer packaged goods industry, Zuhlke founded KnowledgeHound in 2012 and went from paper to product in four months. Her first customer? Procter & Gamble.

Calling themselves “the Google of market research,” KnowledgeHound’s technology enables Fortune 500 companies to use, re-use, and recycle the consumer and market knowledge on which they spend an average of $60,000 per study. Instead of living on individual hard drives, which makes it unsearchable to the company at large, the data and research from each study is imported into the KnowledgeHound database, then augmented with a custom search engine and visualization tools. Results are modeled on a right brain/left brain approach when presented to the user, with research summaries representing the right and data points the left.

 

kh shot1

 

“Where we’re really different is the left side of the brain, the data points,” Zuhlke said. The search engine looks through questions as they were asked in the study, takes the raw data, and transforms it into graphs and charts in milliseconds.

 

kh shot2

KnowledgeHound’s product puts it in competition with a variety of markets—data visualization tools, business intelligence software, and enterprise search. Entering into just one of these would be a tall order. The bulk of competitors provide general document searching that can be used for content, but few focus exclusively on the area of market research and its emphasis on data. One company that does jump out as a strong competitor is InfoTools out of New Zealand, which focuses so singularly on market research tools, they created awards around them (DIVAs: Data Insight Visualization Awards.)

Otherwise, the choice of tools to parse a company’s market research is all over the map. There is a thriving DIY community around visualization that focuses more on the research agencies that produce the studies. This puts the onus on the producer of the data to present findings in the way they think the client will need; a short-sighted solution that likely comes up short once it makes its way to the client.  And occasionally, you’ll find a CPG company with a corporate librarian to help employees find the data they’re seeking. But those are rare.

Angel-funded but also generating revenue, KnowledgeHound aims to hire 10-12 more employees in 2015 to help add to its customer base – Fortune 500, privately held, billion dollar companies – in addition to moving into the medical research space and launching new technologies in the next 6-9 months.

Market research around CPGs is not going to be the next big thing in Silicon Valley. It’s wonky, niche-y, and involves a lot of dry data. But its target market is one of the largest industries in North America, valued at approximately $2 trillion. KnowledgeHound seems poised to take the lead, if it can attract the talent and resources needed to keep billion-dollar brand giants happy.

Tableau CEO says the company’s biggest challenge now is talent

In an interview with Gigaom on Thursday, Tableau Software CEO Christian Chabot spoke about the company’s $100 million third quarter and the challenges it faces as it continues to grow. He doesn’t suspect Salesforce.com’s new analytics service will be one of them.

AppDynamics launches new real-time analytics service

Application-performance monitoring startup AppDynamics now has a real-time analytics service called Application Analytics, it said Monday. The service lets organizations analyze all of their application and transactional data so that IT staff can spot errors and management can make better business decisions. In July, the startup landed $120 million in funding, highlighting the hot application-monitoring space market in which companies like New Relic and AppNeta compete. New Relic recently unveiled its own real-time analytics service called Insights.