In this week’s podcast we tackle some big issues, such as whether we want to put in the effort to train the anticipatory home and if the internet of things needs an OS.
There was a lot of news about Spark’s ascension in the big data ranks this week, as well as some speculation. According to Cloudera’s Mike Olson, his company is widely embracing Spark — including to run Hive — but not in place of Impala.
Big data startup Databricks keeps humming along, announcing on Monday a large round of venture capital and a new cloud service that aims to seed adoption of Spark — a framework it says is faster, easier and more versatile than other options.
Apache Spark might push MapReduce to the back burner faster than some people might like, but it will also boost the Hadoop overall ecosystem. The project’s co-creator Matei Zaharia explains why Spark is so popular now and where it fits into the big data ecosystem.
Matei Zaharia is a big Spark booster. He helped build the project into the force it’s become in big data analytics and is CTO of Databricks.
Analytics is all the rage, with Hadoop and big data leading the hype. But the new technologies are not yet mature, a market full of startups is yet to shake out, and most enterprises have a mishmash of solutions from legacy warehouses to marketing department SaaS subscriptions and early developmental projects.
I talked this week with Andrew Brust, the Gigaom Research research director for big data and analytics, about the state of the market and any recommendations he might have for CIOs grappling with the optimal deployment of the technology. Among the management-oriented suggestions that Andrew offered are the following key points:
- Don’t overregulate all use of analytics. Andrew likens the current acquisition of SaaS data sources through the enterprise to the adoption of PCs in the 1980s. There was pent up demand for ready access to computing power that traditional IT wasn’t providing. Eventually PCs became so pervasive and so central to the business that centralized management was required and became the norm. Through those Wild West early years of PC use, however, companies not only gained immediate value and, sometimes, an early-adopter advantage, but through trial and error many of those renegade PC users also discovered and honed valuable new applications that were subsequently adopted in their organizations more broadly. Andrew sees the same dynamic at work today, and he expects departmental experimentation will likewise lead to valuable new applications of analytics and big, streaming data flows.
- Leave room for ad hoc data use. Just like the early days of end users exploiting VisiCalc, Lotus 1-2-3, and Microsoft Excel, many applications of new analytics are of a one-off or experimental nature. They are uniquely suited to the needs of an individual employee or a small workgroup, and are best developed by individual workers who aren’t encumbered with all sorts of heavy and unnecessary data governance requirements. This casual use of analytics is fundamental to a healthy organization. Already there are a number of new self-service tools that are making a new level of casual analysis viable, and as the technology matures more nontechnical users are about to gain access to an entirely new level of data and analysis. That will be a good thing.
- Recognize the threshold for when more data governance is required. Undoubtedly, there are data governance requirements for sensitive data and data that must be integrated for broader use within the organization. And, as one recent study points out, a CEO-led interest in the innovative use of analytics is correlated with the greater use of the capability throughout an enterprise. Andrew says there is a recognizable gradation and threshold as to when informal data use needs to be regulated: “you’ll know it when you see it”. IT organizations must be proactive in identifying and handling such situations, although Andrew is skeptical of such heavy-handed techniques as naming a Chief Data Officer to oversee all data use.
- The best metrics often bubble up democratically, rather than being imposed from above. A corollary to allowing casual data use through the organization is that individuals and small departments often know best how to do their jobs. In some ways they are thus the ones who best know how to measure their efforts as well. Although, as a recent Zendesk analysis in customer service confirmed, companies that measure performance get better performance, there is a risk in imposing too many metrics from above that may stifle and limit individual contributions. Andrew points to another historical trend—the rise and fall of the ‘balanced scorecard’—as an example of the impracticality of too heavy a hand in imposing top-down-derived metrics on too many aspects of a company’s operations. This is therefore an area where allowing bottom-up data experimentation can lead to better organizational practices. The best individual data findings are often adopted at the workgroup or departmental level, and some of the very best of those may percolate up for use corporate-wide. Although line-of-business managers may be the best at identifying these more broadly applicable uses of data, Andrew points out that IT departments may sometimes spot the same, based on patterns of data use that can be tracked within an analytics system.
- Know your organization’s appetite for experimental technology. Andrew notes that the Hadoop environment is rapidly maturing. However, we are still in the early days of the technology. Atop the promise of open source as a defense against vendor lock-in is the usual trade-off of proprietary, or at least vendor-dependent, enhancements that provide greater functionality than the current open-only standard. The immediate payoff from being pulled in the direction of those vendor enhancements may justify a risk that the solution does not survive in the longer term. However, Andrew offers a couple of suggestions for IT departments wary of going down that path. Apache Hive has a widely used SQL-like language that works on Hadoop. It provides only traditional batch query, but may be a match for the ready skillset in some organizations. Apache Spark is another open source enhancement to Hadoop that provides in-memory analysis that is appropriate for some applications (e.g., market analytics), but not all. Spark is being widely adopted by leading Hadoop vendors (e.g., Cloudera, Hortonworks), and so offers a degree of safety in a fragmented market. Finally, enterprises with large data warehouse operations that may be hesitant to inch too far out on the early-adopter limb can probably turn to their data warehouse vendors for Hadoop tools, rather than opting for more advanced capabilities from less stable, startup suppliers.
Like many of its chipmaking competitors, Texas Instruments is really stoked about the promise of connected devices. It all boils down to more chips sold. So TI has built out a partner program for the internet of things to help manufacturers link together devices and services from different companies. Participants in TI’s ecosystem include 2lemetry, ARM, Arrayent, Exosite, IBM, LogMeIn (Xively), Spark, and Thingsquare. Basically if a company buys TI chips they’ll work with software, hardware or cloud offerings from the above vendors.
MapR is the latest Hadoop vendor to embrace Apache Spark, adding the entire Spark stack of technologies to its distribution. It’s a smart move by MapR, but just more validation that Spark might be the data-processing framework of the future.
Databricks, the company behind the commercialization of the Apache Spark data-processing framework, is certifying third-party software to run on the platform. Spark is gaining popularity as a faster, easier alternative to MapReduce in Hadoop environments.
Apache Spark, an in-memory data-processing framework, is now a top-level Apache project. That’s an important step for Spark’s stability as it increasingly replaces MapReduce in next-generation big data applications.