Report: Extending Hadoop Towards the Data Lake

Our library of 1700 research reports is available only to our subscribers. We occasionally release ones for our larger audience to benefit from. This is one such report. If you would like access to our entire library, please subscribe here. Subscribers will have access to our 2017 editorial calendar, archived reports and video coverage from our 2016 and 2017 events.
Extending Hadoop Towards the Data Lake by Paul Miller:
The data lake has increasingly become an aspect of Hadoop’s appeal. Referred to in some contexts as an “enterprise data hub,” it now garners interest not only from Hadoop’s existing adopters but also from a far broader set of potential beneficiaries. It is the vision of a single, comprehensive pool of data, managed by Hadoop and accessed as required by diverse applications such as Spark, Storm, and Hive, that offers opportunities to reduce duplication of data, increase efficiency, and create an environment in which data from very different sources can meaningfully be analyzed together.
Fully embracing the opportunity promised by a comprehensive data lake requires a shift in attitude and careful integration with the existing systems and workflows that Hadoop often augments rather than replaces. Existing enterprise concerns about governance and security will certainly not disappear, so suitable workflows must be developed to safeguard data while making it available for newly feasible forms of analysis.
Early adopters in a range of industries are already finding ways to exploit the potential of their data lakes, operationalizing internal analytic processes and integrating rich real-time analyses with more established batch processing tasks. They are integrating Hadoop into existing organizational workflows and addressing challenges around the completeness, cleanliness, validity, and protection of their data.
In this report, we explore a number of the key issues frequently identified as significant in these successful implementations of a data lake.
To read the full report, click here.

Report: Hybrid application design: balancing cloud-based and edge-based mobile data

Our library of 1700 research reports is available only to our subscribers. We occasionally release ones for our larger audience to benefit from. This is one such report. If you would like access to our entire library, please subscribe here. Subscribers will have access to our 2017 editorial calendar, archived reports and video coverage from our 2016 and 2017 events.
Data - generic
Hybrid application design: balancing cloud-based and edge-based mobile data by Rich Morrow:
We’re now seeing an explosion in the number and types of devices, the number of mobile users, and the number of mobile applications, but the most impactful long-term changes in the mobile space will occur in mobile data as users increasingly interact with larger volumes and varieties of data on their devices. More powerful devices, better data-sync capabilities, and peer-to-peer device communications are dramatically impacting what users expect from their apps and which technologies developers will need to utilize to meet those expectations.
As this report will demonstrate, the rules are changing quickly, but the good news is that, because of more cross-platform tools like Xamarin and database-sync capabilities, the game is getting easier to play.
To read the full report, click here.

Report: Market landscape: in-memory database technologies

Our library of 1700 research reports is available only to our subscribers. We occasionally release ones for our larger audience to benefit from. This is one such report. If you would like access to our entire library, please subscribe here. Subscribers will have access to our 2017 editorial calendar, archived reports and video coverage from our 2016 and 2017 events.
Data Analytics
Market landscape: in-memory database technologies by Lynn Langit:
The landscape of data solutions has been significantly disrupted in the last several years, on multiple fronts. Another such disruption is taking place now, with the mainstreaming of in-memory database (IMDB) products. New products are emerging, as are entirely new categories of products.
This report will help CIOs, enterprise architects, and database administrators understand what categories and products they should examine more closely. This includes a review of the rapidly growing array of IMDB products and the various categories that define the market landscape. The report will detail important considerations for purchase and implementation of in-memory data systems and review the market leaders in each category.
To read the full report, click here.

Microsoft queues up DocumentDB for broad availability

Microsoft continues to fill in the check boxes for its Azure cloud. Example: Azure DocumentDB, Microsoft’s take on NoSQL databases a la Couch or MongoDB, will be generally available April 8, the company said Thursday.

The beauty of these document databases is they can ingest JavaScript Object Notation (JSON) formatted information as is — no need for the mapping process that had to occur to pump them into relational SQL databases. [company]Amazon[/company] Web Services added JSON support to its DynamoDB database last last year.

[company]Microsoft[/company] announced DocumentDB in August.

Microsoft, which is trying to kit out Azure as a comfy home to a wide variety of workloads, supports a variety of homegrown and third-party databases including MongoDB and Microsoft SQL Azure and Oracle.

Microsoft also said Azure Search, which works across more than 50 languages, is now available. This “search-as-a-service” targets developers who want to add full-text search into their applications.

The company also unveiled a new premium encoder for Azure Media Services.

For a primer on DocumentDB check out the video below.

[protected-iframe id=”4448a4bcd0bd8c2c93244aa57b76ff78-14960843-26974994″ info=”//” width=”960″ height=”540″ frameborder=”0″]

NSA-linked Sqrrl eyes cyber security and lands $7M in funding

Sqrrl, the big data startup whose founders used to work for the NSA, plans to announce Thursday that it is shifting its focus to cyber security with a new release of its enterprise service. The startup is also taking in a $7 million Series B investment round, bringing its total funding to $14.2 million, said Ely Kahn, a Sqrrl co-founder and vice president of business development.

The heart of Sqrrl’s technology is the NSA-developed and open-sourced Apache Accumulo NoSQL database, which the company, like other open-source-reliant companies such as Docker or Hortonworks, sells premium services around.

While the Accumulo technology, based on Hadoop, provided a way for companies to store and analyze all their data similar to how they could with other big data vendors like Splunk, Kahn said his team found that their biggest customers were using the technology for cybersecurity purposes. Just a hunch, but I bet the whole “ties to the NSA” thing probably leads to people wanting to give it a go for their security challenges.

Sqrrl’s technology spools together many different types of data sets, from intrusion detection logs to human resources information, and puts that in a single platform that can be used for discovering bad actors that may be loitering in a company’s infrastructure.

Because the Accumulo NoSQL database can function as a graph database (graph databases are a class of NoSQL databases, said Kahn) the Sqrrl team can dump all that data into the system and then receive a picture of the network that contains all the users, devices and servers and how they are connected together.

Sqrrl dashboard

Sqrrl dashboard

“We are able to take all these disparate data sets and defuse them into this linked-data model,” said Kahn.

Graph databases seem to be getting a lot of action these days (DataStax just bought out a graph-database company called Aurelius) and it’s often that people use the technology as a way to map out their infrastructure and learn about vulnerabilities.

Given this traction of using graph databases for security purposes it makes sense that Sqrrl would want to ride this wave, and its Sqrrl Enterprise 2.0 product line now contains security specific features including a visualization tools like bar charts and pie charts, and a dashboard for users to create reports based from the data.

“It’s a big data analytics platform with a focus on cybersecurity,” said Kahn. “It has a database foundation, but it now has advanced visualization capabilities that supports the incident-detection lifecycle.”

This might sounds similar to Argyle Data, which built fraud-detection software on top of the Accumulo database, but Kahn said that startup is more focussed on using its technology to prevent telephone scams and the like and that solving problems related to fraud requires different types of data sets than the ones Sqrrl analyzes to detect anomalies.

Rally Ventures drove the latest funding round along with previous investors Atlas Venture and Matrix Partners.

For more on how innovative companies are using big data to solve complex problems, be sure to check out Structure Data 2015 on March 18-19 in New York City.

DataStax’s first acquisition is a graph-database company

DataStax, the rising NoSQL database vendor that hawks a commercial version of the open-source Apache Cassandra distributed database, plans to announce on Tuesday that it has acquired graph-database specialist Aurelius, which maintains the open-source graph database Titan.

All of Aurelius’s eight-person engineering staff will be joining DataStax, said Martin Van Ryswyk, DataStax’s executive vice president of engineering. This makes for DataStax’s first acquisition since being founded in 2010. The company did not disclose the purchase price, but Van Ryswyk said that a “big chunk” of DataStax’s recent $106 million funding round was used to help finance the purchase.

Although DataStax has been making a name for itself amid the NoSQL market, where it competes with companies like MongoDB and Couchbase, it’s apparent that the company is branching out a little bit by purchasing a graph-database shop.

Cassandra is a powerful and scalable database used for online or transactional purposes (Netflix and Spotify are users), but it lacks some of the features that make graph databases attractive for some organizations, explained DataStax co-founder and chief customer officer Matt Pfeil. These features include the ability to map out relationships between data points, which is helpful for social networks like Pinterest or [company]Facebook[/company] who use graph architecture to learn about user interests and activities.

Financial institutions are also interested in graph databases as a way to detect fraud and malicious behavior in their infrastructure, Pfeil said.

As DataStax “started to move up the stack,” the company noticed that its customers were using graph database technology, and DataStax felt it could come up with a product that could give customers what they wanted, said Pfeil.

DataStax Enterprise

DataStax Enterprise

Customers don’t just want one database technology, they want a “multi-dimensional approach” that includes Cassandra, search capabilities, analytics and graph technology, and they are willing to plunk down cash for commercial support, explained Van Ryswyk.

Because some open-source developers were already figuring out ways for both Cassandra and the Titan database to be used together, it made sense that DataStax and the Aurelius team to work together on making the enterprise versions of the technology compatible with each other, Van Ryswyk said.

Together, DataStax and the newly acquired Aurelius team will develop a commercial graph product called DataStax Enterprise (DSE) Graph, which they will try to “get it to the level of scalability that people expect of Cassandra,” said Van Ryswyk. As of now, there is no release date as to when the technology will be ready, but Pfeil said work on the new product is already taking place.

If you’re interested in learning more about what’s going on with big data in the enterprise and what other innovative companies are doing, you’ll want to check out this year’s Structure Data conference from March 18-19 in New York City.

With $20M, Neo Technology makes a case for the graph database

Neo Technology, the creator of the Neo4j graph database, has brought in $20 million in series C money, signaling that we are indeed seeing the rise of graph analysis in big data. The startup now has $44.1 million in total funding.

Unlike document databases like MongoDB, graph databases deal with the relationships between data points and are used by big social networking companies like Facebook and Pinterest to map out the connections of their many users. For example, Pinterest’s graph architecture lets the startup know which users are following other users as well as how their interests overlap with each other.

Similar to the NoSQL world where several companies and related projects like MongoDB, Couchbase and DataStax vie for the crown, there are many different graph database projects with no clear leader(s) yet. Some examples include GraphLab Inc. and the open-source GraphLab database, the Facebook-developed Giraph open-source database and the Cassovary big graph-processing library brought to you by Twitter.

Neo Technology CEO and founder Emil Eifrem said his startup stands out from other projects because of its large development community; the company claims it has 20,000 Neo4j Meetup members in 25 countries and has received 500,000 downloads since Neo4j 2.0 was released last year. Eifrem said that Neo4j now supports many different languages, frameworks and tooling through the help of community support and developers.

Neo4j was created from the ground up by Neo Technology’s founders back in 2007 and is not a rejiggered version of MySQL with some sort of relational layer built on top of it, he said.

Neo4j browser screenshot

Neo4j browser screenshot

As far as building a viable business, Neo Technology is following in the footsteps of other open-source-centric startups like [company]Docker[/company] in that it sells commercial software that functions as “operations and management tools,” but which is built atop open-source technology. While developers can download Neo4j for free, enterprises that want more-traditional IT features like monitoring and management and clustering will have to cough up some cash.

“I can’t see a big, serious company putting [the free version] in production,” Eifrem said.

The 80-person startup claims [company]Walmart[/company], [company]eBay[/company], [company]CenturyLink[/company], [company]Cisco[/company] and Medium as users of the Neo4j database.

Creandum and Dawn Capital drive the funding round along with Fidelity Growth Partners Europe, Sunstone Capital and Conor Venture Partners. Johan Brenner, a Creandum general partner, will join Neo Technology’s board.

Basho, creator of NoSQL Riak database, raises $25M

Basho, the company behind the Riak key-value database and Riak CS cloud-storage system, has raised a $25 million series G round of venture capital led by Georgetown Partners. The company has now raised nearly $60 million in a combination of equity and debt financing since it was founded in 2008.

Basho is among a handful of companies, including MongoDB, DataStax and Couchbase, that seems to have garnered some real traction in the NoSQL space over the past few years. Riak, its flagship open source, database competes most directly against Cassandra, around which DataStax was built. Basho released its Riak CS storage system in 2012 to help users build distributed object stores a la Amazon Web Services’ S3 or OpenStack Swift.

Although it has raised much less capital than its NoSQL peers (MongoDB, for example, just announced an $80 million round on top of the $150 million in closed in October 2013) and had a major executive shakeup in 2014 — the company replaced both its CEO and CTO — Basho claims it’s doing just fine. In an interview on Monday, new CEO Adam Wray cited an 89 percent annual increase in bookings, tens of millions in annual revenue and accounts at some of the world’s largest companies.

Big data, the internet of things and hybrid cloud computing environments are driving many of Basho’s deployments, he added.

Assuming the market for non-relational databases keeps growing like many expect (“One day, we’ll be a $50 billion market space,” Wray said), there’s no reason it can’t support a handful of successful companies. Riak might never have the the user base of MongoDB or the webscale reputation of Cassandra, but if the company can get its act together operationally and the technology remains solid, there should be plenty of business to go around.

And if a large software vendor starts going shopping for NoSQL software, Basho will likely have a much more-palatable price tag than the other big-name options.

MongoDB CEO: Company was ‘opportunistic’ in raising $80M

According to MongoDB CEO Dev Ittycheria, it was unsolicited demand from investors that drove much of the company’s recent $80 million investment round, news of which broke on Friday afternoon.

In an interview with Gigaom, Ittycheria said that the company initially planned to raise a smaller sum of money to finance its acquisition of WiredTiger in December, but demand from adoring investors that caught wind of the raise was too much to resist. The company ended up raising about three times what it had planned to, and on very favorable terms, he said.

“We were very opportunistic,” Ittycheria said. MongoDB’s previous fundraising round had it valued at $1.2 billion, and it’s now worth even more.

The company will likely go public at some point but doesn’t want to — or have to — rush into it, he said. It still has plenty of mindshare and capital as a private company to choose its own timing. And there’s still some work to do proving its business model can succeed and scaling its global presence (MongoDB only has a handful of reps in both Europe and Asia, for example), Ittycheria noted.

That being said, MongoDB can probably, and safely, be optimistic about its prospects whenever it decides to go public.

Although the Hadoop and NoSQL markets are quite different, Ittycheria said he watched the recent Hortonworks IPO very closely because Hortonworks was the first of the next-generation, open-source data infrastructure vendors to test the public markets. While the company’s financials initially made him a little nervous, he said the fact that it was well-received speaks to demand from public investors to invest in “loosely, ‘big data’ companies.”

MongoDB confirms an $80M funding round

NoSQL startup MongoDB is aiming to raise $100 million and has already taken in $79.9 million, according to a SEC document that the company filed this week and has confirmed to Gigaom.

The new cash influx comes after a $150 million funding round the startup landed in October 2013 when the company was then valued at $1.2 billion.

MongoDB is a hot commodity in the NoSQL database space, where it competes with Couchbase and DataStax, among others. In their last investment rounds, Couchbase and DataStax have raised $60 million and $106 million, respectively.

MongoDB has also been figuring out how to make money as a company that’s built around open source software. In October, MongoDB unveiled its MongoDB Management Service, designed to help users scale and manage their databases; the startup is banking that the new service will generate a lot of revenue. It also added paid support (or what it calls “production support”) for users of the free version in August, and brought in a new CEO with IPO experience the same month.

The startup recently bought out WiredTiger, whose storage engine technology should be available as an option for a forthcoming MongoDB release. Financial terms of the acquisition were not disclosed.

With Hadoop vendor Hortonworks recently going public with a market cap of a little over a billion dollars, it’s clear the big data space is on fire and investors aren’t scared off by open source software. MongoDB has indicated that it eyes an IPO in its future, but this new funding round will give it leeway to find an optimal timeframe.

In October, MongoDB’s vice chairman and former CEO Max Schireson came on by the Structure Show to chat about databases as well as managing a family while trying to lead a fast-rising startup.

[soundcloud url=”″ params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]