Review: DB Networks Enhances Database Security with Machine Learning

San Diego based DBNetworks may very well have the answers to many of those security shortcomings in the form of their IDS-6300, a security appliance which detects intrusions into databases and provides administrators with the intelligence to do something about it.

Airbnb open sources SQL tool built on Facebook’s Presto database

Apartment-sharing startup Airbnb has open sourced a tool called Airpal that the company built to give more of its employees access to the data they need for their jobs. Airpal is built atop the Presto SQL engine that Facebook created in order to speed access to data stored in Hadoop.

Airbnb built Airpal about a year ago so that employees across divisions and roles could get fast access to data rather than having to wait for a data analyst or data scientist to run a query for them. According to product manager James Mayfield, it’s designed to make it easier for novices to write SQL queries by giving them access to a visual interface, previews of the data they’re accessing, and the ability to share and reuse queries.

It sounds a little like the types of tools we often hear about inside data-driven companies like Facebook, as well as the new SQL platform from a startup called Mode.

At this point, Mayfield said, “Over a third of all the people working at Airbnb have issued a query through Airpal.” He added, “The learning curve for SQL doesn’t have to be that high.”

He shared the example of folks at Airbnb tasked with determining the effectiveness of the automated emails the company sends out when someone books a room, resets a password or takes any of a number of other actions. Data scientists used to have to dive into Hive — the SQL-like data warehouse framework for Hadoop that [company]Facebook[/company] open sourced in 2008 — to answer that type of question, which meant slow turnaround times because of human and technological factors. Now, lots of employees can access that same data via Airpal in just minutes, he said.

The Airpal user interface.

The Airpal user interface.

As cool as Airpal might be for Airbnb users, though, it really owes its existence to Presto. Back when everyone was using Hive for data analysis inside Hadoop — it was and continues to be widely used within web companies — only 10 to 15 people within Airbnb understood the data and could write queries using its somewhat complicated version of SQL. Because Hive is based on MapReduce, the batch-processing engine most commonly associated with Hadoop, Hive is also slow (although new improvements have increased its speed drastically).

Airbnb also used [company]Amazon[/company]’s Redshift cloud data warehouse for a while, said software engineer Andy Kramolisch, and while it was fast, it wasn’t as user-friendly as the company would have liked. It also required replicating data from Hive, meaning more work for Airbnb and more data for the company to manage. (If you want to hear more about all this Hadoop and big data stuff from leaders at [company]Google[/company], Cloudera and elsewhere, come to our Structure Data conference March 18-19 in New York.)

A couple years ago, Facebook created and then open sourced Presto as a means to solve Hive’s speed problems. It still accesses data from Hive, but is designed to deliver results at interactive speeds rather than in minutes or, depending on the query, much longer. It also uses standard ANSI SQL, which Kramolisch said is easier to learn than the Hive Query Language and its “lots of hidden gotchas.”

Still, Mayfield noted, it’s not as if everyone inside Airbnb, or any company, is going to be running SQL queries using Airpal — no matter how easy the tooling gets. In those cases, he said, the company tries to provide dashboards, visualizations and other tools to help employees make sense of the data they need to understand.

“I think it would be rad if the CEO was writing SQL queries,” he said, “but …”

The Druid real-time database moves to an Apache license

Druid, an open source database designed for real-time analysis, is moving to the Apache 2 software license in order to hopefully spur more use of and innovation around the project. It was open sourced in late 2012 under the GPL license, which is generally considered more restrictive than the Apache license in terms of how software can be reused.

Druid was created by advertising analytics startup Metamarkets (see disclosure) and is used by numerous large web companies, including eBay, Netflix, PayPal, Time Warner Cable and Yahoo. Because of the nature Metamarkets’ business, Druid requires data to include a timestamp and is probably best described as a time-series database. It’s designed to ingest terabytes of data per hour and is often used for things such as analyzing user or network activity over time.

Mike Driscoll, Metamarkets’ co-founder and CEO, is confident now is the time for open source tools to really catch on — even more so than they already have in the form of Hadoop and various NoSQL data stores — because of the ubiquity of software as a service and the emergence of new resource managers such as Apache Mesos. In the former case, open source technologies underpin multiuser applications that require a high degree of scale and flexibility on the infrastructure level, while in the latter case databases like Druid are just delivered as a service internally from a company’s pool of resources.

However it happens, Driscoll said, “I don’t think proprietary databases have long for this world.”

Disclosure: Metamarkets is a portfolio company of True Ventures, which is also an investor in Gigaom.

Pivotal open sources its Hadoop and Greenplum tech, and then some

Pivotal, the cloud computing and big data company that spun out from EMC and VMware in 2013, is open sourcing its entire portfolio of big data technologies and is teaming up with Hortonworks, IBM, GE, and several other companies on a Hadoop effort called the Open Data Platform.

Rumors about the fate of the company’s data business have been circulating since a round of layoffs began in November, but, according to Pivotal, the situation isn’t as dire as some initial reports suggested.

There is a lot of information coming out of the company about this, but here are the key parts:

  • Pivotal is still selling licenses and support for its Greenplum, HAWQ and GemFire database products, but it is also releasing the core code bases for those technologies as open source.
  • Pivotal is still offering its own Hadoop distribution, Pivotal HD, but has slowed development on core components of MapReduce, YARN, Ambari and the Hadoop Distributed File System. Those four pieces are the starting point for a new association called the Open Data Platform, which includes Pivotal, [company]GE[/company], [company]Hortonworks[/company], [company]IBM[/company], Infosys, Pivotal, SAS, Altiscale, [company]EMC[/company], [company]Verizon[/company] Enterprise Solutions, [company]VMware[/company], [company]Teradata[/company] and “a large international telecommunications firm,” and which promises to build its Hadoop technologies using a standard core of code.
  • Pivotal is working with Hortonworks to make Pivotal’s big data technologies run on the Hortonworks Data Platform, and eventually on the Open Data Platform core. Pivotal will continue offering enterprise support for Pivotal HD, although it will outsource to Hortonworks support requests involving the guts of Hadoop (e.g., MapReduce and HDFS).

Sunny Madra, vice president of the data and mobile product group at Pivotal, said the company has a relatively successful big data business already — $100 million overall, $40 million of which came from the Big Data Suite license bundle it announced last year — but suggested that it sees the writing on the wall. Open source software is a huge industry trend, and he thinks pushing against it is as fruitless as pushing against cloud computing several years ago.

“We’re starting to see open source pop up as an RFP within enterprises,” he said. “. . . If you’re picking software [today] . . . you’d look to open source.”

pivotalbds

The Pivotal Big Data Suite.

Madra pointed to Pivotal’s revenue numbers as proof the company didn’t open source its software because no one wanted to pay for it. “We wouldn’t have a $100 million business . . . if we couldn’t sell this,” he said. Maybe, but maybe not: Hortonworks isn’t doing $100 million a year, but word was that Cloudera was doing it years ago (on Tuesday, Cloudera did claim more than $100 million in revenue in 2014). Depending how one defines “big data,” companies like Microsoft and Oracle are probably making much more money.

However, there were some layoffs late last year, which Madra attributed to consolidation of people, offices and efforts rather than a failing business. Pivotal wanted to close some global offices and bring the data team and Cloud Foundry teams under the same leadership, and to focus its development resources on its own intellectual property around Hadoop. “Do we really need a team going and testing our own distribution?” he asked, troubleshooting it, certifying it against technologies and all that goes along with that?

EMC first launched the Pivotal HD Hadoop distribution, as well as the HAWQ SQL-on-Hadoop engine, with much ado just over two years ago.

The deal with Hortonworks helps alleviate that engineering burden in the short term, and the Open Data Platform is supposed to help solve it over a longer period. Madra explained the goal of the organization as Linux-like, meaning that customers should be able to switch from one Hadoop distribution to the next and know the kernel will be the same, just like they do with the various flavors of the Linux operating system.

Mike Olson, Cloudera’s chief strategy officer and founding CEO, offered a harsh rebuttal to the Open Data Platform in a blog post on Tuesday, questioning the utility and politics of vendor-led consortia like this. He simultaneously praised Hortonworks for its commitment to open source Hadoop and bashed Pivotal on the same issue, but wrote, among other things, of the Open Data Platform: “The Pivotal and Hortonworks alliance, notwithstanding the marketing, is antithetical to the open source model and the Apache way.”

The Pivotal HD and Hawq architecture

Much of this has been open sourced or replaced.

As part of Pivotal’s Tuesday news, the company also announced additions to its Big Data Suite package, including the Redis key-value store, RabbitMQ messaging queue and Spring XD data pipeline framework, as well as the ability to run the various components on the company’s Cloud Foundry platform. Madra actually attributes a lot of Pivotal’s decision to open source its data technologies, as well as its execution, to the relative success the company has had with Cloud Foundry, which has always involved an open source foundation as well as a commercial offering.

“Had we not had the learnings that we had in Cloud Foundry, then I think it would have been a lot more challenging,” he said.

Whether or not one believes Pivotal’s spin on the situation, though, the company is right in realizing that it’s open source or bust in the big data space right now. They have different philosophies and strategies around it, but major Hadoop vendors Cloudera, Hortonworks and MapR are all largely focused on open-source technology. The most popular Hadoop-ecosystem technologies, including Spark, Storm and Kafka, are open source, as well. (CEOs, founders and creators from many of these companies and projects will be speaking at our Structure Data conference next month in New York.)

Pivotal might eventually sell billions of dollars worth of software licenses for its suite of big data products — there’s certainly a good story there if it can align the big data and Cloud Foundry businesses into a cohesive platform — but it probably has reached its plateau without having an open source story to tell.

Update: This post was updated at 12:22 p.m. PT to add information about Cloudera’s revenue.

Not a shocker: SAP puts HANA at center of new biz apps push

When you hear from SAP these days, the software giant always leads with HANA, its in-memory database. HANA is to SAP what Watson is to IBM — proof that just because a company is getting along in years doesn’t mean it can’t do great stuff.

So it’ s not a huge surprise that SAP’s “next generation” business software suite, S/4, will draw heavily on HANA and sport a single unified interface across the applications. The first of these to be delivered, Simple Finance, was introduced Tuesday with more modules to follow.

At a rollout event in New York on Tuesday, [company]SAP[/company] CEO Bill McDermott characterized this as “the biggest product launch in the last 23 years and perhaps the company’s history.”

No pressure there.

The rewritten S/4HANA applications are now available on-premises across industries and regions. Simple Finance is the first application to be offered via SaaS — it’s also available on premises now, according to a spokeswoman.

Update: An SAP spokesman said these new applications were built from the ground up to run on HANA and will not work with third party databases, which is sort of shocking. (So much for earlier reports that the applications would continue to  work with third-party databases if needed — but would work better, faster, prettier with HANA.)

[company]Oracle[/company] has a similar “better together” story around its database, middleware, analytics, Linux and servers — er, make that engineered systems. All of these vendors talk about being open, but also say they’re more powerful when running with the company’s full array of technologies.

Complicating this particular storyline is that SAP and Oracle used to be more friends than enemies, with the majority of SAP’s business applications running on Oracle databases. Then Oracle decided to dive full on into enterprise applications with its acquisitions of PeopleSoft, Siebel Systems and everything else that wasn’t nailed down, while SAP doubled down in databases, buying Sybase and creating HANA. SAP and Oracle also bulked up their respective SaaS rosters — Oracle buying RightNow, Taleo  and SAP snapping up SuccessFactors.

And of course you know, this means war.

This story was updated at 11:16 a.m. PST to add SAP’s statement that the new applications will not run with third-party databases. 

DataStax’s first acquisition is a graph-database company

DataStax, the rising NoSQL database vendor that hawks a commercial version of the open-source Apache Cassandra distributed database, plans to announce on Tuesday that it has acquired graph-database specialist Aurelius, which maintains the open-source graph database Titan.

All of Aurelius’s eight-person engineering staff will be joining DataStax, said Martin Van Ryswyk, DataStax’s executive vice president of engineering. This makes for DataStax’s first acquisition since being founded in 2010. The company did not disclose the purchase price, but Van Ryswyk said that a “big chunk” of DataStax’s recent $106 million funding round was used to help finance the purchase.

Although DataStax has been making a name for itself amid the NoSQL market, where it competes with companies like MongoDB and Couchbase, it’s apparent that the company is branching out a little bit by purchasing a graph-database shop.

Cassandra is a powerful and scalable database used for online or transactional purposes (Netflix and Spotify are users), but it lacks some of the features that make graph databases attractive for some organizations, explained DataStax co-founder and chief customer officer Matt Pfeil. These features include the ability to map out relationships between data points, which is helpful for social networks like Pinterest or [company]Facebook[/company] who use graph architecture to learn about user interests and activities.

Financial institutions are also interested in graph databases as a way to detect fraud and malicious behavior in their infrastructure, Pfeil said.

As DataStax “started to move up the stack,” the company noticed that its customers were using graph database technology, and DataStax felt it could come up with a product that could give customers what they wanted, said Pfeil.

DataStax Enterprise

DataStax Enterprise

Customers don’t just want one database technology, they want a “multi-dimensional approach” that includes Cassandra, search capabilities, analytics and graph technology, and they are willing to plunk down cash for commercial support, explained Van Ryswyk.

Because some open-source developers were already figuring out ways for both Cassandra and the Titan database to be used together, it made sense that DataStax and the Aurelius team to work together on making the enterprise versions of the technology compatible with each other, Van Ryswyk said.

Together, DataStax and the newly acquired Aurelius team will develop a commercial graph product called DataStax Enterprise (DSE) Graph, which they will try to “get it to the level of scalability that people expect of Cassandra,” said Van Ryswyk. As of now, there is no release date as to when the technology will be ready, but Pfeil said work on the new product is already taking place.

If you’re interested in learning more about what’s going on with big data in the enterprise and what other innovative companies are doing, you’ll want to check out this year’s Structure Data conference from March 18-19 in New York City.

MongoDB confirms an $80M funding round

NoSQL startup MongoDB is aiming to raise $100 million and has already taken in $79.9 million, according to a SEC document that the company filed this week and has confirmed to Gigaom.

The new cash influx comes after a $150 million funding round the startup landed in October 2013 when the company was then valued at $1.2 billion.

MongoDB is a hot commodity in the NoSQL database space, where it competes with Couchbase and DataStax, among others. In their last investment rounds, Couchbase and DataStax have raised $60 million and $106 million, respectively.

MongoDB has also been figuring out how to make money as a company that’s built around open source software. In October, MongoDB unveiled its MongoDB Management Service, designed to help users scale and manage their databases; the startup is banking that the new service will generate a lot of revenue. It also added paid support (or what it calls “production support”) for users of the free version in August, and brought in a new CEO with IPO experience the same month.

The startup recently bought out WiredTiger, whose storage engine technology should be available as an option for a forthcoming MongoDB release. Financial terms of the acquisition were not disclosed.

With Hadoop vendor Hortonworks recently going public with a market cap of a little over a billion dollars, it’s clear the big data space is on fire and investors aren’t scared off by open source software. MongoDB has indicated that it eyes an IPO in its future, but this new funding round will give it leeway to find an optimal timeframe.

In October, MongoDB’s vice chairman and former CEO Max Schireson came on by the Structure Show to chat about databases as well as managing a family while trying to lead a fast-rising startup.

[soundcloud url=”https://api.soundcloud.com/tracks/172382044″ params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

Citus Data open sources tool for scalable, transactional Postgres

Database startup Citus Data has open sourced a tool, called pg_shard, that lets users scale their PostgreSQL deployments across many machines while maintaining performance for operation workloads. As the name suggests, pg_shard is a Postgres extension that evenly distributes, or shards, the database as new machines are added to the cluster.

Earlier this year, Citus developed and open sourced an extension called Cstore that lets users add a columnar data store to their Postgres databases, making them more suitable for interactive analytic queries.

It’s all part of a move to transition Citus Data from being just another analytic database company into a company that’s helping drive advanced uses of Postgres, Co-founder and CEO Umur Cubukcu said. Citus launched in early 2013 promising to let Postgres users use the same SQL to query Hadoop, MongoDB and other NoSQL data stores, but has come to realize that its customers aren’t as excited about those capabilities as they are enamored with Postgres.

[protected-iframe id=”49aa437994cc19939e148f897521bcf2-14960843-6578147″ info=”http://www.indeed.com/trendgraph/jobgraph.png?q=postgresql%2C+mysql&relative=1″ style=”width:540px”]

As Postgres undergoes something of a renaissance among web startups (it’s also the database foundation of PaaS pioneer Heroku and its managed database service), Cubukcu thinks there’s a big opportunity to provide tooling that lets developers take advantage of everything they love about Postgres and not have to worry about whether they’ll outgrow it or bring on another database to handle their analytic workloads.

The NoSQL connectivity is still there, but Cubukcu acknowledges that running analytics on those workloads might be a job best left for the technologies (e.g., Spark) focused on that world of data.

And whether or not pg_shard or Citus Data are the ultimate answer for scale-out Postgres, Cubukcu is definitely onto something when he talks about how the narrative around SQL and scalability has changed over the past few years. His company’s work, along with that of startups such as MemSQL and Tokutek, and open-source projects such as WebScaleSQL and Postgres-XL, have shown that SQL can scale. The tradeoff for developers is no longer relational capabilities for the scale of NoSQL.

Rather, Cubukcu thinks the new tradeoff is between open-source ecosystems and proprietary software as companies try to scale out their relational databases. At least when it comes to Postgres, he said, “Our take is, ‘You don’t have to do this.'”

Twitter now indexes every tweet ever

Twitter has built a new search index that allows users to surface all public tweets since the service launched in 2006. At nearly half a trillion documents and a scale of 100 times Twitter’s standard real-time index, it’s an impressive feat of engineering.

IBM builds up its cloud with Netezza as a service and NoSQL as software

IBM announced a new, promising collection of cloud data services on Monday, adding to an already-impressive collections services on its Bluemix platform. At this point, though, IBM’s biggest challenge isn’t selling enterprise users on the cloud, but convincing them it’s still the best choice.