Big Data SQL makes Hadoop servant, Oracle master

Ever since Cloudera’s October, 2012 announcement of its Impala SQL-on-Hadoop engine, it seems the database industry has been obsessed with fusing the SQL query language with Hadoop. These various pairings have roughly broken down into two broad groups: standalone SQL-on-Hadoop engines from Hadoop distribution vendors, and SQL-to-Hadoop bridges from various relational database and data warehouse vendors, including Teradata, HP Vertica, IBM and Microsoft.
SQL on Hadoop, redux
Oracle has had an offering out there too, in the form of its Big Data Connectors and, most interestingly, its Oracle SQL Connector for HDFS (OSCH). That connector allowed data in Hive tables and HDFS files to be imported into the Oracle Database catalog as external tables which could then be queried with Oracle SQL and even joined to physical tables in the Oracle Database.
My own observation was that Oracle didn’t push OSCH very hard, and I wasn’t certain why that was. But today the reason became apparent: Oracle had a much more sophisticated Hadoop integration technology under development. Tuesday, Oracle announced that technology, called Oracle Big Data SQL (BDS), to be made generally available on its Big Data Appliance this calendar quarter (that is, sometime before October).
Rather than just consume Oracle’s press release, I requested a briefing, and was able to speak with Dan McClary, Oracle’s principal product manager for Big Data and Hadoop. I was lucky to have this briefing; McClary was able to explain how BDS works in great detail.
What’s changed
So how different are OSCH and BDS? Vastly so. McClary explained to me that OSCH was never really intended to be used for interactive query, but rather for ETL-like scenarios. While that seemed just a tad revisionist to me, I was nonetheless impressed with BDS’ much greater capabilities. Essentially, what Oracle did with BDS was to build a translator of sorts that works natively with both Oracle and Hadoop.
On the Hadoop side, BDS uses its Smart Scan for Hadoop technology, which interfaces directly with Hadoop’s YARN cluster management layer to parallelize properly across the cluster. BDS can query Hadoop data in virtually any format, including custom formats, as long as a “SerDe” (serializer-deserializer) is available. BDS will also determine schema as it reads the data, a key concept in the Hadoop world. On the Oracle side, BDS returns the data in Oracle Block Stream format so the Exadata Database Machine can work with it natively. BDS and Smart Scan for Hadoop are, in McClary’s words, “Oracle on the top, Hadoop on the bottom.”
NoSQL and security
BDS also allows standard JSON functions to be used in Oracle SQL and will then query natively against JSON data in HDFS. That means semi-structured data can be queried with SQL and this capability will be useable against the Oracle NoSQL Database when BDS becomes generally available. McClary told me there was a strong possibility that this NoSQL interface could eventually be extended to work with HBase, Cassandra and even MongoDB.
Finally, BDS also makes use of Apache Sentry such that Oracle’s own role-based security scheme can be projected over Hadoop data. Other, more advanced security capabilities, including data redaction, can be enforced over Hadoop data, as long as that data is queried through Oracle. This makes possible a model whereby production Hadoop clusters are locked down, and users get to the cluster data exclusive via Oracle, through which very granular and specific role-based security is imposed and enforced.
Hadoop as workhorse
While the industry has no shortage of SQL-on-Hadoop solutions, SQL-to-Hadoop bridges are a bit different from standalone solutions like Hive and Impala. The bridges don’t merely allow SQL-knowledgeable professionals to work with Hadoop. Instead, they bring Hadoop data to specific database platforms, effectively utilizing Hadoop as a specialized, embedded engine, rather than exposing it as a new database platform in its own right.
I covered Actian’s Hadoop SQL Edition in my Weekly Update two weeks ago. Actian’s product, which integrates its Vector database with Hadoop, was coincidentally made generally available Tuesday, the same day BDS was announced. Big Data SQL and Hadoop SQL Edition take different architectural approaches, but they both embed Hadoop. Microsoft does something similar with its Analytics Platform System, released in April.
Like other infrastructure, Hadoop is most powerful when it recedes from view and works behind the scenes. So expect to see more Hadoop-embedded solutions emerge. In fact, Oracle’s McClary said BDS technology may even make its way from Exadata and the Big Data Appliance to the mainstream Oracle database.

Hadoop analytics startup Karmasphere sells itself to FICO

Hadoop startup Karmasphere, which launched in 2010, has sold its intellectual property to credit-scoring specialist FICO. Karmasphere appears to have been struggling for adoption and funding, so selling its assets was not an unforeseen turn of events.

Spark is now part of MapR’s Hadoop distro, too

MapR is the latest Hadoop vendor to embrace Apache Spark, adding the entire Spark stack of technologies to its distribution. It’s a smart move by MapR, but just more validation that Spark might be the data-processing framework of the future.

Citus Data builds a column store for Postgres

Citus Data, a startup focused on turning PostgreSQL into a scale-out analytic engine, has developed a developed a columnar data store for the popular open source database. The company is open sourcing its extension for single-node environments, although it’s offering a distributed version as part of its CitusDB software. Citus already supported interactive SQL queries over Postgres (on which its technology is based), Hadoop and MongoDB, but columnar stores are faster for certain types of queries. Also, the compression features of the ORC file format that CitusDB uses can cut disk space by more than half.

Apache Tajo SQL-on-Hadoop engine now a top-level project

Apache Tajo, a relational database warehouse system for Hadoop, has graduated to to-level status within the Apache Software Foundation. It might be easy to overlook Tajo because its creators, committers and users are largely based in Korea — and because there’s a whole lot of similar technologies, including one developed at Facebook — but the project could be a dark horse in the race for mass adoption. Among Tajo’s lead contributors are an engineer from LinkedIn and members of the Hortonworks technical team, which suggests those companies see some value in it even among the myriad other options.

MapR now supports YARN, puts HP Vertica on top of Hadoop

MapR is continuing along its path to Hadoop glory with new support for the YARN resource manager and a direct integration with the HP Vertica analytic database. In such a competitive space, every little edge matters.

SQL-on-Hadoop startup Splice Machine closes $15M in venture capital

Splice Machine, a startup promising a SQL-on-Hadoop database that can handle both transactional and analytic workloads, has closed a $15 million series B round of venture capital from InterWest Partners, along with Mohr Davidow Ventures. Supporting transactional workloads would put Splice Machine in a good position among the glut of companies and projects letting users perform SQL operations on Hadoop, because most are strictly for analytics. The big question for Splice Machine, though, might be whether companies actually want to run transactions on that data or whether they’re willing to stick to a tried-and-true database for that.