Hadoop vendor Cloudera is singing the praises of its own SQL query engine, releasing on Monday the results of a benchmark that shows how Cloudera Impala compares to Apache Hive and a mystery proprietary database. As one might expect, Impala easily bested its competitors in the benchmarks (no vendor has ever, to my knowledge, released results highlighting its product’s inferiority), but Hive and SQL databases probably aren’t Impala’s real rivals.
Its more-direct competition comes from other Hadoop vendors doing their own things to try and make Hadoop queries faster and more interactive. Because the choice right now isn’t to Hadoop or not to Hadoop, it’s which flavor of Hadoop to do. Companies that are using Hive are already using Hadoop, so that decision has been made. And even Cloudera — unless its stance has shifted drastically — acknowledges that Impala isn’t yet a replacement for a purpose-built data warehouse or relational database systems.
(Although, a future where Hadoop vendors do actively try to upset the database market would be interesting. Maybe we’ll get a sense of how realistic during sessions with the CEOs of Cloudera, Hortonworks and Pivotal at Structure Data in March.)
If having some degree of interactive SQL queries is important to users, they’ll likely be comparing one Hadoop distribution to another on this front. So while Cloudera is smart to position the choice as being between Impala, Hive and DMBS-Y (“one of the top 5 commercial MPP query engines on the market,” a Cloudera spokesperson confirmed), the more relevant comparison is probably between Impala and the Hortonworks-backed Apache Stinger/Tez, Pivotal HD Hawq, Presto (on Qubole), the MapR-backed Apache Drill, Hadapt, IBM BigSQL, Shark … you get the point.
For what it’s worth, everyone is faster than Hive — that’s the whole point of all of these SQL-on-Hadoop technologies. How they compare with each other is harder to gauge, and a determination probably best left to individual companies to test on their own workloads as they’re making their own buying decisions. But for what it’s worth, here is a collection of more benchmark tests showing the performance of various Hadoop query engines against Hive, relational databases and, sometimes, themselves.