If the future of BI is Hadoop, SQL and the cloud are the glue

[youtube http://www.youtube.com/watch?v=neo6TE41I8I&w=560&h=315]
Session Name: The future of BI and Hadoop
Ravi Murthy
Justin Borgman
Tomer Shiran
Ashish Thusoo
Ben Werther
Audience Member 1
Audience Member 2
All right up next, we have the future of business intelligence Hadoop. And that’s going to be run– moderated by Ravi Murthy, engineering manager of Facebook and he’s going to be talking with Justin Borgman, the CEO of Hadapt. Tomer Shiran, director of product management from MapR technologies. Ashish Thusoo, Co-founder and CEO of Qubole and Ben Werther founder and CEO of Platfora. Please welcome the next panel to the stage.
Four people. This is a night coach. So, hi I’m Ravi I manage the analytics infrastructure at Facebook. Today the topic of this panel is called the future of BI and Hadoop. As [inaudible] Boyd it’s tough to make predictions, especially facts about the future. But then a good way to predict the future is to just invent it. So I’m here with an awesome panel of folks who are basically who are looking to invent the future of BI and we’ll talk about what the challenges are the opportunities and in the end will have some time for questions. So to kick things off, why don’t we do a quick round of introductions? Talk briefly about sort of what you’re working on, and then we’ll get into the questions.
Great. I’m Justin Borgman, Co-Founder and CEO of a company called Hadapt. Hadapt is an analytical database for Hadoop effectively, we allow you to do large-scale active analysis on data inside your Hadoop cluster, which means you have access to both structure and unstructured data. But, we provide access through the familiarity of SQL and SQL-based tools. So, you can connect your existing BI tools and legacy SQL infrastructure to data in now inside of Hadoop.
Tomer Shiran, I’m the director of product management at MapR and for those who don’t know MapR is the Hadoop technology leader. We’re really focused on making Hadoop easy dependable and fast. We provide 100 distribution that really makes Hadoop look like a lot of the other infrastructure that you may have in your data center. Whether it’s databases or enterprise class storage, so provide things like snapshots, high billability and NFS access, disaster recovery. And so MapR I think several thousands of deployments across many different verticals.
Hi my name Ashish Thusoo eyeing the Co-Founder and CEO of Qubole. Qubole is an analytics service is based on a Cloud. It’s available in the Amazon Cloud and it’s targeted towards analyst where they can bring in data sets from various sources into the service, explore it. And create pipelines or transformations on the service that’s based on Hadoop and Hive so it provides those interfaces. If you are looking at any of those technologies, it’s very, very easy to get started with Qubole. And really get the power big data at your fingertips in a way, turn-key manner to apply the service.
Ben Werther I’m founder and CEO of Platfora. We build a solution that really makes Hadoop ready for business users. So, let your raw data in any Hadoop distribution, build a data reservoir and then Platfora turns that into subsecond-interactive visual exploratory BI analysis. It doesn’t require IT to get going. It allows regular business users to derive value immediately, visualizing, and collaborating, discovering things in data.
Alright, great. So, to kick things off, let me ask sort of a simple question and this is probably something a lot of folks in the audience may have seen lots of these reports about every other day. Almost it seems like there is a new SQL on Hadoop engine. At last count I think there were probably 95 of them. And the obvious question here is what’s going on here, why is there so many systems being built and what’s the place in role for these sorts of systems in the BI landscape. So, just jump then kind of talk about this stuff, because one way to look at it is if Hadoop kind of [inaudible] on this, promise that we’ve been hearing about. Where is the need for all these new systems to come about?
S? 04:33
[inaudible] I actually want–
So, I think– SQL is a languages are [inaudible] by a lot of people. And a lot of tools also generate a lot of in SQL or render things. So, from that angle, putting SQL on top of Hadoop makes a lot of sense. And of course, Hive, Hive was something that we had started– I used to work at Facebook before doing Qubole and I was one of the creators of Hive back then in 2007. But, I think what has also happened as a system Hapoop it’s not really a real time loaded low end–low-end system where you can quickly go in and analyze your data sets and so on and so forth. And frankly speaking and Hive is also not caught up in that legacy game. And that’s why there is a need that people feel, that they need a faster SQL-based system on Hadoop because there is a lot of data, there to be able to query the data. And that’s why all these systems starting up. I don’t think there is a space for 95 of these, maybe a half a dozen or less of these. But, it’s a good thing to see that there’s a lot of activity in the ecosystem points to a certain problem, which is being sought and a lot of people are taking different approaches to solve that. And eventually, they’ll be some winners out of this whole shakeout and that’s how I see it. But, there’s certainly a need for that.
Sort of one of the players in that space can you talk about from your standpoint, what is it you’re trying to achieve the Apache drill project and how do you see that fitting into this, solving this problem?
Yeah. It’s interesting there’s a lot of talk about SQL on Hadoop recently. I think it’s important first of all to put that in context, because companies that use Hadoop they have so many different use cases run on Hadoop. So, so we have credit card companies doing business recommendations, doing fraud detection, banks detecting, world trading. Companies doing offloading of ETL from traditional data houses on the Hadoop environment, so there’s so many different use cases, predictive analytics and ETL offload and log file analysis. So many different things that people do with Hadoop other than straight kind of SQL, SQL interactive queries. But I think that SQL on Hadoop there’s one use case for these big data environments they end up having a lot of data is the ability to interactively query and explore that data, and were happy to partner with definitely a variety of solutions that do SQL on Hadoop. So, Hadapt is a close partner with MapR we’ve worked together with many customers. And we also invested in and started the Apache drill project, is an Apache open-source project, providing interactive SQL [inaudible] SQL.
Do you see the– so one part of the SQL is to do sort of for EDL into another external system and sort of the Platfora solution is all sort of based on this notion that you use Hadoop you transform it into sort of smaller data sets or break up the cubes and then loaded into another system. And so, is there even a need. Or is the expectation that Hadoop would ever be a good place to do your interactive analysis, is that even– is there just too many solutions searching for a problem that isn’t actually a problem there?
I think there is I think you see it at– I think Google is the internally they use a system called Dremel and they use it very extensively throughout the organization. So, I think there’s definitely a movement to be able to provide interactive quarries on Hadoop, right. So, that’s one of the reasons we started this product Apache drill for interactive queries. It’s really too enable that kind of use use case, because it isn’t really addressed by mapper use, which is more focused on batch processing.
Yeah, Ravi it’s a good question. I think about the way this sort of evolves, if you think about what you were trying to achieve by adding a SQL layer on top, I mean, were very supportive of better access. Access modes of getting in the data and things like Drill and Pile were very– they’re all good things that evolve ecosystem for. But I think at the same time is worth asking what is the end point were targeting? And I think a lot of people are assuming that the right answer is you go let’s rebuild old data warehouse architecture inside Hadoop and that’ll be a good thing. And it’ll be almost as good as the old way. And I think there’s an opportunity do something that’s much, much better, much, much more agile. I mean, were landing raw data without having to make all of these decisions about modeling up front. How do we then eliminate all of that IT work and having to build aggregations and indexing and organizing and Q rating and sort of the metadata, and make it really sort of democratized and that accessible. And so, we want to take advantage of all these enhancements and I’ll think they’ll be good part of the stack, but we go up a layer where we think about using sort of two level engine where it would automatically driving mapper jobs. And potentially leveraging these kind of new enhancements to build out dynamically and automatically there’s sort of scale out in memory to still have a new view data based on what people are interested in not based on some IT decisions upfront.
And then allow users, those two have this sort of consistently far subsecond performance. It gives the in memory aggregate. So, there never having to deal with writing SQL and figuring out if you’re going to write a query that’s too slow. And just visually interacting and exploring, but knowing downstream the engine will build an aggregate intelligently using these new technologies where–
It makes sense.
So, it’s seems to be sort of two schools of part here and one is Hadoop as a ETL loading into external systems and then I see sort of this other architecture of having a database on Hadoop or sitting next to Hadoop co-located with Hadoop. And can you kind of elaborate on what you see as sort of the pros and cons of the two approaches?
Yeah. I mean in our opinion, we think that part of the appeal of Hadoop in the first place is this notion that it scales so cost-effectively and it becomes this landing area, or we even had one customer that calls it a landfill, where they can just dump everything. And if you can bring analytics to that landfill if you will, and actually analyze that entire broad set data, which I don’t the really do in memory, just because it’s too big. I think that opens up new opportunities for you to understand that data. And I think in order to do that. There’s two things you have to do, first of all you need the feature richness that SQL provides and that connectivity to existing BI tools and other legacy applications. But, I think the other piece which some of these new SQL on Hadoop solutions miss out on is being able to control the storage layer and the optimization that come with that, for example, indexing. HDFS is actually poorer storage engine from the standpoint of do it fast retrieval and analytics. And that’s why there were 30 years of relational database research that we kind of tossed out. When we first got into this space and we toss it out, not because it was bad, but because it’s hard to do.
And the first step was building a fault-tolerant file system. I think the next evolution– and this is really what Hadapt is all about is leveraging that storage lever and making it optimized for fast retrieval. Doing things like indexing and locality-based optimizations minimize IO, indexing to minimize the sort of broad sequential scans, and all of this leads to better for performance and better interactivity across a broader large data sets.
So, in some some sense this discussion is interesting from a technical standpoint, but I think a lot of questions and a lot of answers about what does it mean to me as the business planner. I’m a businessperson, and there was this great talk into sessions back. They talked about classic case of I need a big data strategy, give it to me. But, unfortunately, the answer is a lot more complex than that. And, to a person like that who is approaching from a business standpoint, how would you describe the value of Hadoop in my BI solution set of systems? And what is it that I can do with Hadoop to do before as a traditional BI there’s a bunch of things that I know how to do and have gotten comfortable doing doing that. Tell me what is it that I can do now that could be done before.
Yeah. I mean, I think, first and foremost you’re getting access to a broader set of data. And I think that gives you more opportunities. But, I think the other aspect is this notion of unstructured data and being able to do things with that as well, which you couldn’t do in a traditional data warehousing or database system. So, maybe you want to sentiment analysis, or you wanted to tech search, you want to mine your Twitter, social media traffic. These are new opportunities that these technologies are now allowing you to do and at least in Hadapt case one of the things that we have called the Hadapt development kit, which is a way of extending SQL. So, the guy in your organization that knows mass-producing as a programmer can build additional SQL functions that actually execute more of these advanced capabilities of of the Hadoop platform. They might execute a map, produced job, or a Mahout, machine learning activity that is now enriching that data and ways that you couldn’t otherwise note in a traditional data warehouse.
Can you share in the question to Ben.
In terms of, sort of the use cases that you’re seeing from your customer standpoint. Things that fundamentally, look different from sort of old BI.
Yeah. Yeah. So I mean, I think the classic difference, I mean, there’s a perfect example that kind of captures of what it is. One of our customers will always be talking about these guys pretty soon. They had about 50 analyst against sort of traditional SQL stores kind of siloed and a lot of data that wasn’t able to get in there because they’ve spend– it would be another year or more modeling a designed to get going. Moved to the Hadoop stack because now they can start to pool all this data together, unify build sort of a data reservoir, which were sort of seeing as sort of this merging idea that were seeing, sort of popping out everywhere that people want to build. But, the result was I think maybe five of those 50 were actually able to be productive with that data. And even then it was sort of a slow batched jobs and frustration because it was a really effective. And so, we were able to go in and essentially allow those 50 users to all– within a week or so be productive again, interacting exploring, discovering–
What do you particularly– so to summarize it seems like you’re giving a lot more power to the end-users to able to–
In standing [inaudible]
Find a population here. Let’s compare though the behavior of those people on other silos before were different. We’ve merged the data together, look weird behavior. Let’s see how those behave in terms of advertising behavior, transactions–
One might argue– so this whole notion of democratizing of data access is great and I think there is a lot of value to be had by that. The flip side of it the arguments here about what is there sort of prevented from leading into chaos, if you let everyone in the organization just sort of [inaudible] add up– work on large amounts of data is that– how do you sort of prevent–
If you base- if you make it consistently fast and if you offload and accelerate in that then encouraging that uses in the same way that I think if you– if everybody’s generating Hive jobs you have a problem. And so, I think that part of it is you give me the tools in the infrastructure that lets people democratize in a way that then doesn’t bring down your infrastructure as you have success doing that.
So, I think skills are there, but out of skills you get a lot of good insights. So, in a pro-white more and more organizations are trying to become more data-driven. So, tools, and the infrastructure that can support that has to catch-up to that. It’s not the old way of doing things where in– by the way all these things of moving data into one place, is there classic data strategy which has been there for a long, long time. But, when it fails is that- all that strategy predicates that there is some centered being was managing those data [inaudible] who’s controlling whatever is coming into this data into this warehouse, and what is going now. And primarily that is because whatever the scalability limitations or the cost limitations or whatever of that technology is. With Hadoop you have for the first time cost effective way where you can pump in data almost anybody can do it, and out of that a lot of insights use cases and large [inaudible] were even thought about. And the other thing that is used to with Hadoop is for the first time you can instrument the heck out of everything. In a year
So, whites are really low cost–
ItÕs a low cost…
–place to store all your data without thinking up front–
So, each–
–about what the value of what you’re storing.
Right. So changes are dynamite, right. So, in the past all the warehousing Windows are costing by terabyte that is a business moral and family, because it’s a storage. For each terabyte, each bite that is stored was fairly useful. And the Hadoop big system it’s very different. Each byte that is stored and usually in Hadoop is by itself is not that useful, which the aggregation becomes very useful, so the dynamics totally change. You need to find a system that you cost-effectively and in a scalable manner that is able to store data. And then you need to find a system that allows a lot of people to be able to assess it in a way that is user-friendly so that they can generate insight.
So, you bring up costs. I think that’s definitely a big factor in many of these decisions and the hard part of that is to quantify sort of what is the other why on that investment that I put in. How do you guys look at when you talk to your customers and you present this sort of low-cost solution and there comparing to a traditional enterprise data warehouse that has a certain price point. What is the sort of– how do you express the Auto I on a Hadoop system?
So, I would be happy to give some concrete examples on– I think we’re at a point many of our customers have realize significant revenue or cost-saving advantages from Hadoop. So, just to give a few examples, one customer in the Telco space is offloading all of the ETL work from a Teradata deployment and saving tens of millions dollars every year. So, it’s a massive deployment am by what they notice is that they were doing about 70% of the work that was being done in Teradata was just ETO. They they were cleaning up the data, filtering it, things like that. By doing that, inside Hadoop at 2% of the cost per terabyte they save a lot of money, right? So, that’s an easy kind of entryway into Hadoop. And a lot of the other– a lot of the other examples are if you look at existing applications that many people have their really restricting the amount of data they can look at. So, your one insurance company, there calculating these risk triangles that they used to determine premiums and things like that, well now, instead of doing that coarse-grained at the neighborhood level. Well, they can calculate that with very fine-grained parameters for every house, right? And they can get more accurate models. Another example is identifying rogue trading one of the banks, that’s only something that could only be done in the past. Using into day market data. And you could miss a lot of things by limiting yourself in the day market data, rather than–
Do you work on a much larger data set here?
Right. Much more fine grained as well. So you get much more accurate models, and that translates directly into if I catch rogue trading and I save billion dollars or I’m able to determine better pricing, or things like that just by having better models.
That’s great. Just switching gears a little bit one of the other sort of plans in the space of BI is the move to the Cloud. And actually, this is more a question up your alley. As you kind of look, this move from on premise into Cloud what sort of challenges do you see customer facing in terms of the transition and the way they actually operate in deal with this data.
So, I think, first of all the Cloud has immense benefits, in the sense that you as a business can just focus on your data and transformations and insights. And don’t really have to care about what is the most optimum stack and what is the most optimum and technology that I have to use in order to get– it’s a very turn-key processing get started, very easily and so on and so forth. And the biggest challenge that we see is the perception that a public Cloud or a Cloud like–by the way, Qubole works on Amazon Cloud right now. And there’s a perception of is the Cloud safer than my own private data centers? I think it’s more of a perception because a lot of data leakages that have been reported, more recently, having mostly from the private data centers. You hardly ever hear, okay my data got stolen because of data [inaudible]. And again I get this analogy about you have your cash, do you keep your cash at home or do you put it in a bank, and that the same thing that works with public account.
So, do you see the business kind of changing their perception–
I think it’s a gradual change.
–of Cloud.
It’s not like a step function change, but I see– are ready if you roll back, say three years, there were much less people who comfortable with this argument. Today there are a lot more– there are industries who are very comfortable in them, and there are industries are still very, very jittery about this design. And some of that has also got with legislation and some of these places have a lot of legislation they need to know exactly which particle of jurisdiction their data resides in. And the Cloud doesn’t give that so there is some gap the capabilities Cloud and classic [inaudible] provides.
So, I want to sort of– just the getting customers comfortable with the Cloud are there any unique challenges that are doing BI in the Cloud, beyond just doing sort of your operations in the Cloud?
So, in terms of challenges– benefits or challenges?
So, benefits are obvious benefits are you don’t need to run these operations it’s completely elastic it can completely adapt to your workloads you don’t need to run your data centers at peak capacity. And there are immense benefits in terms of moving to the Cloud. And one of the challenges I mentioned was security. The other challenges course our data, gravity, if there’s a lot of data inside the data center, so you always move the smaller data sets closer, to the bigger data sets. That can also on the Cloud–a lot of industries are now actually producing a lot of data on the Cloud itself. So, progressively that trend is also going to– the data gravity can work for the Cloud as opposed to– opposing the Cloud as it is today. And so those are some of the challenges. So security and data gravity has some of the biggest challenges but I think those are the trends that are moving in the direction where there are more conducive to a Cloud based environment. Because the benefits are immense, elasticity is a huge benefit. You don’t need to run hundred servers. You can get hundred servers, when you want. And be done with it.
That’s great.
From our perspective, we we see many different applications in the Cloud. I think what you’ve said about data gravity is absolutely true. Most of the use cases are people that are generating data in the Cloud and now there are a variety of companies that also have on premise very large data center deployments that buffer some new applications. Maybe it’s telemetry data where there collecting information from different sensors around the world, there’s no reason they can’t do that application in the Cloud. And so we have a partnership with both Amazon and Google. They both selected MapR as their Hadoop distribution in the Cloud. And so we spent a lot of time working with customers in the cloud who are doing Hadoop specifically in the Cloud. And I definitely think that’s an area that’s growing and I think 10 years from now we’ll all be running in the Cloud. There’s this–
I was about to ask–
— and one is benefits.
RAVI MURTHY4:23 If you’re kind of looking into the crystal ball and project out sort of two years, five years horizon. Couple of questions come to mind. One is in terms of the future of Hadoop itself, and each one of you can kind of take a minute scribe your– what do you expect of Hadoop to transform into and how is that going to change BI? All the way from or are we in the world where the– what happens to the traditional data warehouses do they morph into something that looks like big data or is Hadoop going to replace the traditional warehouses?
Yeah. I think, I mean this idea of the data reservoir and there’s different terms for it, but the idea that you’ll handle the data in sort of a HTF like system and then there’s a range of different capabilities on top of that, that didn’t allow you drive value, but you’re no longer locked into one proprietary architecture. And you can bring all these data sets together without having to figure out how you’re going to use them up front. I think that is the big shift– that is the big, big shift that’s going to happen. It’s happening now, and five years from now. My bet is we look back the idea of enterprise data warehouse, in a relational sort of lockdown model is more like the mainframe. It hasn’t gone away still there, but those are the applications you just don’t touch. You put them in the corner, but all the usage is all about landing raw data from across your business in these reservoirs. And then how do you drive business value from that, and what is the state of stack technologies in what is that– how does that get exposed to users? And I think my big additional implication is that I think that the existing BI vendors existing stack are really not poised to be able to take advantage that. There based on a world with traditional SQL kind of regimented structures and be able natively for that environment is just qualitatively, quantum lead and what you can really do with data, compared to the old ways.
Where running out of time. We can adjust the time if you keep it short.
I’ll open up for questions.
So, I think between the old school data warehouses Hadoop is real time is legacy. I think that a lot projects out there which are trying to solve it, and that is going to get solved. Once that is– once they get solved then the data gravity thing comes to the picture. If most of your data in Hadoop all these data houses tend to become data smarts of tomorrow. And I think that evolution is going to happen.
So, I think Hadoop has evolved into this that the platform where you’re going to store all your big data now it’s about how do we enable access to that data in a variety of different ways. So, it started with batch processing on that part we enabled the manifest access so you could use data file based applications. We now have a solution for table-based access, but you’ll see search coming you’ll see real-time processing with things like Storm you’ll see SQL interactive queries. So it’s about threading more and more access to that data.
And I think just to reiterate what Ashish was saying. I mean, I think ultimately what that means that a Hadoop does become the data warehouse of the future. It’s a very different data warehouse and I think that’s the motivation. It’s not cost, but rather what new things can you do with it and it’s first and foremost access to all of that data all types of data, so you don’t have those silos. And different types of data in terms of structured and unstructured and increasingly were going to see vendors like us to develop tools to make it easier to tap into all of that all of that data. So, I think that certainly a trend.
At this time I want to open it up for questions. If there are any from the audience, can walk up to the mic that are on the two sides. Yep.
You were talking latency being one of the key problems. I’m more concerned about acidity, right? How how can I be sure I’m getting all the data, not missing something and not getting garbage? So, how are you solving that because without solving that all the BI discussion is useless.
Yeah. I’ll just quickly answer it. That’s where were actually using RDM technology underneath is helpful because databases is asset compliant. So, that something we’ll get to more quickly I think than any of the other SQL. On Hadoop vendors is being able to do updates and deletes and keep that asset.
Any other?
Just a follow-through if you use database and you can look at infinite amounts of data, though.
Sure can’t. So were loading structured data into effectively an RDM built inside of Hadoop. It’s an integrated architecture and I know were short on time. So I’ll be happy to talk to you more about it afterwards. But effectively all in the Hadoop cluster and loading it into that optimize storage.
But, I think the flip side of it– just to add more color to that question is about is in this world of Hadoop is the value per record or value per fight low enough that it actually okay to not have asset. That it’s okay to lose some amount of data.
And I think it’s also different, I mean in a SQL world you would define green accuracies you think about these problems a lot. I mean, this isn’t largely in a [inaudible] world of data being landed that’s being processed incrementally. It’s not that hard to keep track of all the files that have landed, make sure they were processed and make sure you have all the data. We deal with things where we have to– there’s compliance reasons. You have to have the right values. It’s not is hard as it seems to do it at that level.
And the data is really annoyance depend only in large [inaudible] data. That is the bigger part of the data, tracks action data is very, very small.
I think it depends on the application.
Good, that’s great. Is there any other questions?
We’re out of time. Thank you all.
Pretty good. Thank you.
Now it’s time for lunch. Want to remind you of just a few things, you have an hour for lunch, please network with everybody.