Hadoop applications abound, but Hadoop still needs improvement

[youtube http://www.youtube.com/watch?v=z7BhGEQX9BQ&w=560&h=315]
Session Name: The Hadoop Of Plenty: What New Tech Innovations Is Hadoop Enabling?
Bradford Stephens Derrick Harris Jonathan Gray Muddu Sudhakar Omer Trajman
Next, we’re going to be talking about the Hadoop of plenty: What new tech innovations is Hadoop enabling, and I’m going to see how many more times I can say the word Hadoop in the next five minutes. It’s going to be a talk moderated by Derek Harris, a senior writer for GigaOM and a conference chair of this event, and he’s going to be speaking with Jonathan Grey, founder and CTO of Continuity, Muddu Sudhakar, VP and GM at Pivotal, and Omer Trajman, founder and VP of operations at WibiData. Please welcome the next Hadoop panel!
So, I know Chris made a joke and I saw someone say on Twitter yesterday, the drinking game is ” take a shot every time someone says Hadoop at Structured Data”, so I think at this point you should be on the floor and you’re done for after the next hour. This afternoon there’s going to be a bunch of corpses on the floor. Anyhow, we’re going to get going talking about Hadoop and not batch processing and not a lot of MapReduce I hope. So, with that, I just want to let the panels briefly introduce themselves, so you get a sense of where they’re coming from and why they’re talking about what they’re talking about.
My name is Jonathan Grey, I’m the co-founder and CTO of continuity. We’re building a big data application development environment, or platform, and what we’re really trying to do is enable the oncoming wave of what we see as apps being developed in this space. So what Derek was saying, as we move from MapReduce and batch stuff into actual real time user interface type of applications, we’re really trying to enable that at continuity.
My name is Sudhakar Muddu, I’m at Pivotal, before that I was at VMware. I think that Pivotal like Paul probably discovered yesterday, I think that at Pivotal what we’re trying to do is solve the problems of big data analytics and big data all in Hadoop. I think for us Hadoop is a very big part of our journey and vision going forward, and Hadoop includes both batch processing and also streaming. These two are the areas that we are going to make big. Also on top of how we build application development I think CloudFront will be part of Pivotal, to enable people to build apps and use analytics as a part of building application development.
I’m Omer, with WibiData, we build big data applications, and that means helping organizations use data to really create a better application experience for users, more than just having applications that are driven by static rules or by limited resources, in terms of how they view apps on the web or interact with them on mobile devices. We look at how the big data technologies that are being used in the back office in batch can be applied in real time. How do you really apply for example predictive models and score in real time as data changes to guide users towards a better application outcome?
So to start I want to talk a bit about the state of now and I think Jonathan and Omer actually have interesting backgrounds here, I mean Omer used to work for CloudEra and Jonathan worked at Facebook. So Jonathan, could you just talk a bit about how big data applications are built within Facebook, I’m guessing that it’s different to what a lot of CloudEra’s customers were doing.
The thing is it takes a team of dedicated engineers. It’s top-tier talent focusing on a problem and developing that application and seeing it though from prototype into production, and that’s really hard. It takes a huge team and a lot of resources and a ton of time and it’s not something that you do from a user perspective. Facebook had to make changes into HBase and changes into Hadoop and do all different kinds of specialized things in order to really make it work for them. The thing is, it worked, and they’re getting tremendous success out of it, whether its Facebook messages and very online OLTP applications that are based on HBase, or their entire hive data warehouse, all those different types of things. That’s the challenge that we have today, which is the Facebooks and the Yahoos and the Googles of the world can get tremendous value and run their businesses on the back of Hadoop. But there’s a huge gap there in terms of what the capabilities are of the rest of the world out there and where the technology’s at.
Alright. And Omer, I think we’ve spoken before about how one reason that you wanted to move to WibiData was this ability to actually build applications for customers, right, instead of…
Yeah, and we saw very similar needs, which is that most organizations, now even more so because of what companies such as Cloudera have done, are able to actually stand up and run the platform, they can get beyond the initial 10, 15, 20 people that Facebook needed just to get things going at their scale. But then the development tools on top of that were not quite there, they’d have to piece a lot of things together, and certainly a lot of the field team that ran Cloudera we helped with that, but it was very bespoke. But where we wanted to get to was actually an application. How do I actually tell the system, Heres what I want, my data scientists have figured out a model, just make it happen and effect the experience that my customers have.” And so that’s really the end game is how do we up-level the platform as Continuity is doing, as we trying to do with our open-source Kiji project. Then actually the end goal is building those applications so people don’t have to develop from scratch.
And I think one of the interesting things you guys are working on at Wibi with Kiji is the direction it’s going in and we’re doing similar is unifying those real time and batch workloads and the APIs and the programming interfaces you do it with, because the way it works today, if I’m a data scientist and I’m playing with R, I’m doing hive queries or whatever, when I find the needle in the haystack, the model I’m trying to build, then I have to do a whole new implementation in real time. Totally different system, totally different set of requirements, and I think over time what we’ll see is some more flexibility. Writing to a single set of APIs, being able to move that, running it in real time, running it in batch, do streaming, and that will just make this whole life cycle in between exploration and actually productionizing stuff way shorter.
For anyone in the audience who’s not familiar, Kiji is an open-source HBase framework, a framework for developing HBase applications, and I think that’s a good segway into HBase, you know is the no-SQL database that’s kind of built on the Hadoop file system and at the moment that seems to be one of the big things moving the technology forward into something that can handle an operational application, into something that can handle stuff at nearish real time with some consistency.
HBase gives you at its raw level, if Hadoop is very batch oriented, open a ton of files, read a ton of files, HBase is very accurate, give me just the data for a specific key. But where HBase leaves off is effectively a bit bucket, literally. Imagine writing to a database, but you could kind of define the columns, but you actually had to dictate byte-by-byte how you serialize the data for like a date or a character string. So that’s kind of where HBase leaves off, and it’s good at that layer and it can do many things. For the applications we’re focusing on we needed to create something that was a little more like an SQL DDL, a data definition, but not rows and columns, because that’s not how applications work, applications have very complex structures to them. I think what Jonathan was talking about was how do we merge the application view of the world which is you have a person and they have demographic data but also a time series of clicks, with the analytic view the world, which does look a little more relational, looks like log files and log records. It’s the same data set in two very different representations.
So we’ve been talking about what people are building in terms of applications, so Muddu so [inaudible] is a part of VMware now, is a part of Pivotal now. What are elements of cloud platform frameworks, I mean what types of things are your customers using the platform for that maybe they wouldn’t have used Hadoop for before?
That’s a good question, but just to add to the previous comment, I see the world differently. I think people are going to continuously develop apps, enterprise apps on Java, [inaudible], JBoss, the pattern is not going to change in the next five years. People are going to develop apps in the cloud world, they want to develop in CloudFoundry, Heroku, that’s not going to change. The problem is this group of app developers don’t have access to big data paradigms. What is big data? Big data is not just Hadoop or MapReduce, sometimes I need memcache, I need in-memory database, I need a streaming engine, I may need a [inaudible]. Being able to provide developers with these in some way is important. What I am saying right now is people develop an app, they [inaudible] the data into some place, and then someone will do analytics. I think you have to do it almost inside out. It’s almost like you need a revolutionary way of thinking about it. Don’t build apps without putting data in the center. Your app question is about data as well. Someone needs to provide to developers the libraries and make it transparent for the developers how to do analytics. That’s the key game I think, and that’s where Pivotal is. We’re taking this CloudFoundry, we’re taking analytics, we’re taking HBase to kind of fuse together, to provide a new way for people to develop applications.
Alright, that makes sense, and I think that’s part of the thing. Those have been IT initiatives in the past, and now they’re trying to make them line-of-business initiatives and developer initiatives, and make something where you can have access to these resources without some huge undertaking.
And again, just to add one more thing on HBase, people make the mistake of thinking that HBase can solve all their problems. HBase is good, it has random access, does app-end properties but if you are dealing with time-series data, telemetry data, then HBase may not be the right data structure. Or maybe you need something on top of it, you may need indexes on top of HBase, so I think that what customers are saying is that you can’t have one size fits all, there’s no one data structure, there are things that are good for HBase, but as we push from more human generated to machine generated data, I think you need new HBase kinds of things. There’s already an open source project called TSDB, time series database, which sits on top of HBase too. So I you need to work towards specialized structures if you’re developing an application.
Alright, so I was going to say from an application perspective, how people are actually trying to use these systems, and theres a few different examples, and maybe I’ll talk about the categories broadly, so we’re seeing a few different things about some of the applications people are using to build big data to build versus the classic applications. These are in the areas of specialized personalization, catering to specific users, really understanding them from a micro segmentation perspective, not a bucket of a hundred or a thousand types of people but how do you dynamically identify maybe a million different categories of types of users, and how do people change between them, even throughout the day as their usage evolves, say if they’re browsing an e-commerce site whether they’re browsing for themselves or maybe buying gifts, those are logically two different behaviors and two different people. So if you’re personalizing you want to be able to detect those and switch between those. There’s also generic content recommendation, like people who like red sweaters might also like black boots, so that factors in and that also changes dynamically as the styles and the content changes a lot faster in say digital publishing, where there’s a lot of different news and different media, versus retail where there’s more seasonal cycles to it.
And you could extrapolate that out to fraud, or anywhere where you’re trying to make personal types of connections between people.
Absolutely, and we actually talked to someone who’s playing around with Hadoop and trying to figure out from an IT perspective what are the different business use cases, and then bringing different business teams in just to play around with it, bringing in their data and looking at it. They happened to have the fraud detection team and the marketing team in on the same day, and they noticed there’s kind of a similarity there. On the one hand, you’re basically trying to find the outliers who are doing something really bad and shut them down, and on the other hand you’re trying to find the outliers who are your best customers and specialize and cater to them. So the same kinds of anomaly detection and predictive models apply to many different applications as long as you’re directing it towards knowing what goal you’re trying to drive in your application.
And Jonathan, Continuity is slightly newer than WibiData, we’re talking two years combined with [inaudible]. What are your early customers trying to do?
We’re seeing some personalization type of stuff, we’re seeing a lot also in gaming and advertising, those guys have tons of data, also the mobile guys and Telco guys as well, but advertising and gaming, those companies run off the back of analytics. What you see in that market is a lot of different companies that are delivering kind of SAS applications, for the publishers, advertisers and marketers. Those guys become aggregators of a bunch of different advertisers and a bunch of different publishers, so they’ve got tons of data, Obviously ad attribution and retargeting and all those different kind of tricky problems in advertising get much trickier for those platforms, because they have billions and billions of events and so being able to give people that platform to actually focus on. They have all this streaming data coming in from all these different sources. One thing Muddu has said that really resonated with me is that it’s not a one size fits all thing. For me, that’s what the no-SQL revolution was all about, there isn’t one answer to every data problem out there. Even with the Hadoop ecosystem, I always say the Hadoop ecosystem”, because it’s not just Hadoop and HTFS and MapReduce, and H-Base, but its other things like Storm, which isn’t in the Hadoop ecosystem but still contributes to it.
I agree with Jonathan, I think the point that he is making is that Hadoop is good for certain properties, but people are going to add other new engines, for streaming in general, in-memory databases, or in memory queries continuously. So I think this whole angle of streaming, how you marry that with Hadoop is not well defined either in open source. Everybody is doing one offs, but there is a great opportunity right there. How do you take the streaming data and try to connect with Hadoop itself.
I’ll do a quick plug, if you Google Hadoop ecosystem, you should see a map, that [inaudible] finished a couple of weeks ago. It’s fantastic, it lays out all the companies in the space, but that’s an aside. But Muddu, you brought up a point before and I think you referenced NoSQL too. Right now there’s HBase, then we see a few things building on top of that, some time series databases, some graph databases. How expansive can layering be on top of or around Hadoop be, to say now it’s a graph database, now it’s a time series database?
That’s what’s interesting about it. These data structures have already existed, and people have built them bespoke, or they’ve tuned one database very differently, they took Oracle and changed it completely from the way it was used for OLab versus OLTP. So there’ve always been these modifications, but now the systems are coming more out of the box to do it. The big difference is that we’re not talking about isolated datasets anymore. Hadoop as an ecosystem is trying to create an environment where you can take many of these data sources from many different applications, many touch points that you have with customers and suppliers and parts of the business, and create one view. It may have different representations, but one view. So if you’re trying to get data that tells you that because you did a marketing campaign someone landed on your page and you’re trying to optimize the conversion, you don’t have three different systems that you have to query dynamically to create a web service that recommends one thing over another, you’ve actually precollected that data through a variety of these systems into one thing where you can make that decision on the fly.
I’ll give you my perspective. I do agree with Omer on that, but I think what’s going to happen in the bigger picture is that people are going to build these new products and new applications and we may not even see what kinds of things are going to come in the future. There’s always going to be new data sets. Today it could be graph-based, tomorrow bipartite graph, there could be specialized stuff. The key is the underlying data structures that we know today are good for some applications. They may not be good for new application coming online, and applications may need more than one. I’ve seen applications that need HTFS, they still need graphs. A single application needing multiple data structures, and they still need NoSQL or something else of that kind. Some of these may be in Hadoop, some may not be in Hadoop, and at the end of the day I think you have to go back and ask what problem you are solving. A subtle statement that I hear from customers is platforms don’t solve problems. People have problems, you need to find a solution, solutions become platforms. That’s what history has taught us. Today we are playing the game of building a platform and hoping that it will become the solution. My view of the world is, let’s go and solve a hundred problems. Each problem will have a solution. From the solution emerges the platform, which is an ecosystem of all these things.
And I think what is very unique in how everyone in the environment is solving those problems today it used to be that if you found a particular solution you wanted to solve the problem the technology was not necessarily broadly available. Everything was proprietary. You didn’t know what other people were doing, sometimes you built your own. How many content management systems were created? Today a lot of this open source, it’s off the shelf, it’s widely used and can be multipurposed. So from an IT perspective you end up with, relatively speaking, fewer systems that can be repurposed to do many different things, and people can focus on solving the applications instead of building the technologies.
You were saying about solutions versus platforms, but in this case with Hadoop you have a platform, that happens to be pretty utilitarian as it turns out, you can do a lot with that. So isn’t there a place where we just start saying, we need to solve this problem, we have all our data here, these various different parts of this ecosystem, so this solution will be built on top of this platform, this technology will leverage what we’ve already invested in”?
It’s a good point. I think you can actually go with applications. I’ll go back to my earlier point. [inaudible] you can particularize, let’s say if you wanted to get into the Telco space or ad publishing, or e-commerce. You can make a very good big data application. But is that a platform? The question comes back to an app versus a platform. To me, first we ought to solve a specific problem with an app, and underneath that we ought to figure out whether this is a platform that can support multiple apps. Is customization possible, is there flexibility available. I think Hadoop is a good platform to start with and solve some problems, it has a framework with MapReduce et cetera, but as we talked about, it does not solve everything. You still have to marry and create a whole ecosystem around Hadoop. But I think the point is, go and solve a specific problem and create a solution that includes many of the platforms underneath. Out of that comes I would say a one in a lifetime [inaudible] event to mainframe clients. Out of this cloud division will create one big data platform. That’s still up for grabs, no-one has won that. It’s going to happen in the next ten years.
OK. Jon at Facebook, was this a case where you say we need to build some new application. Was it a case where you would write a new data platform if need be?
Yeah, you would. At Facebook you definitely would. It was a very highly technical place and so when Facebook messages were kicked off, they did a bake-off. They took MySQL stuff, Cassandra, HBase, and they had some other ones, but those were the three. They hammered the hell out of them and they made the decision that way. It’s not the panacea, because they’ve managed to really wrangle HBase and Hadoop and get tons of stuff on there, and tons of petabytes are running online, but it’s not doing everything. They’re not moving every application they can find on to it, because every application has very different patterns, it’s very specialized. One of the funny things is if you look at Facebook messages, which is HBase, and you looked at another application there called Puma, which is their streaming analytics, their clusters are both HBase, but their configurations are so different, and the way that they are used is so different. Facebook messages is terabytes and terabytes, it’s very high-frequency counting, it’s all in memory. That’s the power of all these technologies, they’re very flexible like that but it takes a big team to be able to bring it there.
Is that why you do this repeatably? Are you trying to build repeatable applications?
Absolutely. I think so, I think that’s exactly what we’re talking about.
That’s the point of building the app. There are repeatable patterns, so you can go hang out with the Facebook team and say you’re building something that looks like Puma, and ask them to tell you about how they felt through it, and write a little recipe. But you can also say that you need to do streaming analytics, and you’re building a streaming analytics application, and then all the configuration just happens for you. Or you’re building a messages application, and the configuration just happens for you, because that’s what the app does. It sets up the technologies to do custom work for the problem you’re trying to solve, it’s a solution for the platform underneath that.
On the question of repeatable applications, you can always do one-off deals like a Facebook deal with multiple apps, but the question is can you go and deploy this as a product and go it give it to your marketing and sales teams to deploy to a thousand customers? We’re not there yet. And that’s the great opportunity.
Like, can I buy the big data version of CRM?
I think you’ll be able to.
We’re very early in the market. Hadoop was created what, 6 years ago maybe? The market was only a couple of years old. Those patterns will come. The sales force automations, supply chain management, and CRM, and those big billion dollar vertical apps that kind of went horizontal.
We see sort of the slices of what they will become. If you think of a retail big data application, that is really trying to track the end-to-end path for customer success, not just how do I recommend efficiently, not just how do I target efficiently, but really end-to-end what is a loyal customer and a repeat customer throughout the entire process. All the pieces are there, you can go to people who will give you search relevancy and that’s it. Or they will give you the best retargeting solution out there, and that’s it. They’re not tied together. The promise of big data as a differentiator for applications is that it actually ties all those pieces together, the same way it does in the back office, you can now do that in the front office.
I said at the start to both of these gentlemen I think they key point is what I call three-tier architecture. Underneath you need a cloud infrastructure. I think that’s still stuck in Hadoop in the physical infrastructure deployment, but I think that’s going to change. It could be deployed in a virtualized environment, it could be CloudBase, it could be VMWare, I think that’s going to happen. Second, people are going to create on Hadoop what I call a big data platform. Then you have an analytics platform, which includes [inaudible]. I think these three have to be a cloud infrastructure, a Hadoop platform, and an analytics platform. Those three have to happen in a product manner in order to do repeatable deployments and win at scale.
I’m going to do a bigger picture, we still have a few minutes left. Looking at this broadly, today we can do some things, like stream processing, you can talk about connecting data or doing in-memory stuff or Hadoop or whatever. Looking forward, where is the real value in expanding Hadoop? Is it in doing something like a graph database? Doug Cutting, the creator of Hadoop, told me recently that something like Google spanner, which is a globally distributed transactional database system, will come to Hadoop. Within the reasonably foreseeable future, how far can we take Hadoop as a platform, beyond building for search and doing batch analytics?
There’s a lot of places it can go, like Muddu was saying, we’re still down here in this infrastructure layer and I think we’ve got to start raising up the levels of abstraction, we’ve got to start raising up where we’re focusing our efforts. That’s where we’ve seen this wave of new startups come out, like Continuity and Wibi and these other guys who [inaudible] these guys who are really looking to the next frontier, and how do we drag this technology there, and how do we beat it down so it’s actually consumable and usable? What I’m really excited about right now is Yarn. Really cool, so at Continuity our entire deployment model is built on top of Yarn, and we’ve done tight integration with Linux containers. It’s kind of a VM model for how we can manage a distributed cluster of resources. I’m very excited about the future of that, and what that means for deploying software, and moving away from the ” chef” model of very fixed partitioning of clusters, very fixed roles around what your nodes are doing to a much more flexible way of running MapReduce jobs, running real-time streaming jobs, HBase, across this given set of resources.
Muddu, you were at VMware, there’s a lot of interesting aspects [inaudible]
I think Jonathon [inaudible]. My view of the world beyond Hadoop and what Hadoop can do in the short term is if I look at it, obviously Splunk has done a great job of solving the problem, I think that in the future customers will be saying we have all these petabytes of data, we’re going to deploy on Hadoop and I’m going to put the data in Splunk. I think from an IT operations perspective that is not solved yet. The Hadoop community has to come down and say we can’t even do [inaudible] on top of Hadoop, properly. A single, short-term problem is Splunk-like functionality on top of Hadoop. Second, I think that MapReduce is good for what people call high throughput, high latency. But people are asking me for high throughput, low latency. Nobody has solved it. I think there could be new paradigms like MapReduce created, I call it MapReduce low. I think Jonathon talked about this whole, I call it dynamic resource management. Right now the whole Hadoop resource management is like the 1970s. Someone needs to solve that problem. It has to be done. Amazon and VMware already solved virtualization. The Hadoop community is still behind in the 1970s. Someone needs to solve dynamic allocation of resources. So I think there are three or four immediate problems someone can solve beyond Hadoop, that are what this market needs.
I agree with all those. I think one of the specific things that Wibi is really excited about is as we build applications, is the emergence of this concept of a big data application server. It used to be that you had a file server and you had a database system and you wrote Perl scripts, and that was called CGI. It was impossible to maintain and debug and the schemas were a mess. All of a sudden application servers came on the market and you could actually use repeatable higher level constructs that used the underlying infrastructure’s capabilities, but they give you a framework within which to build repeatable application that work well together. Big data application servers are taking that beyond what you can do with well-structured databases into the more fluid flexible environment that big data promises.
Alright. And final question, we have a minute left. Where should we be looking? If someone’s looking at Hadoop right now and they’re saying they want to get a kind of crystal ball view of where things are going, should we look at Google still, going back to MapReduce and the Google file system? Should we look at Facebook, or Twitter, or the startup community? Where do you guys think we should look?
Doug Cutting said to me a few weeks ago that it’s amazing to live in a world where Google sends you messages from the future. Look at what they have needed to build to create the kind of world that Google has created where the ultimate feature is Im feeling lucky”. Just tell me. You know all about me already, just tell me what’s next. You just kind of look at how they evolve and how they got there, not necessarily exactly how Google got there, but the paradigms kind of flow from that success.
I think the key point is that the streaming aspect is the most important for Hadoop. Streaming and cloud infrastructure. Hadoop needs to be virtualized. People are not going to want physical servers, so how do you virtualize my Hadoop installation. I might lose a little bit of the performance of a single node, but it gives me all the things that virtualization has been good for. I think Hadoop will be virtualized, it’s surely going to happen, but the question is how do I do it, how do I manage that, how do I do it for streaming data. People are going to take this and want to deploy as many nodes as they need based on their workload. I don’t want to pay for 4, 000 nodes. Maybe Facebook can afford it, but for most customers the power cost of 4, 000 servers is more than the cost of the software. They are saying they don’t want to deploy hardware. They are living in the cloud world, but they are going backwards. So I think that whole thing needs to be solved.