How streaming can fit into the big data toolbox

[protected-iframe id=”e0e786f582e6e38d937737efe59f321b-14960843-61002135″ info=”″ width=”640″ height=”360″ frameborder=”0″ scrolling=”no”]
Transcription details:
Input sound file:
1004.Batch 4

Transcription results:
Session Name: The total cost of real-time performance in a massively connected world

Jo Maitland
Damian Black

Jo Maitland 00:04
That was the AWS perspective. We’re going to have a brief presentation and then the counterpoint to that perspective from John Engates of Rackspace/OpenStack. So let’s do a mini presentation now with Damian Black who’s the CEO of SQLstream on trade-offs and dealing with big data. Damian, welcome.
Damian Black 00:33
Thanks Jo. Thanks everybody. It’s a great pleasure to be here today. Damian Black, I’m the CEO of SQLstream. We’re the leading relational streaming technology, and I’m going to talk about the total cost of performance for real-time big data analysis and analytics in today’s massively connected world. Let’s look at first of all the evolution of data management technologies, as they’ve evolved over the decades to be able to address the complexity and scale and volumes of data. What we see is two sets of technologies. First we have the stored data technologies, traditional databases where you store the data and then you run your analysis and then streaming technologies where you basically process the data as it streams off the wire. What we’ve seen is to deal with the complexity and to come with cost-effective solutions. We’ve had a progression of data models from the sequential flat file model through to hierarchical databases, relational databases and then in big data, we’re seeing a reversion back to the relational model as indicated by Google BigQuery, Cloudera’s Impala, and Hadoop’s Hive. In the streaming side, we started off again with the same kind of progression, the sequential data model starting off with point-to-point network pipes and sockets through to the hierarchical model of messaging middleware where you have hierarchies of message topics to be able to deal with the complexity of the information space and then finally latest technology progression again relational streaming, where you have continuous queries running over streaming data and then streaming out real-time results continuously, incrementally based on the incrementally newly-arriving data.
Damian Black 02:30
In both of the streaming model and the stored data model, you get the benefits of having automatic optimization and being able to get any view of information that you need. Let’s look at the cost profile for processing real-time high velocity data. The data velocity is on the x-axis and on the y-axis, we have the total cost, and the z-axis going into the picture is latency and of course we have the three big important big data technologies, stream processing, relational databases, and Hadoop. Stream processing is latency and measured in seconds to milliseconds. Classically, it can go to much bigger time periods. Relational databases, it’s seconds to minutes and Hadoop, it’s hours to tens of minutes. Each have got their own characteristics, each are dealing with different kinds of application requirements cost-effectively, and as we look through the various swim lanes, we are going from social media through to e-commerce, security, telematics and telecom, and as you see as we’re going along, you’re getting an order of magnitude increase in the volumes of data.
Damian Black 03:41
Let’s look at the actual cost profiles and you’ll see something a little bit surprising or interesting. Stream processing is much lower latency, but it also uses a lot less hardware and for a lot less infrastructure and you’re able to process very high volumes of data in contrast to the stored data technologies of relational databases and Hadoop. Relational databases come at a much higher cost point, scale well and then suddenly when you get beyond a certain limit in terms of records per second, there’s a sharp knee and performance drops off. With Hadoop, it comes in at a lower cost than relational databases and it scales nicely, but again up to a certain point, you’re suddenly going to get this knee. The question is, “Why are the costs higher and why does this knee come in and what is going on here?” There is a simple explanation really and that is that they’re not solving the same problems. Relational databases and Hadoop platforms are storing all of the information before they’re processed, so when you want to respond and rewind your analysis, you basically are re-computing everything against all of the data and you’re storing all of the information. You need a lot of hardware to store the information and you need to reprocess those queries overall of the stored data. They’re all parallel technologies, but the stream processing approach basically just stores enough information to be able to answer those incremental queries in real time and it does them incrementally. Rather than, for example, recalculating perhaps the sum of all the data, the average of all the data, it will just work out what the difference is for the requisites that have arrived in the last say few milliseconds and the records that are now aged out of a particular time window. They’ve all got a role, it’s all important, each solve different problems, but if you bring them together, you’ll see there’s quite different cost profiles for the various technologies. Stream processing I think you’re going to see is going to be increasingly important as an extra total if you like in your case as you tackle big data analytics and integration problems. That’s all I have for my five-minute presentation. Hope that made sense and thank you very much for your time.
Jo Maitland 05:52
Right on schedule. Thank you very much, appreciate it.