Serverless-enabled storage? It’s a big deal

The success of services like AWS Lambda, Azure Functions or Google Cloud Functions is indisputable. It’s not for all use cases, of course, but the technology is intriguing, easy to implement and developers (and sysadmins!) can leverage it to offload some tasks to the infrastructure and automate a lot of operations that, otherwise, would be necessary to do at the application level, with a lower level of efficiency.
The code ( a Function) is triggered by events and object storage is perfect for this.

Why object storage

Object storage is usually implemented with a shared-nothing scale-out cluster design. Each node of the cluster has its own capacity, CPU, RAM and network connections. At the same time, modern CPUs are very powerful and usually underutilized when the only scope of the storage node is to access objects. By allowing the storage system to use its spare CPU cycles to run Functions, we obtain a sort of very efficient hyperconverged infrastructure (micro-converged?).
Usually, we tend to bring data close to the CPU but in this case we do the exact opposite (we take advantage of CPU power which is already close to the data), obtaining even better results. CPU-data vicinity coupled with event triggered micro-services is a very powerful concept that can radically change data and storage management.
Scalability, is not an issue. CPU power increases alongside the number of nodes and the code is instantiated asynchronously and in parallel, triggered by events. This also means that response time, hence performance, is not always predictable and consistent but, for the kind of operations and services that come to mind, it’s good enough.
Object metadata is another important key element. In fact, the Function can easily access data and metadata of the object that triggered it. Adding and modifying information is child’s play… helping to build additional information about content for example.
These are only a few examples, but the list of characteristics that make scale-out storage suitable for this kind of advanced data service is quite long. In general, it’s important to note that, thanks to the architecture design of this type of system, this functionality can boost efficiency of the infrastructure at an unprecedented level while improving application agility. It’s no coincidence that most of the triggering events implemented by cloud providers are related to their object storage service.

Possible applications

Ok, Serverless-enabled storage is cool but what can I do with it?
Even though this kind of system is not specifically designed to provide low latency responses, there are a lot of applications, even real time applications, can make use of this feature. Here are some examples:
Image recognition: for each new image that lands in the storage system, a process can verify relevant information (identify a person, check a plate number, analyze the quality of the image, classify the image by its characteristics, make comparisons and so on). All this new data can be added as metadata or in the object itself.
Security: for each new, or modified, file in the system, a process can verify if it contains a virus, sensitive information, specific patterns (i.e. credit card numbers) and take proper action.
A businessman or an employee is drawing an analytics optimisation chart on the glass screen in a modern panoramic office in New York.Analytics: each action performed on an object can trigger a simple piece of code to populate a DB with relevant information.
Data normalization: every new piece of information added to the system can be easily verified and converted to other formats. This could be useful in complex IoT environments for example, where different types of data sources contribute to a single large database.
Big Data: AWS has already published a reference architecture for Map/Reduce jobs running on S3 and Lambda! (here The link)
And, as mentioned earlier, these are only the first examples that come to my mind. The only limit here is one’s imagination.

Back-end is the key

There are only a few serverless-enabled storage products at the moment, with others under development and coming in 2017. But I found two key factors that make this kind of solution viable in real production environments.
The first is multiple language support – in fact the product should be capable of running different types of code so as not to limit its possibilities. The second, is the internal process/Function scheduler. We are talking about a complex system which shares resources between storage and compute (in a hyperconverged fashion) and resource management is essential in order to grant the right level of performance and response time for storage and applications.
One of the most interesting Serverless-enabled products I’m aware of is OpenIO. The feature is called Grid For Apps while another component called Conscience technology is in charge of internal load balancing, data placement and overall resource management. The implementation is pretty slick and efficient. The product is open source, and there is a free download from their website. I strongly suggest taking a look at it to understand the potential of this technology. I installed it in a few minutes, and if I can do it… anyone can.

No standards… yet

Contrary to object storage, where the de facto standard is S3 API, Serverless is quite new and with no winner yet. Consequently, there are neither official nor de facto standards to look at.
I think it will take a while before one of these services will prevail over the others but, at that time, API compatibility won’t be hard to achieve. Most of these services have the same goal and similar functionalities…

Closing the circle

Data storage as we know it is a thing of the past. More and more end users are looking at object storage, even when the capacity requirement is under 100TB. Many begin with one application (usually as a replacement of traditional file services) but after grasping its full potential it gets adopted for more use cases ranging from backup to back-end for IoT applications through APIs.
Serverless-enabled storage is a step forward and introduces a new class of advanced data services which will help to simplify storage and data management. It has a huge potential, and I’m keeping my eye on it… I suggest you do the same.

Originally posted on

S3, to rule them all! (storage tiers, that is)

Last week I was at NetApp Insight and it was confirmed, not that it was necessary, that S3 protocol is the key to connecting and integrating different storage tiers. You know, I’ve been talking about the need of a two tier architecture for modern storage infrastructures for a long time now (here a recent paper on the topic) and I also have strong opinions about object storage and its advantages.

The missing link

The main reasons for having two storage tiers is cost and efficiency. $/GB and $/IOPS (or better $/Latency today) are the most important metrics while, on the other hand, efficiency in terms of local and distributed performance (explained here in detail) are other fundamental factors. All the rest is taken for granted.
The challenges come when you have to move data around. In fact, data mobility is a key factor in achieving high levels of efficiency and storage utilization contributing, again, to lower $/GB. But, primary and secondary storage use different protocols for data access and this makes it quite difficult to have a common and consistent mechanism to move data seamlessly in the front-end.
Some solutions, like Cohesity for example, are quite good in managing data consolidation and its re-utilization by leveraging data protection mechanisms… but it means adding additional hardware and software to your infrastructure, which is not always possible either because of cost or complexity.

S3 to the rescue

It seems several vendors are finally discovering the power of object storage and the simplicity of RESTful-based APIs. In fact, the list of primary storage systems adopting S3 to move (cold) data to on-premises object stores or to the public cloud is quickly growing.
Tintri and Tegile have recently joined Solidfire and DDN in coming up with some interesting solutions in this space, and NetApp previewed its Fabric Pools at Insight. I’m sure I’ve left someone out, but it should give you an idea of what is happening.
The protocol of choice is always the same (S3) for multiple and obvious reasons, while the object store in the back-end can be on-premises or on the public cloud.
Thanks to this approach the infrastructure remains simple, with your primary storage serving front-end applications while internal schedulers and specific APIs are devoted to supporting automated “to-the-cloud” tiering mechanisms. It’s as easy as it sounds!
Depending on each specific implementation, S3 allows to off-load old snapshots and clones from primary storage, make a copy of data volumes in the cloud for backup or DR, automated tiering, and so on. We are just at the beginning, and the number of possible applications is very high.

Closing the circle

It is pretty much clear that we are going to see more and more object-based back-ends for all sorts of storage infrastructures with the object store serving all secondary needs, no matter where the data comes from. And in this case we are not talking about Hyper-scale customers!
In the SME it will primarily be cloud storage, even though many object storage vendors are refocusing their efforts to offer options to this kind of customer: the list is long but I can mention Scality with its S3 Server, OpenIO with it’s incredible easy of use, NooBaa with its clever “shadow storage” approach, Ceph and many others. All of them have the ability to start small, with decent performance, and grow quickly when needed. Freemium license tiers (or open source versions of the products) are available, easing the installation on cheap (maybe old and used) x86 servers and minimizing adoption risks.
In large enterprises object storage is now part of most private cloud initiatives, but it is also seen as a stand-alone convenient storage system for many different applications (backup, sync&share, big data archives, remote NAS consolidation, etc).

Originally posted on

Hortonworks is pitching an object store for Hadoop

Some members of the Hadoop community are proposing a new object storage environment for Hadoop, which would let the big data platform store data in a manner similar to popular cloud data stores such as Amazon S3, Microsoft Azure Storage and OpenStack Swift.

Cloudian, a hybrid cloud storage provider, raises $24 million

Hybrid cloud storage startup Cloudian has landed $24 million in funding, which the company said in a release that it will use to expand operations and ensure that its products are being more rapidly deployed. Cloudian’s object storage software specializes in setting up both public and private clouds atop the Amazon S3 storage service. The Innovation Network Corporation of Japan (INCJ) and Fidelity Growth Partners were new investors to this financing round along with current Cloudian shareholders, including Intel Capital.