S3mper Fi! Netflix open sources library to make Amazon S3 even more awesome

Cloud architect Adrian Cockcroft may have flown the coop but Netflix(s nflx) continues to do its thing by enhancing or bolstering existing Amazon cloud services and then making its work available for others to use.

This time out, it’s open-sourcing the s3mper library it’s been building and testing to ensure better consistency of data stored in Amazon Web Services’ gigantic S3 storage service. S3mper is now available under the Apache open source license version 2.

As usual, the story on the Netflix Tech Blog is that while the Amazon service in question (S3) is amazing, it needs to be just a scooch more amazing to meet Netflix’ needs. In his blog post, Netflix senior software engineer Daniel Weeks cited S3’s “99.999999999% durability, 99.99% availability, effectively infinite storage, versioning (data recovery), and ubiquitous access” as huge benefits.

But because Netflix views S3 as the “source of truth” for all data warehousing, data consistency is important. It wants to make sure it’s using the most current data available but because S3 stores so much data — 2 trillion objects as of last April — much of which is changing, consistency can be an issue. This is true for Netflix, but not necessarily so for a many S3 users for whom strict consistency is not be a huge deal.

Those who do need better consistency can employ a secondary index to “catalog file metadata while backing the raw data on S3,” Weeks wrote, but he said there can be issues. At a smallish scale, he said, you can get get required consistency by using “a consistent, secondary index to catalog file metadata while backing the raw data on S3. ” That gets trickier as scale grows but generally, as long as the second index can handle all the requests, it’ll do. Still, there is a growing risk of data loss and performance hits when you rely on two separate systems, he said.

Here’s where s3mper — which uses AWS DynamoDB as the secondary index — comes in. Weeks wrote:

“S3mper is an experimental approach to tracking file metadata through use of a secondary index that provides consistent reads and writes. The intent is to identify when an S3 list operation returns inconsistent results and provide options to respond. We implemented s3mper using aspects to advise methods on the Hadoop FileSystem interface and track file metadata with DynamoDB as the secondary index. The reason we chose DynamoDB is that it provides capabilities similar to S3 (e.g. high availability, durability through replication), but also adds consistent operations and high performance.”

This is techie stuff, so read the whole post to get all the nuance, but a couple of AWS shops I contacted were thrilled to hear about S3mper. It would especially help companies who use S3 for big batch processing jobs and those dealing with financial or medical data where consistency is key, said one engineer.

David Mytton, founder and CEO of Server Density was also intrigued. “We originally used S3 for our deploy system to host build artifacts before they were pulled down by our application servers. We saw inconstancy enough that we switched to the Softlayer(s ibm) Object Store so we could ensure our servers all always got the correct code,” Mytton said via email.

In theory, s3mper would help ensure that you get what you expect without having to resort to “dumb” workarounds like just waiting a few minutes, he said. The topic of data handling and how data consistency can impact the applications of tomorrow, will surely crop up at our Structure Data show slated for New York in March. Check it out.