While some websites were forced offline as a result of Xen hypervisor updates affecting multiple cloud providers, Netflix once again remained up entirely. The biggest fear last weekend was for the nearly one-tenth of its Cassandra nodes that had to be rebooted.
Building upon the idea of having tools that simulate errors and failures in Netflix’s enormous Amazon infrastructure, the company is looking for engineers whose purpose will be to discover new flaws.
Ariel Tseitlin, former director of cloud services for Netflix, is now a partner at Scale Venture Partners.
Cloud developers and engineers have probably heard about Netflix’s (s nflx)Chaos Monkey before, and now the company has turned the tool to its production Cassandra database clusters, as this post explains. Chaos Monkey isn’t just about spotting weaknesses in cloud architectures — the real goal is figuring out fixes. Netflix has improved its Cassandra clusters with real-time monitoring and automatic replacement of failed nodes.
Netflix open sourced the code to Janitor Monkey, a tool it uses to automate the deletion of unused Amazon Web Services resources. It’s easy to spin up cloud compute instances, but not so easy to shut them down as they fall into disuse.
With millions of viewers expected to watch history Sunday night, NASA couldn’t afford to let the live stream of its Mars rover Curiosity landing go untested. Here’s how NASA put its Amazon Web Services-based infrastructure through its paces to ensure it keeps up with demand.
Netflix has open sourced Chaos Monkey, a service designed to terminate cloud computing instances in a controlled manner so companies can ensure their applications keep running when a virtual server dies unexpectedly. In the past year, Chaos Monkey has terminated more than 65,000 of Netflix’s instances.