EC2 spot instances experiments

To optimize Mozilla's AWS bills I recently started playing with EC2 spot instances. They can be much cheaper than regular on-demand or reserved instances, but they can be killed "by price" anytime if somebody bids higher than you. The idea to use cheaper instances has been around for a while, but the fact that a job can be interrupted was one of the psychological stoppers for us.

We decided to try how the spot instances behave as unit tests slaves in order to simplify changes required to plug them into our infra. The spot request API is a bit different than on-demand API - there is no way to re-use existing disk volumes, only snapshots (read AMI snapshots). We use Puppet to setup everything on the system as long as you have proper DNS and SSL certificates in place. With minimal changes I managed to get AMIs self-bootstrapping themselves in a couple minutes after boot. Under the hood, we're pre-creating EC2 network interfaces and entries in DNS so that puppet and buildbot work just like the rest of our infra.

Another challenge was to avoid losing interrupted jobs. It turned out that bhearsum has recently deployed a buildbot change to retry interrupted jobs. During the experiment all interrupted jobs have been retried. Avoiding using spot instances for the second run if a job has been interrupted is still to be done in bug 925285.

Some facts.

  • 50 m1.medium instances have been used in the experiment.

  • Bid prices varied from 2.5¢ to 8¢.

  • 22 instances has been killed "by price" within first 20 minutes.

  • 20 instances survived 48 hours (until I killed them). Not only the most expensive once, but even the cheapest ones were in the list of survived instances.

  • ~2000 test jobs have been run.

  • Unknown amount of thoughts have been exchanged with catlee and tglek. :)

To be continued... Stay tuned!