Using DNS to query AWS

DNS is hard. Twisted is hard. AWS is easy. :)

At Mozilla Releng we use EC2 a lot. DNS has always been one of the issues -- one always wants to ssh/vnc to a specific VM to debug issues. Our Puppet infrastructure requires proper forward and reverse DNS entries to generate an SSL certificate.

Before we switched to PuppetAgain we weren't bothering ourselves to add VMs to DNS, and used to use a script to generate an /etc/hosts style file to simplify name resolution.

After adding spot instances into the equation we had to switch to a tricky model when we pre-create EC2 network interfaces in advance, add the corresponding IP addresses to DNS and tag the interfaces so our AMIs can use that information to set up their hostnames, etc.

This DNS requirement makes some things very inflexible. One has wait 10-20 minutes for DNS propagation. Even though we can use API to add new entries, cleaning up old ones has been always tricky.

During one my the 1x1s with catlee, we were brainstorming how we can get rid of DNS management and still be able to reach the VMs easily, we came to a simple idea to invent our own DNS server. Yay!

I wrote a simple DNS server using Twisted. It uses boto to query AWS and generate responses. The initial version is pretty simple, has a lot of hard coded values (like the port, log file, etc), has some issues with running boto async (yay defer.execute()), but it does addresses some of our issues above.

Some useful examples:

$ dig -p 1253 @localhost

# use wildcards
$ dig -p 1253 @localhost *-linux64-ec2-010.*
;; ANSWER SECTION: 600 IN A 600 IN A 600 IN A 600 IN A

# use instance ID
$ dig -p 1253 @localhost i-b462f595

# use tags
$ dig -p 1253 @localhost tag:moz-loaned-to=j*,moz-type=tst*

# do something useful, ping all loaned slaves
fping `dig -p 1253 @localhost tag:moz-loaned-to=* +short` is alive is alive is alive is alive is unreachable is unreachable

EC2 spot instances experiments

To optimize Mozilla's AWS bills I recently started playing with EC2 spot instances. They can be much cheaper than regular on-demand or reserved instances, but they can be killed "by price" anytime if somebody bids higher than you. The idea to use cheaper instances has been around for a while, but the fact that a job can be interrupted was one of the psychological stoppers for us.

We decided to try how the spot instances behave as unit tests slaves in order to simplify changes required to plug them into our infra. The spot request API is a bit different than on-demand API - there is no way to re-use existing disk volumes, only snapshots (read AMI snapshots). We use Puppet to setup everything on the system as long as you have proper DNS and SSL certificates in place. With minimal changes I managed to get AMIs self-bootstrapping themselves in a couple minutes after boot. Under the hood, we're pre-creating EC2 network interfaces and entries in DNS so that puppet and buildbot work just like the rest of our infra.

Another challenge was to avoid losing interrupted jobs. It turned out that bhearsum has recently deployed a buildbot change to retry interrupted jobs. During the experiment all interrupted jobs have been retried. Avoiding using spot instances for the second run if a job has been interrupted is still to be done in bug 925285.

Some facts.

  • 50 m1.medium instances have been used in the experiment.
  • Bid prices varied from 2.5¢ to 8¢.
  • 22 instances has been killed "by price" within first 20 minutes.
  • 20 instances survived 48 hours (until I killed them). Not only the most expensive once, but even the cheapest ones were in the list of survived instances.
  • ~2000 test jobs have been run.
  • Unknown amount of thoughts have been exchanged with catlee and tglek. :)

To be continued... Stay tuned!

Firefox Unit Tests on Ubuntu

It's been a while since we in RelEng started thinking about offloading builds and tests to AWS VMs. Last year we started building Firefox (Linux and Android), Thunderbird and Firefox OS on CentOS 6.2 based AWS VMs. Since then our wait times have been always above 95%, usually around 99%.

However, the story of tests' wait times is different. Since RelEng started building faster, added new products (especially Firefox OS) and more branches, the wait times went below 50%.

It took more than a month to get new platform up and running, properly pupptized and documented. I really liked using mind maps to organize chaotic thoughts, git-buildpackage to keep package building process under control and Upstart for its ability to chain services on system boot.

Chris posted a great overview of what we have now.

I would also like to say THANKS to Armen and Joel for their help with getting tests running on the new platforms, Callek and Dustin for their patience reviewing HUGE patches to get the platform puppetized.

Switching to Nikola

I've been using Wordpress as a blog engine for a while now, but I wasn't happy with it for some reasons:

  • Security. I've never been hit this issue, but some of my friends and colleagues had bad experience with recovering their blogs after sucessfull attacks.
  • It's not easy to sanely backup it. Since Wordpress is a database driven blog engine you have to dump the database, copy the files, etc. Using version control is almost impossible in this scenario.
  • No way to use it offline. That's the time when you want to write something! Of course you can write down things using your favourite text editor, then transfer the text and fix its appearance, but you can't see and use the whole blog.

Since I've been running Wordpress on my own server, I won't complain about other concerns people usually have: PHP, PHP versions, PHP modules, database, file permissions, running not under www-data user, etc.

I've been looking for something that eliminates most of the problems I listed above (the security issue can't be ever eliminated!). Since Python is the most used language at Mozilla RelEng, I've decided to pick up one of the static blog engines/generators listed at Python blog software. Nikola was one of the engines that I had been seeing in the news recently. Also having a blog engine named after your grand father isn't a bad idea. :)

So far I managed to import my old post from Wordpress. However, I'm going to reimport the old post manually, just to be sure that I have all my post in the same format. BTW, Nikola suports reStructuredText and Markdown what is really great.

I still need to figure out what would be the easiest way to put the blog under version control, teach VIM and myself to properly use Nikola.

P.S. I can hilight :)


import sys

def hello(name='world'):
    print "hello", name

if __name__ == "__main__":

How to use Visual Studio 2010 to build Firefox using Try

It took a while to start working on Bug 563317 (Install Visual C++ 2010 on build slaves) and get it working properly.

The first challenge was the OPSI installation procedure of Visual Studio 2010 which requires 3 reboots (!) to get installed properly. The final OPSI installation instructions don't seem too horrible.

The second challenge was awaiting me after I deployed the package on the try build slaves. Our start-buildbot.bati batch file was setting Visual Studio 2005 environment variables and it was not easy to reset those variables easily. After a bunch of try pushes the solution was pushed!

So, if you want to compile Firefox with Visual Studio 2010 using try server, add the following line to the end of your mozconfig:

. $topsrcdir/browser/config/mozconfigs/win32/vs2010-mozconfig

P.S. To have talos tests for debug builds running properly we still need to fix Bug 701700 and deploy VC++ 2010 debug CRT on talos slaves.

Harvesting releases

This month was a very interesting one.

I had a chance to be involved into 6 (!) release processes: 3.7a1, 4.0b1 (2 builds), 3.6.6, 3.6.7 and 3.5.11. All of these builds were unique (at least for me).


Last alpha with a different naming (MozillaDeveloperPreview). We introduced linux64 and macosx64 platforms in this release. Lucky me, the build environment for these platforms was carefully prepared and tested by Armen and Bear beforehand. During the preparation for this release, RelEng resolved some annoying bugs, which reduced manual intervention into the release process.


Not released yet. First branded version of Firefox 4 built for 5 platforms. Due to some discovered bugs we had to wait a day or two and produce build2.


Stable release with some fixes. Nothing unusual except the previous product version, 3.6.4 (not 3.6.5), and some fun with forcing L10N repacks. Despite of the fact that the time when we started the build wasn't ideal (Friday night, my Saturday morning) we released it in less than 24 hours.

It is the fastest release in RelEng history. It's pleasure being a part of history. :)


Not released yet. Available for the beta users. We had to run this release in parallel with 3.5.11. Needed some sed magic for snippets (thanks to Nick Thomas) to reduce server load and use to the mirrors for the beta channel updates. A lot of fun with producing Major Updates (MU) for Firefox 3.0.19 manually.


Not released yet old stable version. Available for the beta users. The build was done in parallel with 3.6.7. As a part of this build we also produced MUs for 3.5.x -> 3.6.7. MUs were done by release automation.

As a result, now I have much more clearer understanding of the release process, the release work flow and the release infrastructure.

Special thanks go to Ben Hearsum, Chris AtLee and Nick Thomas for being great supervisors!