Funsize is ready for testing!

Funsize is very close to be enabled in production! It has undergone a Rapid Risk Assessment procedure, which found a couple of potential issues. Most of them are either resolved or waiting for deployment.

To make sure everything works as expected and to catch some last-minute bugs, I added new update channels for Firefox Nightly and Developer Edition. If you are brave enough, you can use the following instructions and change your update channels to either nightly-funsize (for Firefox Nightly) or aurora-funsize (for Firefox Developer Edition).

TL;DR instruction look like this:

  • Enable update logging. Set app.update.log to true in about:config. This step is optional, but it may help with debugging possible issues.
  • Shut down all running instances of Firefox.
  • Edit defaults/pref/channel-prefs.js in your Firefox installation directory and change the line containing app.update.channel with:
pref("app.update.channel", "aurora-funsize"); // Firefox Developer Edition

or

pref("app.update.channel", "nightly-funsize"); // Firefox Nightly
  • Continue using you Firefox

You can check your channel in the About dialog and it should contain the channel name:

About dialog

Reporting Issues

If you see any issues, please report them to Bugzilla.

Taskcluster: First Impression

Good news. We decided to redesign Funsize a little and now it uses Taskcluster!

The nature of Funsize is that we may start hundreds of jobs at the same time, then stop sending new jobs and wait for hours. In other words, the service is very bursty. Elastic Beanstalk is not ideal for this use case. Scaling up and down very fast is hard to configure using EB-only tools. Also, running zero instances is not easy.

I tried using Terraform, Cloud Formation and Auto Scaling, but they were also not well suited. There were too many constrains (e.g. Terraform doesn't support all needed AWS features) and they required considerable bespoke setup/maintenance to auto-scale properly.

The next option was Taskcluster, and I was pleased that its design fitted our requirements very well! I was impressed by the simplicity and flexibility offered.

I have implemented a service which consumes Pulse messages for particular buildbot jobs. For nightly builds, it schedules a task graph with three tasks:

  • generate a partial MAR
  • sign it (at the moment a dummy task)
  • publish to Balrog

All tasks are run inside Docker containers which are published on the docker.com registry (other registries can also be used). The task definition essentially comprises of the docker image name and a list of commands it should run (usually this is a single script inside a docker image). In the same task definition you can specify what artifacts should be published by Taskcluster. The artifacts can be public or private.

Things that I really liked

  • Predefined task IDs. This is a great idea! There is no need to talk to the Taskcluster APIs to get the ID (or multiple IDs for task graphs) nor need to parse the response. Fire and forget! The task IDs can be used in different places, like artifact URLs, dependant tasks, etc.
  • Task graphs. This is basically a collection of tasks that can be run in parallel and can depend on each other. This is a nice way to declare your jobs and know them in advance. If needed, the task graphs can be extended by its tasks (decision tasks) dynamically.
  • Simplicity. All you need is to generate a valid JSON document and submit it using HTTP API to Taskcluster.
  • User defined docker images. One of the downsides of Buildbot is that you have a predefined list of slaves with predefined environment (OS, installed software, etc). Taskcluster leverages Docker by default to let you use your own images.

Things that could be improved

  • Encrypted variables. I spent 2-3 days fighting with the encrypted variables. My scheduler was written in Python, so I tried to use a half dozen different Python PGP libraries, but for some reason all of them were generating an incompatible OpenPGP format that Taskcluster could not understand. This forced me to rewrite the scheduling part in Node.js using openpgpjs. There is a bug to address this problem globally. Also, using ISO time stamps would have saved me hours of time. :)
  • It would be great to have a generic scheduler that doesn't require third party Taskcluster consumers writing their own daemons watching for changes (AMQP, VCS, etc) to generate tasks. This would lower the entry barrier for beginners.

Conclusion

There are many other things that can be improved (and I believe they will!) - Taskcluster is still a new project. Regardless of this, it is very flexible, easy to use and develop. I would recommend using it!

Many thanks to garndt, jonasfj and lightsofapollo for their support!

Funsize hacking

Prometheus

The idea of using a service which can generate partial updates for Firefox has been around for years. We actually used to have a server called Prometheus that was responsible for generating updates for nightly builds and the generation was done as a separate process from actual builds.

Scaling that solution wasn't easy and we switched to build-time update generation. Generating updates as a part of builds helped with load distribution, but lacked of flexibility: there is no easy way to generate updates after the build, because the update generation process is directly tied to the build or repack process.

Funsize willl solve the problems listed above: to distribute load and to be flexible.

Last year Anhad started and Mihai continued working on this project. They have done a great job and created a solution that can easily be scaled.

Funsize is split into several pieces:

  • REST API fronted powered by Flask. It's responsible for accepting partial generation requests, forwarding them to the queue and returning generated partials.
  • Celery-based workers to generate partial updates and upload them to S3.
  • SQS or RabbitMQ to coordinate Celery workers.

One of the biggest gains of Funsize is that it uses a global cache to speed up partial generation. For example, after we build an en-US Windows build, we ask Funsize to generate a partial. Then a swarm of L10N repacks (almost a hundred of them per platform) tries to do a similar job. Every single one asks for a partial update. All L10N builds have something in common, and xul.dll is one of the biggest files. Since the files are identical there is no reason to not reuse the previously generated binary patch for that file. Repeat 100 times for multiple files. PROFIT!

The first prototype of Funsize lives at github. If you are interested in hacking, read the docs on how to set up your developer environment. If you don't have an AWS account, it will use a local cache.

Note: this prototype may be redesigned and switch to using TaskCluster. Taskcluster is going to simplify the initial design and reduce dependency on always online infrastructure.

Deploying your code from github to AWS Elastic Beanstalk using Travis

I have been playing with Funsize a lot recently. One of the goals was iterating faster:

I have hit some challenges with both Travis and Elastic Beanstalk.

The first challenge was to run the integration (actually end-to-end) tests in the same environment. Funsize uses Docker for both hacking and production environments. Unfortunately it's not possible to create Docker images as a part of Travis job (there is a option to run jobs inside Docker, but this is a different beast).

A simple bash script works around this problem. It starts all services we need in background and runs the end-to-end tests. The end-to-end test asks Funsize to generate several partial MAR files, downloads identical files from Mozilla's FTP server and compares their content skipping the cryptographic signature (Funsize does not sign MAR files).

The next challenge was deploying the code. We use Elastic Beanstalk as convenient way to run simple services. There is a plan to use something else for Funsize, but at the moment it's Elastic Beanstalk.

Travis has support for Elastic Beanstalk, but it's still experimental and at the moment of writing this post there were no documentation on the official website. The .travis.yml file looks straight forward and worked fine. The only minor issue I hit was long commit message.

# .travis.yml snippet
deploy:
    - provider: elasticbeanstalk
      app: funsize # Elastic Beanstalk app name
      env: funsize-dev-rail # Elastic Beanstalk env name
      bucket_name: elasticbeanstalk-us-east-1-314336048151 # S3 bucket used by Elastic Beanstalk
      region: us-east-1
      access_key_id:
        secure: "encrypted key id"
      secret_access_key:
        secure: "encrypted key"
      on:
          repo: rail/build-funsize # Deploy only using my user repo for now
          all_branches: true
          # deploy only if particular jobs in the job matrix passes, not any
          condition: $FUNSIZE_S3_UPLOAD_BUCKET = mozilla-releng-funsize-travis

Having the credentials in a public version control system, even if they are encrypted, makes me very nervous. To minimize possible harm in case something goes wrong I created a separate user in AWS IAM. I couldn't find any decent docs on what permissions a user should have to be able to deploy something to Elastic Beanstalk. It took a while to figure out the this minimal set of permissions. Even with these permissions the user looks very powerful with limited access to EB, S3, EC2, Auto Scaling and CloudFormation.

Conclusion: using Travis for Elastic Beanstalk deployments is quite stable and easy to use (after the initial setup) unless you are paranoid about some encrypted credentials being available on github.

Firefox builds are way cheaper now!

Releng has been successfully reducing the Amazon bill recently. We managed to drop the bill from $115K to $75K per month in February.

To make this happen we switched to a cheaper instance type, started using spot instances for regular builds and started bidding for spot instances smarter. Introducing the jacuzzi approach reduced the load by reducing build times.

More details below.

s/m3.xlarge/c3.xlarge/

When we fist tried to switch form m3.xlarge ($0.45 per hour) to c3.xlarge ($0.30 per hour) we hit an interesting issue when ruby won't execute anything -- segmentation fault. It turned out that using paravirtual kernels on c3 instance type is not a good idea since this instance type "prefers" HVM virtualization unlike the m3 instance types.

Massimo did a great job and ported our AMIs from PV to HVM.

This switch from PV to HVM went pretty smoothly except the fact that we had to add a swap file because linking libxul requires a lot of memory and the existing 7.5G wasn't enough.

This transition saved us more or less 1/3 of the build pool bill.

Smarter spot bidding

We used to bid blindly:

- I want this many spot instances in this availability zone and this is my maximum price. Bye!

- But the current price is soooo high! ... Hello? Are you there?.. Sigh...

- Hmm, where are my spot instances?! I want twice as many spot instances in this zone and this is my maximum price. Bye!

Since we switched to a much smarter way to bid for spot instances we improved our responsiveness to the load so much that we had to slow down our instance start ramp up. :)

As a part of this transition we reduced the amount of on-demand builders from 400 to 100!

Additionally, now we can use different instance types for builders and sometimes get good prices for the c3.2xlarge instance type.

Less EBS

As a part of the s/m3.xlarge/c3.xlarge/ transition we also introduced a couple other improvements: * Reduced EBS storage use * Started using SSD instance storage for try builds. All your try builds are on SSDs now! Using instance storage is not an easy thing, so we had to re-invent our own wheel to format/mount the storage on boot.

Using DNS to query AWS

DNS is hard. Twisted is hard. AWS is easy. :)

At Mozilla Releng we use EC2 a lot. DNS has always been one of the issues -- one always wants to ssh/vnc to a specific VM to debug issues. Our Puppet infrastructure requires proper forward and reverse DNS entries to generate an SSL certificate.

Before we switched to PuppetAgain we weren't bothering ourselves to add VMs to DNS, and used to use a script to generate an /etc/hosts style file to simplify name resolution.

After adding spot instances into the equation we had to switch to a tricky model when we pre-create EC2 network interfaces in advance, add the corresponding IP addresses to DNS and tag the interfaces so our AMIs can use that information to set up their hostnames, etc.

This DNS requirement makes some things very inflexible. One has wait 10-20 minutes for DNS propagation. Even though we can use API to add new entries, cleaning up old ones has been always tricky.

During one my the 1x1s with catlee, we were brainstorming how we can get rid of DNS management and still be able to reach the VMs easily, we came to a simple idea to invent our own DNS server. Yay!

I wrote a simple DNS server using Twisted. It uses boto to query AWS and generate responses. The initial version is pretty simple, has a lot of hard coded values (like the port, log file, etc), has some issues with running boto async (yay defer.execute()), but it does addresses some of our issues above.

Some useful examples:

$ dig -p 1253 @localhost bld-linux64-ec2-010.build.releng.use1.mozilla.com
...
;; ANSWER SECTION:
bld-linux64-ec2-010.build.releng.use1.mozilla.com. 600 IN A 10.134.53.24
...

# use wildcards
$ dig -p 1253 @localhost *-linux64-ec2-010.*.releng.use1.mozilla.com
...
;; ANSWER SECTION:
tst-linux64-ec2-010.test.releng.use1.mozilla.com. 600 IN A 10.134.57.212
try-linux64-ec2-010.try.releng.use1.mozilla.com. 600 IN A 10.134.64.70
dev-linux64-ec2-010.dev.releng.use1.mozilla.com. 600 IN A 10.134.52.95
bld-linux64-ec2-010.build.releng.use1.mozilla.com. 600 IN A 10.134.53.24
...

# use instance ID
$ dig -p 1253 @localhost i-b462f595
...
;; ANSWER SECTION:
bld-linux64-ec2-158.build.releng.use1.mozilla.com. 600 IN A 10.134.52.9
...

# use tags
$ dig -p 1253 @localhost tag:moz-loaned-to=j*,moz-type=tst*
...
;; ANSWER SECTION:
tst-linux64-ec2-jrmuizel.test.releng.usw2.mozilla.com. 600 IN A 10.132.59.211
...

# do something useful, ping all loaned slaves
fping `dig -p 1253 @localhost tag:moz-loaned-to=* +short`
10.134.58.103 is alive
10.134.58.233 is alive
10.132.59.211 is alive
10.134.57.55 is alive
10.134.58.244 is unreachable
10.134.58.8 is unreachable

EC2 spot instances experiments

To optimize Mozilla's AWS bills I recently started playing with EC2 spot instances. They can be much cheaper than regular on-demand or reserved instances, but they can be killed "by price" anytime if somebody bids higher than you. The idea to use cheaper instances has been around for a while, but the fact that a job can be interrupted was one of the psychological stoppers for us.

We decided to try how the spot instances behave as unit tests slaves in order to simplify changes required to plug them into our infra. The spot request API is a bit different than on-demand API - there is no way to re-use existing disk volumes, only snapshots (read AMI snapshots). We use Puppet to setup everything on the system as long as you have proper DNS and SSL certificates in place. With minimal changes I managed to get AMIs self-bootstrapping themselves in a couple minutes after boot. Under the hood, we're pre-creating EC2 network interfaces and entries in DNS so that puppet and buildbot work just like the rest of our infra.

Another challenge was to avoid losing interrupted jobs. It turned out that bhearsum has recently deployed a buildbot change to retry interrupted jobs. During the experiment all interrupted jobs have been retried. Avoiding using spot instances for the second run if a job has been interrupted is still to be done in bug 925285.

Some facts.

  • 50 m1.medium instances have been used in the experiment.
  • Bid prices varied from 2.5¢ to 8¢.
  • 22 instances has been killed "by price" within first 20 minutes.
  • 20 instances survived 48 hours (until I killed them). Not only the most expensive once, but even the cheapest ones were in the list of survived instances.
  • ~2000 test jobs have been run.
  • Unknown amount of thoughts have been exchanged with catlee and tglek. :)

To be continued... Stay tuned!

Firefox Unit Tests on Ubuntu

It's been a while since we in RelEng started thinking about offloading builds and tests to AWS VMs. Last year we started building Firefox (Linux and Android), Thunderbird and Firefox OS on CentOS 6.2 based AWS VMs. Since then our wait times have been always above 95%, usually around 99%.

However, the story of tests' wait times is different. Since RelEng started building faster, added new products (especially Firefox OS) and more branches, the wait times went below 50%.

It took more than a month to get new platform up and running, properly pupptized and documented. I really liked using mind maps to organize chaotic thoughts, git-buildpackage to keep package building process under control and Upstart for its ability to chain services on system boot.

Chris posted a great overview of what we have now.

I would also like to say THANKS to Armen and Joel for their help with getting tests running on the new platforms, Callek and Dustin for their patience reviewing HUGE patches to get the platform puppetized.

Switching to Nikola

I've been using Wordpress as a blog engine for a while now, but I wasn't happy with it for some reasons:

  • Security. I've never been hit this issue, but some of my friends and colleagues had bad experience with recovering their blogs after sucessfull attacks.
  • It's not easy to sanely backup it. Since Wordpress is a database driven blog engine you have to dump the database, copy the files, etc. Using version control is almost impossible in this scenario.
  • No way to use it offline. That's the time when you want to write something! Of course you can write down things using your favourite text editor, then transfer the text and fix its appearance, but you can't see and use the whole blog.

Since I've been running Wordpress on my own server, I won't complain about other concerns people usually have: PHP, PHP versions, PHP modules, database, file permissions, running not under www-data user, etc.

I've been looking for something that eliminates most of the problems I listed above (the security issue can't be ever eliminated!). Since Python is the most used language at Mozilla RelEng, I've decided to pick up one of the static blog engines/generators listed at Python blog software. Nikola was one of the engines that I had been seeing in the news recently. Also having a blog engine named after your grand father isn't a bad idea. :)

So far I managed to import my old post from Wordpress. However, I'm going to reimport the old post manually, just to be sure that I have all my post in the same format. BTW, Nikola suports reStructuredText and Markdown what is really great.

I still need to figure out what would be the easiest way to put the blog under version control, teach VIM and myself to properly use Nikola.

P.S. I can hilight :)

hello.py

#!/usr/bin/python

import sys


def hello(name='world'):
    print "hello", name

if __name__ == "__main__":
    hello(*sys.argv[1:])

How to use Visual Studio 2010 to build Firefox using Try

It took a while to start working on Bug 563317 (Install Visual C++ 2010 on build slaves) and get it working properly.

The first challenge was the OPSI installation procedure of Visual Studio 2010 which requires 3 reboots (!) to get installed properly. The final OPSI installation instructions don't seem too horrible.

The second challenge was awaiting me after I deployed the package on the try build slaves. Our start-buildbot.bati batch file was setting Visual Studio 2005 environment variables and it was not easy to reset those variables easily. After a bunch of try pushes the solution was pushed!

So, if you want to compile Firefox with Visual Studio 2010 using try server, add the following line to the end of your mozconfig:

. $topsrcdir/browser/config/mozconfigs/win32/vs2010-mozconfig

P.S. To have talos tests for debug builds running properly we still need to fix Bug 701700 and deploy VC++ 2010 debug CRT on talos slaves.