Thoughts about partial updates on demand

Firefox has it's own built-in update system. The update system supports 2 types of updates: complete and incremental. Completes can be applied to any older version, unless there are some incompatible changes in the MAR format. Incremental updates can be applied only to a release they were generated for.

Usually for the beta and release channels we generate incremental updates against 3-4 versions. This way we try to minimize bandwidth consumption for our end users and increase the number of users on the latest version. For Nightly and Developer Edition builds we generate 5 incremental updates using funsize.

Both methods assume that we know ahead of time what versions should be used for incremental updates. For releases and betas we use ADI stats to be as precise as possible. However, these methods are static and don't use real-time data.

The idea to generate incremental updates on demand has been around for ages. Some of the challenges are:

  • Acquiring real-time (or close to real-time) data for making decisions on incremental update versions
  • Size of the incremental updates. If the size is very close to the size of the corresponding complete, there is reason to serve incremental updates. One of the reasons is that the that the updater tries to use the incremental update first, and then falls back to the complete in case if something goes wrong. In this case the updater downloads both the incremental and the complete.

Ben and I talked about this today and to recap some of the ideas we had, I'll put them here.

  • We still want to "pre-seed" most possible incremental updates before we publish any updates
  • Whenever Balrog serves a complete-only update, it should generate a structured log entry and/or an event to be consumed by some service, which should contain all information required to generate a incremental update.
  • The "new system" should be able to decide if we want to discard incremental update generation, based on the size. These decisions should be stored, so we don't try to generate incremental update again next time. This information may be stored in Balrog to prevent further events/logs.
  • Before publishing the incremental update, we should test if they can be applied without issues, similar to the update verify tests we run for releases, but without hitting Balrog. After they pass this test, we can publish them to Balrog and check if Balrog returns expected XML with partial info in it.
  • Minimize the amount of served completes, if we plan to generate incremental updates. One of the ideas was to modify the client to support responses like "Come in 5 minutes, I may have something for you!"

The only remaining thing is to implement all these changes. :)

Firefox 46.0 and SHA512SUMS

In my previous post I introduced the new release process we have been adopting in the 46.0 release cycle.

Release build promotion has been in production since Firefox 46.0 Beta 1. We have discovered some minor issues; some of them are already fixed, some still waiting.

One of the visible bugs is Bug 1260892. We generate a big SHA512SUMS file, which should contain all important checksums. With numerous changes to the process the file doesn't represent all required files anymore. Some files are missing, some have different names.

We are working on fixing the bug, but you can use the following work around to verify the files.

For example, if you want to verify, you need use the following 2 files:

Example commands:

# download all required files
$ wget -q
$ wget -q
$ wget -q
$ wget -q
# Import Mozilla Releng key into a temporary GPG directory
$ mkdir .tmp-gpg-home && chmod 700 .tmp-gpg-home
$ gpg --homedir .tmp-gpg-home --import KEY
# verify the signature of the checksums file
$ gpg --homedir .tmp-gpg-home --verify firefox-46.0.checksums.asc && echo "OK" || echo "Not OK"
# calculate the SHA512 checksum of the file
$ sha512sum "Firefox Setup 46.0.exe"
c2ed64298ac2140d8dbdaed28cabc90b38dd9444e9c0d6dd335a2a32cf043a35314945536a5c75124a88bf418a4e2ba77256be223425380e7fcc45a97da8f479  Firefox Setup 46.0.exe
# lookup for the checksum in the checksums file
$ grep c2ed64298ac2140d8dbdaed28cabc90b38dd9444e9c0d6dd335a2a32cf043a35314945536a5c75124a88bf418a4e2ba77256be223425380e7fcc45a97da8f479 firefox-46.0.checksums
c2ed64298ac2140d8dbdaed28cabc90b38dd9444e9c0d6dd335a2a32cf043a35314945536a5c75124a88bf418a4e2ba77256be223425380e7fcc45a97da8f479 sha512 46275456 install/sea/firefox-46.0.ach.win64.installer.exe

This is just a temporary work around and the bug will be fixed ASAP.

Release Build Promotion Overview

Hello from Release Engineering! Once a month we highlight one of our projects to help the Mozilla community discover a useful tool or an interesting contribution opportunity. This month's project is Release Build Promotion.

What is Release Build Promotion?

Release build promotion (or "build promotion", or "release promotion" for short), is the latest release pipeline for Firefox being developed by Release Engineering at Mozilla.

Release build promotion starts with the builds produced and tested by CI (e.g. on mozilla-beta or mozilla-release). We take these builds, and use them as the basis to generate all our l10n repacks, partial updates, etc. that are required to release Firefox. We "promote" the CI builds to the release channel.

How is this different?

The previous release pipeline also started with builds produced and tested by CI. However, when it came time to do a release, we would create an entirely new set of builds with slightly different build configuration. These builds would not get the regular CI testing.

Release build promotion improves the process by removing the second set of builds. This drastically improves the total time to do a release, and also increases our confidence in our products since we now are shipping exactly what's been tested. We also improve visibility of the release process; all the tasks that make up the release are now reported to Treeherder along with the corresponding CI builds.

Current status

Release build promotion is in use for Firefox desktop starting with the 46 beta cycle. ESR and release branches have not yet been switched over.

Firefox for Android is also not yet handled. We plan to have this ready for Firefox 47.

Some figures

One of the major reasons of this project was our release end-to-end times. I pulled some data to compare:

  • One of the Firefox 45 betas took almost 12 hours
  • One of the Firefox 46 betas took less than 3 hours

What's next?

  • Support Firefox for Android
  • Support release and ESR branches
  • Extend this process back to the aurora and nightly channels

Can I contribute?

Yes! We still have a lot of things to do and welcome everyone to contribute.

  • Bug 1253369 - Notifications on release promotion events.
  • (No bug yet) Redesign and modernize Ship-it to reflect the new release work flow. This will include new UI, multiple sign-offs, new release-runner, etc.
  • Tracking bug

More information

For more information, please refer to these other resources about build promotion:

There will be multiple blog posts regarding this project. You have probably seen Jordan's blog on how to be productive when distributed teams get together. It covers some of our experience we had during the project sprint week in Vancouver.

Rebooting productivity

Every new year gives you an opportunity to sit back, relax, have some scotch and re-think the passed year. Holidays give you enough free time. Even if you decide to not take a vacation around the holidays, it's usually calm and peaceful.

This time, I found myself thinking mostly about productivity, being effective, feeling busy, overwhelmed with work and other related topics.

When I started at Mozilla (almost 6 years ago!), I tried to apply all my GTD and time management knowledge and techniques. Working remotely and in a different time zone was an advantage - I had close to zero interruptions. It worked perfect.

Last year I realized that my productivity skills had faded away somehow. 40h+ workweeks, working on weekends, delivering goals in the last week of quarter don't sound like good signs. Instead of being productive I felt busy.

"Every crisis is an opportunity". Time to make a step back and reboot myself. Burning out at work is not a good idea. :)

Here are some ideas/tips that I wrote down for myself you may found useful.


  • Task #1: make a daily plan. No plan - no work.
  • Don't start your day by reading emails. Get one (little) thing done first - THEN check your email.
  • Try to define outcomes, not tasks. "Ship XYZ" instead of "Work on XYZ".
  • Meetings are time consuming, so "Set a goal for each meeting". Consider skipping a meeting if you don't have any goal set, unless it's a beer-and-tell meeting! :)
  • Constantly ask yourself if what you're working on is important.
  • 3-4 times a day ask yourself whether you are doing something towards your goal or just finding something else to keep you busy. If you want to look busy, take your phone and walk around the office with some papers in your hand. Everybody will think that you are a busy person! This way you can take a break and look busy at the same time!
  • Take breaks! Pomodoro technique has this option built-in. Taking breaks helps not only to avoid RSI, but also keeps your brain sane and gives you time to ask yourself the questions mentioned above. I use Workrave on my laptop, but you can use a real kitchen timer instead.
  • Wear headphones, especially at office. Noise cancelling ones are even better. White noise, nature sounds, or instrumental music are your friends.

(Home) Office

  • Make sure you enjoy your work environment. Why on the earth would you spend your valuable time working without joy?!
  • De-clutter and organize your desk. Less things around - less distractions.
  • Desk, chair, monitor, keyboard, mouse, etc - don't cheap out on them. Your health is more important and expensive. Thanks to mhoye for this advice!


  • Don't check email every 30 seconds. If there is an emergency, they will call you! :)
  • Reward yourself at a certain time. "I'm going to have a chocolate at 11am", or "MFBT at 4pm sharp!" are good examples. Don't forget, you are Pavlov's dog too!
  • Don't try to read everything NOW. Save it for later and read in a batch.
  • Capture all creative ideas. You can delete them later. ;)
  • Prepare for next task before break. Make sure you know what's next, so you can think about it during the break.

This is my list of things that I try to use everyday. Looking forward to see improvements!

I would appreciate your thoughts this topic. Feel free to comment or send a private email.

Happy Productive New Year!

Funsize enabled for Nightly builds

Keep calm and update Firefox

Note: this post has been sitting in the drafts queue for some reason. Better to publish it. :)

As of Tuesday, Aug 4, 2015 Funsize has been enabled on mozilla-central. From now on all Nightly builds builds will get updates for builds up to 4 days in the past (for yesterday, for the day before yesterday, etc). This should make people who don't run their nightlies every day happier.

Firefox Developer Edition partial updates will be enabled after 42.0 hits mozilla-aurora, but early adopters can use the aurora-funsize channel.

Partial updates as a part of builds and L10N repacks on nightly will be disabled as soon as Bug 1173459 is resolved.

As a bonus, you can take a look at the presentation I gave in Whistler during the work week.

Reporting Issues

If you see any issues, please report them to Bugzilla.

Funsize is ready for testing!

Funsize is very close to be enabled in production! It has undergone a Rapid Risk Assessment procedure, which found a couple of potential issues. Most of them are either resolved or waiting for deployment.

To make sure everything works as expected and to catch some last-minute bugs, I added new update channels for Firefox Nightly and Developer Edition. If you are brave enough, you can use the following instructions and change your update channels to either nightly-funsize (for Firefox Nightly) or aurora-funsize (for Firefox Developer Edition).

TL;DR instruction look like this:

  • Enable update logging. Set app.update.log to true in about:config. This step is optional, but it may help with debugging possible issues.
  • Shut down all running instances of Firefox.
  • Edit defaults/pref/channel-prefs.js in your Firefox installation directory and change the line containing with:
pref("", "aurora-funsize"); // Firefox Developer Edition


pref("", "nightly-funsize"); // Firefox Nightly
  • Continue using you Firefox

You can check your channel in the About dialog and it should contain the channel name:

About dialog

Reporting Issues

If you see any issues, please report them to Bugzilla.

Taskcluster: First Impression

Good news. We decided to redesign Funsize a little and now it uses Taskcluster!

The nature of Funsize is that we may start hundreds of jobs at the same time, then stop sending new jobs and wait for hours. In other words, the service is very bursty. Elastic Beanstalk is not ideal for this use case. Scaling up and down very fast is hard to configure using EB-only tools. Also, running zero instances is not easy.

I tried using Terraform, Cloud Formation and Auto Scaling, but they were also not well suited. There were too many constrains (e.g. Terraform doesn't support all needed AWS features) and they required considerable bespoke setup/maintenance to auto-scale properly.

The next option was Taskcluster, and I was pleased that its design fitted our requirements very well! I was impressed by the simplicity and flexibility offered.

I have implemented a service which consumes Pulse messages for particular buildbot jobs. For nightly builds, it schedules a task graph with three tasks:

  • generate a partial MAR
  • sign it (at the moment a dummy task)
  • publish to Balrog

All tasks are run inside Docker containers which are published on the registry (other registries can also be used). The task definition essentially comprises of the docker image name and a list of commands it should run (usually this is a single script inside a docker image). In the same task definition you can specify what artifacts should be published by Taskcluster. The artifacts can be public or private.

Things that I really liked

  • Predefined task IDs. This is a great idea! There is no need to talk to the Taskcluster APIs to get the ID (or multiple IDs for task graphs) nor need to parse the response. Fire and forget! The task IDs can be used in different places, like artifact URLs, dependant tasks, etc.
  • Task graphs. This is basically a collection of tasks that can be run in parallel and can depend on each other. This is a nice way to declare your jobs and know them in advance. If needed, the task graphs can be extended by its tasks (decision tasks) dynamically.
  • Simplicity. All you need is to generate a valid JSON document and submit it using HTTP API to Taskcluster.
  • User defined docker images. One of the downsides of Buildbot is that you have a predefined list of slaves with predefined environment (OS, installed software, etc). Taskcluster leverages Docker by default to let you use your own images.

Things that could be improved

  • Encrypted variables. I spent 2-3 days fighting with the encrypted variables. My scheduler was written in Python, so I tried to use a half dozen different Python PGP libraries, but for some reason all of them were generating an incompatible OpenPGP format that Taskcluster could not understand. This forced me to rewrite the scheduling part in Node.js using openpgpjs. There is a bug to address this problem globally. Also, using ISO time stamps would have saved me hours of time. :)
  • It would be great to have a generic scheduler that doesn't require third party Taskcluster consumers writing their own daemons watching for changes (AMQP, VCS, etc) to generate tasks. This would lower the entry barrier for beginners.


There are many other things that can be improved (and I believe they will!) - Taskcluster is still a new project. Regardless of this, it is very flexible, easy to use and develop. I would recommend using it!

Many thanks to garndt, jonasfj and lightsofapollo for their support!

Funsize hacking


The idea of using a service which can generate partial updates for Firefox has been around for years. We actually used to have a server called Prometheus that was responsible for generating updates for nightly builds and the generation was done as a separate process from actual builds.

Scaling that solution wasn't easy and we switched to build-time update generation. Generating updates as a part of builds helped with load distribution, but lacked of flexibility: there is no easy way to generate updates after the build, because the update generation process is directly tied to the build or repack process.

Funsize willl solve the problems listed above: to distribute load and to be flexible.

Last year Anhad started and Mihai continued working on this project. They have done a great job and created a solution that can easily be scaled.

Funsize is split into several pieces:

  • REST API fronted powered by Flask. It's responsible for accepting partial generation requests, forwarding them to the queue and returning generated partials.
  • Celery-based workers to generate partial updates and upload them to S3.
  • SQS or RabbitMQ to coordinate Celery workers.

One of the biggest gains of Funsize is that it uses a global cache to speed up partial generation. For example, after we build an en-US Windows build, we ask Funsize to generate a partial. Then a swarm of L10N repacks (almost a hundred of them per platform) tries to do a similar job. Every single one asks for a partial update. All L10N builds have something in common, and xul.dll is one of the biggest files. Since the files are identical there is no reason to not reuse the previously generated binary patch for that file. Repeat 100 times for multiple files. PROFIT!

The first prototype of Funsize lives at github. If you are interested in hacking, read the docs on how to set up your developer environment. If you don't have an AWS account, it will use a local cache.

Note: this prototype may be redesigned and switch to using TaskCluster. Taskcluster is going to simplify the initial design and reduce dependency on always online infrastructure.

Deploying your code from github to AWS Elastic Beanstalk using Travis

I have been playing with Funsize a lot recently. One of the goals was iterating faster:

I have hit some challenges with both Travis and Elastic Beanstalk.

The first challenge was to run the integration (actually end-to-end) tests in the same environment. Funsize uses Docker for both hacking and production environments. Unfortunately it's not possible to create Docker images as a part of Travis job (there is a option to run jobs inside Docker, but this is a different beast).

A simple bash script works around this problem. It starts all services we need in background and runs the end-to-end tests. The end-to-end test asks Funsize to generate several partial MAR files, downloads identical files from Mozilla's FTP server and compares their content skipping the cryptographic signature (Funsize does not sign MAR files).

The next challenge was deploying the code. We use Elastic Beanstalk as convenient way to run simple services. There is a plan to use something else for Funsize, but at the moment it's Elastic Beanstalk.

Travis has support for Elastic Beanstalk, but it's still experimental and at the moment of writing this post there were no documentation on the official website. The .travis.yml file looks straight forward and worked fine. The only minor issue I hit was long commit message.

# .travis.yml snippet
    - provider: elasticbeanstalk
      app: funsize # Elastic Beanstalk app name
      env: funsize-dev-rail # Elastic Beanstalk env name
      bucket_name: elasticbeanstalk-us-east-1-314336048151 # S3 bucket used by Elastic Beanstalk
      region: us-east-1
        secure: "encrypted key id"
        secure: "encrypted key"
          repo: rail/build-funsize # Deploy only using my user repo for now
          all_branches: true
          # deploy only if particular jobs in the job matrix passes, not any
          condition: $FUNSIZE_S3_UPLOAD_BUCKET = mozilla-releng-funsize-travis

Having the credentials in a public version control system, even if they are encrypted, makes me very nervous. To minimize possible harm in case something goes wrong I created a separate user in AWS IAM. I couldn't find any decent docs on what permissions a user should have to be able to deploy something to Elastic Beanstalk. It took a while to figure out the this minimal set of permissions. Even with these permissions the user looks very powerful with limited access to EB, S3, EC2, Auto Scaling and CloudFormation.

Conclusion: using Travis for Elastic Beanstalk deployments is quite stable and easy to use (after the initial setup) unless you are paranoid about some encrypted credentials being available on github.

Firefox builds are way cheaper now!

Releng has been successfully reducing the Amazon bill recently. We managed to drop the bill from $115K to $75K per month in February.

To make this happen we switched to a cheaper instance type, started using spot instances for regular builds and started bidding for spot instances smarter. Introducing the jacuzzi approach reduced the load by reducing build times.

More details below.


When we fist tried to switch form m3.xlarge ($0.45 per hour) to c3.xlarge ($0.30 per hour) we hit an interesting issue when ruby won't execute anything -- segmentation fault. It turned out that using paravirtual kernels on c3 instance type is not a good idea since this instance type "prefers" HVM virtualization unlike the m3 instance types.

Massimo did a great job and ported our AMIs from PV to HVM.

This switch from PV to HVM went pretty smoothly except the fact that we had to add a swap file because linking libxul requires a lot of memory and the existing 7.5G wasn't enough.

This transition saved us more or less 1/3 of the build pool bill.

Smarter spot bidding

We used to bid blindly:

- I want this many spot instances in this availability zone and this is my maximum price. Bye!

- But the current price is soooo high! ... Hello? Are you there?.. Sigh...

- Hmm, where are my spot instances?! I want twice as many spot instances in this zone and this is my maximum price. Bye!

Since we switched to a much smarter way to bid for spot instances we improved our responsiveness to the load so much that we had to slow down our instance start ramp up. :)

As a part of this transition we reduced the amount of on-demand builders from 400 to 100!

Additionally, now we can use different instance types for builders and sometimes get good prices for the c3.2xlarge instance type.

Less EBS

As a part of the s/m3.xlarge/c3.xlarge/ transition we also introduced a couple other improvements: * Reduced EBS storage use * Started using SSD instance storage for try builds. All your try builds are on SSDs now! Using instance storage is not an easy thing, so we had to re-invent our own wheel to format/mount the storage on boot.