Flake bisect

Flaky tests are reported in a separate step on the bots (example build).

Each test log provides a pre-filled command line for triggering an automated flake bisect, like:

Trigger flake bisect on command line:
echo '{"bisect_buildername": "V8 Linux64 - verify csa", "bisect_mastername": "client.v8", "build_config": "Release", "extra_args": [], "isolated_name": "bot_default", "swarming_dimensions": ["cpu:x86-64", "gpu:none", "os:Ubuntu-14.04", "pool:Chrome"], "test_name": "inspector/runtime/command-line-api-without-side-effects", "timeout_sec": 60, "to_revision": "7f51fdac5bc8bf28b30904e1601819b356187b43", "total_timeout_sec": 120, "variant": "nooptimization"}' | buildbucket.py put -b luci.v8.try -n v8_flako -p -

Before triggering flake bisects for the first time, users must log in with a google.com account:

depot-tools-auth login https://cr-buildbucket.appspot.com

Then execute the provided command, which returns a build URL running flake bisect (example).

If you’re in luck, bisection points you to a suspect. If not, you might want to read further…

Detailed description

For technical details, see also the implementation tracker bug. The flake bisect approach has the same intentions as findit, but uses a different implementation.

How does it work?

A bisect job has 3 phases: calibration, backwards and inwards bisection. During calibration, testing is repeated doubling the total timeout (or the number of repetitions) until enough flakes are detected in one run. Then, backwards bisection doubles the git range until a revision without flakes is found. At last, we bisect into the range of the good revision and the oldest bad one. Note, bisection doesn't produce new build products, it is purely based on builds previously created on V8's continuous infrastructure.

Bisection fails when…

Properties for customizing flake bisect

Properties you won’t need to change

Tips and tricks

Bisecting a hanging test (e.g. dead lock)

If a failing run times out, while a pass is running very fast, it is useful to tweak the timeout_sec parameter, so that bisection is not delayed waiting for the hanging runs to time out. E.g. if the pass is usually reached in <1 second, set the timeout to something small, e.g. 5 seconds.

Getting more confidence on a suspect

In some runs, confidence is very low. E.g. calibration is satisfied if four flakes are seen in one run. During bisection, every run with one or more flakes is counted as bad. In such cases it might be useful to restart the bisect job setting to_revision to the culprit and using a higher number of repetitions or total timeout than the original job and confirm that the same conclusion is reached again.

Working around known timeout issues on Windows

Sometimes the overall timeout option doesn’t work on Windows. In this case it’s best to estimate a fitting number of repetitions and set total_timeout_sec to 0.

Test behavior depending on random seed

Rarely, a code path is only triggered with a particular random seed. In this case it might be beneficial to fix it using extra_args, e.g. "extra_args": ["--random-seed=123"]. Otherwise, the stress runner uses different random seeds throughout. Note though that a particular random seed might reproduce a problem in one revision, but not in another.