Using V8’s sample-based profiler

V8 has built-in sample-based profiling. Profiling is turned off by default, but can be enabled via the --prof command-line option. The sampler records stacks of both JavaScript and C/C++ code.

Build #

Build the d8 shell following the instructions at Building with GN.

Command line #

To start profiling, use the --prof option. When profiling, V8 generates a v8.log file which contains profiling data.

Windows:

build\Release\d8 --prof script.js

Other platforms (replace ia32 with x64 if you want to profile the x64 build):

out/ia32.release/d8 --prof script.js

Process the generated output #

Log file processing is done using JS scripts running by the d8 shell. For this to work, a d8 binary (or symlink, or d8.exe on Windows) must be in the root of your V8 checkout, or in the path specified by the environment variable D8_PATH. Note: this binary is just used to process the log, but not for the actual profiling, so it doesn’t matter which version etc. it is.

Make sure d8 used for analysis was not built with is_component_build!

Windows:

tools\windows-tick-processor.bat v8.log

Linux:

tools/linux-tick-processor v8.log

macOS:

tools/mac-tick-processor v8.log

Web UI for --prof #

Preprocess the log with --preprocess (to resolve C++ symbols, etc).

$V8_PATH/tools/linux-tick-processor --preprocess > v8.json

Open tools/profview/index.html in your browser and select the v8.json file there.

Example output #

Statistical profiling result from benchmarks\v8.log, (4192 ticks, 0 unaccounted, 0 excluded).

 [Shared libraries]:
   ticks  total  nonlib   name
      9    0.2%    0.0%  C:\WINDOWS\system32\ntdll.dll
      2    0.0%    0.0%  C:\WINDOWS\system32\kernel32.dll

 [JavaScript]:
   ticks  total  nonlib   name
    741   17.7%   17.7%  LazyCompile: am3 crypto.js:108
    113    2.7%    2.7%  LazyCompile: Scheduler.schedule richards.js:188
    103    2.5%    2.5%  LazyCompile: rewrite_nboyer earley-boyer.js:3604
    103    2.5%    2.5%  LazyCompile: TaskControlBlock.run richards.js:324
     96    2.3%    2.3%  Builtin: JSConstructCall
    ...

 [C++]:
   ticks  total  nonlib   name
     94    2.2%    2.2%  v8::internal::ScavengeVisitor::VisitPointers
     33    0.8%    0.8%  v8::internal::SweepSpace
     32    0.8%    0.8%  v8::internal::Heap::MigrateObject
     30    0.7%    0.7%  v8::internal::Heap::AllocateArgumentsObject
    ...


 [GC]:
   ticks  total  nonlib   name
    458   10.9%

 [Bottom up (heavy) profile]:
  Note: percentage shows a share of a particular caller in the total
  amount of its parent calls.
  Callers occupying less than 2.0% are not shown.

   ticks parent  name
    741   17.7%  LazyCompile: am3 crypto.js:108
    449   60.6%    LazyCompile: montReduce crypto.js:583
    393   87.5%      LazyCompile: montSqrTo crypto.js:603
    212   53.9%        LazyCompile: bnpExp crypto.js:621
    212  100.0%          LazyCompile: bnModPowInt crypto.js:634
    212  100.0%            LazyCompile: RSADoPublic crypto.js:1521
    181   46.1%        LazyCompile: bnModPow crypto.js:1098
    181  100.0%          LazyCompile: RSADoPrivate crypto.js:1628
    ...

Profiling web applications #

Today’s highly optimized virtual machines can run web apps at blazing speed. But one shouldn’t rely only on them to achieve great performance: a carefully optimized algorithm or a less expensive function can often reach many-fold speed improvements on all browsers. Chrome DevToolsCPU Profiler helps you analyze your code’s bottlenecks. But sometimes, you need to go deeper and more granular: this is where V8’s internal profiler comes in handy.

Let’s use that profiler to examine the Mandelbrot explorer demo that Microsoft released together with IE10. After the demo release, V8 has fixed a bug that slowed down the computation unnecessarily (hence the poor performance of Chrome in the demo’s blog post) and further optimized the engine, implementing a faster exp() approximation than what the standard system libraries provide. Following these changes, the demo ran 8× faster than previously measured in Chrome.

But what if you want the code to run faster on all browsers? You should first understand what keeps your CPU busy. Run Chrome (Windows and Linux Canary) with the following command line switches, which causes it to output profiler tick information (in the v8.log file) for the URL you specify, which in our case was a local version of the Mandelbrot demo without web workers:

./chrome --js-flags='--prof' --no-sandbox 'http://localhost:8080/'

When preparing the test case, make sure it begins its work immediately upon load, and close Chrome when the computation is done (hit Alt+F4), so that you only have the ticks you care about in the log file. Also note that web workers aren’t yet profiled correctly with this technique.

Then, process the v8.log file with the tick-processor script that ships with V8 (or the new practical web version):

v8/tools/linux-tick-processor v8.log

Here’s an interesting snippet of the processed output that should catch your attention:

Statistical profiling result from null, (14306 ticks, 0 unaccounted, 0 excluded).
 [Shared libraries]:
   ticks  total  nonlib   name
   6326   44.2%    0.0%  /lib/x86_64-linux-gnu/libm-2.15.so
   3258   22.8%    0.0%  /.../chrome/src/out/Release/lib/libv8.so
   1411    9.9%    0.0%  /lib/x86_64-linux-gnu/libpthread-2.15.so
     27    0.2%    0.0%  /.../chrome/src/out/Release/lib/libwebkit.so

The top section shows that V8 is spending more time inside an OS-specific system library than in its own code. Let’s look at what’s responsible for it by examining the “bottom up” output section, where you can read indented lines as “was called by” (and lines starting with a * mean that the function has been optimized by TurboFan):

[Bottom up (heavy) profile]:
  Note: percentage shows a share of a particular caller in the total
  amount of its parent calls.
  Callers occupying less than 2.0% are not shown.

   ticks parent  name
   6326   44.2%  /lib/x86_64-linux-gnu/libm-2.15.so
   6325  100.0%    LazyCompile: *exp native math.js:91
   6314   99.8%      LazyCompile: *calculateMandelbrot http://localhost:8080/Demo.js:215

More than 44% of the total time is spent executing the exp() function inside a system library! Adding some overhead for calling system libraries, that means about two thirds of the overall time are spent evaluating Math.exp().

If you look at the JavaScript code, you’ll see that exp() is used solely to produce a smooth grayscale palette. There are countless ways to produce a smooth grayscale palette, but let’s suppose you really really like exponential gradients. Here is where algorithmic optimization comes into play.

You’ll notice that exp() is called with an argument in the range -4 < x < 0, so we can safely replace it with its Taylor approximation for that range, which delivers the same smooth gradient with only a multiplication and a couple of divisions:

exp(x) ≈ 1 / ( 1 - x + x * x / 2) for -4 < x < 0

Tweaking the algorithm this way boosts the performance by an extra 30% compared to latest Canary and 5× to the system library based Math.exp() on Chrome Canary.

This example shows how V8’s internal profiler can help you go deeper into understanding your code bottlenecks, and that a smarter algorithm can push performance even further.

To find out more about how benchmark that represent today’s complex and demanding web applications, read How V8 measures real-world performance.