In Chrome M117 we introduced a new optimizing compiler: Maglev. Maglev sits between our existing Sparkplug and TurboFan compilers, and fills the role of a fast optimizing compiler that generates good enough code, fast enough.
All the benchmark scores in this post were measured with Chrome 117.0.5897.3 on a 13” M2 Macbook Air.
Since the difference in execution speed and compile time between Ignition and TurboFan is so large, in 2021 we introduced a new baseline JIT called Sparkplug. It’s designed to compile bytecode to equivalent machine code almost instantaneously.
The simplicity of Sparkplug imposes a relatively low upper limit on the speedup it can provide though. This is clearly demonstrated by the large gap between Ignition + Sparkplug and Ignition + TurboFan.
This is where Maglev comes in, our new optimizing JIT that generates code that’s much faster than Sparkplug code, but is generated much faster than TurboFan can.
Maglev: A Simple SSA-Based JIT compiler #
When we started this project we saw two paths forward to cover the gap between Sparkplug and TurboFan: either try to generate better code using the single-pass approach taken by Sparkplug, or build a JIT with an intermediate representation (IR). Since we felt that not having an IR at all during compilation would likely severely restrict the compiler, we decided to go with a somewhat traditional static single-assignment (SSA) based approach, using a CFG (control flow graph) rather than TurboFan's more flexible but cache unfriendly sea-of-nodes representation.
First Maglev does a prepass over the bytecode to find branch targets, including loops, and assignments to variables in loop. This pass also collects liveness information, encoding which values in which variables are still needed across which expressions. This information can reduce the amount of state that needs to be tracked by the compiler later.
Maglev does an abstract interpretation of the frame state, creating SSA nodes representing the results of expression evaluation. Variable assignments are emulated by storing those SSA nodes in the respective abstract interpreter register. In the case of branches and switches, all paths are evaluated.
When multiple paths merge, values in abstract interpreter registers are merged by inserting so-called Phi nodes: value nodes that know which value to pick depending on which path was taken at runtime.
Loops can merge variable values “back in time”, with the data flowing backwards from the loop end to the loop header, in the case when variables are assigned in the loop body. That’s where the data from the prepass comes in handy: since we already know which variables are assigned inside loops, we can pre-create loop phis before we even start processing the loop body. At the end of the loop we can populate the phi input with the correct SSA node. This allows the SSA graph generation to be a single forward pass, without needing to "fix up" loop variables, while also minimizing the amount of Phi nodes that need to be allocated.
Known Node Information #
During graph building Maglev will look at runtime feedback metadata collected during unoptimized execution, and generate specialized SSA nodes for the types observed. If Maglev sees
o.x and knows from the runtime feedback that
o always has one specific shape, it will generate an SSA node to check at runtime that
o still has the expected shape, followed by a cheap
LoadField node which does a simple access by offset.
Additionally, Maglev will make a side node that it now knows the shape of
o, making it unnecessary to check the shape again later. If Maglev later encounters an operation on
o that doesn't have feedback for some reason, this kind of information learned during compilation can be used as a second source of feedback.
Runtime information can come in various forms. Some information needs to be checked at runtime, like the shape check previously described. Other information can be used without runtime checks by registering dependencies to the runtime. Globals that are de-facto constant (not changed between initialization and when their value is seen by Maglev) fall into this category: Maglev does not need to generate code to dynamically load and check their identity. Maglev can load the value at compile time and embed it directly into the machine code; if the runtime ever mutates that global, it'll also take care to invalidate and deoptimize that machine code.
Some forms of information are “unstable”. Such information can only be used to the extent that the compiler knows for sure that it can’t change. For example, if we just allocated an object, we know it’s a new object and we can skip expensive write barriers entirely. Once there has been another potential allocation, the garbage collector could have moved the object, and we now need to emit such checks. Others are "stable": if we have never seen any object transition away from having a certain shape, then we can register a dependency on this event (any object transitioning away from that particular shape) and don’t need to recheck the shape of the object, even after a call to an unknown function with unknown side effects.
Given that Maglev can use speculative information that it checks at runtime, Maglev code needs to be able to deoptimize. To make this work, Maglev attaches abstract interpreter frame state to nodes that can deoptimize. This state maps interpreter registers to SSA values. This state turns into metadata during code generation, providing a mapping from optimized state to unoptimized state. The deoptimizer interprets this data, reading values from the interpreter frame and machine registers and putting them into the required places for interpretation. This builds on the same deoptimization mechanism as used by TurboFan, allowing us to share most of the logic and take advantage of the testing of the existing system.
Representation Selection #
Maglev learns about the representation of SSA nodes mainly by looking at runtime feedback of e.g., binary operations, and propagating that information forwards through the Known Node Info mechanism. When SSA values with specific representations flow into Phis, a correct representation that supports all the inputs needs to be chosen. Loop phis are again tricky, since inputs from within the loop are seen after a representation should be chosen for the phi — the same "back in time" problem as for graph building. This is why Maglev has a separate phase after graph building to do representation selection on loop phis.
Register Allocation #
After graph building and representation selection, Maglev mostly knows what kind of code it wants to generate, and is "done" from a classical optimization point of view. To be able to generate code though, we need to choose where SSA values actually live when executing machine code; when they're in machine registers, and when they're saved on the stack. This is done through register allocation.
Each Maglev node has input and output requirements, including requirements on temporaries needed. The register allocator does a single forward walk over the graph, maintaining an abstract machine register state not too dissimilar from the abstract interpretation state maintained during graph building, and will satisfy those requirements, replacing the requirements on the node with actual locations. Those locations can then be used by code generation.
First, a prepass runs over the graph to find linear live ranges of nodes, so that we can free up registers once an SSA node isn’t needed anymore. This prepass also keeps track of the chain of uses. Knowing how far in the future a value is needed can be useful to decide which values to prioritize, and which to drop, when we run out of registers.
After the prepass, the register allocation runs. Register assignment follows some simple, local rules: If a value is already in a register, that register is used if possible. Nodes keep track of what registers they are stored into during the graph walk. If the node doesn’t yet have a register, but a register is free, it’s picked. The node gets updated to indicate it’s in the register, and the abstract register state is updated to know it contains the node. If there’s no free register, but a register is required, another value is pushed out of the register. Ideally, we have a node that’s already in a different register, and can drop this "for free"; otherwise we pick a value that won’t be needed for a long time, and spill it onto the stack.
On branch merges, the abstract register states from the incoming branches are merged. We try to keep as many values in registers as possible. This can mean we need to introduce register-to-register moves, or may need to unspill values from the stack, using moves called “gap moves”. If a branch merge has a phi node, register allocation will assign output registers to the phis. Maglev prefers to output phis to the same registers as its inputs, to minimize moves.
If more SSA values are live than we have registers, we’ll need to spill some values on the stack, and unspill them later. In the spirit of Maglev, we keep it simple: if a value needs to be spilled, it is retroactively told to immediately spill on definition (right after the value is created), and code generation will handle emitting the spill code. The definition is guaranteed to ‘dominate’ all uses of the value (to reach the use we must have passed through the definition and therefore the spill code). This also means that a spilled value will have exactly one spill slot for the entire duration of the code; values with overlapping lifetimes will thus have non-overlapping assigned spill slots.
Due to representation selection, some values in the Maglev frame will be tagged pointers, pointers that V8’s GC understands and needs to consider; and some will be untagged, values that the GC should not look at. TurboFan handles this by precisely keeping track of which stack slots contain tagged values, and which contain untagged values, which changes during execution as slots are reused for different values. For Maglev we decided to keep things simpler, to reduce the memory required for tracking this: we split the stack frame into a tagged and an untagged region, and only store this split point.
Code Generation #
Once we know what expressions we want to generate code for, and where we want to put their outputs and inputs, Maglev is ready to generate code.
Maglev nodes directly know how to generate assembly code using a “macro assembler”. For example, a
CheckMap node knows how to emit assembler instructions that compare the shape (internally called the “map”) of an input object with a known value, and to deoptimize the code if the object had a wrong shape.
One slightly tricky bit of code handles gap moves: The requested moves created by the register allocator know that a value lives somewhere and needs to go elsewhere. If there’s a sequence of such moves though, a preceding move could clobber the input needed by a subsequent move. The Parallel Move Resolver computes how to safely perform the moves so that all values end up in the right place.
So the compiler we just presented is both clearly much more complex than Sparkplug, and much simpler than TurboFan. How does it fare?
In terms of compilation speed we’ve managed to build a JIT that’s roughly 10x slower than Sparkplug, and 10x faster than TurboFan.
This allows us to deploy Maglev much earlier than we’d want to deploy TurboFan. If the feedback it relied upon ended up not being very stable yet, there’s no huge cost to deoptimizing and recompiling later. It also allows us to use TurboFan a little later: we’re running much faster than we’d run with Sparkplug.
Slotting in Maglev between Sparkplug and TurboFan results in noticeable benchmark improvements:
We have also validated Maglev on real-world data, and see good improvements on Core Web Vitals.
Since Maglev compiles much faster, and since we can now afford to wait longer before we compile functions with TurboFan, this results in a secondary benefit that’s not as visible on the surface. The benchmarks focus on main-thread latency, but Maglev also significantly reduces V8’s overall resource consumption by using less off-thread CPU time. The energy consumption of a process can be measured easily on an M1- or M2-based Macbook using
Maglev isn’t complete by any means. We've still got plenty more work to do, more ideas to try out, and more low-hanging fruit to pick — as Maglev gets more complete, we’ll expect to see higher scores, and more reduction in energy consumption.
Maglev is now available for desktop Chrome now, and will be rolled out to mobile devices soon.