Undef, an Introduction

undef is a curiosity and an enigma. It's a strange complexity that most people would rather not think about, including many who are unwilling to give it up. It's a fascinating glimpse into the mind of a compiler.

This post begins an evolving series of posts on the topic of undef. It's a work in progress right now; feedback is welcome!

Source material

The two compiler communities I'm aware of having grappled with the concept of undef are LLVM, in the Undefined Values section of LLVM's LangRef, and Firm, which calls it "Unknown", in the Unknown section of the Firm Node Types documentation, and in more detail in this article dedicated to Unknown and Undefined Behavior in Firm. If other compilers have thoughts on this topic, I'd appreciate pointers to them.

LLVM's Code Generator has an operator called IMPLICIT_DEF; it is equivalent to undef, despite the semantics that the name might suggest.

Firm's Unknown differs from LLVM's undef in that Firm says that "An unknown value always results in the same value", and at several points they make it clear that one can assign a variable with the value Unknown and expect to get the same value out of it each time it is read. Firm guarantees that "x-x" is always 0, even if x is Unknown. In contrast, LLVM says that each use of an undef value could see a different actual value. (Of course, LLVM can still choose to fold "x-x" to zero, because in performing that fold it picks values for both uses simultaneously).

Firm's version of this concept has two main disadvantages. First, since all uses of an Unknown value must see the same value, the Unknown value must occupy storage in the machine. It may take up a register, and it may require stack spills and reloads, which means it may incur non-trivial runtime overhead. Second, Firm's Unknown is semantically odd: It's an operator with no operands and no dependencies on external state, which would ordinarily mean that it has no problem being moved to any point in the program where it dominates its uses. However, sinking an Unknown value from outside a loop into a loop would (e.g. to reduce register pressure) change the program from seeing the same value each iteration to potentially seeing a different value every iteration. In terms of code motion, this makes Unknown unlike any other construct in the IL.

LLVM's undef introduces the idea of a value which has no "live range". It is explicitly allowed to appear to have a different value each time it is read. This property is sometimes described as the value "fluctuating". This property allows it to avoid both the runtime overhead that Firm's Unknown implies, and avoids the awkward code motion constraint. It allows undef to act very much like a constant with respect to code motion. In LLVM's C++ API, the UndefValue class is even a subset of the Constant class for this reason.

While LLVM's LangRef specifically covers how this implies that xor with an undef operand returns undef, as of this post LLVM still has a long-standing hack which folds xor(undef, undef) to zero, to work around what was claimed at the time to be a common misuse.

LLVM's undef also introduces the idea of an undef value existing on a per-bit basis and even on an arbitrary value constraint basis. or(undef, 1) returns a value with 1 in the least significant bit, and undef in all the remaining bits. select(undef, 42, 58) is an undef constrained to fluctuate among the values 42 and 58. Extrapolating from this idea, (((undef%3)+7)*4)/3 must be a value which fluctuates among the values 9, 10, and 12. It's easy to extend this idea to create an undef of any type constrained to any arbitrary set of values within that type.

This fluctuating with arbitrarily constrained undef values also follows from the desire to allow a register allocator to rematerialize expressions in order to reduce register pressure. When an expression with an undef input is duplicated, it will start from a new undef value and recompute the expression on top of it.

Undef isn't limited to values either. An undef value can be used as the condition of a conditional branch instruction. And since it's desirable for semantic sanity to say that a conditional branch with a phi is semantically equivalent to a select (provided that the code otherwise fits the pattern), and since it's undesirable to say that branches may have side effects in any case, this must imply that the entire virtual machine then begins to fluctuate between taking both sides of the branch. And since this is a Turing-class machine, it's possible that some paths through the program are non-terminating, while others may still be terminating, so a branch on an undef with a phi that merges the two sides back again can even be a value which fluctuates between terminating and non-terminating.

The interpretation that undef can induce non-deterministic execution was also discovered by VellVM, an impressive formalization of a subset of LLVM IR. They ended up devoting several pages of their paper non-deterministic evaluation semantics in order to to accommodate undef.

The origins of LLVM's undef

The earliest writings about LLVM's undef are in an old LLVM note from Chris Lattner's LLVM notes pages. Note that this page uses the phrase "undefined behavior" in a manner inconsistent with was added to LLVM; this appears to be a trivial error. That aside, the content of this post remains consistent with the undef concept which is in LLVM today.

LLVM's undef is inspired by uninitialized variables in C, but it has some important differences. Because LLVM IR is a compiler IR which is used as both the input and output of the optimizer, it has to strike some balances that aren't relevant at the level of C source code. In gross terms, adding more rules to a compiler IR make analysis easier, because they reduce the amount of variety that an IR consumer has to handle, but it makes transformation harder, because any new output also has to conform to all the rules. The rules governing C's uninitialized variables are too strong for LLVM's transformation passes. They require that undefined behavior happen immediately upon use of an uninitialized value. With rules like that, LLVM would have to be very conservative whenever it wants to move code around, in order to avoid exposing undefined behavior when it's not actually supposed to happen in the program.

LLVM's undef is also inspired by the idea of "just use whatever value happens to be in some register" idiom which is possible in many assembly languages. But, the basic tradeoff in the previous paragraph also applies here, from the opposite direction. Like Firm's Unknown, an assembly language register will have a fixed bit pattern once it is declared "live", which is fine for assembly language, but burdensome for an optimizer and optimizing code generator, as it takes up a register that could be used for other things, or it may require overhead for spilling and reloading, and it has odd code motion semantics.

More to come

In upcoming posts I'll describe what these rules mean for writing optimization passes, the relationship between undef, poison, unspecified values, and data races, and a variety of alternatives to the undef and related concepts for people who think they want something different.