Undefined vs Implementation-defined

Created by: wks

NOTE: a higher-level discussion is in https://github.com/microvm/microvm-meta/issues/17

During the meeting in 23 September 2014, we talked about the difference between "undefined behaviour" and "implementation-defined behaviour".

Background

Some operations in C as well as LLVM are undefined behaviours.

Division by zero.
- Example: 42 / 0
Overflow in signed division.
- Example: int a = -0x80000000; int b = a / -1;
Shifting an integer by a number of bits greater than the length of the left-hand-side
- Example: int a = 42; int b = a << 32; int c = a << -1; assume int is 32-bit.

However, the machine instruction counterparts have defined behaviours in each and every architecture.

Division by zero
- x86: IDIV, DIV: Divide-by-zero raises "divide error".
- ARMv7: SDIV, UDIV:
  - ARMv7-A: Divide-by-zero always gets 0.
  - ARMv7-R: Controlled by SCTLR.DZ, an "Undefined Instruction" exception may or may not be raised.
- A64: Divide-by-zero always gets 0
division overflow
- x86: IDIV, DIV: If the result is not representable (positive too large, negative too small) by the corresponding type (signed or unsigned), then raises "divide error".
- ARMv7-A: SDIV and UDIV are optional. They may be implemented by software.
- ARMv7, A64: SDIV, UDIV: result is truncated to the number of bits of the corresponding type. No error is raised. So -0x80000000 / -1 == -0x80000000 when it is 32-bit.
shifting:
- x86: SAL,SAR,SHL,SHR: the count operand is masked to 5 bits (32-bit integer) or 6 bits (64-bit integer)
- ARMv7:
  - LSL, LSR, ASR (immediate): It can only encode 5 bits of shift amount.
  - LSL, LSR, ASR (register): The shift mount register is masked to 8 bits. After shifting, the last 32 bits are the result.
- A64:
  - ASR, LSL, LSR (immediate): It can only encode 6 bits of shift amount.
  - ASRV, LSLV, LSRV (register controlled): The shift amount is the second register modulo the register size (i.e. masked).

Undefined behaviour vs implementation-defined behaviour

In C11:

undefined behavior: behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements
implementation-defined behavior: unspecified behavior where each implementation documents how the choice is made

Implementation-defined behaviour has an additional requirement that the implementation should document the behaviour. As long as the behaviour is documented, it is still considered "defined", but at a different layer.

If a behaviour is never defined, the higher level (e.g. the Client) has no chance to depend on it even if the lower level (e.g. the CPU) has a precisely defined behaviour. On the contrary, if the behaviour is implementation-defined, there are still ways for the higher level to use the low-level detail. The ways include (assume the higher level is the client, the middle level is the µVM and the lower level is the CPU):

The Client programmer read the µVM manual for the particular CPU.
The µVM provides compile-time-checkable flags and the Client is conditionally compiled. (configure --with-uvm=x86-64)
The µVM provides run-time-callable functions and the Client generates code conditionally (if (uvm.wordSize() == 64) { emitStore64(reg,mem); }).

How should an abstraction layer be made over differences?

Undefined behaviours usually occur in the cases (other than errors) where different platforms behave differently. Division and shifting are two examples.

When creating an abstraction layer over such differences, there are basically three choices.

Define the different part as undefined behavior and the high-level user must avoid using them.
- C and LLVM takes this approach.
- The advantage is to make the specification very simple.
- The disadvantage is making it very difficult for the higher level to make efficient use of these cases since they must try everything to avoid those undefined behaviours.
Define the different part in one particular way and provide an implementation on every platform to behave like that.
- Java takes this approach.
- The advantage is the maximum portability of high-level code, since all programs work the same everywhere.
- The disadvantage is to make implementation and optimisation very difficult because the are too much invariants to maintain.
Define the different part as platform-specific and require the high-level to handle the difference.
- This will be the approach taken by the µVM.
- The advantage is to let the client make full use of the capabilities provided by the platform, resulting in efficient code. The µVM spec is also kept simple by pushing the differences to the implementation and the client.
- The disadvantage is adding more burden to the clients. The client must be aware of the difference between the µVM implementation on different platforms. The good thing is, the µVM still abstracts over the hardware, so the knowledge required by the client is limited to the µVM layer, not the hardware.