Portability vs exploiting architecture potentials.

Created by: wks

For should the enemy strengthen his van, he will weaken his rear; should he strengthen his rear, he will weaken his van; should he strengthen his left, he will weaken his right; should he strengthen his right, he will weaken his left. If he sends reinforcements everywhere, he will everywhere be weak. -- Art of War, by Sun Zi

The µVM is designed to abstract over the hardware, but only acts as a thin layer of abstraction.

There are many differences between underlying hardware platforms. Scroll down to the Appendix or read https://github.com/microvm/microvm-meta/issues/16

Solutions to differences

As mentioned in https://github.com/microvm/microvm-meta/issues/16, there are three solutions, from the weakest to the strongest.

Define the differences as undefined behaviour. The client must prevent touching those fields at all cost.
Define the differences as implement-defined behaviour. The µVM implementation define the behaviour and provide documents/compile-time/run-time checkable mechanisms.
Define the behaviour in the µVM. The µVM implementation bridges the differences.

The 1st approach is the weakest. It makes it absolutely impossible to make use of any platform-specific features and will introduce excessive checkings to avoid undefined behaviour. So the µVM should lie somewhere in between approach 2 and 3.

The µVM design goal

The µVM is having several conflicting design goals.

µVM is low-level. It is a thin layer over the hardware.
µVM is minimal. Anything that can be done efficiently by the Client should be done by the Client.
µVM is portable.
µVM should support high-performance VMs
µVM potentially run on resource-constrained devices.

Depending on how to interpret portable, there are two different implications:

The µVM provide compile-time or run-time flags so that the Client can generate platform-dependent µVM IR code.
The µVM IR code is cross-platform. The same µVM IR code should run on all platforms.

The second interpretation is more portable than the first. Overall, the more portable the µVM is, the thicker the abstraction layer is.

Goal 4 can be interpreted in different ways:

Highest theoretical possible performance.
As high as C for computation-intensive problems.
As high as Java for computation-intensive problems.
Much higher than Python/PHP/R/<insert your favourite scripting language here>
As high as Python/PHP/R/...
As long as the program eventually terminates.

Depending on the concrete implementation, an average desktop/server µVM should reach between 2 and 3.

Goal 5 requires the µVM and the Client to be simple. In this case, the Client and the µVM cannot perform too much reasoning about the program. This results in sub-optimal performance of emitted code.

Choices in the µVM

Behaviour of UDIV/SDIV

UDIV/SDIV are implementation-defined, or
They always behave like Java (div by 0 is an exception, -0x80000000/-1 = -0x80000000)

Supported vector sizes

Conflicts:

The current µVM IR is very expressive. It can express any vector types, given type T and size n: vector <T n>
Not all are supported by the architecture

So the µVM should

Make it implementation-defined, or
Only support some selected vector sizes, or
Support some selected vector sizes and platform-specific vector sizes, or
Support all vector<T n>, either using scalar operations or vector instructions.

Rationale:

The Client can probe the supported vector sizes and can generate code accordingly. So it is the Client's responsibility to choose vectors. This is also a kind of specialisation which is usually done by the high-level optimiser.
Choosing the appropriate vector size is a kind of instruction selection and should be done by the µVM in a platform-specific fashion.

For rationale 1, (DeVito et al)[http://terralang.org/pldi071-devito.pdf] used auto-tuning technique to determine the optimal vector size for vectorised matrix multiplication problem.

For rationale 2, (Jibaja et al)[https://01.org/node/1495] proposed adding SIMD into JavaScript, but only providing 128-bit registers to the programmer. However, the following code is difficult for the µVM to convert to 256-bit vectors:

__m128 elems[N];
for (int i=0; i<N; i++) {
    elems[i] = vector_add_floatX4(elems[i], constant_vector(1,2,3,4));
}

The reason is:

It requires extensive control-flow analysis to merge two adjacent 128-bit adding to one 256-bit adding.
256-bit vectors have different alignment requirement. So array __m128[] cannot be treated as the array __m256[].

The following code, however, is easier to convert:

__align_to(256) __m128 elems[N];
for (int i=0; i<N; i+=2) {
    elems[i] = vector_add_floatX4(elems[i], constant_vector(1,2,3,4));
    elems[i+1] = vector_add_floatX4(elems[i+1], constant_vector(1,2,3,4));
}

However, the client, when generating such code, is already aware of the presence of 256-bit vectors.

Array indexing

Conflict:

Array indexing seems to be platform-independent. It is "begin+index" (index can be negative).
It is implemented using address calculation and memory accessing, where "address" is word-sized.

Solutions

Use any integer type as the index and are sign-extended (used by LLVM. See http://llvm.org/docs/LangRef.html#getelementptr-instruction), or
The index must be word-sized and the Client must appropriately extend/truncate the index to word-sized (current µVM design)

Rationale:

It only involves a signed extension of truncation and the µVM can cheaply do it. However,
Since the Client can probe the word size and the Client can generate TRUNC/SEXT instructions, it should be done by the client.

Portable µVM IR as a subset of the µVM

If we can define a subset of the µVM with defined behaviours and reasonable performance (perhaps calling it "µVM Mobile Edition"), then some simplistic µVM Clients (I assume there will soon be many such Clients) can generate portable µVM IR code.

For implementations that seek ultimate performance, the µVM implementation can implement many machine-dependent instructions as a super set of such a "portable" IR. (call it "µVM Enterprise Edition"?)

Alternatively, we can define that "subset" as "The µVM IR ®" and treat the implementation-dependent extensions as "extended µVM".

Appendex: Examples of differences

There are differences between architectures. For example:

The word length is different. Some operations (including array indexing) depends on the word length.

The integer division instruction behaves differently among architectures, especially in "division by zero" and "signed integer overflow".

Not all operations on all types perform equally well. 64-bit integer operations perform significantly worse (if possible at all) than 32-bit counterparts on 32-bit machines.

The supported vector length of vector instructions is different. This may vary from 64 bits (ARM) up to 512 bits (x86 with AVX) with 128 bits widely supported.

Some instructions are "optional" and the functionality must be software-implemented on those architectures. For example, UDIV and SDIV in ARMv7.

Although most architectures provide all operations (binary arithmetic and logical, comparison and conversion) mentioned in LLVM, concrete instructions does not work on all data sizes. For example, conversion from floating point to integer can only convert between certain data types (float to 32-bit int or 64-bit int, but not other int types.) Not all vector operations work for all vector types.

The address size (and indexes) of memory (and array) operations is different among architectures.

Supported data type that can be atomically loaded/stored. This affects how the Client is going to implement mutex locks.

Alignment requirement for vector load/store operations.

In the native interface/C interface/foreign function interface(FFI)

The pointer size depends on the architecture. This also affects function pointers for external C functions.
The available system calls depend on the operating system, the ABI and the processor. The Client must handle the differences between operating systems.

Other issues mentioned in https://github.com/microvm/microvm-meta/issues/16

Current extension mechanism

An intrinsic function (or IFUNC for short) is something that has the similar form of a function, but is treated by the µVM as a regular instruction. The simplest form takes only the name (or ID) of the IFUNC:

%result = ICALL @uvm.thread_exit

The most complete form takes type arguments, value arguments and an exceptional destination:

%result = ICALL @uvm.do_something_weird <@T1 @T2 @T3> (%val1 %val2 %val3) EXC %normal %exceptional

Theoretically all binary operations can be defined as IFUNCs. For example:

%result = ICALL @uvm.math.sdiv <int<64>> (%lhs %rhs) EXC %normal %div_by_zero_handler