IR construction API for the convenience of verification
Created by: wks
The only way for the current API to transfer an IR bundle from a client to a Mu micro VM is via the
load_bundle API (for example,
microvm.load_bundle(""".typedef @i32 = int<32>\n"""), or use a binary format), which passes a serialised IR format (text or binary). JVM takes this approach. Its serialisation format is "Java bytecode".
An alternative way to deliver a bundle is to let the client construct each node (type, function, basic block, instruction, ...) in the bundle by calling an API function (for example,
handle = microvm.make_instruction("ADD", i32, op1, op2)) which returns a handle to the node, then passing this handle to the Mu micro VM. LLVM uses a similar approach: it provides constructors to each node class, and provides a "builder" to conveniently build a CFG.
Since the construction of the IR should be off-loaded from the micro VM, there should be a client-side library (we call it libmu) which constructs the IR for the client and provides an API to the client.
libmu itself talks with the micro VM through implementation-dependent "private" APIs.
It is not clear whether serialisation or construction by calling is "better" because "better" has many definitions, but the calling-based API must exist because it is reported that it is very difficult to verify a parser.
Comparing the two approaches
- language independent
- The serialisation format can be standardised and (almost) all languages can access byte arrays.
- faster than construction by calling when the client and the micro VM have different runtime environments
- (e.g. Haskell to C, Java to C, ...)
- FFI can introduce considerable cost. The Go lang impl performs a SWAP-STACK operation every time it calls C.
- simple API
- one API function
load_bundlehandles the entire IR format.
- one API function
- language independent
- difficult to verify (there are reports that it is very difficult to verify a parser)
- require a parser inside the micro VM:
- there may be a mixed approach to offload the parser outside the micro VM.
- not suitable for the end user, i.e. those who writes the high-level language compiler.
- Either the high-level compiler writer or some third party need to build a library for serialisation.
construction by calling
- easy to verify
- minimal for the micro VM
- No parser in the micro VM is needed.
- faster than serialisation when the client and the micro VM have similar runtime environments
- (e.g. C to C, Java to Java, C to C++, Haskell to Haskell, ...)
- The libmu library can be implemented in the same language as the client.
- bloated API
- There will be hundreds of API functions just to construct each IR node. All of them need to be verified.
- The internal states of libmu needs to be verified, but hopefully this can be easier than verifying the parser.
- paradigm impedance
- There is no single API that satisfies all languages. For example, if the API is defined in C, it may be unsuitable for a client written in Haskell. So ideally there should be a libmu for Haskell, whether that is part of the formally verified Mu/libmu pair or not.
- bloated API
Cost of foreign function calls
Depending on the two languages calling each other, the cost can be trivial (such as C and C++) or huge. Java must go through JNI to call native functions. Go performs a swap-stack operation every time it calls C in order to work around blocking system calls in its M*N threading model. Other language implementations, such as Python, relies on
libffi, which builds a native call by dynamically prepare its arguments, and the high-level language such as Python needs to convert C types to Python types and vice versa.
An experiment shows calling a simplest C function from Haskell introduces 30x overhead comparing to calling the same C function from another C function.
But from verification's point of view, the cost does not matter as long as it is spent in the client.
Cost of serialisation
Serialisation is not free. In a simple experiment, serialising a CFG-like data structure to an intermediate format and then parse it in another module (both written in C) introduces a 10% overhead comparing to directly constructing the target CFG structure in the receiver while assuming the sender only holds opaque references. The major cost is memory allocation (where malloc is the bottleneck) and resolving the cross-references between nodes (where the hash map is the bottleneck).
Serialisation is not free, but is reasonably cheap. When foreign function call is expensive, serialisation can be used as an alternative.
Not calling across languages
Since calling across languages is expensive, it is desirable to implement part of the libmu in the same (or similar) language the client is written in, and let libmu construct the data structure that is native to the micro VM.
For example, if the micro VM is written in C++, the libmu should construct a tree of C++ class instances as LLVM does.
Note that LLVM is designed and implemented in C++ and serves C/C++. It is not a problem if the only official API it provides is C++. There is a C binding, too, but it is trivial. However, Mu has a specification which allows multiple implementations. In this case, the micro VM core would not always be in C/C++ or any particular language.
However, if the micro VM is written in a managed language, such as RJava or Java, then it will be interesting:
If both the client, libmu and the micro VM happen to share the same (or similar runtime), then the API calls can be cheap. The ideal case is being "metacircular", i.e. both the client and libmu run on the same micro VM as the micro VM itself. The cost is minimal.
If libmu is written in C and provides the C API, then there is a semantic mismatching between the micro VM and libmu: somewhere in libmu must cross the line between two different runtimes, which introduces a FFI-like overhead, the cost of which depends on the concrete languages. Holding opaque references of objects in Mu requires such object to be pinned, or being held by some containers (like the
MuCtxstructure in the current API, which is not light-weight).
If the client and libmu uses a different runtime as the micro VM ("Haskell on libmu on micro VM" is this case), it will pay for two levels of cross-runtime calls.
Other concerns in design
Does libmu need to be minimal?
Maybe. At least the verified libmu needs to be minimal.
There may be even higher-level libraries outside above in the client's world outside libmu. Those libraries are not minimal.
How many languages should be supported?
Ideally the libmu should have both intimate knowledge of a particular Mu implementation, and intimate knowledge of the language the client is written in. In theory, there can be one (or more) libmu for each (micro VM impl, client language) pair.
Since C is so popular, we will define a C API for libmu. In theory,
- there can be more than one C APIs
- there can be APIs for other languages and they may or may not look like the C API. (preferably not, to avoid paradigm impedance)
Mu CFG and the client CFG (or AST)
How much should the client use the Mu IR CFG? i.e. should the client construct Mu IR nodes and do transformations on it as LLVM does? Probably not.
LLVM is designed to be maximum: it attempts to be maximum and its CFG contains much information for optimisation, such as the "nsw" and "nuw" flag on the "add" and the "sub" instructions.
But the Mu IR is designed to be minimal and is only designed for the micro VM to consume. It does not contain much information that benefits the client.
It is possible that a client-side library performs IR transformations, but it is doubtful whether that IR is the same Mu IR. Many optimisations, such as whether
x+1 > x is always true, depends on extra information (such as the "nsw" flag in LLVM) which the Mu IR do not provide.
I @wks believe the Mu IR is only generated as the last step of the client-side transformation, i.e. the next step is to deliver it into the micro VM.
Towards the new API
The existing controlling API does not need to be changed.
The bundle loading API can be removed. i.e. "how to load bundles into the micro VM" is implementation-dependent, which actually mean libmu-dependent.
This API needs to be carefully designed because it is part of the formal verification.
There should be a model of the internal states of libmu which includes:
- The set of handles which map to Mu IR nodes.
- The state of each Mu IR node.
- Isolation between threads.
During construction, each node can hold incomplete information at a given time. The Mu IR nodes may circularly refer to each other (currently they refer to each other via IDs), so it is desirable to allocate several nodes and then link them with each other.
If multiple threads can use the same libmu, the transition in its internal state must be properly handled.
Cares must be taken to select the minimum set of API functions, because the current Mu IR has 18 types and 37 instructions. The number of API functions may easily bloat to 100+ if too many CRUD commands are added.
Micro VM-to-libmu interaction
This interface does not need to be public, but the proper handling of data structures in the micro VM is important. This interface is part of the verification.
In all cases, the choice of languages does matter. Properly chosen languages for the client and the micro VM will result in high performance and verifiability.
This section contains my @wks personal opinions. These affects my opinions on this API, too.
In the future, popular system programming languages will generally be higher-level than current languages (such as C). We already showed that a high-level language, such as RJava and Java, can produce high-quality language runtimes, such as JikesRVM.
Instead of relying on C, high-level system programming languages will gain direct control over low-level operations, such as raw memory access (pointers). These low-level stuff can be done even though the language itself still has very high-level features, such as garbage collection and object-oriented programming. It is possible (in my opinion at least) that eventually such high-level language can replace libc and directly interface with the kernel, thus eliminates the necessity of C except for some very rare cases.
This trend is already visible in several languages. C# already has unsafe pointers and unsafe native interface (P/Invoke). The Java API is mostly implemented in Java, unlike the old days when the standard Java library is mostly implemented in C++. There is also a JEP to add unsafe native interface in Java, which still has not become mainstream yet. However, OpenJDK already exposes some low-level stuff though the
sun.misc.Unsafe class. RJava obviously used magic to gain low-level support. In Ruby, ruby-ffi is recommended over directly writing C modules.
With C being redundant, the high-level language (or runtime) may be optimised for internal interoperation (e.g. Java-to-Java calls will be faster and faster) at the expense of the interoperability with C (e.g. even unsafe native call may be costly. Object pinning or opaque handles are required for native programs, which would only run briefly.).
Since raw memory access will be faster while foreign function call will be slower, serialisation may have an advantage over calling (my experiment already shows this for Java and C). But this does not rule out the calling-based API because it may not be foreign calls, in which case it is still faster than serialisation. Ideally, in the meta-circular setting, if both Mu, libmu and the client are in Mu, then Mu-to-Mu function calls are virtually free.