Low-bit ref tagging

CakeML, like many ML implementations I suspect, uses sophisticated low-bit tagging in its pointers. While taggedref64 in Mu is a start, and is a kind of clever "hack", it does not really cover the CakeML case and also prevents one from using any / all parts of the 64-bit address space. So after some reflection I would like to start discussion of a possible general technique for supporting low-bit-tagged references. (One can imagine -high bit tagging as well using BiBoP - big bag of pages, where each "page" has a given type and thus the high bits tell the types apart, but I don't think that necessarily meets the same need.)

So I start with the idea of lowtaggedref<t, n> where n gives the number of tag bits desired and t is the type of object the ref refers to. Note that choosing a particular value for n requires that objects referred to by the lowtaggedref must be align on a 2^n boundary. This may be larger than the minimum granule size that the allocator would naturally use, and may lead to fragmentation, but the user is buying into that when making the choice. Actual allocators probably use minimum granule sizes of 4, 8, or 16 bytes anyway, which allows for 2, 3, or 4 tag bits without extra fragmentation.

This information is not enough, however. The reason is that some tag values may indicate that the remainder of the ref is a pointer while other values indicate that it is a non-pointer value. Again, on reflection, and taking into account the emerging spec I am working on with MIchael and Tony for an Immix collector that might be used with CakeML and with Mu, I propose to cover that case with a different type. Therefore, lowtaggedref<t, n> gives n low bits of tag value that can be fetched, etc., separately from a ref value in the remaining bits, and it is implemented by forcing 2^n alignment. It support a null pointer as well (which also has the same n bits of tag (so an == 0 comparison is not quite enough to detect a null pointer - you have to mask off the tag).

Now, to support dynamic mixing of pointer and non-pointer data, a bit more like taggedref64 and what happens in a number of dynamic languages (and CakeML, which uses a "dynamic" representation in the heap because of parameterized types), I add dyntaggedref<t, n, pattern>. Here t is the type of thing that this refers to if it is a ref, n is the number of low bits used for tagging, and pattern is an n-bit number that tells the system which tag values indicate that this value is a ref. In particular, if bit i is a i, then tag value i indicates that the w-n ref bits are a pointer (w is the word size). So, a 2-bit tag requires a 4-bit pattern, a 3-bit tag an 8-bit pattern, etc. Since these values are statically part of the type, GC can still pick things apart. It is also possible, using classical Karnaugh map reduction techniques, etc., to derive, automatically, fast masking checks, though the general case would just shift the pattern right and check the low bit (or equivalent).