TAILCALL is not always possible

Currently the mu-spec requires the TAILCALL instruction to not create a new 'frame' (and is supported whenever the callee and caller have the same return type). Unfortunantly, on aarch64 at least, I can't guarantee this when the callee's signature requires more space for arguments on the stack than the 'caller' (since we can't simply place the arguments for the callee in the callers argument space since it's insufficient space). Currently, the best idea I could come up with as to implementing this is roughly as follows:

caller:
    SUB SP, SP, #E // Reserve extra space

    // The usual prologue:
        // Create a new frame record
        STP FP, LR, [SP], #16 // Push (FP, LR)
        MOV FP, SP
        // Save callee saved registers and allocate stack space (as normal)...
   ...

// The code emitted for a call to callee
   // Compute arguments as usual, and place them in registers
   // or the argument stack space starting at FP-16 (i.e. the result of the SUB instruction emitted above)
   // Restore all callee saved registers

   // Restore the frame record
   MOV SP, FP
   LDP FP, LR, [SP], #16 // Pop (FP, LR)

   // We have to save the LR somewhere (we can't use the stack as the callee may modify it
   // but we just restored all callee saved registers so we can't use them either
   // TODO: append LR (somehow?) to a thread-local linked list on the heap...

   BL callee    
callsite: // Record the exceptionall destination as 'exception'
    ADD SP, SP, #E
    // TODO: Restore LR from the heap
    RET LR
    ...

epilogue: // Before caller returns
   // Do the normal epilogue (restore callee saved register, pop the frame record)...
   ...
   ADD SP, SP, #E
  
exception[exc]: // I could probably do this inside muentry_throw_exception itself...
    ADD SP, SP, #E
    // TODO: Restore LR from the heap

    MOV X0, exc // Rethrow 'exc'
    B muentry_throw_exception

Where 'E' is maximum of (B - A), where B is the stack argument size of each signature that is in a TAILCALL, and A is the stack argument size of caller.

Unfortunately, this doesn't actually do a taillcall (as it uses a call instruction, BL, instead of a branch B), and requires the horrible storing of the link register on the heap somewhere (I haven't actually worked out how to do this..) (by doing this weird link register marshalling however I may be breaking the no new 'frame' rule). I could simply do a normal call, which may be more efficient (as it won't require storing the LR on the heap), however that would definitely break the no new frame rule.

I checked with clang/llvm (as @U1817699 suggested), and it has a musttaill flag for it's call instruction, which guarantees a tailcall, but can only be used if the callee and caller have the same signature/prototype, it also has a tail flag which is merely a hint, and does not actually do a tailcall or affect the generated code in the case mentioned above.

Is my implementation conform ant with the spec? If not, should we change the spec or does anyone have a better idea of how to implement it?