Flags for int<128>
I have worked out what instructions should be emitted to compute flags for binary operations on Aarch64, on x86-64 a similar method of implementation should hopefully work. As such I have included my notes here.
Notes on notation:
The notation <exp> indicates exp is a 128-bit value
[exp] indicates exp is a 64-bits value
<[exp]> indicates exp is a 192-bits value
<<exp>> indicates exp is a 256-bits value
exp.h indicates the higher 64-bits of the expression and exp.l is the lower 64-bits of the expression
(each should occupy there own register)
Xi = 2^(64*i) (i.e. it is i*64-bits worth of zeros with a one at the front)
Ti is a temporary register (64-bits)
Note:
Some optimisations may be able to be performed if an argument to the instruction is an immediate
Zero and Negtaive Flags:
D, Z, N = BINOP S1, S2
ORR Z <- D.h, D.l // Z = D.h | D.l
CMP Z, #0 // Z <=> 0
CSET Z, EQ // Z = (Z == 0) ? 1 : 0
LSR N, D.h, 63 // N = (D.h >> 63) (so that N[0] = D.h[63])
Overflow and Carry for Add/Sub:
D, C, V = ADD/SUB S1, S2
// Compute the add/subtraction normal (except ensure the Ad with carry/subtract with carry sets the carry flag)
CSET C, CS // Set to 1 if the carry flag is set
// V[63] = 1 IFF D and S1 have different signs
EOR V <- D.h, S1.h // V = D.h ^ S1.h
For ADD:
// T[63] = 1 IFF S1 and S2 have different signs
EOR T1 <- S1h, S2.h // T1 = S1.h ^ S2.h
For Sub:
// T[63] = 1 IFF S1 and -S2 have different signs
EON T1 <- S1h, S2.h // T1 = S1.h ^ (~S2.h)
// V[63] = 1 iff D and S1 have different signs
// and S1 and S2 (or -S2) have the same sign
BIC V <- V, T // V = V & ~T
// Check tmp_status[n-1]
TST V, 1 << 63 // V[63] <=> 1
CSET V, NE // V = (V[63] != 1) ? 1 : 0
Overflow for Sbutraction: (Note: this is essentially the same method I used for arithmetic less than 32 bits)
D, V = SUB S1, S2
// Compute the subtraction normally
// V[63] = 1 IFF D and S1 have different signs
EOR V <- D.h, S1.h // V = D.h ^ S1.h
// V[63] = 1 iff D and S1 have different signs
// and S1 and -S2 have the same sign
BIC V <- V, T // V = V & ~T
// Check tmp_status[n-1]
TST V, 1 << 63 // V[63] <=> 1
CSET V, NE // V = (V[63] != 1) ? 1 : 0
------------
Overflow and carry for Multiply:
D, C, V = MUL S1, S2
---------------------- (this is just my working) ----------
<S1.h*X1+S1.l> * <S2.h*X1+S2.l> =
<<S1.h*S2.h*X2>> + <[S1.l*S2.h*X1]> + <[S1.h*X1*S2.l]> + <S1.l*S2.l>
Discared everything that occupys the lower 128-bits:
<<S1.h*S2.h*X2>> + <[S1.l*S2.h*X1]> + <[S1.h*X1*S2.l]>
-----------------------------
<S1.h*S2.h>*X2 +
<S1.l*S2.h>*X1 +
<S1.h*S2.l>*X1
--------------------------------
<[S1.h*S2.h].h*X1+[S1.h*S2.h].l*X1>*X2 +
<[S1.l*S2.h].h*X1 + [S1.l*S2.h.l>*X1
<[S1.h*S2.l].h*X1 + [S1.h*S2.l].l>*X1
--------------------------------------------------
[S1.h*S2.h].h*X3 + [S1.h*S2.h].l*X3 +
[S1.l*S2.h].h*X2 + [S1.l*S2.h.l*X1 +
[S1.h*S2.l].h*X2 + [S1.h*S2.l].l*X1
----------------------------------------------------
Discare all factors of X1 (as they will only contribute to the lower 128 bits of the result)
[[S1.h*S2.h].h+ [S1.h*S2.h].l]*X3 +
[[S1.l*S2.h].h + [S1.h*S2.l].h]*X2
So to get the overflow flag let:
D.h = [[S1.h*S2.h].h+ [S1.h*S2.h].l]
D.l = [[S1.l*S2.h].h + [S1.h*S2.l].h]
Then set it to '1' iff (D.h != 0) || (D.l != 0)
------------------------------------------
SO EMIT THE FOLLOWING CODE:
UMULH D.l <- S1.l, S2.h // D.l = [S1.l*S2.h].h
UMULH D.h <- S1.h*S2.l // D.h = [S1.h*S2.l].h
ADD D.l <- D.h, D.l // D.l += D.h
UMULH D.h <- S1.h, S2.h // D.h = [S1.h*S2.h].h
MADD D.h <- S1.h, S2.h, D.h // D.h += [S1.h*S2.h].l
CMP D.l, #0 // D.l <=> 0
CSET C <- NE // C = (D.l != 0) ? 1 : 0
CMP D.h, #0 // D.h <=> 0
CSINC C <- C, XZR, EQ // C = (D.h == 0) ? C : (0+1)
MOV V <- C // V = C (they should be the same)
// Now get the lower 128-bits of the product (and store it in D.h, D.l)