Flags for int<128>

I have worked out what instructions should be emitted to compute flags for binary operations on Aarch64, on x86-64 a similar method of implementation should hopefully work. As such I have included my notes here.
Notes on notation:
	The notation	<exp>	indicates exp is a 128-bit value
			[exp]	indicates exp is a 64-bits value
			<[exp]>	indicates exp is a 192-bits value
			<<exp>>	indicates exp is a 256-bits value
	exp.h indicates the higher 64-bits of the expression and exp.l is the lower 64-bits of the expression
		(each should occupy there own register)

	Xi = 2^(64*i)	(i.e. it is i*64-bits worth of zeros with a one at the front)
	Ti is a temporary register (64-bits)
Note:
    Some optimisations may be able to be performed if an argument to the instruction is an immediate

Zero and Negtaive Flags:
	D, Z, N = BINOP S1, S2

	ORR Z <- D.h, D.l	// Z = D.h | D.l
    CMP Z, #0		// Z <=> 0
    CSET Z, EQ		// Z = (Z == 0) ? 1 : 0

	LSR N, D.h, 63	    // N = (D.h >> 63)  (so that N[0] = D.h[63])

Overflow and Carry for Add/Sub:
	D, C, V = ADD/SUB S1, S2
	// Compute the add/subtraction normal (except ensure the Ad with carry/subtract with carry sets the carry flag)
	CSET C, CS	// Set to 1 if the carry flag is set

        // V[63] = 1 IFF D and S1 have different signs
	EOR V <- D.h, S1.h	// V = D.h ^ S1.h

	For ADD:
		// T[63] =  1 IFF S1 and S2 have different signs
		EOR T1 <- S1h, S2.h	// T1 = S1.h ^ S2.h
	For Sub:
		// T[63] =  1 IFF S1 and -S2 have different signs
		EON T1 <- S1h, S2.h	// T1 = S1.h ^ (~S2.h)

	 // V[63] = 1 iff D and S1 have different signs
	//      and S1 and S2 (or -S2) have the same sign
	BIC V <- V, T		// V = V & ~T

                                       // Check tmp_status[n-1]
	TST V, 1 << 63		// V[63] <=> 1
	CSET V, NE		// V = (V[63] != 1) ? 1 : 0

Overflow for Sbutraction: (Note: this is essentially the same method I used for arithmetic less than 32 bits)
	D, V = SUB S1, S2
	// Compute the subtraction normally

    // V[63] = 1 IFF D and S1 have different signs
	EOR V <- D.h, S1.h	// V = D.h ^ S1.h



	 // V[63] = 1 iff D and S1 have different signs
	//      and S1 and -S2 have the same sign
	BIC V <- V, T		// V = V & ~T

    // Check tmp_status[n-1]
	TST V, 1 << 63		// V[63] <=> 1
	CSET V, NE		// V = (V[63] != 1) ? 1 : 0
------------
Overflow and carry for Multiply:
	D, C, V = MUL S1, S2
---------------------- (this is just my working) ----------
<S1.h*X1+S1.l> * <S2.h*X1+S2.l> =
	<<S1.h*S2.h*X2>> + <[S1.l*S2.h*X1]> + <[S1.h*X1*S2.l]> + <S1.l*S2.l>
Discared everything that occupys the lower 128-bits:
	<<S1.h*S2.h*X2>> + <[S1.l*S2.h*X1]> + <[S1.h*X1*S2.l]>
-----------------------------
	<S1.h*S2.h>*X2 +
	<S1.l*S2.h>*X1 +
	<S1.h*S2.l>*X1
--------------------------------
	<[S1.h*S2.h].h*X1+[S1.h*S2.h].l*X1>*X2 +
	<[S1.l*S2.h].h*X1 + [S1.l*S2.h.l>*X1
	<[S1.h*S2.l].h*X1 + [S1.h*S2.l].l>*X1
--------------------------------------------------
	[S1.h*S2.h].h*X3 + [S1.h*S2.h].l*X3 +
	[S1.l*S2.h].h*X2 + [S1.l*S2.h.l*X1 +
	[S1.h*S2.l].h*X2 + [S1.h*S2.l].l*X1
----------------------------------------------------
Discare all factors of X1 (as they will only contribute to the lower 128 bits of the result)

	[[S1.h*S2.h].h+ [S1.h*S2.h].l]*X3 +
	[[S1.l*S2.h].h + [S1.h*S2.l].h]*X2

So to get the overflow flag let:
	D.h = [[S1.h*S2.h].h+ [S1.h*S2.h].l]
	D.l = [[S1.l*S2.h].h + [S1.h*S2.l].h]
Then set it to '1' iff (D.h != 0) || (D.l != 0)
------------------------------------------
SO EMIT THE FOLLOWING CODE:
	UMULH D.l <- S1.l, S2.h		// D.l = [S1.l*S2.h].h
	UMULH D.h <- S1.h*S2.l		// D.h = [S1.h*S2.l].h
	ADD D.l <- D.h, D.l		// D.l += D.h

	UMULH D.h <- S1.h, S2.h 	// D.h = [S1.h*S2.h].h
	MADD D.h <- S1.h, S2.h, D.h	// D.h += [S1.h*S2.h].l

        CMP D.l, #0             // D.l <=> 0
	CSET C <- NE		// C = (D.l != 0) ? 1 : 0
	CMP D.h, #0		// D.h <=> 0
	CSINC C <- C, XZR, EQ	// C = (D.h == 0) ? C : (0+1)

	MOV V <- C		// V = C (they should be the same)

	// Now get the lower 128-bits of the product (and store it in D.h, D.l)