Now I’ve figured out what was going “wrong” — thanks for the valuable comments from user555045 and fuz!
Yes, this behavior is expected: on Haswell, IMUL
is issued only on Port 1, which aligns with the observed results and also matches what uiCA shows.
The root cause of the “strange” interference in the loop containing the ADD
instruction wasn’t the ADD
itself — it was the JNZ
. On Haswell, only one branch instruction can be taken per cycle, so two JNZ
instructions cannot be executed "simultaneously" from two loops. The JNZ
(macro-fused with DEC
) is issued on Port 6, and when Port 6 is enabled in Intel PCM, we can observe where the “missing” µOps are actually landing on the CPU.
Here are two loops running simultaneously on Hyper-Threaded cores:
; Core 0
.loop:
add r10, r10
dec r8
jnz .loop
; Core 1
.loop:
imul r11, r11
dec r8
jnz .loop
And the result is including Port 6:
Time elapsed: 998 ms
Core | IPC | Instructions | Cycles | RefCycles | PORT_0 | PORT_1 | PORT_5 | PORT_6
0 1.98 7115 M 3590 M 3493 M 1148 M 1944 K 1222 M 2371 M
1 1.00 3582 M 3589 M 3492 M 816 K 1193 M 593 K 1194 M
If I terminate the IMUL
loop on Core 1 and leave only Core 0 running with ADD
, then:
Core | IPC | Instructions | Cycles | RefCycles | PORT_0 | PORT_1 | PORT_5 | PORT_6
0 2.85 10 G 3643 M 3546 M 1132 M 1157 M 1175 M 3470 M
1 0.81 55 M 68 M 67 M 9157 K 8462 K 9094 K 6586 K
This explains everything (at least for me).