Reports

Now I’ve figured out what was going “wrong” — thanks for the valuable comments from user555045 and fuz!

Yes, this behavior is expected: on Haswell, IMUL is issued only on Port 1, which aligns with the observed results and also matches what uiCA shows.

The root cause of the “strange” interference in the loop containing the ADD instruction wasn’t the ADD itself — it was the JNZ. On Haswell, only one branch instruction can be taken per cycle, so two JNZ instructions cannot be executed "simultaneously" from two loops. The JNZ (macro-fused with DEC) is issued on Port 6, and when Port 6 is enabled in Intel PCM, we can observe where the “missing” µOps are actually landing on the CPU.

Here are two loops running simultaneously on Hyper-Threaded cores:

; Core 0
.loop:
    add r10, r10
    dec r8
    jnz .loop

; Core 1
.loop:
    imul r11, r11
    dec r8
    jnz .loop

And the result is including Port 6:

Time elapsed: 998 ms
Core | IPC | Instructions  |  Cycles  | RefCycles | PORT_0  | PORT_1  | PORT_5  | PORT_6
   0   1.98        7115 M     3590 M      3493 M    1148 M    1944 K    1222 M    2371 M
   1   1.00        3582 M     3589 M      3492 M     816 K    1193 M     593 K    1194 M

If I terminate the IMUL loop on Core 1 and leave only Core 0 running with ADD, then:

Core | IPC | Instructions  |  Cycles  | RefCycles | PORT_0  | PORT_1  | PORT_5  | PORT_6
   0   2.85          10 G     3643 M      3546 M    1132 M    1157 M    1175 M    3470 M
   1   0.81          55 M       68 M        67 M    9157 K    8462 K    9094 K    6586 K

This explains everything (at least for me).

79763051