I got the code flow wrong earlier. It turns out the main problem was due to STORE-LOAD reordering, and inserting a memory barrier between the STORE and the LOAD should fix it. I’ve updated the detailed analysis in the link (though it’s in Chinese).
https://cai-fuqiang.github.io/posts/virtio-notify/