The implementation for __atomic .. is architecture specific.
On x86/x86_64 all memory ordering will create the same code (with lock prefix). On ARM/ARM64 all implementatons are different. My experience is that the strongest __ATOMIC_SEQ_CST is best. ( I do not need atomic instructions, that can fail! ) I have a concurency safe lockfree list implementation, working only with __ATOMIC_SEQ_CST for __atomic_compare_exchange correctly. You can also try to use -mtune=cortex-XX -mno-outline-atomics for gcc to get the best atomic code performance.