Answering my own question (with helps from comments)
with /O2
flag msvc was able to generate sse instruction for addition. Furthermore, the mscv compiler generated instructions for loop unroll. Combining the two compiler optimisation, it was able out perform my code by a bit (I was using avx).
Here I want to give credits to the people who helped me in the comments section, @PeterCordes and @Homer512 - Thank you both.
I will be reading this book for further study: "Modern X86 Assembly Language Programming: Covers x86 64-bit, AVX, AVX2, and AVX-512"