Answered by Peter Cordes
They're probably counting instruction-fetch: 3 instructions plus a load and a store. Registers aren't memory. Some microarchitectures fetch blocks of machine code in wider chunks (and even decode multiple instructions per cycle in parallel), but those uarches would have an I-cache (or a unified L1 cache on older ARMs). So there'd be 2 data cache accesses (load + store) and one or two I-cache accesses on a high-performance CPU.