Reports

The comment by @Retired Ninja is incorrect, as well as the relevant part in @Basile Starynkevitch's answer. Prefetching in one loop and hoping data would be cached in the following loop (further down in your source code) should be reasonable since you controlled the loop size.

We can verify that prefetching like this indeed has effect, by measuring the performance without -O2, and check the difference with or without prefetch (~900ms vs ~1000ms).

So why isn't your code faster after prefetching? The reason is overlapped execution. Prefetching reduces the latency for retrieving neigh_vals[id], but does not reduce the overall execution time, as these memory accesses are overlapped in time (the compiler knows this because you are simply summing them in res).

We can verify this statement by adding dependencies, like the following example:

#include<bits/stdc++.h>
using namespace std;
const int N=100000005, M = 100000000;
int neighbors[M], neigh_vals[N];
int main(){
    for (int i=0;i<M;++i)neighbors[i]=rand()*rand()%N;
    for (int i=0;i<N;++i)neigh_vals[i]=i*i;

    int res=0;
    int t1=clock();
    for (size_t i = 0; i < M; i++) {
        unsigned int id = neighbors[(i+res)%N];
        __builtin_prefetch(neigh_vals+neighbors[(i+1+(res^(id*id)))%N]);
        res ^= neigh_vals[id];
    }
    
    int diff = clock()-t1;
    printf("time=%d\n", diff);
    printf("res=%d\n", res);
    
    return 0;
}

Now with or without prefetch has different execution time (~22000ms vs ~28000ms).

79344491