It looks like an answer to this is at least partially explained here, gcc optimization flag -O3 makes code slower than -O2.
To summarize the results, using -O3 actually uses a different comparison/branch technique than -O2, so in this task where the comparison is occurring many times the choice of how it is done is important. Because of the predictability in the loop, it adds execution time by using this optimization.
This still does not solve the reason for going from -O1 to -O2 also, but this source is sufficient to describe the one case. Perhaps another knows the answer to this?