While push_back
is as efficient as possible (read up on "amortized complexity") it is still very slow. You could reserve
your vector to some upper bound, which speeds up the push back considerably.
This does not entirely explain why the parallel version would be slower than the sequential. Could it be that the parallel one overflows your memory and you're swapping to disc?
But really, I would try to reformulate the algorithm. Do you actually need the vector or is there a way to do the resulting computation on it without having it stored?