I believe the bottleneck you're facing is the use of the job queue. I've seen many solutions online that have the same problem.
First you're using hardware_concurrency
to determine the number of threads you want to use. The fact is that the call returns the number of logical processors (see SMT or Hyperthreading), if you're doing a lot of calculation maybe you should try something closer to the physical CPU count or you won't see much speedup.
Also you're using a mutex and a condition var, which is correct, but prone to frequent context switch that can mess with the scaling of your solution.
I'd try to see if batching can be implemented, or maybe trying some active waiting methods (i.e. spinlocks instead of locks). Also as other suggested, reserving the memory in advance can be good, but std::vector
makes a good job already, also memory caches are really efficient (so probably the bottleneck isn't there).
There are also a lot of job queues that are lock-free. See for example LPRQ which is a multiproducer-multiconsumer queue. The paper has also an artifact section from which you can get the actual implementation.
If you find the implementation too complicated you can think of having a buffer from the producer to every consumer (in a lock free manner), the implementation is much more simple See here and probably scales much better than a single buffer shared between threads (assuming the thread count is known in advance).