You use a lot of push_back or emplace_back, but I don't see any calls to reserve. Threads all need to get memory from the same place, so any re-allocation would cause them to go through that bottleneck.
Your best bet for multithreading is to pre-allocate the buffers as if they would be serviced by a single thread. Once you do that, the worker threads should each change values in different portions of that buffer (by reference). Threads are for computing so try to eliminate anywhere that they need to perform memory allocation.