My previous solution (memory access) was bad. I checked: the mystery of fast JIT sorting is solved.
It's not about memory access — C++ handles that very well.
The key is function inlining. C++ does inline functions, but not recursive ones. JIT inlines recursive functions for specific cases — e.g., for 5 million elements.
As an example: Java code where the warm-up calls sort(), but for a much smaller number than 5 million, and the 5-million-element sort happens only once.
The result: worse 20%-25% than C++. Exactly the same as when I copied the class DualPivotQuicksort.java into my project for testing and wondered why it was no longer that fast.
To prove it: test Java code: call many times sort for small array length (for example 100) and once call sort for 5 millions.