The specific Intel compiler compiler version should support "static stealing". To enable it, you need to use schedule(runtime)
with the parallel do
directive, like so:
!$omp parallel do schedule(runtime)
When running the application, set OMP_SCHEDULE=static_steal
as an environment variable before you start the application, e.g, for bash-like shells:
export OMP_SCHEDULE=static_steal
or via localize environment:
OMP_SCHEDULE=static_steal ./my-application
The loop is then partitioned statically at first, but when threads run out of work, they can steal from other threads. Does that solve your problem?