I have ran into the same problem and seeing very similar training logs to you when using a multi-discrete action space but the evaluation is not good. Did you ever find a solution?