I just solved the problem. I mistakenly set critic_loss
to be
critic_loss: Tensor = torch.mean(
F.mse_loss(
self.critic(cur_observations),
advantages.detach(), # notice this line
)
)
but it should be
critic_loss: Tensor = torch.mean(
F.mse_loss(
self.critic(cur_observations),
td_target.detach(), # notice this line
)
)
After correcting the loss expression, the agent converged to the safer path after 2000 episodes.
==== strategy ====
> > v
^ > v
^ x ^