79394257

Date: 2025-01-28 15:04:20
Score: 1
Natty:
Report link

I just solved the problem. I mistakenly set critic_loss to be

critic_loss: Tensor = torch.mean(
    F.mse_loss(
        self.critic(cur_observations),
        advantages.detach(),  # notice this line
    )
)

but it should be

critic_loss: Tensor = torch.mean(
    F.mse_loss(
        self.critic(cur_observations),
        td_target.detach(),  # notice this line
    )
)

After correcting the loss expression, the agent converged to the safer path after 2000 episodes. reward and loss curves

==== strategy ====
>  >  v  
^  >  v  
^  x  ^  
Reasons:
  • Probably link only (1):
  • Long answer (-0.5):
  • Has code block (-0.5):
  • Self-answer (0.5):
  • Low reputation (0.5):
Posted by: Eric Monlye