Reports

I just solved the problem. I mistakenly set critic_loss to be

critic_loss: Tensor = torch.mean(
    F.mse_loss(
        self.critic(cur_observations),
        advantages.detach(),  # notice this line
    )
)

but it should be

critic_loss: Tensor = torch.mean(
    F.mse_loss(
        self.critic(cur_observations),
        td_target.detach(),  # notice this line
    )
)

After correcting the loss expression, the agent converged to the safer path after 2000 episodes.

==== strategy ====
>  >  v  
^  >  v  
^  x  ^

79394257