Pretty silly of me, but I think the problem is that the net was learning every step, instead of accumulating it's decisions and then learn them. I overlooked that.