Reports

It learns via bootstrapping. It looks at one frame (one timestep, ignore for a moment we pass 4 frames for one timestep) and if it has a high value, due to just striking a ball, it determines that any frame that lead to this in time step k-1 should also have a higher value (discounted by a gamma). Going recursively back, it also then assigns a higher value to a previous time step that lead to k-1, which is k-2.

In this sense, any situation in which a ball that is seemingly likely to "get stuck", as it is going for the narrow hole on the left will have a higher Q-value. Extend that logic, and over many many itterations, via such bootstrapping state-action pairs that lead to such outcomes will have higher values since they lead to the situations that lead to the situations ... that ended up in higher results.

In short - first frames/situations near just scoring a point will increase in value and this will slowly "creep back" towards the early states/decisions that lead to it over many many iterations. Situations that lead to many good outcomes such as "almost about to get stuck" will therefor also have a higher value, and consequently situations that lead to "about to pass the the hole and get stuck" will also increase in value as well.

79736236