Inside get_best_action
method there are the following lines of code:
qvalues = [self.get_qvalue(state, action) for action in possible_actions]
max_qvalue = max(qvalues)
best_actions = [action for action, qvalue in enumerate(qvalues) if qvalue == max_qvalue]
return random.choice(best_actions)
possible_actions
is a list of numbers, each one being an action from those currently available. The latter means that the list does not always contain all the actions necessarily of course. Later you enumerate
the qvalues
and assume that the index of the current qvalue
is the action
, which is the cause of the bug.
Let's say for example possible_actions
is only actions [0, 3]
. Then we would calculate their q values resulting in a list with two elements, say [q_value_for_action_0, q_value_for_action_3]
(I am just giving names to the q values for simplicity). Lastly, upon enumerate
ing the qvalues
we get two pairs in format action, qvalue
, but since we use enumerate
then the action
s will always be 0
, 1
, 2
, etc... But the qvalue
will in this case be q_value_for_action_0
and then q_value_for_action_3
. The problem is that q_value_for_action_3
would be matched against action with index 1
(as per the enumerate
operation), but it is action 3
!
To fix this, just replace enumerate(qvalues)
with zip(possible_actions, qvalues)
inside the method.
When updating the agent (ie inside update
method) you calculate the next action with:
next_action = self.get_action(next_state)
As far as I know, in Q learning, we need here the maximum reward estimate that can be obtained for the next state. If this is true, we must use get_best_action
here instead of just get_action
, since the latter will (at some point) give us a random action, based on epsilon
parameter, and we don't want that, because the randomly chosen action may not be the best one at the same time. Based on epsilon
, sometimes we get the best action indeed, but some other times we get a random action which does not coincide with the best and this messes the calculation following later.
To be honest, I observed improvement by fixing the hidden bug stated initially, but no noticeable improvement was observed by replacing self.get_action(next_state)
with self.get_best_action(next_state)
inside agent's update
method, but I think this is indeed a bug, at least according to my understanding on the topic. Also, it doesn't really seem to me that you are properly implementing Q learning updates, but I will go through it in a later section.
I increased the number of episodes, because 2k was never enough for the player to reach the exit. After fixing the above bugs and setting episodes to 20k I started seeing the player exploring greater portions of the map, with better distribution in visited locations, and actually making it to the exit for about 240 (~1%) of the episodes. At the same time of course I had also configured epsilon_decay
to 0.9998
in order to reach the minimum epsilon
within the first about 15k episodes, since with the original 0.995
it was decaying too fast and I thought this would harm exploration. I also used half of the learning rate (ie alpha = 0.05
) because 0.1
seems a bit too big, taking into account that greater portion of the episodes has now a greater epsilon
(ie more random moves initially).
I understand though that we have to fix bugs first and then go to hyperparameter tuning, so I am not blaming you for anything here of course, I am just saying how I started to observe desired results.
Where are you implementing Q learning agent's updates from? Please give me a reference so that I know what's going on with the agent's update (ie if it needs fixes, or it's just my lack of knowledge). There is something called weights
in the agent which seems to be the Q learning matrix or something similar, but I don't know what it is, so I just assumed it works fine and moved on with the previous bugs. I also don't know why you are interfering your weights
with features via a dot product.
I am saying this because as far as I know traditional Q learning does not have a concept of weights like Neural Networks do, nor does it encode features this way. Instead, features are implemented in the state of the agent [1]. For example your features are already the location of the player and the location of the monster. In case you would need more features then you would need to encode them in state, again at least as far as my understanding and knowledge goes, so please tell me if I am wrong (citing references would be even more exciting, so that I can proofread your agent's updates).
I insist on all of these things in this section, because as far as I know, Q learning requires storing Q values (for example in a matrix) for each state-action pair. Then it uses the Bellman equation (as seen in Wikipedia here). I have implemented Q learning this way in the past and got acceptable results.
In fact, I also tried this on your environment and got the player to reach the exit for more than 10k (>50%) of the episodes at times. Specifically, I commented out get_features
, updated the get_qvalue
method to read the Q matrix and updated the update
method with the Bellman equation and storing the new Q values back in their matrix. I will later post the full code for this at some point, summarizing all the answer.