RL with dropout uncertainty demo

Deep reinforcement learning demo with two behavioural policies: epsilon greedy (green), and Thompson sampling using dropout uncertainty (blue). The agents (blue and green discs) are rewarded for eating red things and walking straight, and penalised for eating yellow things and walking into walls. Both agents move at random for the first 3000 moves (the red shade in the graph). The $X$ axis of the plot shows the number of batches divided by 500 on log scale and the $Y$ axis shows average reward. (The code seems to work quickest on Chrome).

These are the settings used with the networks:

Maze Game

In this demo, we use another much simpler maze game: A 2D agent that has perfect control, which has two actions: left and right.
There are seven states: A, B, C;
two actions: left, right;
the expected reward of the state is uncertain, E(r(B)) = 0.5*5 + 0.5*0 = 2.5; E(r(C)) = 0.5*2+0.5*0 = 1;
Reset game whenever reaching state leaf node;

These are the settings for maze game used with the networks: