Deterministic Policy Gradient Algorithms
by Sungwon Lyu
WHY?
Policy gradient usually requires integral over all the possible actions.
WHAT?
The purpose of reinforcement learning is to learn the policy to maximize the objective function. Policy gradient directly train the policy network to minimize the objective function.

Stochastic Policy Gradient
Since this assumes stochastic policy, this is called Stochastic Policy Gradient. If a sample return is used to estimate the actionvalue function, it is called REINFORCE algorithm. 
Stochastic ActorCritic
We can train another network to directly learn the value of actionvalue function by td learning. 
Offpolicy ActorCritic (OffPAC)
Onpolicy learning has limitation in exploration. Offpolicy learning use different policies to behave and to evaluate. This OffPolicy ActorCritic(OffPAC)require importance sampling. 
Deterministic Policy Gradient
In continuous action space, integral over all the action space is intractable. Deterministic policy gradient uses the deterministic policy instead of . And then, move the policy in the direction of the gradient of Q. This deterministic policy gradient is a special form of stochastic policy gradient.

OffPolicy Deterministic ActorCritic (OPDAC)
As in the case of stochastic policy gradient, offpolicy is required to ensure adequate exploration. We can use Qlearning to train the critic. Deterministic policy removes the need for integral of actions and Qlearning removes the need for importance sampling. 
Compatible OffPolicy Deterministic ActorCritic (COPDAC)
Since function approximator may not follow the true gradient, this paper suggest two restriction for the compatible actionvalue function.
w minimize MSE of $$\epsilon(s;\theta,w)=\nabla_a Q^w(s,a) {a=\mu{\theta}(s)}  \nabla_a Q^{\mu}(s,a) {a=\mu{\theta}(s)}$$\
The resulting algorithm is called compatible offpolicy deterministic actorcritic(COPDAC). We can use baseline function to reduce the variance of gradient estimator. If we use gradient Qlearning for critic, the algorithm is called COPDACGQ.
Critic
Great reviews of policy gradient algorithms.
Silver, David, et al. “Deterministic policy gradient algorithms.” ICML. 2014.
Subscribe via RSS