Deterministic Policy Gradient Algorithms


Policy gradient usually requires integral over all the possible actions.


The purpose of reinforcement learning is to learn the policy to maximize the objective function. J(\pi_{\theta}) = \int_{\mathcal{S}}\rho^{\pi}(s)\int_{\mathcal{A}}\pi_{\theta}(s,a)r(s,a)da ds \\ = \mathbb{E}_{s\sim\rho^{\pi}, a\sim \pi_{\theta}}[r(s,a)] Policy gradient directly train the policy network to minimize the objective function.

  • Stochastic Policy Gradient
    \nabla J(\pi_{\theta}) = \int_{\mathcal{S}}\rho^{\pi}(s)\int_{\mathcal{A}} \nabla_{theta}\pi_{\theta}(a|s)Q^{\pi}(s,a)da ds \\ = \mathbb{E}_{s\sim\rho^{\pi}, a\sim \pi_{\theta}}[\nabla_{\theta} \log \pi_{\theta}(a|s)Q^{\pi}(s,a)] Since this assumes stochastic policy, this is called Stochastic Policy Gradient. If a sample return is used to estimate the action-value function, it is called REINFORCE algorithm.

  • Stochastic Actor-Critic
    We can train another network to directly learn the value of action-value function by td learning. \nabla J(\pi_{\theta}) = = \mathbb{E}_{s\sim\rho^{\pi}, a\sim \pi_{\theta}}[\nabla_{\theta} \log \pi_{\theta}(a|s)Q^{w}(s,a)]\\ \epsilon^2(w) = \mathbb{E}_{s\sim\rho^{\pi}, a\sim \pi_{\theta}}[(Q^w(s,a) - Q^{\pi}(s,a))^2]

  • Off-policy Actor-Critic (OffPAC)
    On-policy learning has limitation in exploration. Off-policy learning use different policies to behave and to evaluate. J_{\beta}(\pi_{\theta}) = \int_{\mathcal{S}}\int_{\mathcal{A}}\rho^{\beta}(s)\pi_{\theta}(a|s)Q^{\pi}(s,a)da ds\\ \nabla_{\theta} J_{\beta}(\pi_{\theta}) = \int_{\mathcal{S}}\int_{\mathcal{A}}\rho^{\beta}(s)\nabla_{\theta}\pi_{\theta}(a|s)Q^{\pi}(s,a)da ds\\ = \mathbb{E}_{s\sim\rho^{\beta}, a\sim \beta}[\frac{\pi_{\theta}(a|s)}{\beta_{\theta}(a|s)}\nabla_{\theta} \log \pi_{\theta}(a|s)Q^{\pi}(s,a)] This Off-Policy Actor-Critic(OffPAC)require importance sampling.

  • Deterministic Policy Gradient
    In continuous action space, integral over all the action space is intractable. Deterministic policy gradient uses the deterministic policy \mu_{\theta}(s) instead of \pi_{\theta}(a|s). And then, move the policy in the direction of the gradient of Q. This deterministic policy gradient is a special form of stochastic policy gradient.
    \theta^{k+1} = \theta^k + \alpha \mathcal{E}_{s\sim\rho^{\mu^k}}[\nabla_{\theta}\mu_{\theta}(s)\nabla_{\a}Q^{\mu^k}(s,a)|_{a=\mu_{\theta}(s)}]\\ J(\mu_{\theta}) = \int_{\mathcal{S}}\rho^{\mu}(s)r(s, \mu_{\theta}(s))d s \\ \nabla J(\mu_{\theta}) = = \mathbb{E}_{s\sim\rho^{\mu}}[\nabla_{theta} \mu_{\theta}(s) \nabla_{a}Q^{\mu}(s,a)|_{a=\mu_{\theta}(s)}]

  • Off-Policy Deterministic Actor-Critic (OPDAC)
    As in the case of stochastic policy gradient, off-policy is required to ensure adequate exploration. We can use Q-learning to train the critic. J_{\beta}(\mu_{\theta}) = \int_{\mathcal{S}}\rho^{\beta}(s)Q^{\mu}(s,mu_{\theta}(s))d s \\ \nabla_{\theta} J_{\beta}(\mu_{\theta}) = \mathbb{E}_{s\sim\rho^{\beta}}[\nabla_{theta} \mu_{\theta}(s)\nabla_{a}Q^{\mu}(s,a)|_{a=\mu_{\theta}(s)}]\\ \delta_t = r_t + \gamma Q^w(s_{t+1}, \mu_{\theta}(s_{t+1})) - Q^w(s_t, a_t)\\ w_{t+1} = w_t + \alpha_w\delta_t\nabla_w Q^w(s_t, a_t)\\ \theta_{t+1} = \theta_t + \alpha_{\theta}\nabla_{\theta}\mu_{\theta}(s_t)\nabla_a Q^w(s_t, a_t)|{a=\mu_{\theta}(s)} Deterministic policy removes the need for integral of actions and Q-learning removes the need for importance sampling.

  • Compatible Off-Policy Deterministic Actor-Critic (COPDAC)
    Since function approximator Q^w(s,a) may not follow the true gradient, this paper suggest two restriction for the compatible action-value function.

    1. \nabla_a Q^w(s,a)|_{a=\mu_{\theta}(s)}=\nabla_{\theta}\mu_{\theta}(s)^T w
    2. w minimize MSE of $$\epsilon(s;\theta,w)=\nabla_a Q^w(s,a){a=\mu{\theta}(s)} - \nabla_a Q^{\mu}(s,a){a=\mu{\theta}(s)}$$\

The resulting algorithm is called compatible off-policy deterministic actor-critic(COPDAC). We can use baseline function to reduce the variance of gradient estimator. If we use gradient Q-learning for critic, the algorithm is called COPDAC-GQ.


Great reviews of policy gradient algorithms.

Silver, David, et al. “Deterministic policy gradient algorithms.” ICML. 2014.

© 2017. by isme2n

Powered by aiden