# Deterministic Policy Gradient Algorithms

## WHY?

Policy gradient usually requires integral over all the possible actions.

## WHAT?

The purpose of reinforcement learning is to learn the policy to maximize the objective function. `J(\pi_{\theta}) = \int_{\mathcal{S}}\rho^{\pi}(s)\int_{\mathcal{A}}\pi_{\theta}(s,a)r(s,a)da ds \\ = \mathbb{E}_{s\sim\rho^{\pi}, a\sim \pi_{\theta}}[r(s,a)]`

Policy gradient directly train the policy network to minimize the objective function.

Stochastic Policy Gradient

`\nabla J(\pi_{\theta}) = \int_{\mathcal{S}}\rho^{\pi}(s)\int_{\mathcal{A}} \nabla_{theta}\pi_{\theta}(a|s)Q^{\pi}(s,a)da ds \\ = \mathbb{E}_{s\sim\rho^{\pi}, a\sim \pi_{\theta}}[\nabla_{\theta} \log \pi_{\theta}(a|s)Q^{\pi}(s,a)]`

Since this assumes stochastic policy, this is called Stochastic Policy Gradient. If a sample return is used to estimate the action-value function, it is called REINFORCE algorithm.Stochastic Actor-Critic

We can train another network to directly learn the value of action-value function by td learning.`\nabla J(\pi_{\theta}) = = \mathbb{E}_{s\sim\rho^{\pi}, a\sim \pi_{\theta}}[\nabla_{\theta} \log \pi_{\theta}(a|s)Q^{w}(s,a)]\\ \epsilon^2(w) = \mathbb{E}_{s\sim\rho^{\pi}, a\sim \pi_{\theta}}[(Q^w(s,a) - Q^{\pi}(s,a))^2]`

Off-policy Actor-Critic (OffPAC)

On-policy learning has limitation in exploration. Off-policy learning use different policies to behave and to evaluate.`J_{\beta}(\pi_{\theta}) = \int_{\mathcal{S}}\int_{\mathcal{A}}\rho^{\beta}(s)\pi_{\theta}(a|s)Q^{\pi}(s,a)da ds\\ \nabla_{\theta} J_{\beta}(\pi_{\theta}) = \int_{\mathcal{S}}\int_{\mathcal{A}}\rho^{\beta}(s)\nabla_{\theta}\pi_{\theta}(a|s)Q^{\pi}(s,a)da ds\\ = \mathbb{E}_{s\sim\rho^{\beta}, a\sim \beta}[\frac{\pi_{\theta}(a|s)}{\beta_{\theta}(a|s)}\nabla_{\theta} \log \pi_{\theta}(a|s)Q^{\pi}(s,a)]`

This Off-Policy Actor-Critic(OffPAC)require importance sampling.Deterministic Policy Gradient

In continuous action space, integral over all the action space is intractable. Deterministic policy gradient uses the deterministic policy`\mu_{\theta}(s)`

instead of`\pi_{\theta}(a|s)`

. And then, move the policy in the direction of the gradient of Q. This deterministic policy gradient is a special form of stochastic policy gradient.

`\theta^{k+1} = \theta^k + \alpha \mathcal{E}_{s\sim\rho^{\mu^k}}[\nabla_{\theta}\mu_{\theta}(s)\nabla_{\a}Q^{\mu^k}(s,a)|_{a=\mu_{\theta}(s)}]\\ J(\mu_{\theta}) = \int_{\mathcal{S}}\rho^{\mu}(s)r(s, \mu_{\theta}(s))d s \\ \nabla J(\mu_{\theta}) = = \mathbb{E}_{s\sim\rho^{\mu}}[\nabla_{theta} \mu_{\theta}(s) \nabla_{a}Q^{\mu}(s,a)|_{a=\mu_{\theta}(s)}]`

Off-Policy Deterministic Actor-Critic (OPDAC)

As in the case of stochastic policy gradient, off-policy is required to ensure adequate exploration. We can use Q-learning to train the critic.`J_{\beta}(\mu_{\theta}) = \int_{\mathcal{S}}\rho^{\beta}(s)Q^{\mu}(s,mu_{\theta}(s))d s \\ \nabla_{\theta} J_{\beta}(\mu_{\theta}) = \mathbb{E}_{s\sim\rho^{\beta}}[\nabla_{theta} \mu_{\theta}(s)\nabla_{a}Q^{\mu}(s,a)|_{a=\mu_{\theta}(s)}]\\ \delta_t = r_t + \gamma Q^w(s_{t+1}, \mu_{\theta}(s_{t+1})) - Q^w(s_t, a_t)\\ w_{t+1} = w_t + \alpha_w\delta_t\nabla_w Q^w(s_t, a_t)\\ \theta_{t+1} = \theta_t + \alpha_{\theta}\nabla_{\theta}\mu_{\theta}(s_t)\nabla_a Q^w(s_t, a_t)|{a=\mu_{\theta}(s)}`

Deterministic policy removes the need for integral of actions and Q-learning removes the need for importance sampling.Compatible Off-Policy Deterministic Actor-Critic (COPDAC)

Since function approximator`Q^w(s,a)`

may not follow the true gradient, this paper suggest two restriction for the compatible action-value function.`\nabla_a Q^w(s,a)|_{a=\mu_{\theta}(s)}=\nabla_{\theta}\mu_{\theta}(s)^T w`

w minimize MSE of $$\epsilon(s;\theta,w)=\nabla_a Q^w(s,a) *{a=\mu*{\theta}(s)} - \nabla_a Q^{\mu}(s,a)*{a=\mu*{\theta}(s)}$$\

The resulting algorithm is called compatible off-policy deterministic actor-critic(COPDAC). We can use baseline function to reduce the variance of gradient estimator. If we use gradient Q-learning for critic, the algorithm is called COPDAC-GQ.

## Critic

Great reviews of policy gradient algorithms.

Silver, David, et al. “Deterministic policy gradient algorithms.” ICML. 2014.