| Initialize critic network Qθ1, Qθ2, and actor network πϕ with random parameters θ1, θ2, ϕ |
| Initialize target network θ′1←θ1, θ′2←θ2,ϕ′←ϕ |
| Initialize replay buffer B |
| for t = to T do |
| Select action with exploration noise a ~ πϕ(S) + ε, | (1) |
| ε ~ N(0,σ) and observe reward r and new state s′ | (2) |
| Store transition tuble (s, a, r, s′) in B | (3) |
| Sample mini-batch of N transitions (s, a, r, s′) from B | (4) |
| a˜←πϕ′(s′)+ε, ε~clip(N(0,σ˜),−c,c) | (5) |
| y←r+γmini=1,2Qθ′i(s′,a˜) | (6) |
| Upadte critics θi ← argminθi N−1∑ (y−Qθi(s,a))2 | (7) |
| if t mod d then |
| Upadte ϕ by the deterministic policy gradient: | (8) |
| ∇ϕJ(ϕ)=N−1∑∇aQθ1(s, a)|a=πϕ(s)∇ϕπϕ(s) | (9) |
| Update target networks: |
| θ′i←τθi+(1−τ)θ′i | (10) |
| ϕ′←τϕ+(1−τ)ϕ′ |
| end if |
| end for |