Initialize critic network Qθ1, Qθ2, and actor network πϕ with random parameters θ1, θ2, ϕ |
Initialize target network θ′1←θ1, θ′2←θ2,ϕ′←ϕ |
Initialize replay buffer B |
for t = to T do |
Select action with exploration noise a ~ πϕ(S) + ε, | (1) |
ε ~ N(0,σ) and observe reward r and new state s′ | (2) |
Store transition tuble (s, a, r, s′) in B | (3) |
Sample mini-batch of N transitions (s, a, r, s′) from B | (4) |
a˜←πϕ′(s′)+ε, ε~clip(N(0,σ˜),−c,c) | (5) |
y←r+γmini=1,2Qθ′i(s′,a˜) | (6) |
Upadte critics θi ← argminθi N−1∑ (y−Qθi(s,a))2 | (7) |
if t mod d then |
Upadte ϕ by the deterministic policy gradient: | (8) |
∇ϕJ(ϕ)=N−1∑∇aQθ1(s, a)|a=πϕ(s)∇ϕπϕ(s) | (9) |
Update target networks: |
θ′i←τθi+(1−τ)θ′i | (10) |
ϕ′←τϕ+(1−τ)ϕ′ |
end if |
end for |