Table 1. TD3 algorism

Initialize critic network Qθ1, Qθ2, and actor network πϕ with random parameters θ1, θ2, ϕ
Initialize target network θ1θ1,θ2θ2,ϕϕ
Initialize replay buffer B
for t = to T do
 Select action with exploration noise a ~ πϕ(S) + ε, (1)
 ε ~ N(0,σ) and observe reward r and new state s (2)
 Store transition tuble (s, a, r, s′) in B (3)
 Sample mini-batch of N transitions (s, a, r, s′) from B (4)
a˜πϕ(s)+ε,ε~clip(N(0,σ˜),c,c) (5)
yr+γmini=1,2Qθi(s,a˜) (6)
 Upadte critics θi ← argminθi N1 (yQθi(s,a))2 (7)
if t mod d then
  Upadte ϕ by the deterministic policy gradient: (8)
  ϕJ(ϕ)=N1aQθ1(s,a)|a=πϕ(s)ϕπϕ(s) (9)
  Update target networks:
  θiτθi+(1τ)θi (10)
  ϕτϕ+(1τ)ϕ
end if
end for