Do Differentiable Simulators Give Better Policy Gradients?

"Do Differentiable Simulators Give Better Policy Gradients?" by H.J. Terry Suh, Max Simchowitz, Kaiqing Zhang, Russ Tedrake

Abstract:

Differentiable simulators promise faster computa- tion time for reinforcement learning by replacing zeroth-order gradient estimates of a stochastic objective with an estimate based on first-order gradients. However, it is yet unclear what fac- tors decide the performance of the two estimators on complex landscapes that involve long-horizon planning and control on physical systems, despite the crucial relevance of this question for the util- ity of differentiable simulators. We show that characteristics of certain physical systems, such as stiffness or discontinuities, may compromise the efficacy of the first-order estimator, and ana- lyze this phenomenon through the lens of bias and variance. We additionally propose an α-order gra- dient estimator, with α ∈ [0, 1], which correctly utilizes exact gradients to combine the efficiency of first-order estimates with the robustness of zero- order methods. We demonstrate the pitfalls of traditional estimators and the advantages of the α-order estimator on some numerical examples.