Improving Policy Gradient via Parameterised Reward

Abstract

While policy optimisation methods have achieved remarkable success in solving challenging reinforcement learning problems, engineering a structured reward function to guide policy optimisation and poor sample efficiency continue to make policy optimisation difficult. We propose a novel gradient-based method, Taming Reward(TARE), that learns a parameterised reward such that it can facilitate policy optimisation on the true objective by using a technique called implicit differentiation. TARE extends previous reward learning approaches by introducing a more flexible method that enables gradient computation of the extrinsic rewards achieved by the updated policy w.r.t. parameterised reward parameters without needing to differentiating through the policy optimisation process, and can be used in arbitrary policy gradient methods. Based on the reward learning framework, we propose a novel reward parameterization architecture based on an LSTM and multiplicative interaction to fully leverage the power of flexible gradient estimation, enabling efficient reward learning and policy optimisation. We apply our approach to recent state-of-the-art policy optimisation methods (on-policy and off-policy methods), and evaluate them on standard continuous control environments. Our approach consistently outperforms existing methods in terms of sample efficiency during training process and yields higher asymptotic performance across a variety of challenging reinforcement learning tasks.

If videos are not appearing, disable ad-block!