Policy gradient method are widely used in the Reinforcement Learning settings. In this post we build policy gradient from the ground up, starting from the easier static scenario first, where we maximize a reward function {r} depending solely on our control variable {x}. In subsequent posts, we will turn our attention to the contextual bandit setting, where the reward also depends on a โstateโ that evolves. Finally, we will turn to the โfull-blownโ Reinforcement Learning scenario, where state evolves endogenously, as a function of the control variable.
Discussion