Policy gradient for black-box optimization

Policy gradient method are widely used in the Reinforcement Learning settings. In this post we build policy gradient from the ground up, starting from the easier static scenario first, where we maximize a reward function {r} depending solely on our control variable {x}. In subsequent posts, we will turn our attention to the contextual bandit setting, where the reward also depends on a β€œstate” that evolves. Finally, we will turn to the β€œfull-blown” Reinforcement Learning scenario, where state evolves endogenously, as a function of the control variable.

Blogpost Link