Introduction to Reinforcement Learning

Terminology & Notation

Definitions

MDP

POMDP

The Goal of Reinforcement Learning

Markov chain on $(s,a)$:

$$ p((s_{t+1}, a_{t+1})| (s_t, a_t)) = p(s_{t+1}|s_t, a_t) \pi_\theta(a_{t+1}|s_{t+1}) $$

Finite Horizon Case: State-action Marginal

$p_\theta(s_t,a_t)$ is state-action marginal.

$$ \theta^* = arg \max_{\theta} E_{\tau \sim p_\theta(\tau)} \left [ \sum_t r(s_t, a_t) \right ] = arg \max_{\theta} \sum_{t=1}^T E_{(s_t,a_t) \sim p_{\theta}(s_t,a_t) } \left [ r(s_t, a_t) \right ] $$

Infinite Horizon Case: Stationary Distribution

Expectations and Stochastic Systems

For infinite horizon case:

$$ \theta^* = arg \max_\theta E_{(s,a) \sim p_\theta(s,a)} \left [r(s,a) \right] $$

For finite horizon case:

$$ \theta^* = arg \max_\theta \sum_{t=1}^T E_{(s_t,a_t) \sim p_\theta(s_t,a_t)} \left [r(s_t,a_t) \right ] $$

In RL, we almost always care about expectations.

  • $r(x)$ -> not smooth
  • $\pi_\theta(a=fall) = \theta$
  • $E_{\pi_\theta}\left [ r(x) \right]$ -> smooth in $\theta$!

RL Algorithms

Structure of RL Algorithms

  • Sample generation
  • Fitting a model/estimating return
  • Policy improvement

The Anatomy of a RL Algorithm

Q-Function

Q-function is the total reward from taking $a_t$ in $s_t$:

$$ Q^{\pi}(s_t,a_t) = \sum_{t\prime = t}^T E_{\pi_\theta} \left[ r( s_t\prime, a_t\prime ) | s_t, a_t \right ] $$

Value-Function

Value-function is the total reward from $s_t$:

$$ V^\pi(s_t) = \sum_{t\prime = t}^T E_{\pi_\theta} \left [ r(s_{t\prime},a_{t\prime} ) | s_t \right ] $$

$$ V^\pi(s_t) = E_{a_t \sim \pi(a_t|s_t)} \left [ Q^\pi(s_t, a_t) \right ] $$

$E_{s_1 \sim p(s_1)} \left [ V^\pi(s_t) \right ]$ is the RL objective!

Using Q-Function and Valuer Functions

Types fo RL Algorithms

The goal of RL:

$$ \theta^* = arg \max_{\theta} E_{\tau \sim p_\theta(\tau)} \left [ \sum_t r(s_t, a_t) \right ] $$

  • Policy gradients: directly differentiate the above objective
  • Value-based: estimate value function or Q-function of the optimal policy (no explicit policy)
  • Actor-critic: estimate value function or Q-function of the current policy, use it to improve policy
  • Model-based RL: estimate the transition model, then improve the policy

Model-based RL

Estimate the transition model, then improve the policy by:

  • Just use the model to plan (no policy)
    • Trajectory optimization/optimal control (primarily in continuous spaces) – essentially back-propagation to optimize over actions
    • Discrete planning in discrete action spaces – e.g., Monte Carlo tree search
  • Back-propagate gradients into the policy
    • Requires some tricks to make it work
  • Use the model to learn a value function
    • Dynamic programming
    • Generate simulated experience for model-free learner

Value-Function based RL

Direct Policy Gradients

Actor-Critic: Value Functions + Policy Gradients

Why So Many RL Algorithms

  • Different tradeoffs
    • Sample efficiency
    • Stability & ease of use
  • Different assumptions
    • Stochastic or deterministic?
    • Continuous or discrete?
    • Episodic or infinite horizon?
  • Different things are easy or hard in different settings
    • Easier to represent the policy?
    • Easier to represent the model?

Comparison: Sample Efficiency

Sample efficiency: how many samples do we need to get a good policy?

Most important question: is the algorithm off policy?

  • Off policy: able to improve the policy without generating new samples from that policy
  • On policy: each time the policy is changed, even a little bit, we need to generate new samples

Comparison: Stability and Ease of Use

Question:

  • Does it converge?
  • And if it converges, to what?
  • And does it converge every time?

  • Supervised learning: almost always gradient descent

  • Reinforcement learning: often not gradient descent

    • Q-learning: fixed point iteration
      • At best, minimizes error of fit (“Bellman error”)
        • Not the same as expected reward
      • At worst, doesn’t optimize anything
        • Many popular deep RL value fitting algorithms are not guaranteed to converge to anything in the nonlinear case
    • Model-based RL: model is not optimized for expected reward
    • Model minimizes error of fit
      • This will converge
      • No guarantee that better model = better policy
    • Policy gradient
      • The only one that actually performs gradient descent(ascent) on the true objective, but also often the least efficient!

Comparison: Assumptions

  • Common assumption #1: full observability
    • Generally assumed by value function fitting methods
    • Can be mitigated by adding recurrence
  • Common assumption #2: episodic learning
    • Often assumed by pure policy gradient methods
    • Assumed by some model-based RL methods
  • Common assumption #3: continuity or smoothness
    • Assumed by some continuous value function learning methods
    • Often assumed by some model-based RL methods

Examples of Specific Algorithms

  • Value function fitting methods
    • Q-learning, DQN
    • Temporal difference learning
    • Fitted value iteration
  • Policy gradient methods
    • REINFORCE
    • Natural policy gradient
    • Trust region policy optimization
  • Actor-critic algorithms
    • Asynchronous advantage actor-critic (A3C)
    • Soft actor-critic (SAC)
  • Model-based RL algorithms
    • Dyna
    • Guided policy search

Note: Cover Picture