Multi-Task, Goal Conditioned Reinforcement Learning

Terminology & Notation

Use the reward function to determine which action is better or worse.

$s, a, r(s,a), p(s\prime|s,a)$ define Markov decision process.

The Goal of Reinforcement Learning

What is a Reinforcement Learning Task

The Goal of Multi-Task Reinforcement Learning

Multi-task RL

The same as before, except a task identifier is part of the state: $s = (\bar{s}, z_i)$, e.g. one-hot task ID, language description, desired goal state, $z_i = s_g$ which is goal-conditioned RL.

The reward is the same as before, or for goal-conditioned RL:

$$ r(s) = r(\bar{s}, s_g) = -d(\bar{s}, s_g) $$

Distance function $d$ examples:

  • Euclidean $L_2$
  • Sparse 0/1

The Anatomy of a RL Algorithm

Evaluating the Objective

$$ \theta^* = arg \max_{\theta} J(\theta) $$

Sum over samples from $\pi_\theta$.

$$ J(\theta) = E_{\tau \sim p_{\theta}(\tau)} \left [ \sum_{t} r(s_t, a_t) \right ] \approx \frac{1}{N} \sum_{i}\sum_{t}r(s_{i,t}, a_{i,t}) $$

Direct Policy Differentiation

Evaluating the Policy Gradient

Comparison to Maximum Likelihood

Which means:

  • Good stuff is made more likely
  • Bad stuff is made less likely
  • Simply formalizes the notion of “trail and error”

Summary for Policy Gradients

Pros:

  • Simple
  • Easy to combine with existing multi-task & meta-learning algorithms.

Cons:

  • Produces a high0-variance gradient
    • Can be mitigated with baselines (used byu all algorithms in practice), trust regions
  • Requires on-policy data
    • Cannot reuse existing experience to estimate the gradient
    • Importance weights can help, but also high variance

Q-Learning

Value-Based RL: Definitions

For the optimal policy $\pi^*$, it follows the bellman equation:

$$ Q^(s_t, a_t) = E_{s\prime \sim p(\cdot |s,a)} \left [ r(s,a) + \gamma \max_{a\prime} Q^ (s\prime, a\prime) \right ] $$

Fitted Q-iteration Algorithm

Multi-Task Q-Learning

Policy:

$$ \pi_\theta(a| \bar{s}) \rightarrow \pi_\theta(a|\bar{s}, z_i) $$

Q-function:

$$ Q_{\phi} (\bar{s}, a) \rightarrow Q_\phi(\bar{s},a,z_i) $$

Analogous to multi-task supervised learning: stratifies sampling, soft/hard weight sharing, etc.

The different about RL:

  • The data distribution is controlled by the agent
  • You may know what aspect(s) of the MDP are changing across tasks

Goal-conditioned RL with Hindsight Relabeling

Relabeling Strategies: Use any state from the trajectory.

The result is the exploration challenges alleviated, see more: Hindsight Experience Replay

Multi-task RL with Relabeling

随机未标记的交互作用在达到最后一个状态的0/1奖励下是最优的。

Unsupervised Visuo-motor Control through Distributional Planning Networks

Note: Cover Picture