Waypoint Transformer: Reinforcement Learning via Supervised Learning with Intermediate Targets

NeurIPS 2023 Pre-print Code

Summary

We propose a novel RvS method, Waypoint Transformer, using waypoint generation networks and establish new state-of-the-art performance, surpassing all existing methods, in challenging tasks such as AntMaze Large and Kitchen Partial/Mixed (Fu et al., 2020). On tasks from Gym-MuJoCo, our method rivals the performance of TD learning-based methods such as Implicit Q-Learning and Conservative Q-Learning (Kostrikov et al., 2021; Kumar et al., 2020), with significant improvements over existing RvS methods.
We motivate the benefit of conditioning RvS on intermediate targets using a chain-MDP example and an empirical analysis of maze navigation tasks. By providing such additional guidance on suboptimal datasets, we show that a policy optimized with a behavioral cloning objective chooses more optimal actions compared to conditioning on fixed targets (as in Chen et al., 2021; Emmons et al., 2021), facilitating improved stitching capability.
Our work also provides practical insights for improving RvS, such as significantly reducing training time, solving the hyperparameter tuning challenge in RvS posed by Emmons et al., 2021, and notably improved stability in performance across runs.

Waypoint Generation

We can construct waypoints for goal-conditioned and reward-conditioned tasks using simple behavioural cloning objectives. Using these waypoints, we demonstrate significant improvement in goal-reaching and reward-achieving capabilities through the addition of only intermediate targets.

For goal-conditioned tasks, Kumar et al. (2022) demonstrated that RvS methods tend to be unable to stitch as effectively as value-based methods. We demonstrate that through generation of intermediate targets and conditioning the policy on these targets, we are able to stitch together appropriate subsequences and acheive state-of-the-art performance. Figure 1: Shows the ant's location across 100 rollouts of (a) a WT policy and (b) a global goal-conditioned transformer policy.
For reward-conditioned tasks, existing return conditioning variables either have large bias or variance (Emmons et al., 2021; Chen et al., 2021). We adapt the baseline network, often used for reducing variance in policy gradient methods, for offline RL by conditioning it on the desired return.
Figure 2: Comparison of different reward-conditioning methods on hopper-medium-replay-v2.
Left: Performance profiles for transformers using average reward-to-go (ARTG), cumulative reward-to-go (CRTG), and WT across 5 random seeds. Right: Standard deviation in CRTG inputted to the model when updated with attained rewards and using predictions from the reward waypoint network when average return is approximately held constant.

Waypoint Transformer

Incorporating intermediate targets and leveraging the representational power of the transformer architecture, we introduce the waypoint transformer (WT). Based on the decision transformer, we make several architectural simplifications that allow for vastly improved training time (6x speedup) and allow for conditioning on the generated reward/goal waypoints, leading to significantly improved performance (WT: 74.1 ± 2.8; DT: 43.8 ± 7.3).

Figure 3: Waypoint Transformer architecture, where W(s, ω) represents the output of the goal or reward waypoint network.