Learning World Graphs to Accelerate Hierarchical Reinforcement Learning
Wendy Shang Alex Trott* Stephan Zheng* Caiming Xiong Richard Socher
An autonomous agent often counters various tasks within a single complex environment. Our two-stage framework proposes to first build a simple directed weighted graph abstraction over the world in an unsupervised task-agnostic manner and then to accelerate the hierarchical reinforcement learning of a diversity of downstream tasks. Details please refer to Paper with Appendix.
Overview of Proposed Two-Stage Framework
Stage 1: World Graph Discovery
To discover the nodes, i.e. pivotal states, a recurrent variational autoencoder takes state-action pairs as inputs, infers binary latent variables each of which indicates the activation of its corresponding input state, and recovers the original action sequence with the activated states only.
Binary latent variables follow Beta prior and Hard Kumaraswamy approximated posterior. The prior signifies on average how critical a state is for action reconstruction, i.e. to be a pivotal state.
The overall loss is consistent of the evidence lower bound, a sparsity penalty to regularize the desired ratio of pivotal states and a transition penalty to evenly distribute the pivotal states.
The training trajectories are collected via random walks and a curiosity-driven goal-conditioned policy that is optimized in alternation with the latent model.
To forge the edge after the nodes are finalized, use random walks to decide the adjacency matrix, i.e. the neighbors of each pivotal states and then use both random walk trajectories and goal-conditioned policy to approximate edge weights as well as to form transitions between nodes.
Stage 2: Hierarchical Reinforcement Learning
Baselines in our work (but not limited to): non-hierarchical A2C and its hierarchical extension Feudal Network (FN).
Wide-then-Narrow (WN) sequential Manager instruction: pre-define a set of states, e.g. the set of pivotal states, for Manager to propose a Wide Goal from; around the local neighborhood of Wide Goa, Manager proposes a Narrow.
World Graph traversal: send the agent from its near-by pivotal state to the potentially distant Wide Goal, which improves long-horizon planning and exploration. Traversal path is planned via dynamic programming and executed using edge paths or goal-conditioned policy.
Initialization with the goal-conditioned policy weights: achieve implicit skill transfer and more stable optimization.
Control Experiment Results
Tasks: MultiGoal (mean reward with std), with Dense Reward, with Sparse Reward, with Stochasticity, and Door-Key (mean success rate in % with std).
3 different sets Wide Goal: all valid states, pivotal states, random selected states
With or without goal-conditioned policy.
With or without World Graph traversal.
Compare All Variations and Baseline on Small MultiGoal
Compare baseline A2C and FN with intialization (without it training fails):
without graph traversal, with initialization (without the training fails), compare using different Wide Goal sets:
All Feasible States
with graph traversal, without initialization, compare pivotal states, and randomly selected states as Wide Goals:
with graph traversal, with initialization, compare using pivotal states, and randomly selected states as Wide Goals:
Compare the proposed frameworkwith WN, pivotal state graph traversal and initialization against A2C and FN baselines:
Compare Different Wide Goal Sets on Medium Door-Key
with initialization and graph traversal, but when all feasible states are considred, traversal becomes trivial hence the same as without traversal:
All Feasible States
Compare Initialization with Goal-Conditioned Policy on Large MultiGoal-Sparse
with graph traversal, compare the impact of goal-conditioned policy initialization when usning pivotal states:
with Goal-Conditioned Policy Initialization
without Goal-Conditioned Policy Initialization
with graph traversal, compare the impact of goal-conditioned policy initialization when usning randomly selected states: