Learning World Graphs to Accelerate Hierarchical Reinforcement Learning

Wendy Shang Alex Trott* Stephan Zheng* Caiming Xiong Richard Socher

An autonomous agent often counters various tasks within a single complex environment. Our two-stage framework proposes to first build a simple directed weighted graph abstraction over the world in an unsupervised task-agnostic manner and then to accelerate the hierarchical reinforcement learning of a diversity of downstream tasks. Details please refer to Paper with Appendix.

Overview of Proposed Two-Stage Framework

Stage 1: World Graph Discovery

To discover the nodes, i.e. pivotal states, a recurrent variational autoencoder takes state-action pairs as inputs, infers binary latent variables each of which indicates the activation of its corresponding input state, and recovers the original action sequence with the activated states only.
Binary latent variables follow Beta prior and Hard Kumaraswamy approximated posterior. The prior signifies on average how critical a state is for action reconstruction, i.e. to be a pivotal state.
The overall loss is consistent of the evidence lower bound, a sparsity penalty to regularize the desired ratio of pivotal states and a transition penalty to evenly distribute the pivotal states.
The training trajectories are collected via random walks and a curiosity-driven goal-conditioned policy that is optimized in alternation with the latent model.
To forge the edge after the nodes are finalized, use random walks to decide the adjacency matrix, i.e. the neighbors of each pivotal states and then use both random walk trajectories and goal-conditioned policy to approximate edge weights as well as to form transitions between nodes.

Stage 2: Hierarchical Reinforcement Learning

Baselines in our work (but not limited to): non-hierarchical A2C and its hierarchical extension Feudal Network (FN).
Wide-then-Narrow (WN) sequential Manager instruction: pre-define a set of states, e.g. the set of pivotal states, for Manager to propose a Wide Goal from; around the local neighborhood of Wide Goa, Manager proposes a Narrow.
World Graph traversal: send the agent from its near-by pivotal state to the potentially distant Wide Goal, which improves long-horizon planning and exploration. Traversal path is planned via dynamic programming and executed using edge paths or goal-conditioned policy.
Initialization with the goal-conditioned policy weights: achieve implicit skill transfer and more stable optimization.

Control Experiment Results

Tasks: MultiGoal (mean reward with std), with Dense Reward, with Sparse Reward, with Stochasticity, and Door-Key (mean success rate in % with std).
3 different sets Wide Goal: all valid states, pivotal states, random selected states
With or without goal-conditioned policy.
With or without World Graph traversal.

Compare All Variations and Baseline on Small MultiGoal

Compare baseline A2C and FN with intialization (without it training fails):

A2C

FN

without graph traversal, with initialization (without the training fails), compare using different Wide Goal sets:

All Feasible States

Random States

Pivotal States

with graph traversal, without initialization, compare pivotal states, and randomly selected states as Wide Goals:

Random States

Pivotal States

with graph traversal, with initialization, compare using pivotal states, and randomly selected states as Wide Goals:

Random States

Pivotal States

Compare the proposed framework with WN, pivotal state graph traversal and initialization against A2C and FN baselines:

A2C

FN

Proposed

Compare Different Wide Goal Sets on Medium Door-Key

with initialization and graph traversal, but when all feasible states are considred, traversal becomes trivial hence the same as without traversal: