### Google trends for 'Autonomous cars'         ## Potential benefits of self-driving cars For elderly and disabled  More energy efficiency Less parking lots

## Navigation needs to be fast for safer cars

• Reaction time in driving is a matter of life and death
• Perception-reaction time (PRT) is positive correlated to crashes • "PRT has a cause-effect relationship in deceleration rate and crash severity."
"PRT has a direct effect on (crash) event and severity."

## Outline

Mapping Localization Goal conditioned RL

## Mapping

Mapping Localization Goal conditioned RL

## Occupancy grid mapping

Estimate map: $\mathbf{x}^* = \arg \max_{\mathbf{x} \in \Omega} p(\mathbf{x} | \mathbf{z}_{1:T}, \mathbf{y}_{1:T})$
$$O(|\Omega|) = O(2^{\dim(\mathbf{x})})$$

## Factor graphs represent factorizations

• Example: $P(x_1, x_2, x_3, x_4, x_5) = f_A(x_1)f_B(x_2)f_C(x_1, x_2, x_3)f_D(x_3, x_4)f_E(x_3, x_5)$ • Factorization: $P(\mathbf{x}) = \prod_{f \in F} P_f(\mathbf{x}^f)$
• Factor graph $$G = (V, F, E)$$
• $$V =$$ variable nodes. Example $$x_1, x_2, \dots, x_5$$
• $$F =$$ factor nodes. Example $$f_A, f_B, \dots, f_E$$
• $$E = \{(i, f) : i \in V, f \in F, \text{ if }P_f(.) \text{ depends upon }x_i\}$$

## Mapping using Inverse sensor models

• Independent cell assumption: $p(\mathbf{x} | \mathbf{z}, \mathbf{y}) = \prod_{i = 1}^{N} p(x_i | \mathbf{z}, \mathbf{y})$
• Mapping: $\mathbf{x}^* = \arg \max_{\mathbf{x}} \eta \prod_{i=1}^N \prod_{f=1}^{t} p( x_i | z_f, y_f)$

## Mapping using Forward sensor models

• Mapping: $\mathbf{x}^* = \arg \max_{\mathbf{x}} \eta \prod_{f=1}^t p(\mathbf{x}) p( z_f | \mathbf{x}_f, y_f)$ where $$p( z_f | \mathbf{x}_f, y_f)$$ is the Forward sensor model.

## Inverse sensor models

• $$p(\mapsym_i|\meas_f, y_f)$$ is the inverse sensor model

#### Drawbacks

• Cell Independence assumption is limiting
• Inverse sensor models are often inaccurate

• Fast

## Forward sensor models

• $$p(\meas_f | \map, y_f)$$ is the forward sensor model

• No independent cell assumption.
• Forward sensor models are easy to compute

• Slow

## Modern inference methods

• Metropolis Hastings for mapping
• Belief propagation
• Dual decomposition

## Metropolis Hastings for mapping

• Current sample $$\map^r$$
• Potential sample $$\map'$$
• Acceptance ratio \begin{align} a = \frac{P(\map')Q(\map^r| \map')} {P(\map^r)Q(\map'| \map^r)} \end{align}
• \begin{align} Q(\map'| \map^r) = \begin{cases} \frac{1}{N} & \text{if $\|\map' - \map^r\|_1 = 1$}\\ 0 & \text{ otherwise} \end{cases} \end{align}

## Belief propagation intuition

• Example: $P(x_1, x_2, x_3, x_4, x_5) = f_A(x_1)f_B(x_2)f_C(x_1, x_2, x_3)f_D(x_3, x_4)f_E(x_3, x_5)$ • Want: $P(x_3) = \sum_{\mathbf{x} \setminus x_3} P(x_1, x_2, x_3, x_4, x_5)$
• $\newcommand{\bpop}{\sum_{#1}} P(x_3) = \bpop{x_1, x_2, x_4, x_5} f_A(x_1)f_B(x_2)f_C(x_1, x_2, x_3)f_D(x_3, x_4)f_E(x_3, x_5)$
• $P(x_3) = \underbrace{ \bpop{x_1,x_2} \left( f_A(x_1) f_B(x_2) f_C(x_1, x_2, x_3) \right) }_{\bpmsg{f_C}{x_3}{x_3}{}} \underbrace{ \bpop{x_4}f_D(x_3, x_4) }_{\bpmsg{f_D}{x_3}{x_3}{}} \underbrace{ \bpop{x_5}f_E(x_3, x_5) }_{\bpmsg{f_E}{x_3}{x_3}{}}$

## Belief propagation

• \begin{align} \bpmsg{f}{i}{x_i = l_i}{r+1} &= \sum_{\map_f \in \Omega_f: x_i = l_i}P_f(\map_f)\prod_{j \in n(f) \setminus i}\bpmsg{j}{f}{x_j}{r} \\ \bpmsg{i}{f}{x_i = l_i}{r+1} &= \prod_{h \in n(i) \setminus f}\bpmsg{h}{i}{l_i}{r} \end{align}

## Dual decomposition

• Example:
$P(x_1, x_2, x_3, x_4, x_5) = f_A(x_1)f_B(x_2)f_C(x_1, x_2, x_3)f_D(x_3, x_4)f_E(x_3, x_5)$ • $x_3^{C*}, \dots = \arg~\max f_C(\dots) \qquad x_3^{D*}, \dots = \arg~\max f_D(\dots) \qquad \dots$
• Solve for all $$f \in F$$:
$$\map^f \leftarrow \arg~\min\limits_{\map^f} \left( \theta_f(\map^f) \right)$$ where $$\theta_f(\map^f) = -\log P_f(\map^f)$$

## Dual decomposition contd.

• • Solve for all $$f \in F$$:
$$\map^f \leftarrow \arg~\min\limits_{\map^f} \left( \theta_f(\map^f) \right)$$ where $$\theta_f(\map^f) = -\log P_f(\map^f)$$
• For disagreeing nodes $$i \in V$$: $\msg{i}{f}{x^f_i} \leftarrow \msg{i}{f}{x^f_i} + \frac{\alpha}{r}$ where $$r$$ is the iteration count
and $$\alpha$$ is the step size.
• $\map^f \leftarrow \arg~\min\limits_{\map^f} \left( \theta_f(\map^f) + \sum\limits_{i \in n(f)}\msg{i}{f}{x^f_i} \right)$ where $$\theta_f(\map^f) = -\log P_f(\map^f)$$

## Modern inference methods

• Metropolis Hastings for mapping
• Belief propagation
\begin{align} \bpmsg{f}{i}{l_i}{r+1} &= \sum_{\map_f \in \Omega_f: x_i = l_i}P_f(\map_f)\prod_{j \in n(f) \setminus i}\bpmsg{j}{f}{x_j}{r} \label{eq:factor2node} \\ \bpmsg{i}{f}{l_i}{r+1} &= \prod_{h \in n(i) \setminus f}\bpmsg{h}{i}{l_i}{r} \label{eq:node2factor} \end{align} • Dual decomposition \begin{align} \map^f &\leftarrow \arg \min\limits_{\map^f} \left( \theta_f(\map^f) + \sum\limits_{i \in n(f)}\msg{i}{f}{x^f_i} \right) \\ \msg{i}{f}{x^f_i} &\leftarrow \msg{i}{f}{x^{f'}_i} + \frac{\alpha}{r} & \text{for disagreeing nodes $$i$$} \end{align}

## Forward sensor models are still expensive

• Belief propagation \begin{align} \bpmsg{f}{i}{x_i = l_i}{r+1} &= \sum_{{\color{red}\map_f \in \Omega_f}: x_i = l_i}P_f(\map_f)\prod_{j \in n(f) \setminus i}\bpmsg{j}{f}{x_j}{r} \end{align}
• Complexity = $$O(|\Omega_f|) = O(2^{n(f)})$$

## Piecewise constant sensor model • \begin{align} P(z_f|\map_f, y_f) &= \frac{1}{Z}\begin{cases} 1 & \text{ if } \map_f = [0, \dots 0, 1, \dots, *]^\top\\ \exp(-900) & \text{ if } \map_f = [0, \dots 0, 0, \dots, *]^\top\\ \exp(-1000) & \text{ otherwise} \end{cases} \end{align}
• Generalization:
\begin{align} P^{PCSM}_f(\map_f) = \begin{cases} \psi_{1} & \text{ if $\map_f$ matches pattern $\vect{R}_1$} \\ \vdots& \\ \psi_{M} & \text{ if $\map_f$ matches pattern $\vect{R}_M$} \\ \psi_{\text{min}} & \text{ otherwise} \end{cases} \end{align}

## Efficient belief propagation for PCSM

• \begin{align} P^{PCSM}_f(\map_f) = \begin{cases} \psi_{1} & \text{ if $\map_f$ matches pattern $\vect{R}_1$} \\ \vdots& \\ \psi_{M} & \text{ if $\map_f$ matches pattern $\vect{R}_M$} \\ \psi_{\text{min}} & \text{ otherwise} \end{cases} \end{align}
• \begin{align} p(\vect{x_f} \sim \vect{R}_m| x_i = l_i) &= \begin{cases} \prod\limits_{j \in n(f) \setminus i, \vect{R}_m[j] \ne *}\bpmsg{j}{f}{x_j = \vect{R}_m[j]}{r} & \text{ if $x_i = l_i \in \vect{R}_m$ } \\ 0 & \text{ otherwise } \end{cases} \end{align}
• \begin{align} \bpmsg{f}{i}{x_i = l_i}{r+1} = \sum_{m = 1}^M \psi_m p(\vect{x_f} \sim \vect{R}_m, x_i = l_i) + \psi_{\text{min}} p_{\text{otherwise}} \end{align} where $p_{\text{otherwise}} = 1 - \sum_{m = 1}^M p(\vect{x_f} \sim \vect{R}_m, x_i = l_i)$.

Cave
Hospital Section
Hospital
Albert.B

## Conclusion

Mapping Localization Goal conditioned RL

## Communicating cars with mutual obsevation ## Mutual localization

Mapping Localization Goal conditioned RL

## Mutual localization Find $$R,t$$ such that \begin{align} q_i &= Rp_i + t \\ \text{where }&\\ q_i \in&\, \mathbb{R}^3, p_i \in \mathbb{R}^3, \\ R \in& \text{ SO(3), } t \in \mathbb{R}^3 \end{align}

## Mutual localization conclusion

Mapping Localization Goal conditioned RL

## Goal conditioned RL

Mapping Localization Goal conditioned RL

## Goal conditioned reinforcement Learning

• Goal is "specified"
• State is fully specified (no partial observation)

## Q-Learning

• Define $\Q^*(s_t, a_t) = \max_{\policy} \E\left[ \sum_{k=t}^\infty \discount^{k-t} R(s_t, a_t) \right]$
• Bellman equation $\Q^*(s_t, a_t) = R(s_t, a_t) + \gamma \max_{a} Q^*(s_{t+1}, a)$
• Temporal Difference Update for tabular Q-functions $\Q^*(s_t, a_t) \leftarrow \alpha \Q^*(s_t, a_t) + (1-\alpha)( R(s_t, a_t) + \gamma \max_{a} Q^*(s_{t+1}, a) )$

## Deep Q Learning

• \begin{align} R(s_t, a_t) + \max_a \discount \Q(s_{t+1}, a) = \Q(s_t, a_t) \end{align}
• Naive Loss:
$\Loss = \left( R(s_t, a_t) + \max_a \discount \Q(s_{t+1}, a) - \Q(s_t, a) \right)^2$

#### Solution Strategy:

• Target network $$\Qtgt(\dots) = \Q(\dots; \Theta_{\text{tgt}})$$
• Main network $$\Q_m(\dots) = \Q(\dots; \Theta_{m})$$
• Update target weights: $$\Theta_{\text{tgt}} \leftarrow \alpha \Theta_{\text{tgt}} + (1-\alpha) \Theta_{m}$$ where $$\alpha \approx 0.9$$
• $$\nabla_{\Theta_m} \Qtgt = 0$$

#### Solution:

• $\Loss_{\text{DQN}} = \begin{cases} \left( R(s_t, a_t) - {\color{red}\Q_{m}}(s_t, a_t) \right)^2 & \text{ if $$s_{t+1}$$ is terminal } \\ \left( R(s_t, a_t) + \max_a \discount {\color{red}\Qtgt}(s_{t+1}, a) - {\color{red}\Q_m}(s_t, a_t) \right)^2 & \text{ otherwise } \end{cases}$

## Floyd-Warshall

#### Floyd-Warshall

• $\Q^P(s_t, g^*) := \max_{\policy} \E\left[ \sum_{k = t}^{\color{blue}g_{k+1} = g^*} R(s_k, s_{k+1}) \right]$

#### Floyd-Warshall RL

• $\newcommand{\redat}{{\color{red}a_t}} \newcommand{\reda}{{\color{red}a}} \newcommand{\redmaxa}{{\color{red}\max_a}} \Q^P(s_t, \redat, g^*) := \max_{\policy} \E\left[ \sum_{k = t}^{g_{k+1} = g^*} R(s_k, \redat) \right]$
• $\Q^P(s_t, g_w) + \Q^P(s_w, g^*) \le \Q^P(s_t, g^*)$
• $\Q^P(s_t, g_{t+1}) = R(s_t, {\color{red}s_{t+1}})$
• $\Q^P(s_t, \redat, g_w) + \redmaxa \Q^P(s_w, \reda, g^*) \le \Q^P(s_t, \redat, g^*)$
• $\Q^P(s_t, \redat, g_{t+1}) = R(s_t, \redat)$

## Deep Floyd-Warshall RL

• \begin{align} \Q^P(s_t, a_t, g_w) + \max_a \Q^P(s_w, a, g^*) \le \Q^P(s_t, a_t, g^*) \end{align}
• \begin{align} R(s_t, a_t) + \max_a \discount \Q(s_{t+1}, a) = \Q(s_t, a_t) \end{align} $\Loss_{\text{DQN}} = \left( R(s_t, a_t) + \max_a \discount {\color{red}\Qtgt}(s_{t+1}, a) - {\color{red}\Q_m}(s_t, a_t) \right)^2$
• $\LossLo = \left( \Qtgt^P(s_t, a_t, g_w) + \max_a \Qtgt^P(s_w, a, g^*) - {\color{red}\Q^P_{m}}(s_t, a_t, g^*) \right)^2_+$
• $\LossUp = \left( {\color{red}\Q^P_{m}}(s_t, a_t, g_w) + \max_a \Qtgt^P(s_w, a, g^*) - \Qtgt^P(s_t, a_t, g^*) \right)^2_+$
• $\Q^P(s_t, \redat, g_{t+1}) = R(s_t, \redat)$
• $\LossStep = \left( R(s_t, a_t) - {\color{red}\Q^P_{m}}(s_t, a_t, g_{t+1}) \right)^2$

## What does it mean to re-imagine a Virtual Goal

• The desired goal was $$g^*$$, but the achieved goal was $$g_f$$
• The loss function used is
$\newcommand{\cred}{{\color{red}{#1}}} \Loss = \left( R(s_t, a_t, \cred{g_f}) + \max_a \Qtgt(s_t, a_t, \cred{g_f}) - \Q_m(s_t, a_t, \cred{g_f}) \right)^2$
$\Loss = \left( R(s_t, a_t, \cred{g^*}) + \max_a \Qtgt(s_t, a_t, \cred{g^*}) - \Q_m(s_t, a_t, \cred{g^*}) \right)^2$
• This needs re-computing the reward function $$R(s_t, a_t, g^*)$$ ## Ablation with FWRL inspired $$\LossUp$$ and $$\LossLo$$

• \begin{align} \LossLo &= \left( \Qtgt(\dots) + \Qtgt(\dots) - \Q_m(\dots) \right)_+^2 \end{align}
• \begin{align} \LossUp &= \left( \Q_m(\dots) + \Qtgt(\dots) - \Qtgt(\dots) \right)_+^2 \end{align}

## Ablation with Step Loss

• \begin{align} \LossStep &= \left( \fwargs{\state_{t}}{\act_{t}}{\goal_{t+1}}{*}{P} - \Rew(\state_{t}, \act_{t}) \right)^2 \end{align}
• \begin{align} \Loss_{\text{HER}} = \Loss_{\text{bellman}} &= \left( R(s_t, a_t, g_f) + \max_a \Qtgt(s_t, a_t, g_f) - \Q_m(s_t, a_t, g_f) \right)^2 \end{align}

## HER modeling has duplicate information

Hindsight Experience Replay

Can this work? Claim: Goal conditioned reinforcement learning without goal rewards is possible

## Why does duplicate information hurt?

• HER like methods require to reimagine new goal specification.
• If rewards depend upon goal specification, the rewards need to be resampled.
• Reward re-computing can be costly.

## Cost of algorithms in Reward samples

• Ours : HER - Goal rewards $$+ \LossStep$$
• FWRL : HER $$+ \LossLo + \LossUp$$

## Fetch Push ## Fetch Pick And Place ## Effect of distance threshold

• HER Rewards: $R(s_t, a_t, g^*) = \begin{cases} 0 &\text{ if } \|g_t - g^*\| \le \epsilon \\ -1 &\text{ otherwise} \end{cases}$
• Our Rewards: $R(s_t, a_t) = -1$ ## Conclusion

Mapping Localization Goal conditioned RL

## Conclusion

Mapping Localization Goal conditioned RL

# Future work

## Acknowledgements

Thanks to my colleagues and collaborators