Gamma-Models: Generative Temporal Difference Learning for Infinite-Horizon Prediction

Generative Temporal Difference Learning
for Infinite-Horizon Prediction

If you cannot access YouTube, please download our video here.

LaTeX

Summary We train predictive models of environment dynamics with infinite probabilistic horizons using a generative adaptation of temporal difference learning. The resulting gamma-model is a continuous, generative analogue of the successor representation and a hybrid between model-free and model-based mechanisms. Like a value function, it contains information about the long-term future; like a standard predictive model, it is independent of reward.

Gamma-model rollouts have an infinite, probabilistic horizon.

Gamma-model rollouts Replacing single-step models with gamma-models leads to generalizations of the procedures that form the foundation of model-based control. Generalized rollouts have a negative binomial distribution over time per model step. The first step has a geometric distribution from the special case of NegBinom(1, p) = Geom(1 – p).

Value estimation Single-step models estimate values using long model-based rollouts, often between tens and hundreds of steps long. In contrast, values are expectations over a single feedforward pass of a gamma-model.

Generative Temporal Difference Learning for Infinite-Horizon Prediction

Michael Janner, Igor Mordatch, and Sergey Levine

NeurIPS 2020 Paper Code Blog Bibtex