Generative Temporal Difference Learning
for Infinite-Horizon Prediction
NeurIPS 2020   Paper   Code   Blog   Bibtex

If you cannot access YouTube, please download our video here.

 

Summary We train predictive models of environment dynamics with infinite probabilistic horizons using a generative adaptation of temporal difference learning. The resulting gamma-model is a continuous, generative analogue of the successor representation and a hybrid between model-free and model-based mechanisms. Like a value function, it contains information about the long-term future; like a standard predictive model, it is independent of reward.

Gamma-model rollouts have an infinite, probabilistic horizon.

Gamma-model rollouts Replacing single-step models with gamma-models leads to generalizations of the procedures that form the foundation of model-based control. Generalized rollouts have a negative binomial distribution over time per model step. The first step has a geometric distribution from the special case of NegBinom(1, p) = Geom(1 – p).

Value estimation Single-step models estimate values using long model-based rollouts, often between tens and hundreds of steps long. In contrast, values are expectations over a single feedforward pass of a gamma-model.

Generative Temporal Difference Learning for Infinite-Horizon Prediction
NeurIPS 2020 Paper  Code  Blog  Bibtex