Before we come to understand how predictions are made, we first must have an understanding of how we
could describe
an agent interacting with its environment. The interaction of the agent with the world is formalized
as a series
of observations and actions. An agent may take an action $a_t$, and be presented with a new
observation
\(o_{t+1}\). If the problem is, say, a board game, $a_t$ may be them move taken by the agent, and
\(o_{t+1}\)
might be an encoding of the change in the board resulting from the move. Because the board is
*fully observable* $o_t$ perfectly describes the state of the world.
Oftentimes, real-world problems are not fully observable. For instance, imagine a robot interacting
with the
world where observations $o$ describe the sensor readings available to the robot on a
moment-to-moment basis.
The world is *partially observable* to the robot. There are many aspects of the world which are
not
perceived by the sensors of the robot---e.g., objects outside of its visual field---so $o_t$ in this
case
does not describe the state of the world, but rather the *agent state*: the state of the world
from
the agent's perspective.

General Value Functions (GVFs) estimate the future accumulation of a value $c$. In the simplest
case, this
might be the accumulation of some element of an agent's observation $c \in o$. The discounted sum of
$c$,
is called the *return*, and is defined over discrete time-steps $t = 1,2,3,...,n$ as
$G_t = \mathbb{E}_\pi( \sum^\infty_{k=0}(\prod^{k}_{j=1}(\gamma_{t+j}))C_{t+k+1})$---the expectation
of how
a signal will accumulate over time. What the GVF's prediction is about is determined by its
*question parameters*, including the signal of interest $C$ (often called the *cumulant*),
a
discounting function $0 \leq \gamma(o_t, a_t, o_{t+1}) \leq 1$, and a policy $\pi$ which describes
the
behaviour over which the predictions are made. In the simplest case, the discounting function is a
is a
constant value that describes the horizon over which a prediction is made. For example if
$\gamma=0.9$,
then it's corresponding GVF is the accumulation of $c$ over $\frac{1}{1-\gamma} = 10$ time-steps.
by making predictions like this, we can anticipate how a signal changes over a period of time.

The discounting function and cumulant can also be used to express more complex predictions.
For instance, we can specify a GVF that asks the question "How long until we see *x*" by using
the
following cumulant and discount:
\[
c =
\begin{cases}
\text{if}\quad o_i = x, &\quad 1\\
\text{else}, &\quad0 \\
\end{cases}
\label{echo_gvf_c}
\]\[
\gamma =
\begin{cases}
\text{if}\quad c = 1, &\quad 0\\
\text{else}, &\quad0.9 \\
\end{cases}
\label{echo_gvf_y}
\]
This has the effect of counting the time-steps until $o_i$ takes on the value of $x$.
There are many more possible ways that cumulants and discounts could be specified, but we choose
these two
examples to give a flavour of what can be expressed.

There is one final question parameter that we haven't discussed yet: the policy $\pi$. We want our
predictions to not only be a function of what we observe, but the actions we take. We want our
predictions
to be able to capture how the environment changes as a response to our own behaviours. To do so, we
condition the expectation on the policy, where $\pi(o,a) = \mathbb{P}(o,a)$. That is, the policy
describes
the probability of taking an action $a$ given observation $o$. for instance, if an agent had three
actions---turn left, move forwards, or turn right---we could specify a policy as follows:
\[\pi = [0,1,0]\]
Which would mean that each prediction would be conditioned on the agent moving forwards.
If this were the policy of our counting GVF, the question would then become "how long until we see
*x*
if we continue moving forwards"? Possibly the most powerful aspect of General Value Functions is
their
ability to condition predictions in terms of an agent's actions.

Having determined how we want to express predictions, need a way to learn these predictions: to estimate their values using the observations and actions available to the agent. General Value Functions can be estimated using typical Value Function approximation methods from computational reinforcement learning. In this context, we consider Temporal-difference learning.

In temporal difference learning we estimate a value-function $v$ such that
$v(\phi(o_t)) \approx \mathbb{E}_\pi [G_t | o_t]$ : we learn a function that estimates the return at
a
given time-step given the agent's observations. On each time-step the agent receives a vector of
observations $o \in \mathbb{R}^m$ where each $o_i \cdots o_m$ is a different real-valued input. A
function
approximator $\phi : o \rightarrow \mathbb{R}^n$---such as a neural net, kanerva coder, or tile
coder---may
be used to encode the observations into a *feature vector*. The estimate for each time-step
$v(\phi(o_t))$ linear combination of learned weights $w\in \mathbb{R}^n$,
and the current feature vector--$v(o_t) = w^\top\phi(o_t))$.

How do we learn the weights? We need an error metric with which we can adapt the weights over time:
a measure to determine how accurate our guess $v(o_t)$ was. In traditional supervised learning,
we compare the estimated value to the true value. In this case, we do not yet know the true value
of the expected return $G_t$. To compute the true value of the return, we would need collect
$c_t \cdots c_n$, where $n$ is is possibly infinite. To resolve this, we estimate the return by
*bootstrap* our estimate. We estimate the value of the return, using our current approximate
value-function $v$. That is, $G_t \approx c_t + \gamma(o_t, a_t, a_{t+1}) v(o_{t+1})$. We can then
form
the temporal-difference error by taking $\delta_t = c_t + \gamma(o_t, a_t, a_{t+1}) v(o_{t+1}) -
v(o_t)$
(line 3, Algorithm 1). The more accurate our estimate $v(o_{t+1})$ is, the more accurate our error
$\delta$
is. We build the error through which we learn our estimates using our existing estimates. The value
function's weights are learned iteratively on each time-step by updating to reduce the
temporal-difference
error $ w_{t+1} = w_t + \alpha\delta\phi(o_t)$, where the step-size or learning rate is $0 <
\alpha$.

We call the parameters of the learning methods *answer parameters*. Answer parameters change
how an
agent answers a question. Answer parameters include the step-size\footnote{Also known as the
\emph{learning rate}.} $\alpha$ which scales updates to the weights, and the linear or non-linear
function-approximator $\phi$ used to construct state.