Reinforcement Learning: Summary and Review

Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto

You can download the book for free from the author’s website

Review

Reinforcement Learning profoundly changed my understanding of the science of happiness, biological evolution, human intelligence, and also gave me unique tactics for rapid skill acquisition in my personal life.

I wrote a full-length review of this book here: Book Review: Reinforcement Learning by Sutton and Barto

Summary and Notes

Below follows my detailed notes and summaries for each chapter.

Chapter 1: Introduction

Reinforcement learning as a third paradigm of machine learning, different from supervised and unsupervised learning
RL explicitly optimizes for your final goal instead of trying to manually specify subproblems. Components of RL:
- Policy: How to act, given an environment (or a model of the environment)
- Reward: A signal given to the agent as a result of its action
- Value Function: How the agent evaluates rewards in the long-run, e.g. the sum of all rewards
- Model: Takes as input the raw environment and outputs some structured representation of it (e.g. predict how the environment will change based on its current state, e.g. planning for the future)
RL vs. Evolutionary methods: RL uses value functions to explicitly evaluate its policies, but evolutionary methods

Exercise 1.1 Self-Play

I think the agent would learn a different policy because self-play would cause the agent to encounter fewer low-value states and more high-value states, as its opponent is biased towards selecting the high-value states. By contrast, a random opponent would select states randomly so the value/state distribution is more uniform.

Exercise 1.2 Symmetries

Since tic-tac-toe strategy should be invariant under symmetry, we could amend the learning process to remember previously encountered states and map any newly encountered states that are symmetrically identical to the old states. This would make the agent more consistent, computationally simpler, but possibly give it fewer opportunities to perform exploratory moves because we’ve reduced the number of potential branching-off points. If the opponent did not take advantage of symmetries, it may not be beneficial for us to do so because we could exploit the opponent’s lack of symmetry by forcing it into a mirrored position where it would play more poorly than usual.

Exercise 1.3 Greedy Play

The main problem with greedy play is that it will get stuck at local maximums. Greedy players will likely play worse, because few real-life problems can be solved optimally with a greedy strategy, otherwise the problem would be trivial. It requires no talent to do just the most obvious thing at every step. For tic-tac-toe in particular, it’s possible that greedy play is optimal only because the game itself is not that complicated.

Exercise 1.4 Learning from Exploration

If we also learn from exploratory moves, then I would expect this would result in more wins where our opponent has equal skill, because learning only the highest probability moves will cause the opponent to respond in kind with their own highest probability move, washing out any gains in an arms race. However, continuing to make more exploratory moves can result in greater “surprise” for the opponent where the opponent cannot play any moves that it has calculated as high probability.

Exercise 1.5 Other Improvements

Domain knowledge of the game lets us manually tune rewards to help the agent find better strategies, for example giving small rewards for two Xs in a row. It seems there is a trade-off between how much manual tuning you do and the agent’s performance.

Trade-offs between exploration and exploitation

What happens as the delay increases between penalty and reward?

\[-1,-1,-1, \ldots \textrm{repeated } n \textrm{ times} \ldots,-1,+1000000\]

Or the converse: In option theory selling insurance

\[+1,+1,+1, \ldots ,+1,-1000000\]

As $n$ increases, does it increase exponentially?

It could be argued that humans are also not good at non-greedy payoffs for large $n$. Consider the classic example of dieting for weight loss. We could achieve a large reward if we could only avoid eating cake every day, but alas, this reward is distant in the future so that the small near-term reward of eating cake derails our goal of long-term weight loss.

Chapter 2: Multi-armed Bandits

Action-Value Methods

Estimate arm value by keeping a running average of the reward achieved for each action
Select the greedy action with $1-\epsilon$ probability and randomly select an action with $\epsilon$ probability
The slope of $\epsilon=0.1$ and $\epsilon=0.01$ in Figure 2.2 suggests that we can think of $\epsilon$ as analogous to the “step size” for gradient descent?
“It is also possible to reduce $\epsilon$ over time to try to get the best of both high and low values.” This is like a learning rate schedule in pytorch optimizer functions.

Exercise 2.3

In the long run I would expect the $\epsilon=0.01$ method to perform the best, because while it will take longer to find the optimal arm, once it finds the optimal arm it will deviate away from it much less than the other strategies. When the long run is very long, then how long you can exploit matters much more than how much time you ‘waste’ on exploring.

Incremental Implementation

Seems similar to technical analysis stock trading, where you keep track of a moving average and then “buy” (pull the arm) whenever your estimated reward exceeds the moving average, or “sell” (switch arms) otherwise. See SMA on Investopedia

Tracking a Nonstationary Problem

Same as using a simple moving average, except it’s exponentially weighted, again just like in technical analysis stock trading, for the same reason (recent information is likely to be more important than old information). See EMA on Investopedia

Optimistic Initial Values

Exercise 2.6

I think the spikes happen because when we give optimistic initial values in the early stages, it disproportionately encourages the algorithm to choose arms that are bad, but just happen to get good payoffs in the early stages due to luck. This doesn’t happen as much in the non-optimistic algorithm because luck isn’t concealed by the optimistic estimate at first.

Exercise 2.7

$\bar{o}_n$ is recursively defined in terms of $\alpha(1-\bar{o}_{n-1})$, so the higher order terms (less recent steps) will get multiplied more often (exponentially), making it recency-weighted without initial bias.

Upper-Confidence-Bound (UCB) Action Selection

I think of this like the information gain ratio in decision trees. epsilon-greedy algorithms explore indiscriminantly, but what we would optimally like to do is favor exploring the arms that would maximize entropy (i.e. explore arms with the lowest information to gain the most information). The information theory perspective intuitively explains why we need natural log in the equation, because entropy is the negative log of the probability mass function.

Is this related to the optimal stopping problem, or the secretary problem? We can think of the secretary problem as a specific case of the $k$-armed bandit problem, except each arm can only be pulled once, you are only allowed to keep the reward from the last arm you pulled, and $k=$ the number of candidates applying for the job.

I am surprised the book doesn’t talk more about the Gittins index, I read about it in the book Algorithms to Live By by Brian Christian and Tom Griffiths, and although there are lots of strong assumptions that need to be true for it to work (like the geometric discounting, assuming zero cost of switching, rewards being a constant fraction of the previous one), it has the advantage that you can just pre-compute a huge table for your problem and then your agent just needs to do a simple table lookup when taking its actions, potentially saving valuable time in an realtime interactive environment when the agent is deployed. It seems more suitable for a memory/speed tradeoff situation where you can afford lots of memory but not clock cycles. I’d like to know if there’s more research into whether the Gittins index approach will work better for this kind of situation.

Chapter 3: Finite Markov Decision Processes

Basically, instead of receiving just the reward from the environment, the agent also receives some state information from the environment, which it can then use in its value/policy functions.
Markov Property: This is related to the Markov Blanket that we learned in the Machine Learning course on Bayesian Networks.
In a finite MDP states, actions, and rewards are finite sets. The random variables $R_t$ and $S_t$ have well-defined discrete probability distributions dependent only on the preceding state and action. We can see each state transition as a conditional probability distribution.
Actions can be high-level or low-level; usually motors and sensors are considered part of the environment, since the “agent” is the abstract decision-maker, not the physical agent.
The “discount rate” is analogous to a net present value calculation, where you want to figure out the price you would be willing pay today to receive a sequence of cash flows in the future.
Episodic vs. Continuous tasks: does the task eventually terminate?
Dynamics function $\pi\left(s', r \vert s, a\right)$. Probability of ending up in state $s'$ and getting reward $r$, given state $s$ and action $a$.
Value function: How good is it to be in a state, under some policy
Action-value function: How good is it to be in a state, take some action, then forever afterwards follow some policy $q_\pi(s, a)$
Policy function: Mapping from states to probabilities of selecting each possible action.
The absolute value of a reward is not important, only the relative difference between rewards.
There is only one optimal value function but there may be more than one optimal policy function.
It’s more difficult to find the optimal value policy than the optimal policy function, because once you have the optimal value policy then picking the best policy is trivial, just greedily maximize over the optimal value policy.
The optimal solution requires we have the necessary computational resources, which realistically we almost never do because the problem space increases exponentially, so we can only ever approximate.
Can the discount rate be learned, or it’s a hyperparameter that must be specified in advance?
“RL is better at making good decisions for frequently encountered states, at the expense of less effort for infrequently encountered states.” Does this lead to bias? Because you have more data to advantage behaviors from the majority group, and you disadvantage behaviors from the minority group. This is somewhat mitigated by importance sampling and the off-policy behavior policy, described in a later chapter.

Exercise 3.1

Task 1: Write a bestselling novel

Actions: What key to press next on the keyboard
States: The keys that have already been pressed
Rewards: Sales ($) of books (or views, downloads, etc.)

Task 2: Gardening/Farming

Actions: Location to plant seeds, extract weeds, harvest fruits. Amount of and location to apply water, pesticides, fertilizer.
States: Size/shape/age of the plants.
Rewards: How pretty the garden looks (either a human judge can give a rating, or a GAN can give a rating based on photos of other gardens), or mass of edible matter harvested from a farm.

Task 3: Predict winning lottery numbers

Actions: The numbers to pick on a lottery ticket
States: Previous winning numbers
Rewards: Winning the lottery

Exercise 3.2

I think goal-directed learning works only if there is a strong feedback loop between actions, states, and rewards. Otherwise if the environment is chaotic, behaves highly non-deterministically, or subject to non-rational behavior, then I don’t think MDP will work even if the problem can be fully specified in terms of actions, states, and rewards. Some examples are picking winning lottery numbers, telling a funny joke (e.g. measuring volume of laughter), day trading on the stock market, chasing a tornado.

Exercise 3.3

There is a saying in philosophy, “The map is not the territory, but you can’t fold up the territory and put it in your pocket”, and “all maps are wrong, but some maps are useful”. Thus, I would argue that there is a “correct” place to draw the line between the agent and the environment, and the basis of this preference is the usefulness of the resulting model. Defining the driving task where actions represent tire torque is like a map that is too small, your actions are extremely precise but your policy needs to be extremely complicated to compensate for the fact that you’ll have a vastly enlarged state space to consider (is a torque of 100 Nm too much or too little? Depends on if you are speeding up or slowing down, if there’s ice out, how heavy is the car, the gear ratio of your transmission, etc.) and it’s impactical to obtain that much sensor data, even if you had the computational resources. Defining the driving task where actions represent muscle twitches is like a map that is too big, only a tiny subset of all possible muscle twitches will actually move a car forward, your actions are extremely imprecise (most actions result in flailing around) even though you have available to you vastly more possible ways of achieving your goal (you could use the same muscle twitches to fly a plane, attaining the goal faster and achieving a higher reward), but the vast amount of computing power you’d have to burn to do this makes it impactical. Thus, defining the driving task as “accelerator, steering wheel, and brake” is the most natural way to define the task because this is the most practical, and therefore most useful way to model the problem.

Exercise 3.7

The robot needs to also get a negative reward for each unit of time it spends in the maze without escaping, to incentivize it to find the exit faster. Otherwise it can take as long as it wants to find the exit, which may be a very long time. Alternatively, you can treat it as a continuing task and discount the final reward to incentivize speedy completion.

Chapter 4: Dynamic Programming

DP assumes a perfect model of the environment and is of limited utility but is important theoretically.
Exact solutions are possible only in special cases, instead, we will solve DP problems by approximating.
Updating $v_{k+1}$: ($k =$ number of steps, $v_k =$ value for $k$ steps) work backwards from your terminal state and propagate rewards achieved backwards through the state graph.
Value Iteration vs. Policy Iteration: Value Iteration updates the value graph one timestep at a time (to look ahead one step at a time), before updating each policy one timestep at a time. Policy Iteration updates the entire value/policy graph based on the non-deterministic policy/actions. It sweeps back each node and propagates the computation on the fly instead of doing it in the outer loop and waiting only one step.
Value Iteration is easier to parallelize than Policy Iteration. Since the value of each state can be computed independent of all other states, but in Policy Iteration previous actions have to wait for all future actions to finish evaluating before we can evaluate them.
GPI reminds me a lot about coevolution in biology. There’s an OpenAI blog post talking about how genetic algorithms that have 2 agents competing against each other can drive richer and more complex behavior because a changing environment or an adversarial agent applies stronger selection pressure. This is another way of viewing GPI in terms of two constraints, one from the value function and the other from the policy function.
Iterative Policy Evaluation, for estimating $V$ from $v_\pi$. In the state update for $v_{k+1}$ from $v_k$: Is this algorithm is analogous to how we use “relaxation” in the Bellman-Ford algorithm?
In policy iteration, why is each policy guaranteed to be a strict improvement over the previous one? Is it because it accumulates over all possible future actions, so for non-greedy strategies of length $k$ you would need $k$ iterations. That’s why it works even though it’s a strictly greedy algorithm?
Gradient Descent vs. Bootstrapping: This reminds me a lot of “momentum” in gradient descent. Is there any connection between value/policy iteration and gradient descent?

Chapter 5: Monte Carlo Methods

You can think of Monte Carlo (MC) methods as a combination of non-stationary bandits with DP
Notation: The book uses $G$ to denote “sample return” instead of $R$
First-Visit vs Every-Visit: To compute the value of state $S_i$, use only the value estimate you got from the first time you visited $S_i$ in First-Visit, and lock this same value in for all subsequent visits; otherwise update the value estimate and take the average for every time you visit $S_i$ again in Every-Visit. Why they converge even though they behave differently: Markov property: The goodness of $S_i$ doesn’t depend on the history of what you visited before it.
MC does not require a model compared to DP
MC estimates for each state are independent, they do not “bootstrap” as in DP.
Bias-Variance trade-off: MC has low bias but high variance (due to sampling); bootstrap has low variance but high bias (because you are building estimates on previous estimates of value, which could be biased)
MC is way more computationally feasible than DP even when you have perfect knowledge because it can work with sample episodes alone. Generating samples is easy, but updating all probabilities before applying DP requires iterating over the entire value/policy space. This is also great when you only want information about the current state, not all states.
Soap Bubble: This works for the same reason gradient descent works: all functions are locally linear, because we are doing a first-order Taylor approximation in its local neighborhood
MC backup diagram just looks like a very, very long chain, but DP looks like a (shallower, but wide) tree.
Exploring starts: Another disadvantage of this approach is if your starting state-action space is huge to begin with. E.g. There are $10^{100}!$ states or something. Or, every possible chess position. Or, every possible 12 megapixel RGB image.
MC converges when (1) We assume episodes have exploring starts (2) There are an infinite number of episodes (3) We show that $\pi_{k+1}$ is strictly better than or as good as $\pi_k$
On-policy vs. Off-policy: Like epsilon-greedy vs. non-greedy exploit/explore.
Off-policy methods are slower and more computationally expensive, but more general.
Importance-Sampling ratio: $\frac{\pi\left(A \vert G\right)}{b\left(A \vert S\right)}$ product over all time steps
Pros/Cons of Asynchronous DP vs. Monte Carlo? They seem to be computationally similar because both can be parallelized, and value estimates in each can be computed independently of other states.

Chapter 6: Temporal-Difference Learning

“Temporal Difference”: The change in the error of the prediction over time of the prediction, as you gain new experience. Compared to Monte Carlo, where you have to wait until the end of episode to know the true return. TD can be online, and fully incremental, but MC must wait until you get the true return.
TD is a combination of MC and DP ideas: Like MC, it can learn from direct experience without a model, but like DP they can bootstrap without waiting for a final outcome.
$TD(0)$ updates its estimate $V$ of $v_\pi$ immediately after each time step, instead of at the end of the episode—it uses estimates of rewards to do this (like in MC) since it does not yet know the final reward.
Monte Carlo finds estimates that minimize the mean squared error, but batch $TD(0)$ finds estimates for the maximum-likelihood estimate of the Markov process.
Sarsa is for the action-value function $Q$, and TD is just for the value function $V$.
TD converges only when you don’t violate the Markov Assumption.
Sarsa uses five data points for its update: $S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1}$
Sarsa is on-policy and Q-learning is off-policy
Expected Sarsa: Use expected value instead of maximum value to estimate $Q$.
Maximization Bias: Basically, if there is some variance in your rewards, then even if the average reward is negative, you might be likely to choose the actions that lead to a negative reward because the variance causes that state to sometimes have large positive value. This can be mitigated by using double learning, where you use a second estimator to get an unbiased estimate of the first one.
Afterstate value function: Evaluate the state-value function after an action is selected, instead of before. This can collapse various identical states into the same “afterpositions” even if their previous state/action was not. Because of the Markov assumption, we don’t care how we got to the “afterposition”, even if there are many paths to get there, only that we are currently in the same afterstate.

Exercise 6.2

TD updates are likely to be much better initially when you move to a new building because most of the state-value and action-value functions are the same, so the TD estimates based on those are already highly accurate, and it only needs to update its “last mile” estimations for the new building.

Exercise 6.4

I think larger fixed values for alpha will lead to worse performance because you are giving more weight to each specific reward encountered, which increases the variance of the whole thing and making it less likely you’ll converge in the long-run, even if you have better improvement in the beginning. This could be mitigated by having larger values of alpha at first and then gradually tapering off and reducing alpha over time.

Exercise 6.11

I think it’s considered off-policy because it tries to estimate the optial action-value function independent of the policy being followed.

Exercise 6.12

Yes, I think this would make Q-learning the same algorithm as Sarsa because they both will pick the maximum estimate for the action-value function. Given they will be picking the same actions under the same action selection criteria, it follows that the updates will also be the same.

What happens for sparse rewards, is TD too “spikey”? Because the error estimates may be wildly swayed by events that happen during the episode, if the rewards have high variance?
It seems like it’s up to the programmer to specify how TD does exploration/control, is the “best” exploration/control policy application-dependent? Or are there some general rules of thumb for what exploration strategies work best?

Chapter 7: n-step Bootstrapping

n-step TD can be thought of as a generalization of $TD(0)$ on one extreme and Monte Carlo on the other extreme
n-step methods can propagate rewards further “back” up the back-up diagram to update more states than the $TD(0)$ case. The way that n-step handles bootstraping is to update values based on the value of that we think a state has $V()$, this is the mechanism that allows each state to get an n-step look into the expected value of the future states.
In the basic (non-importance sampling weighted) n-step TD prediction we omit the actions from the return calculation, so we just deterministically follow a fixed policy forever, which is why it’s on-policy.
Bias-Variance trade-off: n-steps is lower variance than both $TD(0)$ and MC because we do not propagate rewards all the way through the entire tree (nor are we hyper-local with the updates), but the trade-off is the steps it takes are slower so it will potentially take longer to converge.
The key property bootstrapping: You don’t have to wait to earn a reward in a possibly low-probability event in the same trajectory to do a value update on a state, you can update based on past trajectories that visited the same state even if the current trajectory did not earn the same reward. This assumes the Markov assumption is true, that your rewards are agnostic of how you got to a state.
n-step TD can be turned into an off-policy algorithm by using importance sampling to weight between $\pi$ and $b$.
Control variate: If $X$ and $Y$ are correlated, then if you see $Y > E[Y]$, it’s likely that $X > E[X]$. Used in the update to $G$ as a “fallback” when the importance sampling ratio is $0$, so we can still use $V(S)$ as our return. It lets us measure how much better/worse the current sample is from the average, as a correction.
n-step Tree Backup: Use the estimates of actions that were not selected to bootstrap the value of each state. This allows you to avoid using importance sampling, because it allows us to consider unselected actions more generously and don’t have to correct as much for their not being selected. The actions are weighted by their probability of being selected under policy $\pi$.
How to determine the best value for $n$? There is clearly a trade-off between ease of computation and accuracy, as larger values for $n$ require more computation but are potentially more accurate. In general what are some heuristics or metrics we should use to determine the optimal $n$? Do we just scan over the hyperparameters and pick the best performing one as in Figure 7.2? But this may not work in an online learning environment where you only have one shot to look at the data.

Chapter 8: Planning and Learning with Tabular Methods

“Good Old Fashioned AI” (GOLF AI), “Classical Planning”, non-statistical based methods of creating policies, where you try to explore the state space and come up with a policy.
Model-Based a.k.a. planning: DP, Heuristic Search
Model-Free a.k.a. learning: MC, TD
Both methods are similar in they need to compute value functions, look ahead to future events, and compute a back-up value.
Model: Anything an agent can use to predict the environment.
Distribution model: all possible outcomes and their probabilities
Sample model: Just one possible outcome, sampled according to its probability.
Models can simulate experience by generating transitions given state and action
Indirect methods remind me of importance weighting ratio, where you take into consideration your limited experience by assigning some credence to experiences that never happened, but weight them by their probability of actually happening.
Dyna-Q: A unified architecture incorporating planning, acting, and model-learning.
Planning: Updated based on $V(S)$ e.g. $TD(n)$
Learning: Updated based on actual reward only e.g. MC
Connection to evolutionary biology: Simulated experience, how humans evolved intelligence and what separates us from primates is the ability to imagine events that don’t exist and learn from projections of the future, instead of just direct experience (e.g. visualizing a lion that doesn’t exist, instead of waiting to encounter a lion that does exist and then it’s too late.)
Real vs simulated experience, simulated is also real, and we shouldn’t undervalue it. The human ability to delay gratification is not so much that we put aside the “reward”, but more than we substitute the actual reward for an imagined one. In the “Marshmallow Test”, the most successful children aren’t the ones who just had iron willpower, but instead those who distracted themselves by singing songs, playing with their hands, etc. E.g. toiling away at the fields for the reward of an imagined afterlife. Dan Gilbert in Happiness can be synthesized: “We smirk, because we believe that synthetic happiness is not of the same quality as what we might call “natural happiness.” “I want to suggest to you that synthetic happiness is every bit as real and enduring as the kind of happiness you stumble upon when you get exactly what you were aiming for”
Stories as simulated dopamine: Animals require Pavlovian training and classic rewards/punishments to influence behavior, but for humans we can get away with simulated rewards/punishments (e.g. heaven/hell), because the simulated rewards/punishments are just as real to us as the real rewards/punishments. Stories of eternal rewards and punishment
Tabular Dyna-Q updates $Q(S,A)$ based on a random selection of previously observed states and random actions previously taken, this is the “simulated learning” component.
If the environment changes for the worse, most RL agents will be able to to adapt as long as their learning rate is tuned correctly because their rewards decrease immediately and they have an incentive to do better. However if the environment changes for the better, it may go completely undetected, because once a model is learned it is somewhat “locked in” especially if the better path requires lots of exploration that isn’t “worth it” at a late stage of learning.
Connection to startup culture and “disruptive innovation”: The benefit from exploration as a function for how often the environment changes. Asymmetric behavior when the environment changes for the better, even recalcitrant agents will change when the environment becomes worse, but it takes an open-minded agent to take advantage of an environment that becomes better. Paradigm shifts don’t happen when things become worse, because everyone is forced to adapt. They happen when something changes but no one notices the better way because the current way is “good enough”, and that’s where opportunities lie.
Exploration vs. Exploitation: Effectiveness vs. Efficiency, “Doing the right thing” vs. “Doing things right”.
Prioritized Sweeping: Like “Importance Weighting” for simulated transitions. Selecting actions/states more intelligently than uniformly at random to improve performance. Instead, we maintain a queue and prioritize by the magnitude of the change. Intuitively, this updates the “most wrong” estimations first.
What happens to prioritized sweeping in a stochastic environment with high variance? We should use an expected update instead, or else each update will whipsaw?
Trajectory sampling: distribute updates according to the on-policy distribution (i.e. the current policy); simulate individual trajectories and update states/state-actions along the way.
Not all states are reachable from the start state following an optimal strategy, these are valid states, but we call them “irrelevant” because they are unreachable from any start state under any optimal policy.
RTDP: On-policy trajectory-sampling version of value-iteration in DP. Essentially, this is an optimized version of DP.
Background Planning: use simulated experience to improve a policy/value function based on a model.
Decision-Time Planning: use simulated experience to select an action for the current state; plan after encountering a new state, focused on future actions.
Heuristic Search: Consider a large tree of continuations for each state encountered.
Heuristic Search is focused on states and actions that immediately follow the current state, as these are the highest priority for update and why it is effective even when it requires a lot of computation, because the computation is so focused (compared to value/policy iteration, where you are updating lots of states that have nothing to do with your current state.)
Rollout Algorithms: Estimate action values for a given policy as the average of returns of a Monte Carlo simulation of trajectories that start with some action then follow some policy.
Rollout algorithms are memoryless; they make immediate use of action-value estimates, then discard them, as it is a decision-time planning algorithm for only the current state and a given policy, not for the complete action-value function. It’s not technically a “learning” algorithm because it does not maintain long-term memory of values or policies.

Monte Carlo Tree Search (MCTS)

A rollout algorithm but remembers and can accumulate value estimates obtained by its simulation.
MCTS runs recursively, each execution is a rollout, repeating until it runs out of time.

Four Steps:

Selection: Starting a root, select a tree policy
Expansion: On some iterations, run more MCTS on the child nodes
Simulation: From the selected node, run simulate a complete episode
Backup: Backup update from the simulated episode.

MCTS is an online, incremental sample-based algorithm for value estimation and policy improvement. It iteratively grows a lookup table for a partial action-value function, with estimated values of state-action pairs visited along high-yielding sample trajectories. If you can pick a better initial policy, and a better rollout policy, you can improve MCTS, because the rollout policy needs to be cheap to compute to simulate lots of trajectories, and it informs what action MCTS should take.

The chapter presents a model as a separate entity which influences planning for a policy. But can a model be entirely subsumed inside a policy? Since you could consider any model as just extra parameters on a policy itself. They seem to be separated artificially only because it seems easier to understand and program that way, but is a “model” and a “policy” really just the same high-level concept?
Is Heuristic Search just another name for Minimax decision-making in game theory?
Can it be said that Heuristic Search is more suited to situations you are fairly confident will never occur again? Because if your current trajectory seems likely to be repeated, then it pays to remember this information somehow, as in the tabular methods and value/policy iteration we discussed earlier.

Chapter 9: On-policy Prediction with Approximation

Tabular methods only work for discrete state spaces; for continuous problems, we’ll try function approximation
$\hat{v}$: the approximate version of the value function, given a weight vector $w$. We can think of the weight vector as either linear coefficients, or the weights in an ANN, or a decision tree, $\theta$ or Parameters, etc.
Function approximation can’t augment the state representation with memories of past observations
“Updating” the value function: Each update is an “example” of the input-output behavior of the value function: The estimated value for state $s$ should be more like the update target $u$.
In Machine Learning, this is basically “Supervised Learning”. Neural Nets as universal function approximators.
However in the RL environment, learning is inherently online, and we can’t assume a static training set from which we can make multiple passes. Also, we must be able to approximate a nonstationary target function.
Prediction Quality: There are more states than weights, so we have to say which states we care most about. $\mu(s)$ is this state distribution of how much we care about each state. The error is the mean squared error between the approximate value $\hat{v}$ and the true value function $v_\pi$. So the objective function is this difference weighted by mu, defined as $\overline{VE}$ (the objective function MSE: Mean Squared Error)
On-policy distribution: usually mu(s) is the fraction of time spent in $s$. So states that are more commonly visited are more important.
Note that the actual performance objective is to find a better policy, not necessarily to minimize $\overline{VE}$, but you need a good $\overline{VE}$ to generate good policies.
Stochastic Gradient Descent (SGD): Descent method to minimize $\overline{VE}$. $\hat{v}$ must be a differentiable function. To ensure convergence, we must decrease the step size over time. You can think of each “data point” used as a single rollout of a Monte Carlo episode.
Monte Carlo state-value prediction is guaranteed to find a locally optimal soltion, under the stochastic approximation conditions, as $U_t = G_t$ is unbiased.
Bootstrapping targets are not a true gradient method because they are biased, they depend on the current value of the weight vector. Gradient methods are still advantageous for being online.
State aggregation: group together states into one estimated value (one component in $w$). This essentially converts a continuous problem into one that can be solved by tabular methods.
Linear Methods: $\hat{v}$ as a dot product between the weight vector and a feature vector. A “feature” is a real-valued representation of some state $s$.
How to choose appropriate features? Linear methods cannot take into account interactions between features.
- Example: Use polynomials to represent combinations of states.
- Example: Use fourier bases to represent linear combinations of cosine features, over some high order $n$. Fourier features don’t perform as well with discontinuities. You’re also assuming a shape of the global generalization.
- Example: RBF Basis. Gaussian Kernel that is inherently local, you can think of it as being “soft tabular”, like using a softmax instead of an argmax. Assumes that generalization is inherently local, that local state spaces are inherently related to each other.
- Example: Hand designed features. E.g. “Distance to nearest food pellet” in Pac-Man, instead of one state for each pixel.
TD Fixed Point: When computing the error between $v_\pi$ and $\hat{v}$, using MC to estimate $U_t$ is unbiased and doesn’t depend on $w$, so we can converge to near the global optimum, but if we use bootstrapping then each iteration of computing $U_t$ depends on $w$ and so the target changes from under us. This instead converges to the TD fixed point (a local optimum). The bound of the distance between the TD fixed point and the global optimum is related to $\frac{1}{1-\gamma}$, so it is desirous to have small values for $\gamma$ (more heavily discounting). In practice, we want larger $\gamma$, but it still works out because bootstrapping gives us a good advantage.
Non-Parametric methods: They have parameters, but they aren’t fixed, or the parameters grow according to the size of the data.
Memory-based Function Approximation: “Lazy Learning”: Save training examples as they arrive, and retrieve a set of examples only when a value estimate is required for the query state.
Kernel-based Function Approximation: Lazy Learning require some measure of “distance” or “similarity” that may not necessarily be Euclidean distance. Kernel functions provide some custom measure of similarity between states. Kernel functions numerically express how relevant knowledge about any state is to any other state, or $k(s, s')$ measures how strongly you can generalize from $s'$ to $s$.
Interest: How interested we are in valuing the given state
Emphasis: How much we want to emphasize the learning update done at time t (could be across multiple states)
How to prevent overfitting with a Fourier basis?
Can we still estimate $\overline{VE}$ if $\hat{v}$ is not a differentiable function? We can just substitute SGD for any number of other non-descent methods in numerical optimization?

Chapter 10: On-policy Control with Approximation

Average Value RL / Average Reward Control: To replace discounting in the continuing case.
Episodic semi-gradient control: Control involves taking the Action into account, whereas before we selected random training examples from $S_t \to U_t$.
Action-Value prediction for continuous actions are still an area of ongoing research, but we can try the existing techniques. Another technique we can do is to make the action part of the function approximation by making action itself a dimension. You can’t do an argmax anymore if the action is continuous, but you can do something like a line search.
Average Reward Setting: A third setting alongside episodic and discounted settings. Applies to continuing problems where the interaction goes on without termination or start states.
The quality of a policy is just the moving average of the reward.
Ergodicity: Assumption about long-term behaviors/convergence. Any early decision made by an agent can only have a temporary effect, in the long run the expectation of being in a state depends only on the policy and the transition probability.
Discounting does not work: the average of the discounted returns is proportional to the average reward—ordering doesn’t matter. Function approximation voids the guarantee of the policy improvement theorem. In function approximation, we cannot guarantee improvement in any setting.
In order to effectively learn a value function in the continuous setting, is it REQUIRED that we have some sort of a “similarity” metric between different “states”? While no state may be visited more than once, it seems that we can’t learn a value function unless we are able to choose some some way of comparing states that are said to be “close together”.

Chapter 11: Off-policy Methods with Approximation

Function approximation for off-policy is significantly harder than for on-policy, because the distribution of off-policy updates does not match the on-policy distribution.
Baird’s counterexample: A complete MDP system that is unstable; trying to approximate the state-value function will never converge.
The Deadly Triad: The danger of instability and divergence arises whenever the following three properties are combined:
1. Function approximation
2. Bootstrapping (Update targets use existing estimates, instead of only actual rewards (as in MC))
3. Off-policy training
Instability does not result from control or generalized policy iteration, nor from uncertainties about the environment; the Deadly Triad itself is sufficient.
Since we cannot give up Function Approximation, it follows then that we trade-off between Bootstrapping and Off-policy training. If we must have off-policy training, then we must give up Bootstrapping, which is possible but significantly reduces data efficiency and increases computational costs.
“Learnability”: A hypothesis is “learnable” only if it is efficiently learnable, i.e. it can be learned in polynomial rather than exponential time/examples.

Chapter 12: Eligibility Traces

Eligibility trace: Unification and generalization of TD and Monte Carlo
Main mechanism is a short-term memory vector, the eligibility trace $z$, alongside the long-term weight vector $w$. Learning occurs in the component of $w$ if a nonzero TD error occurs before the trace in $z$ decays to zero.
Lambda $\lambda$: Trace delay parameter. How quickly does the trace in $z$ decay? Similar to “half life” of a chemical element. But $\lambda$ itself is not an actual “half life” measurement, instead that’s denoted $\tau_\lambda$.
Computational advantage over n-step is only a single vector $z$ is needed, instead of the last $n$ feature vectors. Learning can occur immediately instead of being delayed for $n$ steps.
Forward view: updates are based on rewards and states in the future. Backward view: updates use the current TD error, but look backward to recently visited states using an eligibility trace.
Compound update: Weighted average of n-step returns for different $n$.
$\lambda$-return: Average all the n-step updates, each weighted proportionally to $\lambda^{n-1}$
$TD(\lambda)$:
1. Update the weight vector at every step, instead of waiting until the end of an episode (as in offline $\lambda$-return)
2. Computations are uniformly distributed in time, instead of batched at the end of an episode
3. Can be applied to continuing problems instead of just episodic problems
The eligibility trace is incremented each time step by the value gradient $\nabla \hat{v}(S_t, w_t)$, and fades away according to $\gamma\lambda$.
You can think of the trace as indicating the “eligibility” of each weight vector component for undergoing a learning update.
$TD(\lambda)$ updates the weight vector per $\alpha\delta_t z$, where $\delta_t$ is the TD error and $\alpha$ is the step size.
Truncated TD (TTD): use a parameter h to stop backup of $\lambda$ before the final state. It’s like having another T that’s shorter than the full episode. Avoids having to calculate tiny updates (as $\lambda$ exponentially decays) for far away states.
Online $\lambda$-return: Redo updates when you receive new data, to get an updated weight vector in each sequence
True Online $TD(\lambda)$: Take only the weight vectors on the diagonal from the ones produced by the online $\lambda$-return algorithm. It’s called “True” because it’s “truer” to the offline $\lambda$-return. “What would my weights have been if I did the offline version?”
Watkin’s $Q(\lambda)$: Decay the eligibility trace the usual way as long as a greedy action is taking, but cut the trace to zero as soon as a non-greedy action is taken.
Saving on computation: On non-parallel computers, keep track of only the few traces that are significantly greater than zero.

Chapter 13: Policy Gradient Methods

Policy Search (instead of Value Iteration): Learn $\pi$ directly without the value function $v$. $\pi$ can be simpler than $v$, and this naturally lends itself to continuous actions.
Policy gradient method: Gradient ascent to maximize some performance metric $J(\theta)$.
Policy gradient method is just one way of policy search. Others:
- Hill Climbing: Perturb your value randomly and see if it is better, instead of computing a gradient. Does not require your function to be differentiable.
- Genetic search: “crossover” to reproduce policies. Works well for problems that are compositional, where you can do reasonably well with partial solutions, and combining partial solutions are able to get a good global solution. E.g. each partial solution encodes optimal local behavior.
- CMA-ES: Covariance matrix adaptation evolutionary strategy. Successively iterate a gaussian distribution of policies to sample from, and use this information to skew the distribution towards the samples that perform best.
Policy parameterization can’t be deterministic in practice. This is to avoid being stuck at some local optima and never exploring more. Also, sometimes the optimal play in situations with imperfect information is inherently stochastic, like rock-paper-scissors.
Softmax parameterization: Select actions probabilistically based on softmax distribution. This can approach a deterministic policy, whereas epsilon-greedy always has some epsilon probability of selecting a random action. Allows selecting actions with arbitrary probabilities.
Gaussian paramterization: Can also select actions based on a gaussian distribution.
The choice of your policy paramterization strategy is sometimes a good way of injecting prior knowledge about the desired form of the policy into the RL algorithm.
REINFORCE is often high variance, so we modify REINFORCE with Baseline: Subtracting the baseline from the reward is analogous to the TD error. This is done to reduce the variance of the expected value of the update. We can choose the baseline function as $b() = \hat{v}()$.
“actor”: policy, “critic”: (state)-value function

Chapter 14: Psychology

Algorithms for prediction correspond to classical/Pavlovian conditioning, and algorithms for control correspond to instrumental/operant conditioning.

Classical Conditioning

Blocking: Failing to learn a response when a potential conditioning signal is presented with another one that was previously used to condition the animal. Like taking a “shortcut” by associating only the first signal and not the second.

Second-order conditioning: Presenting a black square accompanying a metronome, but not following food. Pavlov’s dog salivated in response to the black square even though it never followed food (as the metronome had). This is similar to the TD model of classic conditioning via the bootstrapping idea.

Operant Conditioning

Law of Effect: Cats placed in a puzzle box will try actions randomly until they accidentally push a bar that allows them to escape. The next time they’re put in the box they will immediately push teh bar. Reinforcement learning also combines serach (selection, trying alternatives) and memory (assocative, the chosen alternatives are associated with states to form the policy) to solve a task.

Is animal exploration “absolutely random, blind groping” or is there guidance from prior learning, reasoning, or otherwise deliberate exploration? This is subject to controversy.

RL can trade-off between random and deliberate exploration, for example in $\epsilon$-greedy or UCB algorithms.

Delayed Reinforcement

Credit-assignment problem for learning systems: There can be a huge number of decisions that went into producing a reward, many of them spurious. Which decisions were actually important? RL resolves this by using eligibility traces, and also TD methods to learn value functions.

Stimuli leave traces in an animal’s nervous system that persists (but weakens over time), which is the analogy driving the eligibility trace.

Cognitive Maps

Model-based RL are like “cognitive maps” in psychology, where we explicitly try to model the environment, e.g. using a state-transition function or a state-value function.

Habitual vs. Goal Directed Behaviour

This is like RL’s distinction between model-free and model-based algorithms. Model-free algorithms make decisions by accessing information that has been stored in a policy or an action-value function, whereas model-based methods select actions as the result of planning ahead using a model of the agent’s environment.

Chapter 15: Neuroscience

The nervous systems of humans and animals implement algorithms that correspond strikingly to RL algorithms. The main point of contact is the chemical dopamine, which functions as biology’s “reward” signal.
Synaptic plasticity: the ability for synapses to modulate in response to the environment, such as arousal, attention, memory, mood, emotion, sleep, and body temperature. You can think of the parameters (weights) of an RL algorithm as corresponding to synaptic efficacies. The brain modulates synpatic plasticity using dopamine, and this is the analogy for how the brain implements reinforcement learning algorithms.
RL’s “Reward” is not a single chemical or master reward signal in an animal’s brain, instead, $R_t$ can be thought of as an abstraction summarizing the overall effect of multiple neural signals generated by many parts of the brain and nervous system that assess how rewarding or punishing a sensation is.
Reinforcement signals are used to direct changes, and are not reward signals. For example, in TD learning the reinforcement signal is the TD error, not the reward.
Reward Prediction Error (RPE) measures the discrepancy between the agent’s predicted state-value ($V$) or action-value ($Q$) estimate of the expected reward and the received reward signal, of which the TD error is a special case.
Dopamine is the TD-Error, not the reward. We don’t receive any dopamine when the reward we get is exactly the same as our expectations, we only get dopamine when the reward exceeds expectations.
The implication of this is “chasing the high”, why you need more stimulus to generate the same dopamine reward over time, because your expectations get adapted to the rewards you already received.
An animal’s dopamine response profile corresponds to RPE (or TD error) in RL. Dopamine is upregulated only when the actual reward exceeds the animal’s expectation of reward. Dopamine is downregulated when the actual reward is less than the expected reward. Therefore dopamine is a reinforcement signal, not a reward signal.
This is not a perfectly analogous because an earlier than expected reward produces a positive dopamine spike, but then later at the expected time dopamine does not drop below baseline the way that TD becomes negative.
Religion: Like promising an infinite future reward that gets propagated back temporally through the value function to influence different behaviours today that you otherwise would not have taken, to try to reach that infinite future reward.
Humans can tell stories about expected future rewards even though nobody has yet experienced those rewards. Examples being funding technological innovation with debt/equity, climate change, we experience large future penalties, and therefore act to try to mitigate them.
Actor-critic algorithms parallel the type of architecture found in the brain, with the dorsal striatum playing the role of “actor” and the ventral striatum playing the role of “critic”, using dopamine as the reinforcement signal.
Hedonistic Neurons: Hypothesis that each neuron can be thought of as an individual unit trying to maximize its own individual reward. See also: Neurons Gone Wild
Addiction: drug rewards cannot be “predicted away”, as cocaine-induced dopamine surges increase the TD error $\delta$ but cannot be cancelled out by changes in the value function. The result is that the values of these states increase without bound, making actions leading to these states preferred above all others.
Note the book does not discuss other neurochemicals like serotonin, oxytocin, endorphins, etc.

Chapter 16: Applications and Case Studies

TD-Gammon

Uses $TD(\lambda)$ with a combination of an ANN for function approximation.
Did not require extensive backgammon knowledge, but did require extensive knowledge of how ANNs work to model the inputs, and hand-craft the feature vector.
Initial moves are very poor, often lasting hundreds of thousands of moves before one side wins by accident. But self-play performance increased rapidly after a few dozen games.
TD-Gammon 3.0 beat human grandmasters, and influenced how humans play the game, for example discovering different opening moves than convention.

Samuel’s Checkers Player

One of the first applications of core RL concepts, like a value function “scoring polynomial”, heuristic search, minimax, backed-up score, “alpha-beta” cutoffs, temporal-difference learning.
First method of learning: Cache previously encountered positions for valuation, to amplify the effective depth of search.
Second method of learning: do a back-up update of hypothetical events back to the actual event

Watson’s Daily-Double Wagering

Represent bets as action values, with the estimated probability of a win from the current game state. State-Value function for a probability of a win from any game state, and a second estimate of the confidence of winning the “in-category” clue.
Training performed over millions of simulated games over hand-crafted models of human players
Self-play was not used because Watson is not a typical human player, so self-play would lead to exploring state spaces that are not typical against human opponents. “Self-play would have been something like playing poker with someone who is holding the same cards that you hold.”
Because of imperfect information, most of the effort in development was dedicated to creating good models of human opponents, extracted from statistics of “Average Contestants”, “Champions” (top 100 players), and “Grand Champions” (top 10 players).

Optimizing Memory Control

Use Sarsa to learn an action-value function for scheduling DRAM memory traffic.
The potential number of features was very large, and the authors had to use simulations to pare these down to select a handful.

Human-level Video Game Play

Deep multi-player ANNs to automate the feature design process. DQN: Deep convolutional ANN + Q-learning
Training took 50 million simulator frames (equivalent to 38 days of live experience)
The same learning system achieved superhuman play on 29 games, without relying on any game-specific modifications (the policies learned for each game were of course different).
The raw input was 84x84 array of luminance values, x4 most recent frames, reduced from the standard 210x160 pixel image purely for computational efficiency, and was the same input for all games.

Mastering the Game of Go

AlphaGo can be thought of as a descendant of TD-Gammon, itself a descendant of Samuel’s checkers player, in performing reinforcement learning over simulated self-play.
Position evaluation function is the crucial part of good Go program design.
AlphaGo’s innovation was by improving MCTS by guiding it using both policy function approximation and value function approximation.
AlphaGo Zero uses MCTS to select moves throughout self-play learning, but AlphaGo uses MCTS only for live play after.

Personalized Web Services

A/B testing can be thought of as a two-armed bandit problem
Understanding each user’s personalization as the “state” for context can take long-term effects of interactions into account. E.g. reward is LTV instead of CTR.
Deploying new policies entails risk: Novel policies must be evaluated using data collected from previous policies, so off-policy evaluation becomes critical.

Thermal Soaring

Fine tuning the reward signal was key to solving the problem. Thermal soaring required dense rewards, a reward signal at each time step, instead of a single reward at the end of the episode.

Temporal Abstraction via Options

I like to think of this as having a “super-policy” where instead of selecting among a number of actions to take, the “super-policy” selects among many sub-policies to execute. Then each sub-policy executes its actions until termination.

Chapter 17: Frontiers

General Value Function: Predict future values of signals themselves, instead of the sum of future rewards.
GVF does not necessarily connect with rewards, so it could be viewed as a “forecast” function instead of a “value” function.
GVF allows for powerful modelling of the environment, e.g. enabling the agent to get rewards more efficiently.
Stretching MDP to cover multiple levels of time resolution. E.g. muscle twitches to planning a flight route. Hiearchical policy: selecting from options instead of actions, where actions execute until termination.
Partially observed environments: The environment emits Observations, instead of States. We then model the state as a function of the History $S_t=f(H_t)$, History being the trajectory of Actions and Observations up until the present. $f$ can be thought of as an “environment translating function” that translates the environment from its history (and hence, observations) into States. This is called the “state-update function”.
How to design reward signals? Rewards that are sparse makes progress difficult or impossible to detect—the agent may wander aimlessly for long periods of time (the “plateau problem”).
In practice, designing rewards is done via informal trial-and-error search, until a reward is found that produces acceptable results.
It does NOT work to reward the agent for achieving subgoals, because the agent may not achieve the overall goal at all, instead opting to behave in such a way to optimize for its subgoals. A better way is to augment the value function approximation with an initial guess, instead of giving rewards for subgoals.
Inverse Reinforcement Learning: Watch a human expert solve the task (imitation learning), and then try to infer what the reward signals should have been to cause the expert to behave that way. It’s fast, but can’t generalize well.
Sparse rewards can be thought of as problems in the agent’s policy, not the environment, because the current policy prevents the agent from frequently encountering reward states. “Reward Shaping” involves changing the reward signal as learning proceeds, starting dense and getting sparser, or giving the agent increasingly difficult problems. Shaping is an essential technique in training animals.
Rewards can be thought of as a hyperparameter in the learning algorithm. In this view, we can automate the trial-and-error search for good reward signal by define a space of feasible candidates and then apply an optimization algorithm. This is analogous to using evolution to shape the objective function itself, to make sure the objective is adapted to the environment.
Multi-Agent RL: You can model other decision-makers as just part of the environment, (e.g. playing soccer) but you can also model them as actual agents/decision-makers, to improve performance in both competitive and cooperative settings (overlaps with game theory)