This exercise set covers the mathematical foundations required for RL, an introduction to the agent-environment interface, and the basics of solving MDPs using Dynamic Programming.
0. Prerequisites
0.1 Multi-armed Bandits - Introduction Lab
Download the notebook RL_WC1_bandit.ipynb from Canvas and follow the instructions.
Solution:
For a diagonal matrix A, the inverse is Aii−1=1/Aii:
A−1=(1/a11001/a22)
For a 2×2 matrix M, M−1=detM1adj(M):
B−1=b11b22−b12b211(b22−b21−b12b11)
3. Compute ∂x∂c and ∂e∂c.
Solution:
We use the numerator layout (Jacobian formulation): if v is an n-vector and w is an m-vector, ∂w∂v is an n×m matrix where entry (i,j) is ∂wj∂vi.
4. Consider the function f(x)=∑i=1Nixi, which maps an N-dimensional vector x to a real number. Find an expression for ∂x∂f in terms of integers 1 to N.
Solution:∂x∂f=[∂x1∂f,∂x2∂f,…,∂xN∂f]
For a single term ∂xj∂f:
∂xj∂f=∂xj∂∑i=1Nixi=∑i=1Ni∂xj∂xi=∑i=1Niδij=j
where δij is the Kronecker delta. Thus:
∂x∂f=(1,2,…,N)
Assume X and Y are two independent random variables with means μ,ν and variances σ2,τ2.
1. What is the expected value of X+αY, where α is some constant?Solution: By linearity of expectation: E[X+αY]=E[X]+αE[Y]=μ+αν.
2. What is the variance of X+αY?Solution: For independent variables: Var(aX+bY)=a2Var(X)+b2Var(Y).
Thus, Var(X+αY)=Var(X)+α2Var(Y)=σ2+α2τ2.
3. Explain the terms in the bias-variance decomposition of squared error:E[(y−f^(x))2]=Bias[f^(x)]2+Var[f^(x)]+σ2
Solution:
Bias: Error from simplifying assumptions. Large when the model is too simple (underfitting).
Variance: Sensitivity of the model to the specific training set. High when the model is too complex and fits noise (overfitting).
Irreducible Error (σ2): The noise in the data itself. Cannot be reduced by improving the model.
4. Explain why this is a “trade-off”.Solution: Reducing bias usually requires increasing model complexity, which increases variance (as the model starts fitting spurious correlations/noise in the training data). Conversely, reducing variance (e.g., via regularization or simpler models) often increases bias by making stronger assumptions.
0.2.3 OLS, linear projection, and gradient descent
Concepts tested:[[Ordinary Least Squares|OLS]], [[Gradient Descent]], [[Linear Algebra]].
Given training set X (n×m) and labels y (n×1). Fit fβ(X)=Xβ by minimizing ∥y−Xβ∥22.
1. What is the dimensionality of β?Solution:β∈Rm (one parameter for each feature).
2. Show by differentiation that the OLS estimator β^=(XTX)−1XTy.Solution:
Define L(β)=(y−Xβ)T(y−Xβ). Set ∂β∂L=0:
∂β∂(yTy−2yTXβ+βTXTXβ)=0−2XTy+2XTXβ=0XTXβ=XTy⟹β=(XTX)−1XTy
3-5. Geometric Interpretation (Figure 1)
Figure 1: Geometric representation of OLS
y (Target vector) ^ | | ε (Residual vector, perpendicular to plane) | : | /| | / | |/ | ______●---+------------------ <- Subspace spanned by columns of X (plane "col X") \ / / \ / X1 (Regressor 1) \. (Origin) \ X2 (Regressor 2)
Description:y lies outside the plane spanned by Xi. The OLS prediction Xβ^ is the orthogonal projection of y onto the plane (the point ●). The residual ϵ=y−Xβ^ is the shortest distance from y to the plane, hence it must be orthogonal to the plane.
Solution (3-5):
Minimizing L2 norm is equivalent to finding the shortest distance from y to the subspace. This is achieved by the orthogonal projection P(y).
Orthogonality means the residual ϵβ is perpendicular to every column in X, hence XTϵβ=0.
Derivation: XT(y−Xβ)=0⟹XTy−XTXβ=0⟹β^=(XTX)−1XTy.
6. What is the loss function for OLS?Solution:Lβ(y,X)=∥y−fβ(X)∥22 (Squared L2 norm).
7. Write the gradient descent update rule for β.Solution:βt+1=βt−α∂βt∂L=βt+2αXT(y−Xβt).
1. Introduction & MDPs
1.1 Introduction
Concepts tested:[[Course of Dimensionality]], [[State Space]].
1. Explain the “curse of dimensionality”.Solution: Computational requirements (and the amount of data needed) grow exponentially with the number of state variables.
2. Predator-Prey on 5×5 toroidal grid.
(a) Naive state space: (xp,yp,xq,yq)⟹54=625 states.
(d) Advantage: Alleviates the curse of dimensionality, making the problem easier to solve.
(e) Tic-Tac-Toe: Exploiting symmetries (rotational, reflectional) to reduce the value function representation.
3. Greedy vs. Non-greedy agent.Solution: Non-greedy (exploratory) agent usually performs better long-term. It discovers better strategies that the greedy agent might miss by settling for a sub-optimal “local” maximum too early.
4. Annealing exploration (ϵ).
(a) Method: Start with high ϵ (e.g., 1.0) and decrease it over time (e.g., ϵt=t1 or linear decay) as the agent learns.
(b) Non-stationary environments: If the opponent changes strategies, time-based annealing fails. The agent will be “locked in” to an old strategy. Heuristic suggestion: Increase ϵ when the TD-error becomes large again, indicating the environment model is no longer accurate.
1. Probability of selecting the greedy action in ϵ-greedy?Solution:P(a∗)=(1−ϵ)+nϵ, where n is the number of actions. (Probability from exploitation + probability of picking it randomly during exploration).
2. 3-armed bandit sequence (Start Q=[0,0,0]).
A0=1,R1=−1⟹Q=[−1,0,0]
A1=2,R2=1⟹Q=[−1,1,0]
A2=2,R3=−2⟹Q=[−1,21−2,0]=[−1,−0.5,0]
A3=2,R4=2⟹Q=[−1,3−1+2,0]=[−1,0.333,0]
A4=3,R5=1⟹Q=[−1,0.333,1]Result:A3 and A4 were non-greedy (exploratory) because Q for action 2 was not the maximum when they were selected.
Return: Pessimistic had higher return (stayed with the first good arm).
Estimation: Optimistic leads to better Q-value estimates because it forced exploration of all arms.
Exploration: Optimistic initialization is a “trick” for exploration; the high initial value makes all unexplored arms look better than explored ones, forcing the agent to try everything.
(a) False: Value Iteration and Policy Iteration both converge to optimal policies.
(b) True: Value Iteration effectively does one step of policy evaluation followed by policy improvement in each sweep.
2. Why does the Bellman Optimality Equation hold at stabilization?Solution:
When Policy Iteration stabilizes:
Policy Improvement yields no change: π(s)=argmaxa∑s′,rp(s′,r∣a,s)[r+γvπ(s′)].
Policy Evaluation is consistent: vπ(s)=∑s′,rp(s′,r∣π(s),s)[r+γvπ(s′)].
Substituting (1) into (2):
vπ(s)=maxa∑s′,rp(s′,r∣a,s)[r+γvπ(s′)]
This is the Bellman Optimality Equation, meaning vπ is v∗.