Coursera was launched in 2012 by Daphne Koller and Andrew Ng with the goal of giving life-changing learning experiences to students all around the world. In the modern day, Coursera is a worldwide online learning platform that provides anybody, anywhere with access to online courses and degrees from top institutions and corporations.

### Week 01 Quiz Answers

Q1. Which of the following statements is true about function approximation in reinforcement learning? (**Select all that apply**)

- We only use function approximation because we have to for large or continuous state spaces. We would use tabular methods if we could, and learn an independent value per state.
- I
**t can be more memory efficient.** **It allows faster training by generalizing between states.****It can help the agent achieve good generalization with good discrimination, so that it learns faster and represent the values quite accurately.**

Q2. We learned how value function estimation can be framed as supervised learning. But not all supervised learning methods are suitable. What are some characteristics of reinforcement learning that can make it harder to apply standard supervised learning methods?

**Data is temporally correlated in reinforcement learning.**- Data is available as a fixed batch.
**When using bootstrapping methods like TD, the target labels change.**

Q3. Value Prediction (or Policy Evaluation) with Function Approximation can be viewed as supervised learning mainly because _________. [choose the most appropriate completion of the proceeding statement]

**Each state and its target (used in the Monte Carlo update, TD(0) update, and DP update) forms an input-output training example which we can use to train our approximation to the value function**- We use stochastic gradient descent to learn the value function.
- We can learn the value function by training with batches of data obtained from the agent’s interaction with the world.

Q4. Which of the following is true about using Mean Squared Value Error (\bar{VE} = \sum \mu(s) [v_\pi(s) – \hat{v}(s,w)]^2*VE*ˉ=∑*μ*(*s*)[*vπ*(*s*)−*v*^(*s*,*w*)]2) as the prediction objective?

\mu(s)*μ*(*s*) represents the weighted distribution of visited states

(Select all that apply)

**Gradient Monte Carlo with linear function approximation converges to the global optimum of this objective, if the step size is reduced over time.****The agent can get zero MSVE when using a tabular representation that can represent the true values.****This objective makes it explicit how we should trade-off accuracy of the value estimates across states, using the weighting \mu***μ*.- Even if the agent uses a linear representation that
**cannot represent**the true values, the agent can still get zero MSVE.

Q5. Which of the following is true about \mu(S)*μ*(*S*) in Mean Squared Value Error? (**Select all that apply**)

**It has higher values for states that are visited more often.****It serves as a weighting to minimize the error more in states that we care about.****If the policy is uniformly random, \mu(S)***μ*(*S*) would have the same value for all states.- It is a probability distribution.

Q6. The stochastic gradient descent update for the MSVE would be as follows.

Fill in the blanks (A), (B), (C ) and (D) with correct terms. (**Select all correct answers**)

\mathbf{w_{t+1}} \doteq \mathbf{w_t} \hspace{0.25em} (A) \hspace{0.25em} \frac{1}{2}\alpha \nabla [\hspace{0.25em} (C) \hspace{0.25em} – \hspace{0.25em} (D)\hspace{0.25em} ]^2 **wt**+**1**≐**wt**(*A*)21*α*∇[(*C*)−(*D*)]2

\quad\quad = \mathbf{w_t} \hspace{0.25em} (B) \hspace{0.25em} \alpha [\hspace{0.25em} (C) \hspace{0.25em} – \hspace{0.25em} (D)\hspace{0.25em} ] \nabla \hat{v}(S_t,\mathbf{w_t})=**wt**(*B*)*α*[(*C*)−(*D*)]∇*v*^(*St*,**wt**)

(\alpha > 0) (*α*>0)

**-, +, v_\pi(S_t), \hat{v}(S_t, \mathbf{w_t})−,+,***vπ*(*St*),*v*^(*St*,wt)**-, -, \hat{v}(S_t, \mathbf{w_t}), v_\pi(S_t)−,−,***v*^(*St*,wt),*vπ*(*St*)- +, +, \hat{v}(S_t, \mathbf{w_t}), v_\pi(S_t)+,+,
*v*^(*St*,**wt**),*vπ*(*St*) - +, -, v_\pi(S_t), \hat{v}(S_t, \mathbf{w_t})+,−,
*vπ*(*St*),*v*^(*St*,**wt**)

Q7. In a Monte Carlo Update with function approximation, we do stochastic gradient descent using the following gradient:

\nabla[G_t – \hat{v}(s,\mathbf{w})]^2 = 2 [G_t – \hat{v}(s, \mathbf{w})]\nabla (-\hat{v}(S_t, \mathbf{w}_t))∇[*Gt*−*v*^(*s*,**w**)]2=2[*Gt*−*v*^(*s*,**w**)]∇(−*v*^(*St*,**w***t*))

\quad\quad\quad\qquad\quad= (-1)*2 [G_t – \hat{v}(s, \mathbf{w})]\nabla \hat{v}(S_t, \mathbf{w}_t) =(−1)∗2[*Gt*−*v*^(*s*,**w**)]∇*v*^(*St*,**w***t*)

But the actual Monte Carlo Update rule is the following:

\mathbf{w_{t+1}} = \mathbf{w_t} + \hspace{0.25em} \alpha [G_t – \hat{v}(S_t, \mathbf{w}_t) ] \nabla \hat{v}(S_t, \mathbf{w}_t), \quad\quad (\alpha >0)**wt**+**1**=**wt**+*α*[*Gt*−*v*^(*St*,**w***t*)]∇*v*^(*St*,**w***t*),(*α*>0)

Where did the constant -1 and 2 go when \alpha*α* is positive? (**Choose all that apply**)

- We are performing gradient ascent, so we subtract the gradient from the weights, negating -1.
**We assume that the 2 is included in the step-size.****We are performing gradient descent, so we subtract the gradient from the weights, negating -1.**- We assume that the 2 is included in \nabla \hat{v}(S_t, \mathbf{w}_t)∇
*v*^(*St*,**w***t*).

Q8. When using stochastic gradient descent for learning the value function, why do we only make a small update towards minimizing the error instead of fully minimizing the error at each encountered state?

- Because the target value may not be accurate initially for both TD(0) and Monte Carlo method.
**Because small updates guarantee we can slowly reduce approximation error to zero for all states.**- Because we want to minimize approximation error for all states, proportionally to \mu
*μ*.

Q9. The general stochastic gradient descent update rule for state-value prediction is as follows:

\mathbf{w_{t+1}} \doteq \mathbf{w_t} + \alpha [U_t – \hat{v}(S_t, \mathbf{w_t})] \nabla \hat{v}(S_t,\mathbf{w_t})**wt**+**1**≐**wt**+*α*[*Ut*−*v*^(*St*,**wt**)]∇*v*^(*St*,**wt**)

For what values of U_t*Ut* would this be a semi-gradient method?

- v_\pi(S_t)
*vπ*(*St*) **R_{t+1} + \hat{v}(S_{t+1}, w_t)***Rt*+1+*v*^(*St*+1,*wt*)- G_t
*Gt* - R_{t+1} + R_{t+2} + … + R_T
*Rt*+1+*Rt*+2+…+*RT*

Q10. Which of the following statements is true about state-value prediction using stochastic gradient descent?

\mathbf{w_{t+1}} \doteq \mathbf{w_t} + \alpha [U_t – \hat{v}(S_t, \mathbf{w_t})] \nabla \hat{v}(S_t,\mathbf{w_t})**wt**+**1**≐**wt**+*α*[*Ut*−*v*^(*St*,**wt**)]∇*v*^(*St*,**wt**)

(**Select all that apply**)

- Stochastic gradient descent updates with Monte Carlo targets always reduce the Mean Squared Value Error at each step.
**When using U_t = R_{t+1} +\hat{v}(S_{t+1},\mathbf{w_t})***Ut*=*Rt*+1+*v*^(*St*+1,wt), the weight update is not using the true gradient of the TD error.**Semi-gradient TD(0) methods typically learn faster than gradient Monte Carlo methods.****Using the Monte Carlo return as target, and under appropriate stochastic approximation conditions, the value function will converge to a local optimum of the Mean Squared Value Error.****Using the Monte Carlo return or true value function as target results in an unbiased update.**

Q11. Which of the following is true about the TD fixed point?

(**Select all correct answers**)

- The weight vector corresponding to the TD fixed point is the global minimum of the Mean Squared Value Error.
- The weight vector corresponding to the TD fixed point is a local minimum of the Mean Squared Value Error.
**At the TD fixed point, the mean squared value error is not larger than \frac{1}{1-\gamma}1−***γ*1 times the minimal mean squared value error, assuming the same linear function approximation.**Semi-gradient TD(0) with linear function approximation converges to the TD fixed point.**

Q12. Which of the following is true about Linear Function Approximation, for estimating state-values? (Select all that apply)

**The gradient of the approximate value function \hat{v}(s, \mathbf{w})***v*^(*s*,w) with respect to \mathbf{w}w is just the feature vector.- State aggregation is one way to generate features for linear function approximation.
- The size of the feature vector is not necessarily equal to the size of the weight vector.

### Week 02 Quiz Answers

Q1. Which of the following is TRUE about coarse coding? (Select all that apply)

**In coarse coding, generalization occurs between states that have features with overlapping receptive fields.****In coarse coding, generalization between states depend on the size and shape of the receptive fields.**- When using features with large receptive fields, the function approximator cannot make discriminations that are finer than the width of the receptive fields.
- When training at one state, the learned value function will be updated over all states within the intersection of the receptive fields.

Q2. Consider a continuous two-dimensional state space. Assuming linear function approximation with the coarse-codings in either A, B or C, which of the following is TRUE? (Select all that apply)

<image: https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/QKm8JIFrEem2AQ6u9bTEKg_01acf3f539fcbdae50246a8d64f13735_coarse-coding.png?expiry=1658793600000&hmac=-3qBNV7KkNNNvYXIrGRYrMOjjyhVUwSa89k_6NH6Dr4>

- Generalization is broader in case A as compared to case B.
**In case B, when updating the state marked by an ‘x’ , the value function will be affected for a larger number of states as compared to case A.****In case C, each update results in more generalization along the vertical dimension, as compared to horizontal dimension.**- In case C, each update results in more generalization along the horizontal dimension, as compared to vertical dimension.

Q3. Which of the following affects generalization in tile coding? (Select all that apply)

**The number of tilings.****The shape of the tiles.****How the tilings are offset from each other.****The number of tiles.**

Q4. When tile coding is used for feature construction, the number of active or non-zero features

- is the number of tiles.
**is the number of tilings.**- is the number of tilings multiplied by the number of tiles.
- depends on the state.

Q5. Which of the following is TRUE about neural networks (NNs) ? (Select all that apply)

**A NN is feedforward if there are no paths within the network by which a unit’s output can influence its input.****Hidden layers are layers that are neither input nor output layers.**- The output of the units in NNs are typically a linear function of their input signals.
**NNs are parameterized functions that enable the agent to learn a nonlinear value function of state.****The nonlinear functions applied to the weighted sum of the input signals are called the activation function.**

Q6. Which of the following is the rectified linear activation function?

- f(x) = \frac{1}{1 + e^{-x}}
*f*(*x*)=1+*e*−*x*1 **f(x) = 1***f*(*x*)=1 if x>0*x*>0 and 00 otherwise- f(x) = max(0,x)
*f*(*x*)=*max*(0,*x*) - f(x) = \frac{e ^{x} – e ^{-x}}{e ^{x} +e ^{-x}}
*f*(*x*)=*ex*+*e*−*xex*−*e*−*x*

Q7. Which of the following is TRUE about neural networks (NNs)?

- A NN with a single hidden layer can represent a smaller class of functions compared to a NN with two hidden layers.
**The universal approximation property of one-hidden-layer NNs is not true when linear activation functions are used for the hidden layer.**- Given the universal approximation property of one-hidden-layer NNs, there is no benefit to including more layers in the network.

Q8. Which of the following is TRUE about backpropagation? (Select all that apply)

**Backpropagation corresponds to updating the parameters of a neural network using gradient descent.****Backpropagation involves computing the partial derivatives of an objective function with respect to the weights of the network.**- The forward pass in backpropagation updates the weights of the network using the partial derivatives computed by the backward passes.
**Backpropagation computes partial derivatives starting from the last layer in the network, to save computation.**

Q9. Training neural networks (NNs) with backpropagation can be challenging because (Select all that apply)

**the loss surface might have flat regions, or poor local minima, meaning gradient descent gets stuck at poor solutions.****the initialization can have a big impact on how much progress the gradient updates can make and on the quality of the final solution.**- neural networks cannot accurately represent most functions, so the loss stays large.
**learning can be slow due to the vanishing gradient problem, where if the partial derivatives for later nodes in the network are zero or near zero then this causes earlier nodes in the network to have small or near zero gradient updates.**

Q10.

Consider the following network:

<image: https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/Vq6Qcs5_EemkWQq0Oc-iqg_cc7af01c168da936fc2adad59d642d71_nn_quiz.png?expiry=1658793600000&hmac=sgApZl4TDIxLHHrT2RbdD9AySVGaxO_y4iTCRG3fJbA>

where for a given input s*s*, value of s*s* is computed by:

\psi = sW^{[0]} + b^{[0]} *ψ*=*sW*[0]+*b*[0]

x = \textit{max}(0, \psi) *x*=*max*(0,*ψ*)

v = xW^{[1]} + b^{[1]}*v*=*xW*[1]+*b*[1]

What is the partial derivative of v(s)*v*(*s*) with respect to W^{[0]}_{ij}*Wij*[0]?

*si***W^{[1]}_js_i***Wj*[1]*si* if x_j > 0*xj*>0 and 00 otherwise*xj*- W^{[1]}_jx_j
*Wj*[1]*xj* if x_j > 0*xj*>0 and 00 otherwise

Q11. Which of the following is TRUE? (Select all that apply)

- When using stochastic gradient descent, we often completely eliminate the error for each example.
**The difference between stochastic gradient descent methods and batch gradient descent methods is that in the former the weights get updated using one random example whereas in the latter they get updated based on batches of data.****Adagrad, Adam, and AMSGrad are stochastic gradient descent algorithms with adaptive step-sizes.****Setting the step-size parameter for stochastic gradient descent can be challenging because a small step-size makes learning slow and a large step-size can result in divergence.**

Q12. Which of the following is TRUE about artificial neural networks (ANNs)? (Select all that apply)

- It is best to initialize the weights of a NN to large numbers so that the input signal does not get too small as it passes through the network.
- It is best to initialize the weights of a NN to small numbers so that the input signal does not grow rapidly as it passes through the network.
**If possible, it would be best to initialize the weights of an NN near the global optimum.****A reasonable way to initialize the NN is with random weights, with each weight sampled from a normal distribution with the variance scaled by the number of inputs to the layer for that weight**.

### Week 03 Quiz Answers

Q1. Which of the following are true? (Select all that apply)

**When using state aggregation or coarse coding, there is generalization across states.**- When using state aggregation or coarse coding, updating the value of one state does not affect the values of other states.
**In the tabular case, updating the value of one state does not affect the values of other states.**- In the tabular case, there is generalization across states, i.e., updates to the value function of one state influences the value function of other states.

Q2. To turn the update of Expected Sarsa algorithm to the update of Q-learning, one must:

**Use the maximum over all the actions instead of the expectation in the update function.**- Use a neural network to approximate the action-value function.
- Behave greedily with respect to the action-value function.
- Expected Sarsa cannot be adapted to represent Q-learning.

Q3. Which of the following are true:

- When using function approximation and with discrete actions, there is no straightforward way to turn Sarsa into Expected Sarsa.
- Exploration is not a problem when using function approximation because learning generalizes across states and actions.
**When learning value functions in the mountain car domain and using tiile coding, where all the rewards are -1 except when the terminal state is reached, initializing the weights of a linear function approximation to zero is an example of optimistic initialization.**- When using function approximation, Q-learning will always result in better performance than Sarsa and Expected Sarsa.

Q4. Which of the following statements about discounted return and differential return algorithms are true:

- There is a set of Bellman equations for the value function corresponding to the discounted return. However, there is no set of Bellman equations for differential value functions.
**The performance of discounted return algorithms can suffer because the optimal value of gamma is problem dependent.****Algorithms that maximize the discounted return can suffer from an exponentially large variance when using a large discount factor.**

Q5. Imagine we have 7 state features and 6 actions and we want to use linear function approximation to compute the action-value function. If we use feature stacking (as explained in Week 3), how many features would we need in total to approximate the action-value function of all the different actions?

- 67
- 76
- 13
- 42

Q6. Imagine we are approximating the action-value function of three different actions (red, green, and blue) using three state features, resulting in the following weights and features:

<image: https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/fUIvgqdyS2qCL4KncmtqSg_db067eb21e7843ac87b9bcbc2081468f_Screen-Shot-2020-12-04-at-3.39.39-PM.png?expiry=1658793600000&hmac=WShkJzhhjwgekjWNBqOUXPgfa9MDzHyy4xajgA6v5LU>

Which of these statements is true about our function approximation scheme?

- The action-value of the green action will be exactly the same as the action-value of the red action for any state.
**The green action will never be selected.**- There will not be any generalization across states.
- The action-value of the red action will be exactly the same as the action-value of the blue action.

Q7. Consider the following feature and weight vectors corresponding to three state features and three actions (red, green, and blue):

<image: https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/q_hBhc0dEemHIhKpCkwcgA_8fa03d0ecc773f78f911359a96c3eb94_image3.png?expiry=1658793600000&hmac=3k-HLxBIJi1_HSnqpFfX-AZ8MujGpDB2EJ3sLjOXq7Q>

What is the approximate action-value for the blue action?

**16**- 42
- 53
- 12

Q8. Which of the following statements about epsilon-greedy policies and optimistic initialization are true:

- Implementing optimistic initialization is always straightforward and simple to implement regardless of the function approximation technique.
- Epsilon-greedy can be easily combined with any function approximation technique because it only needs to be able to query for the approximate action-values, without needing to know how they are initialized or computed.
- Optimistic initialization is always preferred over using an epsilon-greedy policy.
- Optimistic initialization results in a more systematic exploration in the tabular case because the agent takes actions it has not taken as often in a state.

Q9. Consider the mountain car environment where all the rewards are -1 after every action until reaching the flag on top of the right hill:

<image: https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/OkTCDMMIEemOSxK2qhCPHg_54ea04ec05a84d04d2d93de3591ef6a7_image5.png?expiry=1658793600000&hmac=f_OZNrtIaHoCXybcR-wvVhMrcHKjSBEYqMvlr9BGL3E>

Which of the following are examples of optimistic initialization when using linear function approximation?

- Initialize all the weights so that the action-value function is zero everywhere.
**Initialize all the weights at random using a uniform distribution between -5 and 5.**- Initialize all the weights so that the action-value function is -1 everywhere.
**Initialize all the weights so that the action-value function is 10 everywhere.**

Q10. Consider the following MDP where the red action (L) at state S results in an immediate reward of +1 and 0 afterwards, and the blue action (R) results in 0 immediate reward, but a final reward of +2 when returning to state S.

<image: https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/4qvlpc0eEemkWQq0Oc-iqg_0c0c106fc348bb17b24227fa4a7f4794_image2.png?expiry=1658793600000&hmac=lzpks_fen6jbdFaFd3hsYheWKNrUdBmaHJ08Pg4gJqA>

In this MDP, what is the optimal policy at state S when using discounted return with a value of gamma of 0.5? When using discounted return with a value of gamma of 0.9? When using average reward? (Hint: You can modify the equations used for the task from lesson 3 of this module to get the solution)

- The optimal policy is: gamma = 0.5, blue action; gamma = 0.9, red action; average reward, red action.
- The optimal policy is: gamma = 0.5, blue action; gamma = 0.9, red action; average reward, blue action.
**The optimal policy is: gamma = 0.5, red action; gamma = 0.9, blue action; average reward, blue action.**- The optimal policy is: gamma = 0.5, blue action; gamma = 0.9, blue action; average reward, red action.

Q11. Consider the following MDP with two states. State 1 has two actions: staying in the same state (blue) and switching to the other state (red). State 2 has a single (red) action leading to State 1. The rewards are listed next to each arrow/transition.

<image: https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/wYeuLcMIEemqPQ6AZDLU4g_1240e419b3e4306f6a0eac1596c04c8a_image4.png?expiry=1658793600000&hmac=6za_59jwXVCBC8PR0GcPatunimF7n5rZ9Ynd16RDRBk>

If the agent is following a policy that takes the blue and red actions with equal probability, what is the average reward? (Hint: the state visitation probability is ⅔ for state 1 and ⅓ for state 2, the target policy probability \pi is the random policy, and all the transitions p are deterministic) Recall the formula is:

<image: https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/Mt_Roc0fEemkWQq0Oc-iqg_14b69d78a7052ab94b428d75515fb940_image5.png?expiry=1658793600000&hmac=k6PNUdUnFJ5_a-Rp4LwsMJhzwBXDHgDFpBfGi2AJ_oc>

- Average reward = 1
**Average reward = 3/4**- Average reward = 2
- Average reward = 4/3

Q12. For which of the following tasks is the average reward setting preferable than the discounted return setting if we are interested in maximizing the total amount of reward?

- An Atari 2600 game where the agent can keep playing until it loses all its lives, which are a finite amount.
**The**nearsighted MDP**from Week 3. The agent starts at an initial state S that splits into two paths, each leading to a ring that leads back to S after some time. If the agent chooses the left ring, it observes and immediate reward of +1 followed by 0 rewards, whereas choosing the right ring results on immediate reward of 0, but a final reward of +2 when returning to state S.**- The episodic mountain car environment.
**The job scheduling task from**Section 10.3**of Sutton and Barto’s RL textbook. The agent manages how jobs are scheduled into three different servers according to the priority of each job. When the agent accepts a job, the job runs in one of the servers and the agent gets a positive reward equal to the priority of the job, whereas when the agent rejects a job, the job returns to the queue and the agent receives a negative reward proportional to the priority of the job. The servers become available as they finish their jobs. Jobs are continually added to the queue and the agent accepts or rejects them.**

### Week 04 Quiz Answers

Q1. Which of the following is true about policy gradient methods? (**Select all that apply**)

- The policy gradient theorem provides a form for the policy gradient that does not contain the gradient of the state distribution \mu
*μ*, which is hard to estimate. - Policy gradient methods do gradient ascent on the policy objective.
- If we have access to the true value function v_\pi
*vπ*, we can perform unbiased stochastic gradient updates using the result from the Policy Gradient Theorem. - Policy gradient methods use generalized policy iteration to learn policies directly.

Which of the following statements about parameterized policies are true? (**Select all that apply**)

- The probability of selecting any action must be greater than or equal to zero.
**The function used for representing the policy must be a softmax function.**- For each state, the sum of all the action probabilities must equal to one.
**The policy must be approximated using linear function approximation.**

Q3. Assume you’re given the following preferences h_1 = 44*h*1=44, h_2 = 42*h*2=42, and h_3 = 38*h*3=38, corresponding to three different actions (a_1, a_2, a_3*a*1,*a*2,*a*3), respectively. Under a softmax policy, what is the probability of choosing a_2*a*2, rounded to three decimal numbers?

- 0.42
**0.879**- 0.119
- 0.002

Q4. of the following is true about softmax policy? (Select all that apply)

**It can be parameterized by any function approximator as long as it can output scalar values for each available action, to form a softmax policy.**- It is used to represent a policy in discrete action spaces.
- Similar to epsilon-greedy policy, softmax policy cannot approach a deterministic policy.
**It cannot represent an optimal policy that is stochastic, because it reaches a deterministic policy as one action preference dominates others.**

Q4. What are the differences between using softmax policy over action-values and using softmax policy over action-preferences? (**Select all that apply**)

**When using softmax policy over action-values, even if the optimal policy is deterministic, the policy may never approach a deterministic policy.**- When using softmax policy over action-values, assuming a tabular representation, the policy will converge to the optimal policy regardless of whether the optimal policy is stochastic or deterministic.
**When using softmax policy over action-preferences, assuming a tabular representation, the policy will converge to the optimal policy regardless of whether the optimal policy is stochastic or deterministic.**

Q6. What is the following objective, and in which task formulation?

\Large r(\pi) = \Sigma_s \mu(s) \Sigma_a \pi(a|s, \mathbf{\theta}) \Sigma_{s’, r} p(s’, r | s, a) r*r*(*π*)=Σ*s**μ*(*s*)Σ*a**π*(*a*∣*s*,*θ*)Σ*s*’,*r**p*(*s*’,*r*∣*s*,*a*)*r*

- Average reward objective, continuing task
**Undiscounted return objective, episodic task**- Discounted return objective, continuing task
- Average reward objective, episodic task

Q7. The following equation is the outcome of the policy gradient theorem. Which of the following is true about the policy gradient theorem? (Select all that apply)

\Large \nabla r(\pi) = \Sigma_s \mu(s) \Sigma_a \nabla \pi(a|s, \mathbf{\theta}) q_{\pi} (s,a)∇*r*(*π*)=Σ*s**μ*(*s*)Σ*a*∇*π*(*a*∣*s*,*θ*)*qπ*(*s*,*a*)

**The true action value q_\pi***qπ* can be approximated in many ways, for example using TD algorithms.**We do not need to compute the gradient of the state distribution \mu***μ*.- This expression can be converted into:
- \large \mathbb{E}_\pi[\Sigma_a \nabla\pi(a|S, \mathbf{\theta})q_\pi(S,a)]E
*π*[Σ*a*∇*π*(*a*∣*S*,*θ*)*qπ*(*S*,*a*)] **In discrete action space, by approximating q_pi we could also use this gradient to update the policy.****This expression can be converted into the following expectation over \pi***π*:**\large \mathbb{E}_\pi [\nabla \ln \pi (A|S, \theta) q_{\pi} (S,A)]E***π*[∇ln*π*(*A*∣*S*,*θ*)*qπ*(*S*,*A*)]

Q8. Which of the following statements is true? (**Select all that apply**)

**Subtracting a baseline in the policy gradient update tends to reduce the variance of the update, which results in faster learning.**- The Actor-Critic algorithm consists of two parts: a parameterized policy — the actor — and a value function — the critic.
**To update the actor in Actor-Critic, we can use TD error in place of q_\pi***qπ* in the Policy Gradient Theorem.- TD methods do not have a role when estimating the policy directly.

Q9. To train the critic, we must use the average reward version of semi-gradient TD(0).

- False
**True**

Q10. Consider the following state features and parameters \theta*θ* for three different actions (red, green, and blue):

<image: https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/CtCX9MKcEemS6xJ43HxpzA_8c187081e0cfde740580cd8e4b8910b3_c3m4_quiz_img1.png?expiry=1658793600000&hmac=JaQHOYqeYkvfaYZq7fGiBgdEpPz50jO1BOx6AlLN5jQ>

Compute the action preferences for each of the three different actions using linear function approximation and stacked features for the action preferences.

What is the action preference of a_2*a*2 (blue)?

**35**- 39
- 42
- 37

Q11. Which of the following statements are true about the Actor-Critic algorithm with softmax policies? (**Choose all that apply**)

- The actor and the critic share the same set of parameters.
**Since the policy is written as a function of the current state, it is like having a different softmax distribution for each state.**- The preferences must be approximated using linear function approximation.
**The learning rate parameter of the actor and the critic can be different.**

Q12. Which one is a reasonable parameterization for a Gaussian policy?

**\mu***μ*: a linear function of parameters, \sigma*σ*: a linear function of parameters- \mu
*μ*: a linear function of parameters, \sigma*σ*: the exponential of a linear function of parameters. **\mu***μ*: the exponential of a linear function of parameters, \sigma*σ*: a linear function of parameters.

**Review: **

Based on our knowledge, we urge you to enroll in this course so you can pick up new skills from specialists. It will be worthwhile, we trust.