Use best practice’s for machine learning development to ensure that your models generalize to data and tasks in the real world. Create and use decision trees and tree ensemble methods, such as random forests and boosted trees. Build and train a neural network using TensorFlow to perform multi-class classification.

**Advanced Learning Algorithms Coursera Quiz Answers ** **Week 1 Quiz Answers **

**Quiz 1: Neural networks intuition Quiz Answers **Q1. Which of these are terms used to refer to components of an artificial neural network?

- layers
- activation function
- neurons
- axon

Q2. True/False? Neural networks take inspiration from, but do not very accurately mimic, how neurons in a biological brain learn.

False

Feedback. Artificial neural networks do not actually mimic the complexity of real biological neurons.

True

**Quiz 2: Neural network model Quiz Answers **

**Q1**. For a neural network, here is the formula for calculating the activation of the third neuron in layer 2, given the activation vector from layer 1: a^{[2]}_{3}=g( \vec{w}^{[2]}_{3} \cdot \vec{a}^{[1]} + b^{2}_{3} )*a*3[2]=*g*(*w*3[2]⋅*a*[1]+*b*32). Which of the following are correct statements?

The activation of layer 2 is determined using the activations from the previous layer.

The activation of unit 3 (neuron 3) of layer 2 is calculated using a parameter vector \vec{w}*w* and b*b* that are specific to unit 3 (neuron 3).

Unit 3 (neuron 3) outputs a single number (a scalar).

If you are calculating the activation for layer 1, then the previous layer’s activations would be denoted by \vec{a}^{-1}*a*−1

**Q2**. For the binary classification for handwriting recognition, discussed in the lecture, which of the following statements are correct?

There is a single unit (neuron) in the output layer.

The output of the model can be interpreted as the probability that the handwritten image is of the number one “1”. After choosing a threshold, you can convert neural network’s output into a category of 0 or 1.

The neural network cannot be designed to predict if a handwritten image is 8 or 9.

The neural network cannot be designed to predict if a handwritten image is 8 or 9.

**Q3**. For a neural network, what is the expression for calculating the activation of the third neuron in layer 2? Note, this is different from the question that you saw in the lecture video.

a^{[2]}_{3}=g( \vec{w}^{[3]}_{2} \cdot \vec{a}^{[1]} + b^{3}_{2} )*a*3[2]=*g*(*w*2[3]⋅*a*[1]+*b*23) a^{[2]}_{3}=g( \vec{w}^{[2]}_{3} \cdot \vec{a}^{[2]} + b^{2}_{3} )*a*3[2]=*g*(*w*3[2]⋅*a*[2]+*b*32) a^{[2]}_{3}=g( \vec{w}^{[3]}_{2} \cdot \vec{a}^{[2]} + b^{3}_{2} )*a*3[2]=*g*(*w*2[3]⋅*a*[2]+*b*23) a^{[2]}_{3}=g( \vec{w}^{[2]}_{3} \cdot \vec{a}^{[1]} + b^{2}_{3} )*a*3[2]=*g*(*w*3[2]⋅*a*[1]+*b*32)

**Q4**. For the handwriting recognition task discussed in lecture, what is the output a^{[3]}_1*a*1[3]? The estimated probability that the input image is of a number 1, a number that ranges from 0 to 1.

- A vector of several numbers that take values between 0 and 1
- A vector of several numbers, each of which is either exactly 0 or 1
- A number that is either exactly 0 or 1, comprising the network’s prediction

**Quiz 3: TensorFlow implementation Quiz Answers **Q1. For the the following code:

model = Sequential([

Dense(units=25, activation=”sigmoid”),

Dense(units=15, activation=”sigmoid”),

Dense(units=10, activation=”sigmoid”),

Dense(units=1, activation=”sigmoid”)])

This code will define a neural network with how many layers?

3

4

25

5

Q2. How do you define the second layer of a neural network that has 4 neurons and a sigmoid activation?

Dense(units=4, activation=‘sigmoid’)

Dense(layer=2, units=4, activation = ‘sigmoid’)

Dense(units=[4], activation=[‘sigmoid’])

Dense(units=4)

Q3. If the input features are temperature (in Celsius) and duration (in minutes), how do you write the code for the first feature vector x shown above?

x = np.array([[200.0],[17.0]])

x = np.array([[200.0 + 17.0]])

x = np.array([[200.0, 17.0]])

x = np.array([[‘200.0’, ’17.0’]])

**Quiz 4: Neural network implementation in Python Quiz Answers **

**Q1**. According to the lecture, how do you calculate the activation of the third neuron in the first layer using NumPy?

z1_3 =w1_3 * x + b

a1_3 = sigmoid(z1_3)

layer_1 = Dense(units=3, activation=’sigmoid’)

a_1 = layer_1(x)

z1_3 = np.dot(w1_3, x) + b

a1_3 = sigmoid(z1_3)

**Q2**. According to the lecture, when coding up the numpy array W, where would you place the w parameters for each neuron? In the rows of W.

In the columns of W.

**Q3**. For the code above in the “dense” function that defines a single layer of neurons, how many times does the code go through the “for loop”? Note that W has 2 rows and 3 columns.

3 times

3 times

2 times

6 times

For each neuron in the layer, there is one column in the numpy array W. Each row of W represents how many input features are fed into that layer. The for loop calculates the activation value for each neuron.

5 times

For each neuron in the layer, there is one column in the numpy array W. Each row of W represents how many input features are fed into that layer. The for loop calculates the activation value for each neuron.

**Week 2 Quiz Answers **

**Quiz 1: Neural Network Training Quiz Answers **Q1. Here is some code that you saw in the lecture:

“`

model.compile(loss=BinaryCrossentropy())

“`

For which type of task would you use the binary cross entropy loss function?

- BinaryCrossentropy() should not be used for any task.
- A classification task that has 3 or more classes (categories)
- regression tasks (tasks that predict a number)
- binary classification (classification with exactly 2 classes)

**Q2**. Here is code that you saw in the lecture:

“`

model = Sequential([

Dense(units=25, activation=’sigmoid’),

Dense(units=15, activation=’sigmoid’),

Dense(units=1, activation=’sigmoid’)

])

model.compile(loss=BinaryCrossentropy())

model.fit(X,y,epochs=100)

“`

Which line of code updates the network parameters in order to reduce the cost?

- model.compile(loss=BinaryCrossentropy())
- None of the above — this code does not update the network parameters.
- model = Sequential([…])
- model.fit(X,y,epochs=100)

**Quiz 2: Activation Functions Quiz Answers **

**Q1**. Which of the following activation functions is the most common choice for the hidden layers of a neural network?

- Most hidden layers do not use any activation function
- Linear
- Sigmoid
- ReLU (rectified linear unit)

Q2. For the task of predicting housing prices, which activation functions could you choose for the output layer? Choose the 2 options that apply.

- ReLU
- linear
- Sigmoid

**Q3**. True/False? A neural network with many layers but no activation function (in the hidden layers) is not effective; that’s why we should instead use the linear activation function in every hidden layer.

- True
- False

**Quiz 3: Multiclass Classification **

Question 1: For a multiclass classification task that has 4 possible outputs, the sum of all the activations adds up to 1. For a multiclass classification task that has 3 possible outputs, the sum of all the activations should add up to ….

Less than 1

The sum of all the softmax activations should add up to 1 whether the number of possible classes is 3, 4, 5 or any other

number of classes. One way to see this is that if e^{z_1}=10, e^{z_2}=20,e^{z_3}=30*ez*1=10,*ez*2=20,*ez*3=30, then the sum of

a_1 + a_2 + a_3*a*1+*a*2+*a*3 is equal to \frac{e^{z_1} + e^{z_2} + e^{z_3}}{e^{z_1} + e^{z_2} + e^{z_3}}*ez*1+*ez*2+*ez*3*ez*1+*ez*2

+*ez*3 which is 1.

1

It will vary, depending on the input x.

More than 1

**Q2**. For multiclass classification, the cross entropy loss is used for training the model. If there are 4 possible classes for the output, and for a particular training example, the true class of the example is class 3 (y=3), then what does the cross entropy

- loss simplify to? [Hint: This loss should get smaller when a_3
*a*3 gets larger.] z_3/(z_1+z_2+z_3+z_4) - z_3
- -log(a_3)−
*log*(*a*3) - \frac{-log(a_1) + -log(a_2) + -log(a_3) + -log(a_4) }{4}4−
*log*(*a*1)+−*log*(*a*2)+−*log*(*a*3)+−*log*(*a*4)

**Q3**. For multiclass classification, the recommended way to implement softmax regression is to set from_logits=True in the loss function, and also to define the model’s output layer with…

- a ‘softmax’ activation
- a ‘linear’ activation

**Quiz 4: Additional Neural Network Concepts Quiz Answers **

**Q1**. The Adam optimizer is the recommended optimizer for finding the optimal parameters of the model. How do you use the Adam optimizer in TensorFlow?

- The call to model.compile() will automatically pick the best optimizer, whether it is gradient descent, Adam or something else. So there’s no need to pick an optimizer manually.
- The call to model.compile() uses the Adam optimizer by default
- The Adam optimizer works only with Softmax outputs. So if a neural network has a Softmax output layer, TensorFlow will automatically pick the Adam optimizer.
- When calling model.compile, set optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3).

**Q2**. The lecture covered a different layer type where each single neuron of the layer does not look at all the values of the input vector that is fed into that layer. What is this name of the layer type discussed in lecture?

- 1D layer or 2D layer (depending on the input dimension)
- A fully connected layer
- Image layer
- convolutional layer

**Week 3 Quiz Answers**

**Week 3 Quiz Answers **

**Quiz 1: Advice for applying machine learning **Q1. In the context of machine learning, what is a diagnostic?

- A process by which we quickly try as many different ways to improve an algorithm as possible, so as to see what works.
- This refers to the process of measuring how well a learning algorithm does on a test set (data that the algorithm was not trained on).
- An application of machine learning to medical applications, with the goal of diagnosing patients’ conditions.
- A test that you run to gain insight into what is/isn’t working with a learning algorithm.

Q2. True/False? It is always true that the better an algorithm does on the training set, the better it will do on generalizing to new data.

- False
- True

Q3. For a classification task; suppose you train three different models using three different neural network architectures. Which data do you use to evaluate the three models in order to choose the best one?

The cross validation set All the data — training, cross validation and test sets put together. The test set The training set** Quiz 2: Bias and variance Quiz Answers**

Q1. If the model’s cross validation error J_{cv}*Jcv* is much higher than the training error J_{train}*Jtrain*, this is an indication that the model has…

- Low bias
- high variance
- Low variance
- high bias

Q2. Which of these is the best way to determine whether your model has high bias (has underfit the training data)? Compare the training error to the cross validation error.

- See if the cross validation error is high compared to the baseline level of performance
- Compare the training error to the baseline level of performance
- See if the training error is high (above 15% or so)

Q3. You find that your algorithm has high bias. Which of these seem like good options for improving the algorithm’s performance? Hint: two of these are correct.

- Collect more training examples
- Collect additional features or add polynomial features
- Remove examples from the training set
- Decrease the regularization parameter \lambda
*λ*(lambda)

Q4. You find that your algorithm has a training error of 2%, and a cross validation error of 20% (much higher than the training error). Based on the conclusion you would draw about whether the algorithm has a high bias or high variance problem, which of these seem like good options for improving the algorithm’s performance? Hint: two of these are correct.

- Collect more training data
- Decrease the regularization parameter \lambda
*λ* - Reduce the training set size
- Increase the regularization parameter \lambda
*λ*

**Quiz 3: Machine learning development process Quiz Answers **Q1. Which of these is a way to do error analysis?

Calculating the test error J_{test}*Jtest *

Collecting additional training data in order to help the algorithm do better.

Manually examine a sample of the training examples that the model misclassified in order to identify common traits and trends.

Calculating the training error J_{train}*Jtrain *

Q2. We sometimes take an existing training example and modify it (for example, by rotating an image slightly) to create a new example with the same label. What is this process called?

- Machine learning diagnostic
- Bias/variance analysis
- Error analysis
- Data augmentation

Q3. What are two possible ways to perform transfer learning? Hint: two of the four choices are correct.

- Download a pre-trained model and use it for prediction without modifying or re-training it.
- Given a dataset, pre-train and then further fine tune a neural network on the same dataset.
- You can choose to train just the output layers’ parameters and leave the other parameters of the model fixed.
- You can choose to train all parameters of the model, including the output layers, as well as the earlier layers.

**Week 4 Quiz Answers **

**Quiz 1: Decision trees Quiz Answers **

Q1. Based on the decision tree shown in the lecture, if an animal has floppy ears, a round face shape and has whiskers, does the model predict that it’s a cat or not a cat?

- cat
- Not a cat

Q2. Take a decision tree learning to classify between spam and non-spam email. There are 20 training examples at the root note, comprising 10 spam and 10 non-spam emails. If the algorithm can choose from among four features, resulting in four corresponding splits, which would it choose (i.e., which has highest purity)?

- Left split: 2 of 2 emails are spam. Right split: 8 of 18 emails are spam.
- Left split: 7 of 8 emails are spam. Right split: 3 of 12 emails are spam.
- Left split: 10 of 10 emails are spam. Right split: 0 of 10 emails are spam.
- Left split: 5 of 10 emails are spam. Right split: 5 of 10 emails are spam.

**Quiz 2: Decision tree learning **

Q1. Recall that entropy was defined in lecture as H(p_1) = – p_1 log_2(p_1) – p_0 log_2(p_0), where p_1 is the fraction of positive examples and p_0 the fraction of negative examples.

At a given node of a decision tree, , 6 of 10 examples are cats and 4 of 10 are not cats. Which expression calculates the

entropy H(p_1)*H*(*p*1) of this group of 10 animals?

- (0.6) log_2(0.6) + (0.4)log_2(0.4)(0.6)
*log*2(0.6)+(0.4)*log*2(0.4) - -(0.6) log_2(0.6) – (1 – 0.4)log_2(1 – 0.4)−(0.6)
*log*2(0.6)−(1−0.4)*log*2(1−0.4) - -(0.6) log_2(0.6) – (0.4)log_2(0.4)−(0.6)
*log*2(0.6)−(0.4)*log*2(0.4) - (0.6) log_2(0.6) + (1 – 0.4)log_2(1 – 0.4)(0.6)
*log*2(0.6)+(1−0.4)*log*2(1−0.4)

Q2. Recall that information was defined as follows:

H(p_1^{root}) – \left ( w^{left} H(p_1^{left}) + w^{right} H(p_1^{right}) \right ) *H*(*p*1*root*)−(*wleftH*(*p*1*left*)+*wrightH*(*p*1*right*))

Before a split, the entropy of a group of 5 cats and 5 non-cats is H(5/10) *H*(5/10). After splitting on a particular feature, a group

of 7 animals (4 of which are cats) has an entropy of H(4/7)*H*(4/7). The other group of 3 animals (1 is a cat) and has an entropy

of H(1/3)*H*(1/3). What is the expression for information gain?

- H(0.5) – \left ( \frac{4}{7} * H(4/7) + \frac{4}{7} * H(1/3) \right )
*H*(0.5)−(74∗*H*(4/7)+74∗*H*(1/3)) - H(0.5) – \left ( 7 * H(4/7) + 3 * H(1/3) \right )
*H*(0.5)−(7∗*H*(4/7)+3∗*H*(1/3)) - H(0.5) – \left ( \frac{7}{10} H(4/7) + \frac{3}{10} H(1/3) \right )
*H*(0.5)−(107*H*(4/7)+103*H*(1/3)) - H(0.5) – \left ( H(4/7) + H(1/3) \right )
*H*(0.5)−(*H*(4/7)+*H*(1/3))

Q3. To represent 3 possible values for the ear shape, you can define 3 features for ear shape: pointy ears, floppy ears, oval ears. For an animal whose ears are not pointy, not floppy, but are oval, how can you represent this information as a feature vector?

- [0, 0, 1]
- [1, 1, 0]
- [0, 1, 0]
- [1,0,0]

Q4. For a continuous valued feature (such as weight of the animal), there are 10 animals in the dataset. According to the lecture, what is the recommended way to find the best split for that feature?

Choose the 9 mid-points between the 10 examples as possible splits, and find the split that gives the highest information gain. Use gradient descent to find the value of the split threshold that gives the highest information gain.

Try every value spaced at regular intervals (e.g., 8, 8.5, 9, 9.5, 10, etc.) and find the split that gives the highest information gain.

Use a one-hot encoding to turn the feature into a discrete feature vector of 0’s and 1’s, then apply the algorithm we had discussed for discrete features.

Q5. Which of these are commonly used criteria to decide to stop splitting? (Choose two.)

- When the information gain from additional splits is too large
- When the number of examples in a node is below a threshold
- When a node is 50% one class and 50% another class (highest possible value of entropy)
- When the tree has reached a maximum depth

**Quiz 3: Tree ensembles Quiz Answers **

Q1. For the random forest, how do you build each individual tree so that they are not all identical to each other?If you are training B trees, train each one on 1/B of the training set, so each tree is trained on a distinct set of examples.

- Sample the training data with replacement
- Sample the training data without replacement
- Train the algorithm multiple times on the same training set. This will naturally result in different trees.

Q2. You are choosing between a decision tree and a neural network for a classification task where the input x*x* is a 100×100 resolution image. Which would you choose?

- A neural network, because the input is unstructured data and neural networks typically work better with unstructured data.
- A decision tree, because the input is unstructured and decision trees typically work better with unstructured data.
- A decision tree, because the input is structured data and decision trees typically work better with structured data.
- A neural network, because the input is structured data and neural networks typically work better with structured data.

Q3. What does sampling with replacement refer to?

- Drawing a sequence of examples where, when picking the next example, first remove all previously drawn examples from the set we are picking from.
- It refers to a process of making an identical copy of the training set.
- It refers to using a new sample of data that we use to permanently overwrite (that is, to replace) the original data.
- Drawing a sequence of examples where, when picking the next example, first replacing all previously drawn examples into the set we are picking from.

**Review: **

Based on our knowledge, we urge you to enroll in this course so you can pick up new skills from specialists. It will be worthwhile, we trust.