Coursera was launched in 2012 by Daphne Koller and Andrew Ng with the goal of giving life-changing learning experiences to students all around the world. In the modern day, Coursera is a worldwide online learning platform that provides anybody, anywhere with access to online courses and degrees from top institutions and corporations.
Week 1: Machine Learning: Classification Quiz Answers
Quiz 1: Linear Classifiers & Logistic Regression
Question 1: (True/False) A linear classifier assigns the predicted class based on the sign of Score(x)=wTh(x).
- True
- False
Question 2: (True/False) For a conditional probability distribution over y | \mathbf{x}y∣x, where yy takes on two values (+1, -1, i.e. good review, bad review) P(y = +1 | \mathbf{x}) + P(y = -1 | \mathbf{x}) = 1P(y=+1∣x)+P(y=−1∣x)=1.
- True
- False
Question 3: Which function does logistic regression use to “squeeze” the real line to [0, 1]?
- Logistic function
- Absolute value function
- Zero function
Question 4: If Score(x)=wTh(x)>0, which of the following is true about P(y = +1 | \mathbf{x})P(y=+1∣x)?
- P(y = +1 | x) <= 0.5
- P(y = +1 | x) > 0.5
- Can’t say anything about P(y = +1 | x)
Question 5: Consider training a 1 vs. all multiclass classifier for the problem of digit recognition using logistic regression. There are 10 digits, thus there are 10 classes. How many logistic regression classifiers will we have to train?
Answer: 10
Quiz 2: Predicting sentiment from product reviews
Question 1: How many weights are greater than or equal to 0?
Answer: 68419
Question 2: Of the three data points in sample_test_data, which one has the lowest probability of being classified as a positive review?
- First
- Second
- Third
Question 3: Which of the following products are represented in the 20 most positive reviews?
- Snuza Portable Baby Movement Monitor
- Mamadou Kids Foldable Play Yard Mattress Topper, Blue
- Britax Decathlon Convertible Car Seat, Tiffany
- Safety 1st Exchangeable Tip 3 in 1 Thermometer
Question 4: Which of the following products are represented in the 20 most negative reviews?
- The First Years True Choice P400 Premium Digital Monitor, 2 Parent Unit
- JP Lizzy Chocolate Ice Classic Tote Set
- Peg-Perego Tatamia High Chair, White Latte
- Safety 1st High-Def Digital Monitor
Question 5: What is the accuracy of the sentiment_model on the test_data? Round your answer to 2 decimal places (e.g. 0.76).
Answer: 0.91
Question 6: Does a higher accuracy value on the training_data always imply that the classifier is better?
- Yes, higher accuracy on training data always implies that the classifier is better.
- No, higher accuracy on training data does not necessarily imply that the classifier is better.
Question 7: Consider the coefficients of simple_model. There should be 21 of them, an intercept term + one for each word in significant_words.
How many of the 20 coefficients (corresponding to the 20 significant_words and excluding the intercept term) are positive for the simple_model?
Answer: 10;
Question 8: Are the positive words in the simple_model also positive words in the sentiment_model?
- Yes
- No
Question 9: Which model (sentiment_model or simple_model) has higher accuracy on the TRAINING set?
- Sentiment_model
- Simple_model
Question 10: Which model (sentiment_model or simple_model) has higher accuracy on the TEST set?
- Sentiment_model
- Simple_model
Question 11: Enter the accuracy of the majority class classifier model on the test_data. Round your answer to two decimal places (e.g. 0.76).
Answer: 0.84
Question 12: Is the sentiment_model definitely better than the majority class classifier (the baseline)?
- Yes
- No;
Week 2: Machine Learning: Classification Quiz Answers
Quiz 1: Learning Linear Classifiers
Question 1: (True/False) A linear classifier can only learn positive coefficients.
- True
- False
Question 2: (True/False) In order to train a logistic regression model, we find the weights that maximize the likelihood of the model.
- True
- False
Question 3: (True/False) The data likelihood is the product of the probability of the inputs \mathbf{x}x given the weights \mathbf{w}w and response yy.
- True
- False;
Question 4: Questions 4 and 5 refer to the following scenario.
Consider the setting where our inputs are 1-dimensional. We have data
x x | y y |
2.5 | +1 |
0.3 | -1 |
2.8 | +1 |
0.5 | +1 |
and the current estimates of the weights are w_0 = 0w0=0 and w_1 = 1w1=1. (w_0w0: the intercept, w_1w1: the weight for xx).
Calculate the likelihood of this data. Round your answer to 2 decimal places.
Answer: 0.23
Question 5: Refer to the scenario given in Question 4 to answer the following:
Calculate the derivative of the log likelihood with respect to w_1w1. Round your answer to 2 decimal places.
Answer: 0.37
Question 6: Which of the following is true about gradient ascent? Select all that apply.
- It is an iterative algorithm
- It only updates a few of the parameters, not all of them
- It finds the maximum by “hill climbing”;
Quiz 2: Implementing logistic regression from scratch
Question 1: How many reviews in amazon_baby_subset.gl contain the word perfect?
Answer: 2955
Question 2: Consider the feature_matrix that was obtained by converting our data to NumPy format.
How many features are there in the feature_matrix?
Answer: 194
Question 3: Assuming that the intercept is present, how does the number of features in feature_matrix relate to the number of features in the logistic regression model? Let x = [number of features in feature_matrix] and y = [number of features in logistic regression model].
- y = x – 1
- y = x
- y = x + 1
- None of the above;
Question 4: Run your logistic regression solver with provided parameters.
As each iteration of gradient ascent passes, does the log-likelihood increase or decrease?
- It increases.
- It decreases.
- None of the above
Question 5: We make predictions using the weights just learned.
How many reviews were predicted to have positive sentiment?
Answer: 25126
Question 6: What is the accuracy of the model on predictions made above? (round to 2 digits of accuracy)
Answer: 0.75
Question 7: We look at “most positive” words, the words that correspond most strongly with positive reviews.
Which of the following words is not present in the top 10 “most positive” words?
- love
- easy
- great
- perfect
- cheap
Question 8: Similarly, we look at “most negative” words, the words that correspond most strongly with negative reviews.
Which of the following words is not present in the top 10 “most negative” words?
- need
- work
- disappointed
- even
- return;
Quiz 3: Overfitting & Regularization in Logistic Regression
Question 1: Consider four classifiers, whose classification performance is given by the following table:
Classification error on training set | Classification error on validation set | |
Classifier 1 | 0.2 | 0.6 |
Classifier 2 | 0.8 | 0.6 |
Classifier 3 | 0.2 | 0.2 |
Classifier 4 | 0.5 | 0.4 |
Which of the four classifiers is most likely overfit?
- Classifier 1
- Classifier 2
- Classifier 3
- Classifier 4;
Question 2: Suppose a classifier classifies 23100 examples correctly and 1900 examples incorrectly. Compute accuracy by hand. Round your answer to 3 decimal places.
Answer: 0.076
Question 3: (True/False) Accuracy and error measured on the same dataset always sum to 1.
- True
- False
Question 4: Which of the following is NOT a correct description of complex models?
- Complex models accommodate many features.
- Complex models tend to produce lower training error than simple models.
- Complex models tend to generalize better than simple models.
- Complex models tend to exhibit high variance in response to perturbation in the training data.
- Complex models tend to exhibit low bias, capturing many patterns in the training data that simple models may have missed.
Question 5: Which of the following is a symptom of overfitting in the context of logistic regression? Select all that apply.;
- Large estimated coefficients
- Good generalization to previously unseen data
- Simple decision boundary
- Complex decision boundary
- Overconfident predictions of class probabilities
Question 6: Suppose we perform L2 regularized logistic regression to fit a sentiment classifier. Which of the following plots does NOT describe a possible coefficient path? Choose all that apply.
Note. Assume that the algorithm runs for a wide range of L2 penalty values and each coefficient plot is zoomed out enough to capture all long-term trends.
Answer:
<image: https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/2f4k6eWUEeWIdgqHxZs34w_1ee2b9e5512622d0ad3a587b44c49922_JcoDWuBbEeWufRJaRfO1AQ_ff849715b543c56709a46f7be7a14c5d_Capture.png?expiry=1659830400000&hmac=ZuYXWcE5lW2WAYtxHhREAqVrJliBt_t4_kpqQAFBDq4>
<image: https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/5YNc7eWUEeWOVQ68c1xy2w_768b146d738d6e1991ac9d5f678dbef6_9wGVMuBbEeW–hK2gi_BIw_27e6398723f122d80d4caae557763566_Capture.png?expiry=1659830400000&hmac=IpojhM7UWiGlNAQC5Mi–aUCZ-TN2miBuIiaZJCjxeI>
Q 7: Suppose we perform L1 regularized logistic regression to fit a sentiment classifier. Which of the following plots does NOT describe a possible coefficient path? Choose all that apply.
Note. Assume that the algorithm runs for a wide range of L1 penalty values and each coefficient plot is zoomed out enough to capture all long-term trends.
Answer:
<image: https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/A16k-eWVEeW–hK2gi_BIw_0d7afdaf1bd9730fd7b5c51c424f1b95_aJDg6eD2EeWuUgrcWIxPhQ_81584c7620804c5d7093ee0063171269_Capture.png?expiry=1659830400000&hmac=UwK2G8b5dHqM58vbRbsjDCIcDCnDMK7oiRI-J1hVflQ>
<image: https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/Mf6t5uWVEeWuUgrcWIxPhQ_2a643ffaf830d9218db5a6314936d52f_V4_Ld-BdEeWIdgqHxZs34w_dfc3556048dfc2bc157ce8bf58d6d74a_Capture.png?expiry=1659830400000&hmac=XhYkEvi8hMQ1V5UiX4yZNzX0lP05srcEJH8SViZqlSI>
Question 8: In the context of L2 regularized logistic regression, which of the following occurs as we increase the L2 penalty \lambdaλ? Choose all that apply.
- The L2 norm of the set of coefficients gets smaller
- Region of uncertainty becomes narrower, i.e., the classifier makes predictions with higher confidence.
- Decision boundary becomes less complex
- Training error decreases
- The classifier has lower variance
- Some features are excluded from the classifier;
Quiz 4: Logistic Regression with L2 regularization
Question 1: In the function feature_derivative_with_L2, was the intercept term regularized?
- Yes
- No
Question 2: Does the term with L2 regularization increase or decrease the log likelihood \ell\ell(\mathbf{w})ℓℓ(w)?
- Increases
- Decreases
Question 3: Which of the following words is not listed in either positive_words or negative_words?
- love
- disappointed
- great
- money
- quality
Question 4: Questions 5 and 6 use the coefficient plot of the words in positive_words and negative_words.
(True/False) All coefficients consistently decrease in magnitude as the L2 penalty is increased.
- True
- False
Question 5: Questions 5 and 6 use the coefficient plot of the words in positive_words and negative_words.
(True/False) The relative order of coefficients is preserved as the L2 penalty is increased. (For example, if the coefficient for ‘cat’ was more positive than that for ‘dog’, this remains true as the L2 penalty increases.)
- True
- False
Question 6: Questions 7, 8, and 9 ask you about the 6 models trained with different L2 penalties.
Which of the following models has the highest accuracy on the training data?
- Model trained with L2 penalty = 0
- Model trained with L2 penalty = 4
- Model trained with L2 penalty = 10;
- Model trained with L2 penalty = 100
- Model trained with L2 penalty = 1e3
- Model trained with L2 penalty = 1e5
Question 7: Questions 7, 8, and 9 ask you about the 6 models trained with different L2 penalties.
Which of the following models has the highest accuracy on the validation data?
- Model trained with L2 penalty = 0
- Model trained with L2 penalty = 4
- Model trained with L2 penalty = 10
- Model trained with L2 penalty = 100
- Model trained with L2 penalty = 1e3
- Model trained with L2 penalty = 1e5
Question 8: Questions 7, 8, and 9 ask you about the 6 models trained with different L2 penalties.
Does the highest accuracy on the training data imply that the model is the best one?
- Yes
- No;
Week 3: Machine Learning: Classification Quiz Answers
Quiz 1: Decision Trees
Question 1: Questions 1 to 6 refer to the following common scenario:
Consider the following dataset:
x1 | x2 | x3 | y |
1 | 1 | 1 | +1 |
0 | 1 | 0 | -1 |
1 | 0 | 1 | -1 |
0 | 0 | 1 | +1 |
Let us train a decision tree with this data. Let’s call this tree T1. What feature will we split on at the root?
- x1
- x2
- x3
Question 2: Refer to the dataset presented in Question 1 to answer the following.
Fully train T1 (until each leaf has data points of the same output label). What is the depth of T1?
Answer: 3
Question 3: Refer to the dataset presented in Question 1 to answer the following.
What is the training error of T1?
Answer: 0
Question 4: Refer to the dataset presented in Question 1 to answer the following.
Now consider a tree T2, which splits on x1 at the root, and splits on x2 in the 1st level, and has leaves at the 2nd level. Note: this is the XOR function on features 1 and 2. What is the depth of T2?
Answer: 2
Question 5: Refer to the dataset presented in Question 1 to answer the following.
What is the training error of T2?
Answer: 0
Question 6: Refer to the dataset presented in Question 1 to answer the following.
Which has smaller depth, T1 or T2?
- T1
- T2
Question 7: (True/False) When deciding to split a node, we find the best feature to split on that minimizes classification error.
- True
- False
Question 8: If you are learning a decision tree, and you are at a node in which all of its data has the same y value, you should
- find the best feature to split on
- create a leaf that predicts the y value of all the data
- terminate recursions on all branches and return the current tree
- go back to the PARENT node and select a DIFFERENT feature to split on so that the y values are not all the same at THIS node
Question 9: Consider two datasets D1 and D2, where D2 has the same data points as D1, but has an extra feature for each data point. Let T1 be the decision tree trained with D1, and T2 be the tree trained with D2. Which of the following is true?
- T2 has better training error than T1
- T2 has better test error than T1
- Too little information to guarantee anything
Question 10: (True/False) Logistic regression with polynomial degree 1 features will always have equal or lower training error than decision stumps (depth 1 decision trees).
- True
- False
Question 11: (True/False) Decision stumps (depth 1 decision trees) are always linear classifiers.
- True
- False;
Quiz 2: Identifying safe loans with decision trees
Question 1: What percentage of the predictions on sample_validation_data did decision_tree_model get correct?
- 25%
- 50%
- 75%
- 100%
Question 2: Which loan has the highest probability of being classified as a safe loan?
- First
- Second
- Third
- Fourth
Question 3: Notice that the probability preditions are the exact same for the 2nd and 3rd loans. Why would this happen?
- During tree traversal both examples fall into the same leaf node.
- This can only happen with sheer coincidence.
Question 4: What is the accuracy of decision_tree_model on the validation set, rounded to the nearest .01 (e.g. 0.76)?
Answer: 0.64
Question 5: How does the performance of big_model on the validation set compare to decision_tree_model on the validation set? Is this a sign of overfitting?
- big_model has higher accuracy on the validation set than decision_tree_model. This is overfitting.
- big_model has higher accuracy on the validation set than decision_tree_model. This is not overfitting.
- big_model has lower accuracy on the validation set than decision_tree_model. This is overfitting.
- big_model has lower accuracy on the validation set than decision_tree_model. This is not overfitting.
Question 6: Let us assume that each mistake costs money:
- Assume a cost of $10,000 per false negative.
- Assume a cost of $20,000 per false positive.
What is the total cost of mistakes made by decision_tree_model on validation_data? Please enter your answer as a plain integer, without the dollar sign or the comma separator, e.g. 3002000.
Answer: 50280000
Quiz 3: Implementing binary decision trees
Question 1: What was the feature that my_decision_tree first split on while making the prediction for test_data[0]?
- emp_length.4 years
- grade.A
- term. 36 months
- home_ownership.MORTGAGE
Question 2: What was the first feature that lead to a right split of test_data[0]?
- emp_length.< 1 year
- emp_length.10+ years
- grade.B
- grade.D
Question 3: What was the last feature split on before reaching a leaf node for test_data[0]?
- grade.D
- grade.B
- term. 36 months
- grade.A
Question 4: Rounded to 2nd decimal point (e.g. 0.76), what is the classification error of my_decision_tree on the test_data?
Answer: 0.36
Question 5: What is the feature that is used for the split at the root node?
- grade.A
- term. 36 months
- term. 60 months
- home_ownership.OWN
Question 6: What is the path of the first 3 feature splits considered along the left-most branch of my_decision_tree?
- term. 36 months, grade.A, grade.B
- term. 36 months, grade.A, emp_length.4 years
- term. 36 months, grade.A, no third feature because second split resulted in leaf
Question 7: What is the path of the first 3 feature splits considered along the right-most branch of my_decision_tree?
- term. 36 months, grade.D, grade.B
- term. 36 months, grade.D, home_ownership.OWN
- term. 36 months, grade.D, no third feature because second split resulted in leaf;
Week 4: Machine Learning: Classification Quiz Answers
Quiz 1: Preventing Overfitting in Decision Trees
Question 1: (True/False) When learning decision trees, smaller depth USUALLY translates to lower training error.
- True
- False
Question 2: (True/False) If no two data points have the same input values, we can always learn a decision tree that achieves 0 training error.
- True
- False
Question 3: (True/False) If decision tree T1 has lower training error than decision tree T2, then T1 will always have better test error than T2.
- True
- False
Question 4: Which of the following is true for decision trees?
- Model complexity increases with size of the data.
- Model complexity increases with depth.
- None of the above
Question 5: Pruning and early stopping in decision trees is used to
- combat overfitting
- improve training error
- None of the above
Question 6: Which of the following is NOT an early stopping method?
- Stop when the tree hits a certain depth
- Stop when node has too few data points (minimum node “size”)
- Stop when every possible split results in the same amount of error reduction
- Stop when best split results in too small of an error reduction
Question 7: Consider decision tree T1 learned with minimum node size parameter = 1000. Now consider decision tree T2 trained on the same dataset and parameters, except that the minimum node size parameter is now 100. Which of the following is always true?
- The depth of T2 >= the depth of T1
- The number of nodes in T2 >= the number of nodes in T1
- The test error of T2 <= the test error of T1
- The training error of T2 <= the training error of T1
Question 8: Questions 8 to 11 refer to the following common scenario:
Imagine we are training a decision tree, and we are at a node. Each data point is (x1, x2, y), where x1,x2 are features, and y is the label. The data at this node is:
x1 | x2 | y |
0 | 1 | +1 |
1 | 0 | +1 |
0 | 1 | +1 |
1 | 1 | -1 |
What is the classification error at this node (assuming a majority class classifier)?
Answer: 0.25
Question 9: Refer to the scenario presented in Question 8.
If we split on x1, what is the classification error?
Answer: 0.25
Question 10: Refer to the scenario presented in Question 8.
If we split on x2, what is the classification error?
Answer: 0.25
Question 11: Refer to the scenario presented in Question 8.
If our parameter for minimum gain in error reduction is 0.1, do we split or stop early?
- Split
- Stop early;
Quiz 2: Decision Trees in Practice
Question 1: Given an intermediate node with 6 safe loans and 3 risky loans, if the min_node_size parameter is 10, what should the tree learning algorithm do next?
- Create a leaf and return it
- Continue building the tree by finding the best splitting feature
Question 2: Assume an intermediate node has 6 safe loans and 3 risky loans. For each of 4 possible features to split on, the error reduction is 0.0, 0.05, 0.1, and 0.14, respectively. If the minimum gain in error reduction parameter is set to 0.2, what should the tree learning algorithm do next?
- Create a leaf and return it
- Continue building the tree by using the splitting feature that gives 0.14 error reduction
Question 3: Consider the prediction path validation_set[0] with my_decision_tree_old and my_decision_tree_new. For my_decision_tree_new trained with
1
max_depth = 6, min_node_size = 100, min_error_reduction=0.0
is the prediction path shorter, longer, or the same as the prediction path using my_decision_tree_old that ignored the early stopping conditions 2 and 3?
- Shorter
- Longer
- The same
Question 4: Consider the prediction path for ANY new data point. For my_decision_tree_new trained with
1
max_depth = 6, min_node_size = 100, min_error_reduction=0.0
is the prediction path for a data point always shorter, always longer, always the same, shorter or the same, or longer or the same as for my_decision_tree_old that ignored the early stopping conditions 2 and 3?
- Always shorter
- Always longer
- Always the same
- Shorter or the same
- Longer or the same
Question 5: For a tree trained on any dataset using parameters
1
max_depth = 6, min_node_size = 100, min_error_reduction=0.0
what is the maximum possible number of splits encountered while making a single prediction?
Answer: 6
Question 6: Is the validation error of the new decision tree (using early stopping conditions 2 and 3) lower than, higher than, or the same as that of the old decision tree from the previous assigment?
- Higher than
- Lower than
- The same
Question 7: Which tree has the smallest error on the validation data?
- model_1
- model_2
- model_3
Question 8: Does the tree with the smallest error in the training data also have the smallest error in the validation data?
- Yes
- No
Question 9: Is it always true that the tree with the lowest classification error on the training set will result in the lowest classification error in the validation set?
- Yes, this is ALWAYS true.
- No, this is NOT ALWAYS true.
Question 10: Which tree has the largest complexity?
- model_1
- model_2
- model_3
Question 11: Is it always true that the most complex tree will result in the lowest classification error in the validation_set?
- Yes, this is always true.
- No, this is not always true.
Question 12: Using the complexity definition, which model (model_4, model_5, or model_6) has the largest complexity?
- model_4
- model_5
- model_6
Question 13: model_4 and model_5 have similar classification error on the validation set but model_5 has lower complexity. Should you pick model_5 over model_4?
- Pick model_5 over model_4
- Pick model_4 over model_5
Question 14: Using the results obtained in this section, which model (model_7, model_8, or model_9) would you choose to use?
- model_7
- model_8
- model_9
Quiz 3: Handling Missing Data
Question 1: (True/False) Skipping data points (i.e., skipping rows of the data) that have missing features only works when the learning algorithm we are using is decision tree learning.
- True
- False
Question 2: What are potential downsides of skipping features with missing values (i.e., skipping columns of the data) to handle missing data?
- So many features are skipped that accuracy can degrade
- The learning algorithm will have to be modified
- You will have fewer data points (i.e., rows) in the dataset
- If an input at prediction time has a feature missing that was always present during training, this approach is not applicable.
Question 3: (True/False) It’s always better to remove missing data points (i.e., rows) as opposed to removing missing features (i.e., columns).
- True
- False
Question 4: Consider a dataset with N training points. After imputing missing values, the number of data points in the data set is
- 2 * N
- N
- 5 * N
Question 5: Consider a dataset with D features. After imputing missing values, the number of features in the data set is
- 2 * D
- D
- 0.5 * D
Question 6: Which of the following are always true when imputing missing data? Select all that apply.
- Imputed values can be used in any classification algorithm
- Imputed values can be used when there is missing data at prediction time
- Using imputed values results in higher accuracies than skipping data points or skipping features
Question 7: Consider data that has binary features (i.e. the feature values are 0 or 1) with some feature values of some data points missing. When learning the best feature split at a node, how would we best modify the decision tree learning algorithm to handle data points with missing values for a feature?
- We choose to assign missing values to the branch of the tree (either the one with feature value equal to 0 or with feature value equal to 1) that minimizes classification error.
- We assume missing data always has value 0.
- We ignore all data points with missing values.
Week 5: Machine Learning: Classification Quiz Answers
Quiz 1: Exploring Ensemble Methods
Question 1: What percentage of the predictions on sample_validation_data did model_5 get correct?
- 25%
- 50%
- 75%
- 100%
Question 2: According to model_5, which loan is the least likely to be a safe loan?
- First
- Second
- Third
- Fourth
Question 3: What is the number of false positives on the validation data?
Answer: 1618
Question 4: Using the same costs of the false positives and false negatives, what is the cost of the mistakes made by the boosted tree model (model_5) as evaluated on the validation_set?
Answer: 46990000
Question 5: What grades are the top 5 loans?
- A
- B
- C
- D
- E
Question 6: Which model has the best accuracy on the validation_data?
- model_10
- model_50
- model_100
- model_200
- model_500
Question 7: Is it always true that the model with the most trees will perform best on the test/validation set?
- Yes, a model with more trees will ALWAYS perform better on the test/validation set.
- No, a model with more trees does not always perform better on the test/validation set.
Question 8: Does the training error reduce as the number of trees increases?
- Yes
- No
Question 9: Is it always true that the test/validation error will reduce as the number of trees increases?
- Yes, it is ALWAYS true that the test/validation error will reduce as the number of trees increases.
- No, the test/validation error will not necessarily always reduce as the number of trees increases.
Quiz 2: Boosting
Question 1: Which of the following is NOT an ensemble method?
- Gradient boosted trees
- AdaBoost
- Random forests
- Single decision trees
Question 2: Each binary classifier in an ensemble makes predictions on an input xx as listed in the table below. Based on the ensemble coefficients also listed in the table, what is the final ensemble model’s prediction for xx?
Classifier coefficient w_t w t | Prediction for x x | |
Classifier 1 | 0.61 | +1 |
Classifier 2 | 0.53 | -1 |
Classifier 3 | 0.88 | -1 |
Classifier 4 | 0.34 | +1 |
- +1
- -1
Question 3: (True/False) Boosted trees tend to be more robust to overfitting than decision trees.
- True
- False
Question 4: (True/False) AdaBoost focuses on data points it incorrectly predicted by increasing those weights in the data set.
- True
- False
Question 5: Let w_twt be the coefficient for a weak learner f_tft. Which of the following conditions must be true so that w_t > 0wt>0 ?
- weighted_error(ft)<.25
- weighted_error(ft)<.5
- weighted_error(ft)>.75
- weighted_error(ft)>.5
Question 6: If you were using AdaBoost and in an iteration of the algorithm were faced with the following classifiers, which one would you be more inclined to include in the ensemble? A classifier with:
- weighted error = 0.1
- weighted error = 0.3
- weighted error = 0.5
- weighted error = 0.7
- weighted error = 0.99
Question 7: Imagine we are training a decision stump in an iteration of AdaBoost, and we are at a node. Each data point is (x1, x2, y), where x1,x2 are features, and y is the label. Also included are the weights of the data. The data at this node is:
Weight | x1 | x2 | y |
0.3 | 0 | 1 | +1 |
0.35 | 1 | 0 | -1 |
0.1 | 0 | 1 | +1 |
0.25 | 1 | 1 | +1 |
Suppose we split on feature x2. Calculate the weighted error of this split. Round your answer to 2 decimal places.
Answer: 0.35
Question 8: After each iteration of AdaBoost, the weights on the data points are typically normalized to sum to 1. This is used because
- of issues with numerical instability (underflow/overflow)
- the weak learners can only learn with normalized weights
- none of the above
Question 9: Consider the following 2D dataset with binary labels.
<image: https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/MbBDTOBiEeWBDQ73-3lhaw_4436e98252ccb4f3b833fbadbd5a7836_Capture.PNG?expiry=1660089600000&hmac=p6RABWavaZR760Eb1cRwCEi47DbFSUdaCRljLKUVuRY>
We train a series of weak binary classifiers using AdaBoost. In one iteration, the weak binary classifier produces the decision boundary as follows:
<image: https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/HswHyeBjEeWufRJaRfO1AQ_b5746ae36ef317b86c126c93db145c34_Capture2.PNG?expiry=1660089600000&hmac=Y5S30aPSQyZ9AtvfapYN66ioJu34mzNL9DsJ_pk7btI>
Which of the five points (indicated in the second figure) will receive higher weight in the following iteration? Choose all that apply.
- (1)
- (2)
- (3)
- (4)
- (5)
Question 10: Suppose we are running AdaBoost using decision tree stumps. At a particular iteration, the data points have weights according to the figure. (Large points indicate heavy weights.)
Which of the following decision tree stumps is most likely to fit in the next iteration?
Answer:
<image: https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/3-kqZeD0EeW–hK2gi_BIw_7d5bf2ab0433c80a45f1593efdefc45f_Capture2.PNG?expiry=1660089600000&hmac=Hpx4wOJgpWf376D_JSWW8VcmiQB0K1D7YvIpHnOeRCo>
Question 11: (True/False) AdaBoost can boost any kind of classifier, not just a decision tree stump.
- True
- False;
Quiz 3: Boosting a decision stump
Question 1: Recall that the classification error for unweighted data is defined as follows:
classification error=# mistakes# all data points
Meanwhile, the weight of mistakes for weighted data is given by
WM(α,y^)=∑i=1nαi×1[yi≠y^i].
If we set the weights α=1 for all data points, how is the weight of mistakes WM(α,ŷ) related to the classification error?
- WM(α,ŷ) = [classification error]
- WM(α,ŷ) = [classification error] * [weight of correctly classified data points]
- WM(α,ŷ) = N * [classification error]
- WM(α,ŷ) = 1 – [classification error]
Question 2: Refer to section Example: Training a weighted decision tree.
Will you get the same model as small_data_decision_tree_subset_20 if you trained a decision tree with only 20 data points from the set of points in subset_20?
- Yes
- No
Question 3: Refer to the 10-component ensemble of tree stumps trained with Adaboost.
As each component is trained sequentially, are the component weights monotonically decreasing, monotonically increasing, or neither?
- Monotonically decreasing
- Monotonically increasing
- Neither
Question 4: Which of the following best describes a general trend in accuracy as we add more and more components? Answer based on the 30 components learned so far.
- Training error goes down monotonically, i.e. the training error reduces with each iteration but never increases.
- Training error goes down in general, with some ups and downs in the middle.
- Training error goes up in general, with some ups and downs in the middle.
- Training error goes down in the beginning, achieves the best error, and then goes up sharply.
- None of the above
Question 5: From this plot (with 30 trees), is there massive overfitting as the # of iterations increases?
- Yes
- No
Week 6: Machine Learning: Classification Quiz Answers
Quiz 1: Precision-Recall
Question 1:
Question 1: Questions 1 to 5 refer to the following scenario:
Suppose a binary classifier produced the following confusion matrix.
Predicted Positive | Predicted Negative | |
Actual Positive | 5600 | 40 |
Actual Negative | 1900 | 2460 |
What is the recall of this classifier? Round your answer to 2 decimal places.
Answer: 0.81
Question 2: Refer to the scenario presented in Question 1 to answer the following:
(True/False) This classifier is better than random guessing.
- True
- False
Question 3: Refer to the scenario presented in Question 1 to answer the following:
(True/False) This classifier is better than the majority class classifier.
- True
- False
Question 4: Refer to the scenario presented in Question 1 to answer the following:
Which of the following points in the precision-recall space corresponds to this classifier?
<image: https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/TDEYl-BmEeWuUgrcWIxPhQ_3c704bfa2b87d7dc0429d29ac709c690_Capture.PNG?expiry=1660176000000&hmac=5-WPTUGHOu6Emyig492RXT9E2IGaMy9Hs2BzrPolHAs>
- (1)
- (2)
- (3)
- (4)
- (5)
Question 5: Refer to the scenario presented in Question 1 to answer the following:
Which of the following best describes this classifier?
- It is optimistic
- It is pessimistic
- None of the above
Question 6: Suppose we are fitting a logistic regression model on a dataset where the vast majority of the data points are labeled as positive. To compensate for overfitting to the dominant class, we should
- Require higher confidence level for positive predictions
- Require lower confidence level for positive predictions
Question 7: It is often the case that false positives and false negatives incur different costs. In situations where false negatives cost much more than false positives, we should
- Require higher confidence level for positive predictions
- Require lower confidence level for positive predictions
Question 8: We are interested in reducing the number of false negatives. Which of the following metrics should we primarily look at?
- Accuracy
- Precision
- Recall
Question 9: Suppose we set the threshold for positive predictions at 0.9. What is the lowest score that is classified as positive? Round your answer to 2 decimal places.
Answer: 2.20;
Quiz 2: Exploring precision and recall
Question 1: Consider the logistic regression model trained on amazon_baby.sframe using Turi Create.
Using accuracy as the evaluation metric, was our logistic regression model better than the majority class classifier?
- Yes
- No
Question 2: How many predicted values in the test set are false positives?
Answer: 1443
Question 3: Consider the scenario where each false positive costs $100 and each false negative $1.
Given the stipulation, what is the cost associated with the logistic regression classifier’s performance on the test set?
- Between $0 and $100,000
- Between $100,000 and $200,000
- Between $200,000 and $300,000
- Above $300,000
Question 4: Out of all reviews in the test set that are predicted to be positive, what fraction of them are false positives? (Round to the second decimal place e.g. 0.25)
Answer: 0.04
Question 5: Based on what we learned in lecture, if we wanted to reduce this fraction of false positives to be below 3.5%, we would:
- Discard a sufficient number of positive predictions
- Discard a sufficient number of negative predictions
- Increase threshold for predicting the positive class (\hat{y} = +1y^=+1)
- Decrease threshold for predicting the positive class (\hat{y} = +1y^=+1)
Question 6: What fraction of the positive reviews in the test_set were correctly predicted as positive by the classifier? Round your answer to 2 decimal places.
Answer: 0.95
Question 7: What is the recall value for a classifier that predicts +1 for all data points in the test_data?
Answer: 1
Question 8: What happens to the number of positive predicted reviews as the threshold increases from 0.5 to ˀ0.9?
- More reviews are predicted to be positive.
- Fewer reviews are predicted to be positive.
Question 9: Consider the metrics obtained from setting the threshold to 0.5 and to 0.9.
Does the precision increase with a higher threshold?
- Yes
- No
Question 10: Among all the threshold values tried, what is the smallest threshold value that achieves a precision of 96.5% or better? Round your answer to 3 decimal places.
Answer: 0.838
Question 11: Using threshold = 0.98, how many false negatives do we get on the test_data? (Hint: You may use the turicreate. evaluation.confusion_matrix function implemented in Turi Create.)
Answer: 5826
Question 12: Questions 13 and 14 are concerned with the reviews that contain the word baby.
Among all the threshold values tried, what is the smallest threshold value that achieves a precision of 96.5% or better for the reviews of data in baby_reviews? Round your answer to 3 decimal places.
Answer: 0.864
Question 13: Questions 13 and 14 are concerned with the reviews that contain the word baby.
Is this threshold value smaller or larger than the threshold used for the entire dataset to achieve the same specified precision of 96.5%?
- Larger
- Smaller;
Week 7: Machine Learning: Classification Quiz Answers
Quiz 1: Scaling to Huge Datasets & Online Learning
Question 1: (True/False) Stochastic gradient ascent often requires fewer passes over the dataset than batch gradient ascent to achieve a similar log likelihood.
- True
- False
Question 2: (True/False) Choosing a large batch size results in less noisy gradients
- True
- False
Question 3: (True/False) The set of coefficients obtained at the last iteration represents the best coefficients found so far.
- True
- False
Question 4: Suppose you obtained the plot of log likelihood below after running stochastic gradient ascent.
<image: https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/VyRs8uBtEeWBDQ73-3lhaw_70d1c456f3deba4ffa876e53e041cf7a_Capture.PNG?expiry=1660176000000&hmac=5D3PC3fLKqnzCxTC9Pc0eLM_RHsPwVLfIqiwIAt-UF0>
Which of the following actions would help the most to improve the rate of convergence?
- Increase step size
- Decrease step size
- Decrease batch size
Question 5: Suppose you obtained the plot of log likelihood below after running stochastic gradient ascent.
<image: https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/Ym5JpOBtEeWBDQ73-3lhaw_8f2bb6e530dc2b481646181bf6e72a80_Capture2.PNG?expiry=1660176000000&hmac=c564dNreuqYrBqafsUcTD8pRFkrcLeGYcJdVODJkcHY>
Which of the following actions would help to improve the rate of convergence?
- Increase batch size
- Increase step size
- Decrease step size
Question 6: Suppose it takes about 1 milliseconds to compute a gradient for a single example. You run an online advertising company and would like to do online learning via mini-batch stochastic gradient ascent. If you aim to update the coefficients once every 5 minutes, how many examples can you cover in each update? Overhead and other operations take up 2 minutes, so you only have 3 minutes for the coefficient update.
Answer: 180000
Question 7: n search for an optimal step size, you experiment with multiple step sizes and obtain the following convergence plot.
<image: https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/bS2ZaOBtEeWufRJaRfO1AQ_8d67c76c9a6715f27ebc78fb8df13c0a_Capture3.PNG?expiry=1660176000000&hmac=c9mJN6-YxSUtnYeENdj0vohMOaVa12ueofNmLj0mHUo>
Which line corresponds to step sizes that are larger than the best? Select all that apply.
- (1)
- (2)
- (3)
- (4)
- (5)
Question 8: Suppose you run stochastic gradient ascent with two different batch sizes. Which of the two lines below corresponds to the smaller batch size (assuming both use the same step size)?
<image: https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/oj-RfOBtEeWOVQ68c1xy2w_6f018b2d10620f6857e5580f40c7eeb5_Capture4.PNG?expiry=1660176000000&hmac=67p5JZoMj1_RYsdvd1syHchdNuJdf5eUX5nC6ySjoRI>
- (1)
- (2)
Question 9: Which of the following is NOT a benefit of stochastic gradient ascent over batch gradient ascent? Choose all that apply.
- Each coefficient step is very fast.
- Log likelihood of data improves monotonically.
- Stochastic gradient ascent can be used for online learning.
- Stochastic gradient ascent can achieve higher likelihood than batch gradient ascent for the same amount of running time.
- Stochastic gradient ascent is highly robust with respect to parameter choices.
Question 10: Suppose we run the stochastic gradient ascent algorithm described in the lecture with batch size of 100. To make 10 passes over a dataset consisting of 15400 examples, how many iterations does it need to run?
Answer: 1540
Quiz 2: Training Logistic Regression via Stochastic Gradient Ascent
Question 1:
Question 1: In Module 3 assignment, there were 194 features (an intercept + one feature for each of the 193 important words). In this assignment, we will use stochastic gradient ascent to train the classifier using logistic regression. How does the changing the solver to stochastic gradient ascent affect the number of features?
- Increases
- Decreases
- Stays the same
Question 2: Recall from the lecture and the earlier assignment, the log likelihood (without the averaging term) is given by
ℓℓ(w)=i=1∑N((1[yi=+1]−1)wTh(xi)−ln(1+exp(−wTh(xi))))
whereas the average log likelihood is given by
ℓℓA(w)=N1i=1∑N((1[yi=+1]−1)wTh(xi)−ln(1+exp(−wTh(xi))))
How are the functions \ell\ell(\mathbf{w})ℓℓ(w) and \ell\ell_A(\mathbf{w})ℓℓA(w) related?
- ℓℓA(w)=ℓℓ(w)
- ℓℓA(w)=(1/N)⋅ℓℓ(w)
- ℓℓA(w)=N⋅ℓℓ(w)
- |ℓℓA(w)=ℓℓ(w)−∥w∥
Question 3: Refer to the sub-section Computing the gradient for a single data point.
The code block above computed
∂wj∂ℓi(w)
for j = 1 and i = 10. Is this quantity a scalar or a 194-dimensional vector?
- A scalar
- A 194-dimensional vector
Question 4: Refer to the sub-section Modifying the derivative for using a batch of data points.
The code block computed
s=i∑i+B∂wj∂ℓs(w)
for j = 10, i = 10, and B = 10. Is this a scalar or a 194-dimensional vector?
- A scalar
- A 194-dimensional vector
Question 5: For what value of B is the term
s=1∑B∂wj∂ℓs(w)
the same as the full gradient
∂wj∂ℓ(w)
? A numeric answer is expected for this question. Hint: consider the training set we are using now.
Answer: 47780
Question 6
For what value of batch size B above is the stochastic gradient ascent function logistic_regression_SG act as a standard gradient ascent algorithm? A numeric answer is expected for this question. Hint: consider the training set we are using now.
Answer: 47780
Question 7: When you set batch_size = 1, as each iteration passes, how does the average log likelihood in the batch change?
- Increases
- Decreases
- Fluctuates
Question 8: When you set batch_size = len(feature_matrix_train), as each iteration passes, how does the average log likelihood in the batch change?
- Increases
- Decreases
- Fluctuates
Question 9: Suppose that we run stochastic gradient ascent with a batch size of 100. How many gradient updates are performed at the end of two passes over a dataset consisting of 50000 data points?
Answer: 1000
Question 10: Refer to the section Stochastic gradient ascent vs gradient ascent.
In the first figure, how many passes does batch gradient ascent need to achieve a similar log likelihood as stochastic gradient ascent?
- It’s always better
- 10 passes
- 20 passes
- 150 passes or more
Question 11: Questions 11 and 12 refer to the section Plotting the log likelihood as a function of passes for each step size.
Which of the following is the worst step size? Pick the step size that results in the lowest log likelihood in the end.
- 1e-2
- 1e-1
- 1e0
- 1e1
- 1e2
Question 12: Questions 11 and 12 refer to the section Plotting the log likelihood as a function of passes for each step size.
Which of the following is the best step size? Pick the step size that results in the highest log likelihood in the end.
- 1e-4
- 1e-2
- 1e0
- 1e1
- 1e2;
.
Review:
Based on our knowledge, we urge you to enroll in this course so you can pick up new skills from specialists. It will be worthwhile, we trust.