Let’s discuss Coursera Course Practical Machine Learning Week 1 Quiz 1 Answer with you..

** Practical Machine Learning Quiz 1 Answer**

**Question 1)**

**Which of the following are steps in building a machine learning algorithm?**

- Machine learning.
- Statistical inference.
- Artificial intelligence.
**Collecting data to answer the question.**

**Question 2)**

**Suppose we build a prediction algorithm on a data set and it is 100% accurate on that data set. Why might the algorithm not work well if we collect a new data set?**

- We have too few predictors to get good out of sample accuracy.
- We have used neural networks which has notoriously bad performance.
- We are not asking a relevant question that can be answered with machine learning.
**Our algorithm may be overfitting the training data, predicting both the signal and the noise.**

**Question 3)**

**What are typical sizes for the training and test sets?**

- 80% training set, 20% test set.
- 10% test set, 90% training set.
- 90% training set, 10% test set.
**60% in the training set, 40% in the testing set.**

**Question 4)**

**What are some common error rates for predicting binary variables (i.e. variables with two possible values like yes/no, disease/normal, clicked/didn’t click)?**

- P-values.
- R^2.
- Median absolute deviation.
**Predictive value of a positive.**

**Question 5)**

**Suppose that we have created a machine learning algorithm that predicts whether a link will be clicked with 99% sensitivity and 99% specificity. The rate the link is clicked is 1/1000 of visits to a website. If we predict the link will be clicked on a specific visit, what is the probability it will actually be clicked?**

**9 %.**- 50 %.
- 99 %.
- 90 %

**Practical Machine Learning Quiz 2 Answer**

Let’s spend time on Coursera Course Practical Machine Learning Week 2 Quiz 2 Answers here.

**Question 1)**

**Load the Alzheimer’s disease data using the commands:**

`library(AppliedPredictiveModeling)`

`data(AlzheimerDisease)`

**Which of the following commands will create non-overlapping training**

and test sets with about 50% of the observations assigned to each?

and test sets with about 50% of the observations assigned to each?

*adData = data.frame(diagnosis,predictors)*

*trainIndex = createDataPartition(diagnosis, p = 0.50, list = FALSE)*

*training = adData[trainIndex,]*

*testing = adData[-trainIndex,]*adData = data.frame(diagnosis,predictors)

train = createDataPartition(diagnosis, p = 0.50,list=FALSE)

test = createDataPartition(diagnosis, p = 0.50,list=FALSE)

adData = data.frame(diagnosis,predictors)

trainIndex = createDataPartition(diagnosis, p = 0.50)

training = adData[trainIndex,]

testing = adData[-trainIndex,]

adData = data.frame(predictors)

trainIndex = createDataPartition(diagnosis, p=0.5, list=FALSE)

training = adData[trainIndex,]

testing = adData[-trainIndex,]

**Question 2)**

**Load the cement data using the commands:**

`library(AppliedPredictiveModeling)`

`data(concrete)`

`library(caret)`

`set.seed(1000)`

`inTrain = createDataPartition(mixtures$CompressiveStrength, p =`

3/4)[[1]]

`training = mixtures[ inTrain,]`

`testing = mixtures[-inTrain,]`

` `

**Make a plot of the outcome (CompressiveStrength) versus the index of**

the samples. Color by each of the variables in the data set (you may nd the cut2() function in the Hmisc package

useful for turning continuous covariates into factors). What do you notice in these plots?

the samples. Color by each of the variables in the data set (you may nd the cut2() function in the Hmisc package

useful for turning continuous covariates into factors). What do you notice in these plots?

- There is a non-random pattern in the plot of the outcome versus index

that is perfectly explained by the Age variable.

- There is a non-random pattern in the plot of the outcome versus

index.

- There is a non-random pattern in the plot of the outcome versus index

that is perfectly explained by the FlyAsh variable.

**There is a non-random pattern in the plot of the outcome versus**

index that does not appear to be perfectly explained by any predictor suggesting a variable may be

missing.

**Question 3)**

**Load the cement data using the commands:**

`library(AppliedPredictiveModeling)`

`data(concrete)`

`library(caret)`

`set.seed(1000)`

`inTrain = createDataPartition(mixtures$CompressiveStrength, p =`

3/4)[[1]]

`training = mixtures[ inTrain,]`

`testing = mixtures[-inTrain,]`

**Make a histogram and conrm the SuperPlasticizer variable is skewed.**

Normally you might use the log transform to try to make the data more symmetric. Why would that be a poor choice

for this variable?

Normally you might use the log transform to try to make the data more symmetric. Why would that be a poor choice

for this variable?

- The log transform is not a monotone transformation of the data.

- The log transform does not reduce the skewness of the non-zero values

of SuperPlasticizer

- The SuperPlasticizer data include negative values so the log

transform can not be performed.

**There are a large number of values that are the same and even if**

you took the log(SuperPlasticizer + 1) they would still all be identical so the distribution would not be

symmetric.

**Question 4)**

**Load the Alzheimer’s disease data using the commands:**

`library(caret)`

`library(AppliedPredictiveModeling)`

`set.seed(3433)`

`data(AlzheimerDisease)`

`adData = data.frame(diagnosis,predictors)`

`inTrain = createDataPartition(adData$diagnosis, p = 3/4)[[1]]`

`training = adData[ inTrain,]`

`testing = adData[-inTrain,]`

**Find all the predictor variables in the training set that begin with**

IL. Perform principal components on these variables with the preProcess() function from the caret package. Calculate the

number of principal components needed to capture 90% of the variance. How many are there?

IL. Perform principal components on these variables with the preProcess() function from the caret package. Calculate the

number of principal components needed to capture 90% of the variance. How many are there?

- 7

- 5

- 10

**9**

**Question 5)**

**Load the Alzheimer’s disease data using the commands:**

`library(caret)`

`library(AppliedPredictiveModeling)`

`set.seed(3433)data(AlzheimerDisease)`

`adData = data.frame(diagnosis,predictors)`

`inTrain = createDataPartition(adData$diagnosis, p = 3/4)[[1]]training =`

adData[ inTrain,]

`testing = adData[-inTrain,]`

**Create a training data set consisting of only the predictors with**

variable names beginning with IL and the diagnosis.

variable names beginning with IL and the diagnosis.

**Build two predictive models, one using the predictors as they are and**

one using PCA with principal components explaining 80% of the variance in the predictors. Use method=”glm” in

the train function.

one using PCA with principal components explaining 80% of the variance in the predictors. Use method=”glm” in

the train function.

**What is the accuracy of each method in the test set? Which is more**

accurate?

accurate?

Non-PCA Accuracy: 0.72

PCA Accuracy: 0.65

**Non-PCA Accuracy: 0.65**

**PCA Accuracy: 0.72**

Non-PCA Accuracy: 0.91

PCA Accuracy: 0.93

Non-PCA Accuracy: 0.72

PCA Accuracy: 0.93

Practical Machine Learning Quiz 3 Answer

Some important Coursera Course Practical Machine Learning Week 3 Quiz Answer to practice.

**Question 1)**

**For this quiz we will be using several R packages. R package versions**

change over time, the right answers have been checked using the following versions of the packages.

change over time, the right answers have been checked using the following versions of the packages.

**AppliedPredictiveModeling: v1.1.6**

**caret: v6.0.47**

**ElemStatLearn: v2012.04-0**

**pgmm: v1.1**

**rpart: v4.1.8**

**if you aren’t using these versions of the packages, your answers may not**

exactly match the right answer, but hopefully should be close.

exactly match the right answer, but hopefully should be close.

**Load the cell segmentation data from the AppliedPredictiveModeling**

package using the commands:

package using the commands:

`library(AppliedPredictiveModeling)`

`data(segmentationOriginal)`

`library(caret)`

**Subset the data to a training set and testing set based on the Case**

variable in the data set.

variable in the data set.

**Set the seed to 125 and fit a CART model with the rpart method using all**

predictor variables and default caret settings.

predictor variables and default caret settings.

**In the final model what would be the final model prediction for cases**

with the following variable values:

with the following variable values:

- TotalIntench2 = 23,000; FiberWidthCh1 = 10; PerimStatusCh1 = 2.
- TotalIntench2 = 50,000; FiberWidthCh1 = 10; VarIntenCh4 = 100.
- TotalIntench2 = 57,000; FiberWidthCh1 = 8; VarIntenCh4 = 100.
- FiberWidthCh1 = 8; VarIntenCh4 = 100; PerimStatusCh1 = 2.

**Answer:**

**PS**

**WS**

**PS**

**Not possible to predict**

**Question 2)**

**If K is small in a K-fold cross validation is the bias in the estimate of**

out-of-sample (test set) accuracy smaller or bigger?

out-of-sample (test set) accuracy smaller or bigger?

**If K is small is the variance in the estimate of out-of-sample (test set)**

accuracy smaller or bigger. Is K large or small in leave one out cross validation?

accuracy smaller or bigger. Is K large or small in leave one out cross validation?

- The bias is smaller and the variance is bigger. Under leave one out

cross validation K is equal to one.

- The bias is smaller and the variance is smaller. Under leave one out

cross validation K is equal to one.

- The bias is smaller and the variance is smaller. Under leave one out

cross validation K is equal to the sample size.

- The bias is larger and the variance is smaller. Under leave one out

cross validation K is equal to the sample size.

**Question 3)**

**Load the olive oil data using the commands:**

`library(pgmm)`

`data(olive)`

`olive = olive[,-1]`

` `

*(NOTE: If you have trouble installing the pgmm package, you can download*

the -code-olive-/code- dataset here: olive_data.zip

(https://d396qusza40orc.cloudfront.net/predmachlearn/data/olive_data.zip).

After unzipping the archive, you can load the le using the -code-load()-/code- function in

R.)

the -code-olive-/code- dataset here: olive_data.zip

(https://d396qusza40orc.cloudfront.net/predmachlearn/data/olive_data.zip).

After unzipping the archive, you can load the le using the -code-load()-/code- function in

R.)

**These data contain information on 572 dierent Italian olive oils from**

multiple regions in Italy. Fit a classication tree where Area is the outcome variable. Then predict the value of area for

the following data frame using the tree command with all defaults newdata = as.data.frame(t(colMeans(olive)))

multiple regions in Italy. Fit a classication tree where Area is the outcome variable. Then predict the value of area for

the following data frame using the tree command with all defaults newdata = as.data.frame(t(colMeans(olive)))

**What is the resulting prediction? Is the resulting prediction strange?**

Why or why not?

Why or why not?

- 4.59965. There is no reason why the result is strange.

- 2.783. There is no reason why this result is strange.

**2.783. It is strange because Area should be a qualitative variable –**

but tree is reporting the average value of Area as a numeric variable in the leaf predicted for newdata

- 0.005291005 0 0.994709 0 0 0 0 0 0. The result is strange because Area

is a numeric variable and we should get the average within each leaf.

**Question 4)**

**Load the South Africa Heart Disease Data and create training and test**

sets with the following code:

sets with the following code:

`library(ElemStatLearn)`

`data(SAheart)`

`set.seed(8484)`

`train = sample(1:dim(SAheart)[1],size=dim(SAheart)[1]/2,replace=F)`

`trainSA = SAheart[train,]`

`testSA = SAheart[-train,]`

` `

**Then set the seed to 13234 and t a logistic regression model**

(method=”glm”, be sure to specify family=”binomial”) with Coronary Heart Disease (chd) as the outcome and age at onset,

current alcohol consumption, obesity levels, cumulative tabacco, type-A behavior, and low density lipoprotein

cholesterol as predictors. Calculate the misclassication rate for your model using this function and a prediction

on the “response” scale:

(method=”glm”, be sure to specify family=”binomial”) with Coronary Heart Disease (chd) as the outcome and age at onset,

current alcohol consumption, obesity levels, cumulative tabacco, type-A behavior, and low density lipoprotein

cholesterol as predictors. Calculate the misclassication rate for your model using this function and a prediction

on the “response” scale:

`missClass = function(values,prediction){sum(((prediction > 0.5)*1) !=`

values)/length(values)}

**What is the misclassication rate on the training set? What is the**

misclassication rate on the test set?

misclassication rate on the test set?

Test Set Misclassication: 0.35

Training Set: 0.31

Test Set Misclassication: 0.27

Training Set: 0.31

**Test Set Misclassication: 0.31**

**Training Set: 0.27**

Test Set Misclassication: 0.43

Training Set: 0.31

**Question 5)**

**Load the vowel.train and vowel.test data sets:**

`library(ElemStatLearn)`

`data(vowel.train)`

`data(vowel.test)`

**Set the variable y to be a factor variable in both the training and test**

set. Then set the seed to 33833. Fit a random forest predictor relating the factor variable y to the remaining

variables. Read about variable importance in random forests here:

http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#ooberr (http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#ooberr)

The caret package uses by default the Gini importance.

set. Then set the seed to 33833. Fit a random forest predictor relating the factor variable y to the remaining

variables. Read about variable importance in random forests here:

http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#ooberr (http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#ooberr)

The caret package uses by default the Gini importance.

**Calculate the variable importance using the varImp function in the caret**

package. What is the order of variable importance?

package. What is the order of variable importance?

The order of the variables is:

x.10, x.7, x.9, x.5, x.8, x.4, x.6, x.3, x.1,x.2

The order of the variables is:

x.1, x.2, x.3, x.8, x.6, x.4, x.5, x.9, x.7,x.10

**The order of the variables is:**

**x.2, x.1, x.5, x.6, x.8, x.4, x.9, x.3, x.7,x.10**

The order of the variables is:

x.2, x.1, x.5, x.8, x.6, x.4, x.3, x.9, x.7,x.10

Practical Machine Learning Quiz 4 Answer

In this article i am gone to share Coursera

Course Practical Machine Learning Week 4 Quiz Answer with

you..

Course Practical Machine Learning Week 4 Quiz Answer with

you..

**Question 1)**

**For this quiz we will be using several R packages. R package versions**

change over time, the right answers have been checked using the following versions of the packages.

change over time, the right answers have been checked using the following versions of the packages.

**AppliedPredictiveModeling: v1.1.6**

**caret: v6.0.47**

**ElemStatLearn: v2012.04-0**

*pgmm: v1.1*

* *

*rpart: v4.1.8*

* *

*gbm: v2.1*

* *

*lubridate: v1.3.3*

* *

*forecast: v5.6*

* *

*e1071: v1.6.4*

**If you aren’t using these versions of the packages, your answers may**

not exactly match the right answer, but hopefully should be close.

not exactly match the right answer, but hopefully should be close.

**Load the vowel.train and vowel.test data sets:**

*-code--code-library(ElemStatLearn)*

* *

*data(vowel.train)*

* *

*data(vowel.test)*

* *

*-/code--/code*

**Set the variable y to be a factor variable in both the training and**

test set. Then set the seed to 33833. Fit (1) a random forest predictor relating the factor variable y to the remaining

variables and (2) a boosted predictor using the “gbm” method. Fit these both with the train() command in the caret

package.

test set. Then set the seed to 33833. Fit (1) a random forest predictor relating the factor variable y to the remaining

variables and (2) a boosted predictor using the “gbm” method. Fit these both with the train() command in the caret

package.

**What are the accuracies for the two approaches on the test data set?**

What is the accuracy among the test set samples where the two methods agree?

What is the accuracy among the test set samples where the two methods agree?

RF Accuracy = 0.9987

GBM Accuracy = 0.5152

Agreement Accuracy = 0.9985

RF Accuracy = 0.6082

GBM Accuracy = 0.5152

Agreement Accuracy = 0.5152

RF Accuracy = 0.9881

GBM Accuracy = 0.8371

Agreement Accuracy = 0.9983

**RF Accuracy = 0.6082**

**GBM Accuracy = 0.5152**

**Agreement Accuracy = 0.6361**

**Question 2)**

**Load the Alzheimer’s data using the following commands**

`-code-library(caret)`

`library(gbm)`

`set.seed(3433)`

`library(AppliedPredictiveModeling)`

`data(AlzheimerDisease)`

`adData = data.frame(diagnosis,predictors)`

`inTrain = createDataPartition(adData$diagnosis, p = 3/4)[[1]]`

`training = adData[ inTrain,]`

`testing = adData[-inTrain,]`

`-/code-`

**Set the seed to 62433 and predict diagnosis with all the other**

variables using a random forest (“rf”), boosted trees (“gbm”) and linear discriminant analysis (“lda”) model. Stack the

predictions together using random forests (“rf”). What is the resulting accuracy on the test set? Is it better or worse than

each of the individual predictions?

variables using a random forest (“rf”), boosted trees (“gbm”) and linear discriminant analysis (“lda”) model. Stack the

predictions together using random forests (“rf”). What is the resulting accuracy on the test set? Is it better or worse than

each of the individual predictions?

- Stacked Accuracy: 0.80 is better than all three other methods

- Stacked Accuracy: 0.88 is better than all three other methods

- Stacked Accuracy: 0.80 is worse than all the other methods

**Stacked Accuracy: 0.80 is better than random forests and lda and**

the same as boosting.

**Question 3)**

**Load the concrete data with the commands:**

`-code-`

`set.seed(3523)`

`library(AppliedPredictiveModeling)`

`data(concrete)`

`inTrain = createDataPartition(concrete$CompressiveStrength, p =`

3/4)[[1]]

`training = concrete[ inTrain,]`

`testing = concrete[-inTrain,]`

`-/code-`

` `

**Set the seed to 233 and t a lasso model to predict Compressive**

Strength. Which variable is the last coecient to be set to zero as the penalty increases? (Hint: it may be useful to look

up ?plot.enet).

Strength. Which variable is the last coecient to be set to zero as the penalty increases? (Hint: it may be useful to look

up ?plot.enet).

**Cement**- Water
- Age
- CoarseAggregate

**Question 4)**

**Load the data on the number of visitors to the instructors blog from**

here:

here:

*https://d396qusza40orc.cloudfront.net/predmachlearn/gaData.csv*

*(https://d396qusza40orc.cloudfront.net/predmachlearn/gaData.csv)*

**Using the commands:**

`-code-library(lubridate) # For year() function below`

`dat = read.csv("~/Desktop/gaData.csv")`

`training = dat[year(dat$date) < 2012,]`

`testing = dat[(year(dat$date)) > 2011,]`

`tstrain = ts(training$visitsTumblr)`

`-/code-`

**Fit a model using the bats() function in the forecast package to the**

training time series. Then forecast this model for the remaining time points. For how many of the testing points is the

true value within the 95% prediction interval bounds?

training time series. Then forecast this model for the remaining time points. For how many of the testing points is the

true value within the 95% prediction interval bounds?

**96%**- 94%
- 100%
- 92%

**Question 5)**

**Load the concrete data with the commands:**

`-codeset.seed(3523)`

`library(AppliedPredictiveModeling)`

`data(concrete)`

`inTrain = createDataPartition(concrete$CompressiveStrength, p =`

3/4)[[1]]

`training = concrete[ inTrain,]`

`testing = concrete[-inTrain,]`

`-/code-`

**Set the seed to 325 and t a support vector machine using the e1071**

package to predict Compressive Strength using the default settings. Predict on the testing set. What is the

RMSE?

package to predict Compressive Strength using the default settings. Predict on the testing set. What is the

RMSE?

- 6.93
- 45.09
**6.72**- 11543.39