In this lab at home, two different classification methods will be covered: K-nearest neighbours and logistic regression. You can download the student zip including all needed files for practical 5 here.
Note: the completed homework has to be handed in on Black Board and will be graded (pass/fail, counting towards your grade for individual assignment). The deadline is two hours before the start of your lab. Hand-in should be a PDF or HTML file. If you know how to knit pdf files, you can hand in the knitted pdf file. However, if you have not done this before, you are advised to knit to a html file as specified below, and within the html browser, ‘print’ your file as a pdf file.
One of the packages we are going to use is class.
For this, you will probably need to
install.packages("class")
before running the
library()
functions. In addition, you will again need the
caret
package to create a training and a validation split
for the used dataset (note: to keep this at home lab compact,
we will only use a training and validation split, and omit the test
dataset to evaluate model fit). You can download the student zip
including all needed files for practical 5 here.
library(MASS)
library(class)
library(caret)
library(ISLR)
library(tidyverse)
This practical will be mainly based around the default
dataset which contains credit card loan data for 10 000 people. With the
goal being to classify credit card cases as yes
or
no
based on whether they will default on their loan.
Default
dataset,
where balance
is mapped to the x position,
income
is mapped to the y position, and
default
is mapped to the colour. Can you see any
interesting patterns already?facet_grid(cols = vars(student))
to the
plot. What do you see?ifelse()
(0 = not a student, 1 =
student). Then, randomly split the Default dataset into a training set
default_train
(80%) and a validation set
default_valid
(20%) using the
createDataPartition()
function of the caret
package.If you haven’t used the function ifelse()
before, please
feel free to review it in Chapter 5 Control
Flow (particular section 5.2.2) in Hadley Wickham’s Book Advanced R, this provides a concise
overview of choice functions (if()
) and vectorised if
(ifelse()
).
Now that we have explored the dataset, we can start on the task of classification. We can imagine a credit card company wanting to predict whether a customer will default on the loan so they can take steps to prevent this from happening.
The first method we will be using is k-nearest neighbours (KNN). It
classifies datapoints based on a majority vote of the k points closest
to it. In R
, the class
package contains a
knn()
function to perform knn.
knn()
function. Use
student
, balance
, and income
(but
no basis functions of those variables) in the default_train
dataset. Set k to 5. Store the predictions in a variable called
knn_5_pred
.Remember: make sure to review the knn()
function through the help panel on the GUI or through typing
“?knn” into the console. For further guidance on the knn()
function, please see Section 4.7.6 in An introduction to Statistical
Learning
default
) mapped to the colour aesthetic,
and one with the predicted class (knn_5_pred
) mapped to the
colour aesthetic. Hint: Add the predicted class knn_5_pred
to the default_valid
dataset before starting your
ggplot()
call of the second plot. What do you
see?knn_2_pred
vector generated from a 2-nearest neighbours
algorithm. Are there any differences?During this we have manually tested two different values for K, this although useful in exploring your data. To know the optimal value for K, you should use cross validation.
The confusion matrix is an insightful summary of the plots we have
made and the correct and incorrect classifications therein. A confusion
matrix can be made in R
with the table()
function by entering two factor
s:
conf_2NN <- table(predicted = knn_2_pred, true = default_valid$default)
conf_2NN
To learn more these, please see Section 4.4.3 in An Introduction to Statistical Learning, where it discusses Confusion Matrices in the context of another classification method Linear Discriminant Analysis (LDA).
KNN directly predicts the class of a new observation using a majority
vote of the existing observations closest to it. In contrast to this,
logistic regression predicts the log-odds
of belonging to
category 1. These log-odds can then be transformed to probabilities by
performing an inverse logit transform:
\(p = \frac{1}{1 + e^{-\alpha}}\)
where \(\alpha\); indicates log-odds for being in class 1 and \(p\) is the probability.
Therefore, logistic regression is a probabilistic
classifier as opposed to a direct
classifier such as KNN:
indirectly, it outputs a probability which can then be used in
conjunction with a cutoff (usually 0.5) to classify new
observations.
Logistic regression in R
happens with the
glm()
function, which stands for generalized linear model.
Here we have to indicate that the residuals are modeled not as a
Gaussian (normal distribution), but as a binomial
distribution.
glm()
with argument
family = binomial
to fit a logistic regression model
lr_mod
to the default_train
data. Use student,
income and balance as predictors.Now we have generated a model, we can use the predict()
method to output the estimated probabilities for each point in the
training dataset. By default predict
outputs the log-odds,
but we can transform it back using the inverse logit function of before
or setting the argument type = "response"
within the
predict function.
lr_mod
. You can choose for
yourself which type of visualisation you would like to make. Write down
your interpretations along with your plot.Another advantage of logistic regression is that we get coefficients we can interpret.
lr_mod
model
and interpret the coefficient for balance
. What would the
probability of default be for a person who is not a student, has an
income of 40000, and a balance of 3000 dollars at the end of each month?
Is this what you expect based on the plots we’ve made
before?Let’s visualise the effect balance
has on the predicted
default probability.
balance_df
with 3
columns and 500 rows: student
always 0,
balance
ranging from 0 to 3000, and income
always the mean income in the default_train
dataset.newdata
in a
predict()
call using lr_mod
to output the
predicted probabilities for different values of balance
.
Then create a plot with the balance_df$balance
variable
mapped to x and the predicted probabilities mapped to y. Is this in line
with what you expect?Now let’s do another - slightly less guided - round of KNN and/or
logistic regression on a new dataset in order to predict the outcome for
a specific case. We will use the Titanic dataset also discussed in the
lecture. The data can be found in the /data
folder of your
project. Before creating a model, explore the data, for example by using
summary()
.