Classifying bank customers with Log, Net and Tree

Sep 25, 2019 17 min read R

Classifying customers that open a term deposit

This data analysis will be a little different than the previous in that sense that I will now focus a little more on statistical modelling. The data in this post comes from UCI

The dataset comes from a real bank marketing campaign where the bank tried to tell the customers to open a term deposit.

I will use the package caret in order to perform the following models:

Logistic Classification
Neural Network
Classification Tree

Caret is a great package because it makes so many aspects of data science simple, for example data splitting, model tuning, model comparison and changing model.

library(inspectdf)
library(caret)
library(tidyverse)
library(psych)
library(doParallel) #for parallel processing
library(cowplot)
library(ggalluvial)
library(ggridges)
library(patchwork)
library(viridis)
library(gghighlight)
library(ggpubr)
library(scales)
library(gganimate)
library(nnet)
library(pROC)
library(NeuralNetTools)
library(rpart.plot)
library(rpart)
library(skimr)

bank <- read.csv("C:/Users/GTSA - Infinity/Desktop/R analyser/bank-additional-full.csv", sep=";")

This is the first time I use the function skim in order to view the data, I usually use str() or glimpse()

skim(bank)

Table 1: Data summary
Name	bank
Number of rows	41188
Number of columns	21
_______________________
Column type frequency:
factor	11
numeric	10
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
job	1	FALSE	12	adm: 10422, blu: 9254, tec: 6743, ser: 3969
marital	1	FALSE	4	mar: 24928, sin: 11568, div: 4612, unk: 80
education	1	FALSE	8	uni: 12168, hig: 9515, bas: 6045, pro: 5243
default	1	FALSE	3	no: 32588, unk: 8597, yes: 3
housing	1	FALSE	3	yes: 21576, no: 18622, unk: 990
loan	1	FALSE	3	no: 33950, yes: 6248, unk: 990
contact	1	FALSE	2	cel: 26144, tel: 15044
month	1	FALSE	10	may: 13769, jul: 7174, aug: 6178, jun: 5318
day_of_week	1	FALSE	5	thu: 8623, mon: 8514, wed: 8134, tue: 8090
poutcome	1	FALSE	3	non: 35563, fai: 4252, suc: 1373
y	1	FALSE	2	no: 36548, yes: 4640

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age	0	1	40.02	10.42	17.00	32.00	38.00	47.00	98.00	▅▇▃▁▁
duration	0	1	258.29	259.28	0.00	102.00	180.00	319.00	4918.00	▇▁▁▁▁
campaign	0	1	2.57	2.77	1.00	1.00	2.00	3.00	56.00	▇▁▁▁▁
pdays	0	1	962.48	186.91	0.00	999.00	999.00	999.00	999.00	▁▁▁▁▇
previous	0	1	0.17	0.49	0.00	0.00	0.00	0.00	7.00	▇▁▁▁▁
emp.var.rate	0	1	0.08	1.57	-3.40	-1.80	1.10	1.40	1.40	▁▃▁▁▇
cons.price.idx	0	1	93.58	0.58	92.20	93.08	93.75	93.99	94.77	▁▆▃▇▂
cons.conf.idx	0	1	-40.50	4.63	-50.80	-42.70	-41.80	-36.40	-26.90	▅▇▁▇▁
euribor3m	0	1	3.62	1.73	0.63	1.34	4.86	4.96	5.04	▅▁▁▁▇
nr.employed	0	1	5167.04	72.25	4963.60	5099.10	5191.00	5228.10	5228.10	▁▁▃▁▇

We can see the data consist of 21 variables and 41 188 observations, or customers as we also can call it. There is no missing value in this dataset. A good thing about the skim function is that it creates histograms of the numerical variables.

Now I will use the package psych in order to create scatterplots and correlation of all the variables. The dataframe is to big (21 variables) so I won’t be able to plot every variable at the same time. I will therefore only split the 10 first variables and then I will also look at the 10 last variables. The response variable which we have the most interest in will be in both graphs so we can see how it correlates with the others.

bankfirst <- bank[,1:10]

bankfirst$y.yes <- bank[,21]

pairs.panels(bankfirst)

The last row in the graph is the response variable. As we can see so is there not any strong correlations at all regarding the response variable. Let us therefore investigate the rest of the variables and hope that they contain some sort of information that we can help to use when classifying.

banksecond <- bank[,11:21]

pairs.panels(banksecond)

The last row is the response variable. Now we can see that there is a lite more action going on regarding the correlation, for example, the first-row duration (measures the duration of the call the bank has with the customer in seconds) has the strongest correlation of all variables.

We can also see that there are some other variables that have a correlation around 0.20 to 0.30.

Explore only the categorical variables with inspectdf

Here I will use the package inspectdf for the first time in order to show the frequency of all the categorical levels in our bank dataset.

bankcat <- bank %>%
  inspect_cat()


bankcat %>% show_plot()

The plot above shows that some factors have multiple and even small levels which create problems since we know from the correlation plot that many variables don’t have much information concerning our response variable. The many levels will only add more computational time when conduction our models and will also make the summary of logistic model messy.

Looking at the top correlations

Here I still use the package inspectdf but now I use a function that shows the strongest correlation of all variables first.

inspect_cor(bank) %>% show_plot()

We can clearly see from the plot above that the strongest correlation amongst all the variables are between euribor3m & emp.var.rate.

Term Deposit

Now I will start to focus more on some variables, the first one being Term deposit.

bank %>%
  group_by(y)%>%
  count() %>%
  ggplot(aes(x = y, y = n, fill = y)) + geom_col(color = "black", alpha = 0.7) + theme_minimal_hgrid() + scale_y_continuous(expand = expand_scale(mult = c(0, 0.05))) + geom_text(aes(y = , label = n), vjust = 1.3, colour = "black", size = 4) + labs(title = "", x = "Term Deposit", y = "NUmber of cases")

## Warning: `expand_scale()` is deprecated; use `expansion()` instead.

From the 41 188 customers only 4 640 wanted to open a term deposit, the rest of the customers said no. This distribution creates a problem for our classifier since the data is therefore very imbalanced, meaning that there are almost 90 % of “No” customers compared to around 10 % of “Yes” customers.

The classifier model will in that case want to classify many predictions as “No” since it will then yield a high accuracy. In order to counter this, I can use different resampling methods such as oversampling, undersampling, ROSE or SMOTE.

Duration of the call

We could see from the correlation plot that duration has the highest correlation with Term deposit. Let us therefore investigate this further with some graphs.

 p1 <- ggplot(data = bank, aes(x = duration, y = y, fill = 0.5 - abs(0.5-..ecdf..))) + stat_density_ridges( rel_min_height = 0.01, geom = "density_ridges_gradient", calc_ecdf = TRUE, jittered_points = TRUE, position = "raincloud",
    alpha = 0.08, scale = 0.9) + 
  scale_x_continuous(expand = c(0.01, 0)) +
  scale_y_discrete(expand = c(0.01, 0))+
  scale_fill_viridis(name = "Tail probability", direction = -1) + theme(legend.position = "none") + ylab("")  + theme_minimal_hgrid() + theme(
        legend.position="none") + labs(title = "Duration of the call", x = "Seconds") + xlim(c(-20,1500))


 p2 <- ggplot(data = bank, aes(x = y, y = duration, fill = y)) + geom_jitter( alpha = 0.05, aes(color = y)) + geom_boxplot() + theme_minimal_hgrid() + theme(legend.position = 'none') + theme(axis.text.x=element_text(angle=50,hjust=1, size = 9))+
  labs(title= "Duration and Yes/No to Term deposit", x = "Response", y = "Duration") + ylim(c(0,1500))
 
 p1 + p2

The two plots above clearly indicate that duration is a variable that can be used when predicting customers that opens a term deposit. Both the density lines the boxplots show that customers that say yes often has a longer call with the bank.

The age distribution

Let’s take a deeper look at the age distribution amongst our customers.

ggplot(data = bank, aes(x = age, fill = y)) + geom_histogram(bins = 40, color = "black")+
  gghighlight()+ theme(legend.position = 'none') + facet_grid(y ~ .,
             scales = "free_y") + scale_y_continuous(expand = expand_scale(mult = c(0, 0.05)))  + theme_pubclean()+ panel_border() + scale_x_continuous(breaks = seq(0, 100, 5)) + geom_vline(xintercept = c(30, 60), 
             col = "red",
             linetype = "dashed")

There are almost no customers that are above 60 or below 25 years old, I would therefore say that our customers are mostly middle aged with a slight skew on the younger side.

The bad thing is that there doesn’t seem to add any value considering term deposit, I can’t see any clear pattern that would tell me something.

Cell phone or by home telephone

The bank had two options when calling the customers, calling to their cell phone or by their telephone.

p1<- bank %>%
group_by(contact, y) %>%
count() %>%
ggplot(aes(x = contact, y = n,fill = y)) + geom_col(color = "black", alpha = 0.8,  position = "dodge") + theme_minimal_hgrid() + scale_y_continuous(expand = expand_scale(mult = c(0, 0.05))) + labs(title = "Contact by telephone or cellphone", x = "Contanct by", y = "Observations")+ theme(legend.position = 'none')

p2 <- bank %>%
group_by(contact, y) %>%
count() %>%
ggplot(aes(x = contact, y = n,fill = y)) + geom_col(color = "black", alpha = 0.8,  position='fill') + theme_minimal_hgrid() + scale_y_continuous(expand = expand_scale(mult = c(0, 0.05))) + labs(title = "Contact by telephone or cellphone", x = "Contanct by", y = "Observations")+ scale_y_continuous(expand = c(0, 0), label = percent) 

p1 + p2

The bank called the customers most often on their cell phone. The right graph indicates that around 15 % of the customers that got called on their cell phone opened a term deposit, while it was only around 5 % when they were called on the telephone.

Job and term deposit

The Job variable is one of the factors that contaied a lot of levels and didn’t have much information concerning term deposit.

bank %>%
group_by(job, y) %>%
count() %>%
ggplot(aes(x = reorder(job, n), y = n,fill = y)) + geom_col(color = "black", alpha = 0.8) + theme_minimal_vgrid() + scale_y_continuous(expand = expand_scale(mult = c(0, 0.05))) + labs(title = "Job and accepting", x = " ", y = "Amount") + coord_flip()

The job administrator is the most common among the banks customers and next is blue-collar. There is not so much information in the graph so let’s move on.

gganimate job, age and term deposit.

Let’s combined the variables age, job and term deposit and create an animated graph using gganimate.

p  <- ggplot(data = bank, aes(x = age, fill = y)) + geom_density(alpha = 0.8)+ theme_minimal_hgrid() + scale_y_continuous(expand = expand_scale(mult = c(0, 0.05))) + transition_states(job, transition_length = 3, state_length = 1) + labs(title = "{closest_state}") +
  enter_fade() + 
  exit_shrink() +
  ease_aes('sine-in-out')


animate(p, 20, nframes = 300, fps = 30, type = "cairo")

We can see that some jobs are much more common when the age is young or old. However, the graph is good looking but don’t add any information.

Creating and Alluvium diagram

I wanted to use the alluvium plot in order to investigate how the customers “flow” in some of the factors.

#väljer ut variablerna och räknar dem
testing <- bank %>% 
  count(y, loan, contact, marital) %>%
  drop_na()


testing %>%
  mutate( Loan = fct_rev(as.factor(loan)),
          Contact = fct_rev(as.factor(contact)),
          Marital = fct_rev(as.factor(marital)),
          Accepted = fct_rev(as.factor(y))) %>%
  ggplot(aes(y = n, axis1 = marital, axis2 = loan, axis3 = contact, axis4 = y)) +
  geom_alluvium(aes(fill = y), aes.bind=TRUE, width = 1/12) +
  geom_stratum(width = 1/4, fill = "white", color = "black", size = 0.8) +
  geom_text(stat = "stratum", label.strata = TRUE) +
  scale_x_discrete(limits = c("Martial status", "Loan", "Contact", "Term Deposit"),
                   expand = c(.05, .05)) +
  labs(y = "Cases") +
  theme_minimal() +
  theme(legend.position = "none") +
  ggtitle("Alluvium diagram over the bank")

The variables in the alluvial plot are Martial Status, Loan, Contact and Term Deposit. Again, there isn’t any clear pattern in the above plot, I just wanted to make it.

Model preparation

Now I will begin my classification journey. I will be using caret and that package only allows numerical values, so I must create dummy variables of all the factors. I use the function dummyVars in caret that creates the dummy variables.

dmy <- dummyVars(" ~ .", data = bank,fullRank = T)   #Caret automatic fixes the dummy variables 
train_transformed <- data.frame(predict(dmy, newdata = bank))

train_transformed$y.yes <- as.factor(train_transformed$y.yes)  # do the response variable as factor again.

The problem with transforming all the factors to dummy variables is that the data frame now consists of 54 columns. In order to lower the number of columns (because many variables doesn’t contain much information at all regarding term deposit).

I will therefore use a function in caret that ranks the best variables in the data frame.

Feature Selection

Feature Selection means that we look at each of the variables performance and how much they add to the mix. It works like backwards selection. The code is out commented because it takes a very very long time.

#Feature selection using rfe in caret
#control <- rfeControl(functions = rfFuncs,
                  # method = "repeatedcv",
                  # repeats = 3,
                  # verbose = FALSE)
#outcomeName<-'y.yes'
#predictors<-names(trainSet)[!names(trainSet) %in% outcomeName]
#Loan_Pred_Profile <- rfe(trainSet[,predictors], trainSet[,outcomeName],
                      #rfeControl = control)


# plot(Loan_Pred_Profile, type=c("g", "o"))

Selecting only the best predictors

The 15 best predictors given by the above function is seen below. I create a new data frame with only these 15 predictors and the response variable.

fewdata <- train_transformed %>% select("duration", "euribor3m", "pdays", "poutcome.success", "month.oct", "contact.telephone", "emp.var.rate", "age", "cons.conf.idx", "month.mar", "cons.price.idx", "day_of_week.wed", "poutcome.nonexistent", "campaign", "day_of_week.thu", "y.yes")

Splitting the data

Now I will use the createDataPartition function from caret that splits the data frame into training and test set. The function makes sure that the proportion of the response variable stays the same.

set.seed(123)
trainIndex <- createDataPartition(fewdata$y.yes, p = .75, list = FALSE)
trainData1 <- fewdata[trainIndex,]
testData1 <- fewdata[-trainIndex,]


p1 <- trainData1 %>%
  group_by(y.yes)%>%
  count() %>%
  ggplot(aes(x = y.yes, y = n, fill = y.yes)) + geom_col(color = "black", alpha = 0.7) + theme_minimal_hgrid() + scale_y_continuous(expand = expand_scale(mult = c(0, 0.05))) + geom_text(aes(y = , label = n), vjust = 1.3, colour = "black", size = 4) + labs(title = "Train data", x = "Term Deposit", y = "Number of cases") + theme(legend.position = "none")

p2 <- testData1 %>%
  group_by(y.yes)%>%
  count() %>%
  ggplot(aes(x = y.yes, y = n, fill = y.yes)) + geom_col(color = "black", alpha = 0.7) + theme_minimal_hgrid() + scale_y_continuous(expand = expand_scale(mult = c(0, 0.05))) + geom_text(aes(y = , label = n), vjust = 1.3, colour = "black", size = 4) + labs(title = "Test data", x = "Term Deposit", y = " ")


p1 + p2

The plot above shows that the proportion of the response variable stays the same in our training and testing set thanks to the caret function createDataPartition.

Order the right levels

I want the level “Yes” to be the positive level when conduction the confusion matrix, I will therefore change the levels and attribute of the term deposit variable. Because now it is 0 for No and 1 for Yes.

testData1$y.yes   <- ifelse(testData1$y.yes == 1, "Yes","No")
testData1$y.yes <- as.factor(testData1$y.yes )

trainData1 $y.yes   <- ifelse(trainData1$y.yes == 1, "Yes","No")
trainData1 $y.yes <- as.factor(trainData1$y.yes )

trainData1$y.yes<- factor(trainData1$y.yes, levels = c("Yes", "No"))

testData1$y.yes<- factor(testData1$y.yes, levels = c("Yes", "No"))

Creating the trainControl in caret

Caret allows me to create a trainControl to use when constructing the classification model. Here I tell caret to use repeated 10-fold cross validation that repeats 5 times. Here I also tell caret to use the resampling method SMOTE in order to account for the imbalanced data considering term deposit.

ctrl <- trainControl(method="repeatedcv",   # 10fold cross validation
                     repeats=5,         # do 5 repititions of cv
                     summaryFunction=twoClassSummary,   # Use AUC to pick the best model
                     classProbs=TRUE,
                     savePredictions = "final",
                     sampling = "smote")

Logistic Classification

Here I create the logistic classification, again, the model is commented because it takes a long time. So I did the model on my computer and saved it, therefore I simply uploaded the model in this website.

#cl <- makePSOCKcluster(5)   #for parallel processing
#registerDoParallel(cl)



#log_model <- train(y.yes ~.,
              #   data = trainData1,
              #   method = "glm",
              #   family = "binomial",
              #   preProc = c("center", "scale"),
              #   metric = "Sens",
              #   tuneLength=5,
              #   trControl = ctrl)



# stopCluster(cl)   ##  <--When you are done

# saveRDS(log_model, "./log_model.rds")  # <--saving the model

# I have already performed this analysis and it would take to long to do it again in order to upload this script to the website, so I will simply upload the model sinceI have saved it.


log_model <- readRDS("C:/Users/GTSA - Infinity/Desktop/R analyser/log_model.rds")

Predict the test data

Here I will simply predict on the testdata. I won’t waste time showing summary of the model and so on in this website, this is only a quick example and not ment to be the best possibly approach.

predict <- predict(object= log_model, testData1, type='prob')
ho <- pROC::roc(ifelse(testData1[,"y.yes"] == "Yes", 1, 0), predict[[2]])

plot.roc(ho)

The above plot shows the predicted ROC curve. We can see that the curve is performing well.

Area under the curve

ho$auc

## Area under the curve: 0.9324

Here we can see that the AUC is almost 1, which is what we want.

Variable importance for Logistic

Let’s create a plot that show what variables that is the most important variables for the Logistic Model.

plot(varImp(object= log_model),main="Logistic model - Variable Importance")

We can see that the most important variable is duration.

Draw confusion matrix function

I want to create a function that draw the confusion matrix from caret. Below is the function that i named draw_confusion_matrix.

draw_confusion_matrix <- function(cm) {

  layout(matrix(c(1,1,2)))
  par(mar=c(2,2,2,2))
  plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
  title('Confusion Matrix', cex.main=2)

  # create the matrix 
  rect(150, 430, 240, 370, col='green')
  text(195, 435, 'Yes', cex=1.2)
  rect(250, 430, 340, 370, col='red')
  text(295, 435, 'No', cex=1.2)
  text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
  text(245, 450, 'Actual', cex=1.3, font=2)
  rect(150, 305, 240, 365, col='red')
  rect(250, 305, 340, 365, col='green')
  text(140, 400, 'Yes', cex=1.2, srt=90)
  text(140, 335, 'No', cex=1.2, srt=90)

  # add in the cm results 
  res <- as.numeric(cm$table)
  text(195, 400, res[1], cex=1.6, font=2, col='black')
  text(195, 335, res[2], cex=1.6, font=2, col='black')
  text(295, 400, res[3], cex=1.6, font=2, col='black')
  text(295, 335, res[4], cex=1.6, font=2, col='black')

  # add in the specifics 
  plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
  text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
  text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
  text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
  text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
  text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
  text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
  text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
  text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
  text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
  text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)

  # add in the accuracy information 
  text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
  text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
  text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
  text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
}

Here I use the above function to draw the confusion matrix for the Logistic Model.

predictions2<-predict.train(object=log_model, testData1,type="raw")


cm <- confusionMatrix(predictions2, testData1$y.yes, positive = "Yes")

draw_confusion_matrix(cm) #using the draw cm function

We can see that 946 was predictet Yes and was true, however, 214 was classified as NO while they in fact was Yes.

Neural Network

Now I will do the same thing but with a neural network.

# cl <- makePSOCKcluster(5)
# registerDoParallel(cl)


#nnet_model <- train(y.yes ~.,
               #  data = trainData1,
               #  method = "nnet",
               #  preProc = c("center", "scale"),
               #  metric = "Sens",
               #  tuneLength=5,
               #  trControl = ctrl)

## When you are done:
#stopCluster(cl)


nnet_model <- readRDS("C:/Users/GTSA - Infinity/Desktop/R analyser/nnet_model.rds")

plotnet(nnet_model$finalModel, y_names = "Term deposit")

plot(varImp(object=nnet_model),main="NNET - Variable Importance")

predictions2<-predict.train(object=nnet_model, testData1,type="raw")
cm <- confusionMatrix(predictions2, testData1$y.yes, positive = "Yes")
draw_confusion_matrix(cm) #using the draw cm function

We can see that the network is a little bit better than the Logistic Model concerning predicting our yes.

Classification Tree

Now I create a classification tree.

#rpart_model <- train(y.yes ~.,
   #                data = trainData1,
   #                method = "rpart",
   #                preProc = c("center", "scale"),
   #                metric = "ROC",
   #                tuneLength=5,
   #                trControl = ctrl)

## When you are done:
#stopCluster(cl)


rpart_model <- readRDS("C:/Users/GTSA - Infinity/Desktop/R analyser/rpart_model.rds")

rpart.plot(rpart_model$finalModel)

We can see from the tree plot above that the only variable used in the tree is duration, mont.oct, emp.var.rate, cons.conf.idx and euribor3m.

predictions2<-predict.train(object=rpart_model, testData1,type="raw")
cm <- confusionMatrix(predictions2, testData1$y.yes, positive = "Yes")
draw_confusion_matrix(cm) #using the draw cm function

We can see that the classification tree predicts the most true positive, but it predicts to many so many predictions are in fact No.

Comparing the models

I will use a plot that shows the sensitivity and ROC for each model during the cross-validation in caret.

rValues <- resamples(list(rpart_model, nnet_model, log_model)) 

bwplot(rValues,metric=c("Sens", "ROC"),ylab =c( "Tree", "Net", "Logistic"))

We can see that the Logistic classification model has the best sensitivity in the caret cross-validation. However, the network predicted the best on the test data. I would still use the logistic classification in production because it gives a lot more valuable information of the variables and how the model thinks and is also much faster than the network.

That’s all I had to say. This was a short post just showing some classification method. There is a lot more I could have done better, for example transforming some variable to factor and so on. But this is just a small test.

RMarkdown data Viz Classification Caret