Marketing and customer-related decisions are a top priority for every business. With the help of statistical modeling and analytics, it's possible to support decision makers and help them make strategic decisions based on data, not just instincts. The integration of statistical modeling with marketing strategy can also be referred to as marketing analytics.
In this series of two guides, you will learn important techniques for implementing marketing analytics in R.
This guide, Part 1, will cover:
Customer Churn Prediction
The next guide, Part 2, will cover:
Clustering
Sales Forecasting
Let's start by loading the required libraries.
1library(plyr)
2library(readr)
3library(dplyr)
4library(caret)
5library(ggplot2)
6library(repr)
7library(caret)
Customer acquisition is more expensive than retention. That's why it makes business sense to retain customers, especially profitable ones. Machine learning models can model the probability a customer will leave, or churn. This can then be used to target valuable customers and retain those at risk. We'll build a logistic regression model to predict customer churn.
In this guide, we will use a fictitious dataset of retail banking customers containing 600 observations and 10 variables, as described below:
Marital_status
: Whether the customer is married ("Yes") or not ("No").
Is_graduate
: Whether the customer is a graduate ("Yes") or not ("No").
Income
: Annual income of the customer (in USD).
Loan_pending
: Outstanding loan amount (in USD) still to be paid by the customer.
Satisfaction_score
: Satisfaction level of the customer.
Churn
: Whether the customer churned ("Yes") or not ("No").
Age
: The applicant's age in years.
Sex
: Whether the applicant is a male ("M") or a female ("F").
Investment
: Total investment in stocks and mutual funds (in USD) held by the customer.
Purpose
: Purpose of loan related to the Loan_pending
variable. Let's start by loading the data.
1df_churn = read_csv("data_churn.csv")
2glimpse(df_churn)
Output:
1Observations: 600
2Variables: 10
3$ Marital_status <chr> "No", "Yes", "Yes", "Yes", "No", "Yes", "No", "No",...
4$ Is_graduate <chr> "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Ye...
5$ Income <int> 7000, 8990, 13330, 13670, 19230, 23450, 24000, 2471...
6$ Loan_pending <dbl> 900.0, 809.1, 1199.7, 1230.3, 1730.7, 1876.0, 1920....
7$ Satisfaction_score <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Sa...
8$ Churn <chr> "Yes", "Yes", "No", "Yes", "No", "No", "Yes", "No",...
9$ Age <int> 29, 29, 25, 29, 25, 33, 37, 46, 28, 35, 35, 32, 27,...
10$ Sex <chr> "F", "M", "M", "M", "M", "M", "M", "M", "M", "M", "...
11$ Investment <dbl> 2100, 6293, 9331, 9569, 13461, 16415, 16800, 17297,...
12$ Purpose <chr> "Travel", "Travel", "Travel", "Travel", "Travel", "...
The output shows that the dataset has four numerical variables (labeled as int
or dbl
) and six character variables (labeled as chr
). We will convert these into factor
variables using the line of code below.
1names <- c(1,2,5,6,8,10)
2df_churn[,names] <- lapply(df_churn[,names] , factor)
3glimpse(df_churn)
Output:
1Observations: 600
2Variables: 10
3$ Marital_status <fct> No, Yes, Yes, Yes, No, Yes, No, No, Yes, Yes, No, Y...
4$ Is_graduate <fct> Yes, Yes, Yes, Yes, No, Yes, Yes, Yes, No, No, No, ...
5$ Income <int> 7000, 8990, 13330, 13670, 19230, 23450, 24000, 2471...
6$ Loan_pending <dbl> 900.0, 809.1, 1199.7, 1230.3, 1730.7, 1876.0, 1920....
7$ Satisfaction_score <fct> Satisfactory, Satisfactory, Satisfactory, Satisfact...
8$ Churn <fct> Yes, Yes, No, Yes, No, No, Yes, No, Yes, Yes, Yes, ...
9$ Age <int> 29, 29, 25, 29, 25, 33, 37, 46, 28, 35, 35, 32, 27,...
10$ Sex <fct> F, M, M, M, M, M, M, M, M, M, F, M, M, F, F, M, M, ...
11$ Investment <dbl> 2100, 6293, 9331, 9569, 13461, 16415, 16800, 17297,...
12$ Purpose <fct> Travel, Travel, Travel, Travel, Travel, Travel, Tra...
We will build our model on the training set and evaluate its performance on the test set. This is called the holdout-validation method for evaluating model performance.
The first line of code below sets the random seed for reproducibility of results. The second line loads the caTools
package that will be used for data partitioning, while the third to fifth lines create the training and test sets. The training set contains 70 percent of the data (420 observations of 10 variables) and the test set contains the remaining 30 percent (180 observations of 10 variables).
1set.seed(100)
2library(caTools)
3
4spl = sample.split(df_churn$Churn, SplitRatio = 0.70)
5train = subset(df_churn, spl==TRUE)
6test = subset(df_churn, spl==FALSE)
7
8print(dim(train)); print(dim(test))
Output:
1[1] 420 10
2
3[1] 180 10
The next step is to estimate baseline accuracy, one of the initial model evaluation techniques. The code below creates the proportion table for the target class. Since the majority class of the target variable has a proportion of 0.68, the baseline accuracy is 68 percent.
1prop.table(table(train$Churn))
Output:
1 No Yes
20.3166667 0.6833333
In order to fit the model, the first step is to instantiate the algorithm using the glm()
function. The second line prints the summary of the trained model.
1model_glm = glm(Churn ~ . , family="binomial", data = train)
2summary(model_glm)
Output:
1Call:
2glm(formula = Churn ~ ., family = "binomial", data = train)
3
4Deviance Residuals:
5 Min 1Q Median 3Q Max
6-2.24561 -0.00004 0.00004 0.00007 2.23620
7
8Coefficients:
9 Estimate Std. Error z value Pr(>|z|)
10(Intercept) -1.025e+00 8.100e+03 0.000 0.9999
11Marital_statusYes 4.330e-01 4.566e-01 0.948 0.3430
12Is_graduateYes 9.686e-01 4.571e-01 2.119 0.0341 *
13Income 8.054e-08 9.276e-06 0.009 0.9931
14Loan_pending 1.486e-05 3.188e-05 0.466 0.6411
15Satisfaction_scoreSatisfactory 2.284e+01 7.841e+03 0.003 0.9977
16Age -6.213e-02 1.279e-02 -4.859 1.18e-06 ***
17SexM 1.857e-01 5.599e-01 0.332 0.7402
18Investment -1.604e-06 1.378e-05 -0.116 0.9073
19PurposeHome 2.002e+00 8.100e+03 0.000 0.9998
20PurposeOthers -4.128e+01 3.081e+03 -0.013 0.9893
21PurposePersonal 1.388e+00 2.568e+03 0.001 0.9996
22PurposeTravel -1.942e+01 2.030e+03 -0.010 0.9924
23---
24Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
25
26(Dispersion parameter for binomial family taken to be 1)
27
28 Null deviance: 524.44 on 419 degrees of freedom
29Residual deviance: 168.04 on 407 degrees of freedom
30AIC: 194.04
31
32Number of Fisher Scoring iterations: 19
The significance code ***
in the above output shows the relative importance of the feature variables.
Let's now evaluate the model performance, which should be higher than the baseline accuracy. We start with the training data, where the first line of code generates predictions on the train set. The second line creates the confusion matrix with a threshold of 0.5, which means that for probability predictions greater than or equal to 0.5, the algorithm will predict a 'Yes' response for 'Churn' variable. The third line prints the accuracy of the model on the training data using the confusion matrix, and the accuracy comes out to be 90 percent.
We repeat this process on the test data, and the accuracy comes out to be 89 percent.
1# Predictions on the training set
2predictTrain = predict(model_glm, data = train, type = "response")
3
4# Confusion matrix on training data
5table(train$Churn, predictTrain >= 0.5)
6(114+263)/nrow(train) #Accuracy - 90%
7
8#Predictions on the test set
9predictTest = predict(model_glm, newdata = test, type = "response")
10
11# Confusion matrix on test set
12table(test$Churn, predictTest >= 0.5)
13161/nrow(test) #Accuracy - 89%
Output:
1 FALSE TRUE
2 No 114 19
3 Yes 24 263
4
5[1] 0.897619
6
7 FALSE TRUE
8 No 46 11
9 Yes 8 115
10
11[1] 0.8944444
RFM (Recency, Frequency, and Monetary) analysis is a technique that uses customer transaction data to determine the best customers based on how recently they have purchased, how often they purchase, and how much they spend.
For the RFM analysis, we'll use a fictitious dataset of retail store customers containing 92 observations and 3 variables, as described below:
CustId
: The unique customer number.
Purchase_date
: The date of purchase.
Purchase_value
: Value of the purchase (in USD). Let's load the data and look at the structure.
1df_rfm = read_csv("RFM.csv")
2glimpse(df_rfm)
Output:
1Observations: 92
2Variables: 3
3$ CustId <chr> "Id1", "Id2", "Id3", "Id4", "Id5", "Id6", "Id7", "Id8",...
4$ Purchase_date <chr> "01-Oct-19", "02-Oct-19", "03-Oct-19", "04-Oct-19", "05...
5$ Purchase_value <dbl> 19.2, 19.8, 19.7, 21.3, 20.2, 18.6, 21.5, 21.3, 21.3, 2...
We are interested in customer level analytics, so let's look at the unique number of customers using the code below.
1length(unique(df_rfm$CustId))
Output:
1[1] 25
The output shows there are twenty five unique customers. We'll perform the RFM analysis on this data, but before that, we have to convert the date variable into the proper format, which is done using the first line of code below.
We can observe that the Purchase_date
variable covers the time period between October 1, 2019, and December 31, 2019. In order to calculate recency, we'll create a new variable, days_diff
, that measures the difference between the date of purchase and reference date, which is set to be January 1, 2020. The second line of code creates this variable, while the third line prints the data structure.
1df_rfm$Purchase_date = as.Date(df_rfm$Purchase_date, "%d-%b-%y")
2
3df_rfm$days_diff = round(as.numeric(difftime(time1 = "2020-01-01",
4 time2 = df_rfm$Purchase_date,
5 units = "days")),0)
6
7glimpse(df_rfm)
Output:
1Observations: 92
2Variables: 4
3$ CustId <chr> "Id1", "Id2", "Id3", "Id4", "Id5", "Id6", "Id7", "Id8",...
4$ Purchase_date <date> 2019-10-01, 2019-10-02, 2019-10-03, 2019-10-04, 2019-1...
5$ Purchase_value <dbl> 19.2, 19.8, 19.7, 21.3, 20.2, 18.6, 21.5, 21.3, 21.3, 2...
6$ days_diff <dbl> 92, 91, 90, 89, 88, 87, 86, 85, 84, 83, 82, 81, 80, 79,...
We are now ready to perform the RFM analysis and will use the sqldf()
package for doing the computation. The package is loaded in the first line of code below, while the second line performs the computation. The output is stored in the new data frame RFM_data
, which can be displayed with the third line of code.
1library(sqldf)
2
3# Compute recency, frequency, and average purchase amount
4RFM_data = sqldf("SELECT CustId,
5 MIN(days_diff) AS 'Recency',
6 COUNT(*) AS 'Frequency',
7 AVG(Purchase_value) AS 'Monetary'
8 FROM df_rfm GROUP BY 1")
9
10glimpse(RFM_data)
Output:
1Observations: 25
2Variables: 4
3$ CustId <chr> "Id1", "Id10", "Id11", "Id12", "Id13", "Id14", "Id15", "Id16...
4$ Recency <dbl> 72, 63, 22, 21, 20, 19, 18, 17, 10, 9, 8, 1, 7, 6, 5, 4, 37,...
5$ Frequency <int> 3, 3, 2, 2, 2, 3, 3, 3, 4, 4, 4, 8, 6, 5, 5, 5, 2, 2, 4, 4, ...
6$ Monetary <dbl> 23.20000, 25.83333, 32.45000, 32.85000, 33.95000, 32.23333, ...
The resulting data frame has the variable CustID
and the calculated variables, Recency
, Frequency
, and Monetary
. There are 25 observations in line with the unique number of customers in the original data.
There are several ways in which the RFM analysis can be used for marketing-related decisions. One approach is to look at the percentile values of these metrics, which can be done with the line of code below.
1summary(RFM_data)
Output:
1 CustId Recency Frequency Monetary
2 Length:25 Min. : 1.00 Min. :2.00 Min. :23.20
3 Class :character 1st Qu.: 9.00 1st Qu.:3.00 1st Qu.:28.98
4 Mode :character Median :19.00 Median :4.00 Median :32.12
5 Mean :24.04 Mean :3.68 Mean :30.82
6 3rd Qu.:25.00 3rd Qu.:4.00 3rd Qu.:32.85
7 Max. :72.00 Max. :8.00 Max. :36.07
We can draw the following inferences from the above output.
Recency
: The mean recency for customers is 24 days, while the 75th percentile value is 25 days.
`Frequency: The mean frequency for the customers is 3.68 purchases, while the 75th percentile value is 4 purchases.
Monetary
: The mean purchase value is $30.82, while the 75th percentile value is $32.85. We observe that there are variations in these metrics across customers, and the marketing team can develop a customized strategy to target these customers basis their recency, frequency, and monetary values.
In this guide, you learned about two of the most popular use cases of marketing analytics, customer churn prediction and RFM analysis.
In Part 2, you'll learn about clustering, sales forecasting, and other techniques.
To learn more about data science with R, please refer to the following guides: