Implementing Marketing Analytics in R: Part 1

Feb 15, 2020 • 16 Minute Read

Introduction

Marketing and customer-related decisions are a top priority for every business. With the help of statistical modeling and analytics, it's possible to support decision makers and help them make strategic decisions based on data, not just instincts. The integration of statistical modeling with marketing strategy can also be referred to as marketing analytics.

In this series of two guides, you will learn important techniques for implementing marketing analytics in R.

This guide, Part 1, will cover:

Customer Churn Prediction
RFM Analysis

The next guide, Part 2, will cover:

Clustering
Sales Forecasting
Other Areas

Let's start by loading the required libraries.

      library(plyr)
library(readr)
library(dplyr)
library(caret)
library(ggplot2)
library(repr)
library(caret)
    

Churn Prediction

Customer acquisition is more expensive than retention. That's why it makes business sense to retain customers, especially profitable ones. Machine learning models can model the probability a customer will leave, or churn. This can then be used to target valuable customers and retain those at risk. We'll build a logistic regression model to predict customer churn.

Data

In this guide, we will use a fictitious dataset of retail banking customers containing 600 observations and 10 variables, as described below:

Marital_status: Whether the customer is married ("Yes") or not ("No").
Is_graduate: Whether the customer is a graduate ("Yes") or not ("No").
Income: Annual income of the customer (in USD).
Loan_pending: Outstanding loan amount (in USD) still to be paid by the customer.
Satisfaction_score: Satisfaction level of the customer.
Churn: Whether the customer churned ("Yes") or not ("No").
Age: The applicant's age in years.
Sex: Whether the applicant is a male ("M") or a female ("F").
Investment: Total investment in stocks and mutual funds (in USD) held by the customer.
Purpose: Purpose of loan related to the Loan_pending variable.

Let's start by loading the data.

      df_churn = read_csv("data_churn.csv")
glimpse(df_churn)
    

Output:

      Observations: 600
Variables: 10
$ Marital_status     <chr> "No", "Yes", "Yes", "Yes", "No", "Yes", "No", "No",...
$ Is_graduate        <chr> "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Ye...
$ Income             <int> 7000, 8990, 13330, 13670, 19230, 23450, 24000, 2471...
$ Loan_pending       <dbl> 900.0, 809.1, 1199.7, 1230.3, 1730.7, 1876.0, 1920....
$ Satisfaction_score <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Sa...
$ Churn              <chr> "Yes", "Yes", "No", "Yes", "No", "No", "Yes", "No",...
$ Age                <int> 29, 29, 25, 29, 25, 33, 37, 46, 28, 35, 35, 32, 27,...
$ Sex                <chr> "F", "M", "M", "M", "M", "M", "M", "M", "M", "M", "...
$ Investment         <dbl> 2100, 6293, 9331, 9569, 13461, 16415, 16800, 17297,...
$ Purpose            <chr> "Travel", "Travel", "Travel", "Travel", "Travel", "...
    

The output shows that the dataset has four numerical variables (labeled as int or dbl) and six character variables (labeled as chr). We will convert these into factor variables using the line of code below.

      names <- c(1,2,5,6,8,10)
df_churn[,names] <- lapply(df_churn[,names] , factor)
glimpse(df_churn)
    

Output:

      Observations: 600
Variables: 10
$ Marital_status     <fct> No, Yes, Yes, Yes, No, Yes, No, No, Yes, Yes, No, Y...
$ Is_graduate        <fct> Yes, Yes, Yes, Yes, No, Yes, Yes, Yes, No, No, No, ...
$ Income             <int> 7000, 8990, 13330, 13670, 19230, 23450, 24000, 2471...
$ Loan_pending       <dbl> 900.0, 809.1, 1199.7, 1230.3, 1730.7, 1876.0, 1920....
$ Satisfaction_score <fct> Satisfactory, Satisfactory, Satisfactory, Satisfact...
$ Churn              <fct> Yes, Yes, No, Yes, No, No, Yes, No, Yes, Yes, Yes, ...
$ Age                <int> 29, 29, 25, 29, 25, 33, 37, 46, 28, 35, 35, 32, 27,...
$ Sex                <fct> F, M, M, M, M, M, M, M, M, M, F, M, M, F, F, M, M, ...
$ Investment         <dbl> 2100, 6293, 9331, 9569, 13461, 16415, 16800, 17297,...
$ Purpose            <fct> Travel, Travel, Travel, Travel, Travel, Travel, Tra...
    

Data Partitioning

We will build our model on the training set and evaluate its performance on the test set. This is called the holdout-validation method for evaluating model performance.

The first line of code below sets the random seed for reproducibility of results. The second line loads the caTools package that will be used for data partitioning, while the third to fifth lines create the training and test sets. The training set contains 70 percent of the data (420 observations of 10 variables) and the test set contains the remaining 30 percent (180 observations of 10 variables).

      set.seed(100)
library(caTools)

spl = sample.split(df_churn$Churn, SplitRatio = 0.70)
train = subset(df_churn, spl==TRUE)
test = subset(df_churn, spl==FALSE)

print(dim(train)); print(dim(test))
    

Output:

      1] 420  10

[1] 180  10

Baseline Accuracy

The next step is to estimate baseline accuracy, one of the initial model evaluation techniques. The code below creates the proportion table for the target class. Since the majority class of the target variable has a proportion of 0.68, the baseline accuracy is 68 percent.

      prop.table(table(train$Churn))

Output:

      No       Yes 
0.3166667 0.6833333
    

Build, Predict and Evaluate the Model

In order to fit the model, the first step is to instantiate the algorithm using the glm() function. The second line prints the summary of the trained model.

      model_glm = glm(Churn ~ . , family="binomial", data = train)
summary(model_glm)
    

Output:

      Call:
glm(formula = Churn ~ ., family = "binomial", data = train)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.24561  -0.00004   0.00004   0.00007   2.23620  

Coefficients:
                                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)                    -1.025e+00  8.100e+03   0.000   0.9999    
Marital_statusYes               4.330e-01  4.566e-01   0.948   0.3430    
Is_graduateYes                  9.686e-01  4.571e-01   2.119   0.0341 *  
Income                          8.054e-08  9.276e-06   0.009   0.9931    
Loan_pending                    1.486e-05  3.188e-05   0.466   0.6411    
Satisfaction_scoreSatisfactory  2.284e+01  7.841e+03   0.003   0.9977    
Age                            -6.213e-02  1.279e-02  -4.859 1.18e-06 ***
SexM                            1.857e-01  5.599e-01   0.332   0.7402    
Investment                     -1.604e-06  1.378e-05  -0.116   0.9073    
PurposeHome                     2.002e+00  8.100e+03   0.000   0.9998    
PurposeOthers                  -4.128e+01  3.081e+03  -0.013   0.9893    
PurposePersonal                 1.388e+00  2.568e+03   0.001   0.9996    
PurposeTravel                  -1.942e+01  2.030e+03  -0.010   0.9924    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 524.44  on 419  degrees of freedom
Residual deviance: 168.04  on 407  degrees of freedom
AIC: 194.04

Number of Fisher Scoring iterations: 19
    

The significance code *** in the above output shows the relative importance of the feature variables.

Let's now evaluate the model performance, which should be higher than the baseline accuracy. We start with the training data, where the first line of code generates predictions on the train set. The second line creates the confusion matrix with a threshold of 0.5, which means that for probability predictions greater than or equal to 0.5, the algorithm will predict a 'Yes' response for 'Churn' variable. The third line prints the accuracy of the model on the training data using the confusion matrix, and the accuracy comes out to be 90 percent.

We repeat this process on the test data, and the accuracy comes out to be 89 percent.

      # Predictions on the training set
predictTrain = predict(model_glm, data = train, type = "response")

# Confusion matrix on training data
table(train$Churn, predictTrain >= 0.5)
(114+263)/nrow(train) #Accuracy - 90% 

#Predictions on the test set
predictTest = predict(model_glm, newdata = test, type = "response")

# Confusion matrix on test set
table(test$Churn, predictTest >= 0.5)
161/nrow(test) #Accuracy - 89%
    

Output:

      FALSE TRUE
  No    114   19
  Yes    24  263
  
[1] 0.897619
     
      FALSE TRUE
  No     46   11
  Yes     8  115
  
[1] 0.8944444
    

RFM Analysis

RFM (Recency, Frequency, and Monetary) analysis is a technique that uses customer transaction data to determine the best customers based on how recently they have purchased, how often they purchase, and how much they spend.

Data

For the RFM analysis, we'll use a fictitious dataset of retail store customers containing 92 observations and 3 variables, as described below:

CustId: The unique customer number.
Purchase_date: The date of purchase.
Purchase_value: Value of the purchase (in USD).

Let's load the data and look at the structure.

      df_rfm = read_csv("RFM.csv")
glimpse(df_rfm)
    

Output:

      Observations: 92
Variables: 3
$ CustId         <chr> "Id1", "Id2", "Id3", "Id4", "Id5", "Id6", "Id7", "Id8",...
$ Purchase_date  <chr> "01-Oct-19", "02-Oct-19", "03-Oct-19", "04-Oct-19", "05...
$ Purchase_value <dbl> 19.2, 19.8, 19.7, 21.3, 20.2, 18.6, 21.5, 21.3, 21.3, 2...
    

We are interested in customer level analytics, so let's look at the unique number of customers using the code below.

      length(unique(df_rfm$CustId))

Output:

      1] 25

The output shows there are twenty five unique customers. We'll perform the RFM analysis on this data, but before that, we have to convert the date variable into the proper format, which is done using the first line of code below.

We can observe that the Purchase_date variable covers the time period between October 1, 2019, and December 31, 2019. In order to calculate recency, we'll create a new variable, days_diff, that measures the difference between the date of purchase and reference date, which is set to be January 1, 2020. The second line of code creates this variable, while the third line prints the data structure.

      df_rfm$Purchase_date = as.Date(df_rfm$Purchase_date, "%d-%b-%y")

df_rfm$days_diff = round(as.numeric(difftime(time1 = "2020-01-01",
                                            time2 = df_rfm$Purchase_date,
                                            units = "days")),0)

glimpse(df_rfm)
    

Output:

      Observations: 92
Variables: 4
$ CustId         <chr> "Id1", "Id2", "Id3", "Id4", "Id5", "Id6", "Id7", "Id8",...
$ Purchase_date  <date> 2019-10-01, 2019-10-02, 2019-10-03, 2019-10-04, 2019-1...
$ Purchase_value <dbl> 19.2, 19.8, 19.7, 21.3, 20.2, 18.6, 21.5, 21.3, 21.3, 2...
$ days_diff      <dbl> 92, 91, 90, 89, 88, 87, 86, 85, 84, 83, 82, 81, 80, 79,...
    

We are now ready to perform the RFM analysis and will use the sqldf() package for doing the computation. The package is loaded in the first line of code below, while the second line performs the computation. The output is stored in the new data frame RFM_data, which can be displayed with the third line of code.

      library(sqldf)

# Compute recency, frequency, and average purchase amount
RFM_data = sqldf("SELECT CustId,
                          MIN(days_diff) AS 'Recency',
                          COUNT(*) AS 'Frequency',
                          AVG(Purchase_value) AS 'Monetary'
                   FROM df_rfm GROUP BY 1")

glimpse(RFM_data)
    

Output:

      Observations: 25
Variables: 4
$ CustId    <chr> "Id1", "Id10", "Id11", "Id12", "Id13", "Id14", "Id15", "Id16...
$ Recency   <dbl> 72, 63, 22, 21, 20, 19, 18, 17, 10, 9, 8, 1, 7, 6, 5, 4, 37,...
$ Frequency <int> 3, 3, 2, 2, 2, 3, 3, 3, 4, 4, 4, 8, 6, 5, 5, 5, 2, 2, 4, 4, ...
$ Monetary  <dbl> 23.20000, 25.83333, 32.45000, 32.85000, 33.95000, 32.23333, ...
    

The resulting data frame has the variable CustID and the calculated variables, Recency, Frequency, and Monetary. There are 25 observations in line with the unique number of customers in the original data.

Interpretation of the RFM score

There are several ways in which the RFM analysis can be used for marketing-related decisions. One approach is to look at the percentile values of these metrics, which can be done with the line of code below.

      summary(RFM_data)

Output:

      CustId             Recency        Frequency       Monetary    
 Length:25          Min.   : 1.00   Min.   :2.00   Min.   :23.20  
 Class :character   1st Qu.: 9.00   1st Qu.:3.00   1st Qu.:28.98  
 Mode  :character   Median :19.00   Median :4.00   Median :32.12  
                    Mean   :24.04   Mean   :3.68   Mean   :30.82  
                    3rd Qu.:25.00   3rd Qu.:4.00   3rd Qu.:32.85  
                    Max.   :72.00   Max.   :8.00   Max.   :36.07
    

We can draw the following inferences from the above output.

Recency: The mean recency for customers is 24 days, while the 75th percentile value is 25 days.
`Frequency: The mean frequency for the customers is 3.68 purchases, while the 75th percentile value is 4 purchases.
Monetary: The mean purchase value is $30.82, while the 75th percentile value is $32.85.

We observe that there are variations in these metrics across customers, and the marketing team can develop a customized strategy to target these customers basis their recency, frequency, and monetary values.

Conclusion

In this guide, you learned about two of the most popular use cases of marketing analytics, customer churn prediction and RFM analysis.

In Part 2, you'll learn about clustering, sales forecasting, and other techniques.

To learn more about data science with R, please refer to the following guides: