Skip to content

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.

Implementing Marketing Analytics in R: Part 2

Feb 15, 2020 • 16 Minute Read

Introduction

Marketing and customer-related decisions are a top priority for every business. With the help of statistical modeling and analytics, it's possible to support decision makers and help them make strategic decisions based on data, not just instincts. The integration of statistical modeling with marketing strategy can also be referred to as marketing analytics.

In this series of two guides, we explore techniques for implementing marketing analytics in R.

In the previous guide, Part 1, we covered:

  • Customer Churn Prediction

  • RFM Analysis

In this guide, Part 2, we will cover:

  • Clustering

  • Sales Forecasting

  • Other Areas

Let's start by loading the required libraries.

      library(plyr)
library(readr)
library(dplyr)
library(caret)
library(ggplot2)
library(repr)
library(caret)
    

Segmentation

Segmentation is the marketing strategy of dividing or segmenting customers to build and deploy effective marketing plans. The basis of segmenting can be demographic, geographic, or any other metric. The statistical technique of dividing the customers into homogeneous segments is called clustering. Let's start by loading the data.

Data

For this clustering exercise, we'll use a fictitious dataset of retail store customers containing 150 observations and 5 variables, as described below:

  1. mean_HH_income: The mean household income of the customer.

  2. ABV: Average bill value per purchase by a customer.

  3. No_bills_month: Average number of bills in a month from the customer.

  4. satisfaction_score: The customer's satisfaction score. The higher the satisfaction score, the more satisfied the customer is with the store.

  5. family_size: Number of members in the customer's family.

  6. Gender: Whether the customer is Female ('1') or male ('0').

Let's load the data and look at the structure.

      df_seg = read_csv("segdata.csv")
glimpse(df_seg)
    

Output:

      Observations: 150
Variables: 6
$ mean_HH_income     <int> 88006, 88005, 88004, 88003, 88002, 88001, 88000, 64...
$ ABV                <int> 355, 354, 353, 352, 351, 350, 349, 348, 347, 346, 3...
$ No_bills_month     <int> 41, 21, 21, 53, 21, 44, 49, 22, 10, 58, 25, 52, 41,...
$ satisfaction_score <int> 1, 3, 4, 4, 2, 1, 3, 4, 4, 3, 5, 1, 5, 3, 3, 5, 4, ...
$ family_size        <int> 4, 5, 3, 7, 6, 9, 6, 5, 8, 10, 3, 3, 3, 8, 7, 3, 3,...
$ Gender             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
    

The next step is to look at the summary of the numerical variables.

      summary(df_seg)
    

Output:

      mean_HH_income       ABV         No_bills_month  satisfaction_score
 Min.   :45000   Min.   : 26.00   Min.   : 5.00   Min.   :1.000     
 1st Qu.:52349   1st Qu.: 62.25   1st Qu.:16.00   1st Qu.:2.000     
 Median :64277   Median :286.50   Median :26.50   Median :3.000     
 Mean   :59873   Mean   :215.39   Mean   :30.20   Mean   :2.873     
 3rd Qu.:64314   3rd Qu.:323.75   3rd Qu.:44.75   3rd Qu.:4.000     
 Max.   :88006   Max.   :355.00   Max.   :60.00   Max.   :5.000     
  family_size         Gender      
 Min.   : 1.000   Min.   :0.0000  
 1st Qu.: 3.000   1st Qu.:0.0000  
 Median : 6.000   Median :0.0000  
 Mean   : 5.513   Mean   :0.3333  
 3rd Qu.: 8.000   3rd Qu.:1.0000  
 Max.   :10.000   Max.   :1.0000
    

From the output above, we can infer that the variables have different scales that will adversely affect the calculation of distances for clustering. To prevent this, we'll normalize the data using the code below. We can normalize the variables in a data frame by using the preProcess function in the caret package.

The first line of code below pre-processes the data, while the second line performs the normalization. If we look at the summary of the segNorm variable, we'll see that all the variables now have a mean of zero.

      p1_object = preProcess(df_seg)

segNorm = predict(p1_object, df_seg)

summary(segNorm)
    

Output:

      mean_HH_income         ABV          No_bills_month    satisfaction_score
 Min.   :-1.6354   Min.   :-1.5124   Min.   :-1.4713   Min.   :-1.31892  
 1st Qu.:-0.8274   1st Qu.:-1.2229   1st Qu.:-0.8291   1st Qu.:-0.61487  
 Median : 0.4842   Median : 0.5679   Median :-0.2160   Median : 0.08918  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
 3rd Qu.: 0.4883   3rd Qu.: 0.8654   3rd Qu.: 0.8495   3rd Qu.: 0.79323  
 Max.   : 3.0935   Max.   : 1.1149   Max.   : 1.7399   Max.   : 1.49728  
  family_size          Gender       
 Min.   :-1.5167   Min.   :-0.7047  
 1st Qu.:-0.8446   1st Qu.:-0.7047  
 Median : 0.1635   Median :-0.7047  
 Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.8356   3rd Qu.: 1.4095  
 Max.   : 1.5077   Max.   : 1.4095
    

After normalizing the data, the next step is to run the clustering algorithm. We'll use the popular kmeans algorithm because it is powerful and interpretable. The number of clusters we will create is five, as given by the argument centers=5 in the first line of code. The second line displays the number of observations in each cluster.

      kmeans_Clust = kmeans(segNorm, centers=5, iter.max=1000)

table(kmeans_Clust$cluster)
    

Output:

      1  2  3  4  5 
31 44 13 21 41
    

From the output above, we can infer that the smallest cluster, 3, has 13 observations, while the largest cluster, 2, has 44 observations. If we want to understand the clusters, we can do that with the command below.

The first line of code below adds a new variable in the original un-normalized data, df-seg. The second line uses the lapply() function to create and print the summary of each cluster.

      df_seg$kmeansseg = kmeans_Clust$cluster

lapply(split(df_seg, df_seg$kmeansseg), colMeans)
    

Output:

      $`1`
    mean_HH_income                ABV     No_bills_month satisfaction_score 
      51832.129032          46.645161          22.709677           2.258065 
       family_size             Gender          kmeansseg 
          4.354839           1.000000           1.000000 

$`2`
    mean_HH_income                ABV     No_bills_month satisfaction_score 
      67507.954545         311.727273          36.818182           1.909091 
       family_size             Gender          kmeansseg 
          5.204545           0.000000           2.000000 

$`3`
    mean_HH_income                ABV     No_bills_month satisfaction_score 
      51437.230769         129.076923          28.692308           2.153846 
       family_size             Gender          kmeansseg 
          7.384615           0.000000           3.000000 

$`4`
    mean_HH_income                ABV     No_bills_month satisfaction_score 
      5.182395e+04       1.424762e+02       4.519048e+01       3.619048e+00 
       family_size             Gender          kmeansseg 
      6.142857e+00       9.047619e-01       4.000000e+00 

$`5`
    mean_HH_income                ABV     No_bills_month satisfaction_score 
      64556.170732         304.292683          21.560976           4.219512 
       family_size             Gender          kmeansseg 
          5.804878           0.000000           5.000000
    

Sales Forecasting

Forecasting needs no introduction, and it is an important part of marketing teams, often used for forecasting sales. Let's start by loading the data.

Data

For this forecasting exercise, we'll use fictitious monthly sales data containing 81 observations, and the task will be to create a forecasting model to predict sales over the next 12 months.

      library(forecast)
library(fpp2)
library(TTR)
library(dplyr)

dat <- read_csv("salesforecasting.csv")
glimpse(dat)
    

Output:

      Observations: 81
Variables: 3
$ Date  <chr> "01-Mar-13", "01-Apr-13", "01-May-13", "01-Jun-13", "01-Jul-13",...
$ Sales <int> 192, 198, 197, 213, 202, 186, 215, 213, 213, 220, 210, 219, 233,...
$ Class <chr> "Train", "Train", "Train", "Train", "Train", "Train", "Train", "...
    

The next step is to create training and test sets for model validation. The first couple of lines of code below perform this task by using the Class variable, which already has labels Train and Test. The third line of code prints the number of rows in the training and the test data. Note that the test data set has 12 rows because we will be building the model to forecast for 12 months ahead.

      dat_train = subset(dat, Class == 'Train')
dat_test = subset(dat, Class == 'Test')
 
nrow(dat_train); nrow(dat_test)
    

Output:

      1] 69
[1] 12
    

Preparing the Time Series Object

To run forecasting models in 'R', we need to convert the data into a time series object, which is done in the first line of code below. The start and end argument specifies the time of the first and the last observations, respectively. The argument frequency specifies the number of observations per unit of time.

We will also create a utility function for calculating Mean Absolute Percentage Error (or MAPE), which will be used to evaluate the performance of the forecasting models. The lower the MAPE value, the better the forecasting model. This is done in the second to fourth lines of code.

      dat_ts <- ts(dat_train[, 2], start = c(2013, 3), end = c(2018, 11), frequency = 12)
 
#lines 2 to 4
mape <- function(actual,pred){
  mape <- mean(abs((actual - pred)/actual))*100
  return (mape)
}
    

With the data and the mape function prepared, we move to the forecasting technique to be used in this case.

Holt's Trend Method

We'll use Holt's Trend forecasting method, a traditional statistical algorithm, which considers the trend component while generating forecasts. This method involves two smoothing equations, one for the level and one for the trend component.

The first line of code below creates the Holt's winter model and stores it in an object 'holt_model'. The second line prints the summary and the forecasts for the next 12 months.

      holt_model <- holt(dat_ts, h = 12)
summary(holt_model)
    

Output:

      Forecast method: Holt's method

Model Information:
Holt's method 

Call:
 holt(y = dat_ts, h = 12) 

  Smoothing parameters:
    alpha = 0.8342 
    beta  = 1e-04 
    
    Initial states:
    l = 193.7099 
    b = 3.4287 

  sigma:  16.2391

     AIC     AICc      BIC 
682.6969 683.6492 693.8674 

Error measures:
                     ME     RMSE     MAE        MPE     MAPE      MASE        ACF1
Training set -0.8870433 15.76137 11.8633 -0.4614957 4.084311 0.2366019 0.004106845

Forecasts:
         Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
Dec 2018       382.4585 361.6473 403.2698 350.6305 414.2866
Jan 2019       385.8799 358.7765 412.9833 344.4288 427.3309
Feb 2019       389.3012 357.1116 421.4908 340.0715 438.5309
Mar 2019       392.7225 356.1461 429.2989 336.7838 448.6612
Apr 2019       396.1438 355.6521 436.6355 334.2171 458.0706
May 2019       399.5651 355.5036 443.6267 332.1789 466.9514
Jun 2019       402.9865 355.6225 450.3504 330.5496 475.4233
Jul 2019       406.4078 355.9563 456.8593 329.2489 483.5667
Aug 2019       409.8291 356.4676 463.1906 328.2197 491.4385
Sep 2019       413.2504 357.1288 469.3721 327.4198 499.0811
Oct 2019       416.6717 357.9187 475.4248 326.8168 506.5267
Nov 2019       420.0931 358.8209 481.3652 326.3854 513.8008
    

The output above shows that the MAPE for the training data is 4.1 percent. Let's now evaluate the model performance on the test data, which is done in the lines of code below. The MAPE error on the test data comes out to be 2.7 percent, which is an improvement over the training data result.

      df_holt = as.data.frame(holt_model)
dat_test$holt = df_holt$`Point Forecast`
mape(dat_test$Sales, dat_test$holt)
    

Output:

      1] 2.735082
    

Overall Forecasting for Next 12 Months

The accuracy number we got is respectable, so we can create the forecast for the next 12 months using the line of codes below.

      df_forecast <- ts(dat[, 2], start = c(2013, 3), end = c(2019, 11), frequency = 12)
arima_model_ahead <- auto.arima(df_forecast)
forecast_12 = forecast::forecast(arima_model_ahead, h=12)
forecast_12
    

Output:

      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
Dec 2019       438.1473 419.0833 457.2113 408.9915 467.3031
Jan 2020       441.9584 417.3683 466.5485 404.3510 479.5658
Feb 2020       446.0598 416.9752 475.1444 401.5787 490.5409
Mar 2020       442.3232 409.3512 475.2952 391.8968 492.7496
Apr 2020       443.8119 407.3648 480.2591 388.0708 499.5530
May 2020       445.3007 405.6821 484.9193 384.7092 505.8921
Jun 2020       449.9827 407.4283 492.5370 384.9014 515.0640
Jul 2020       459.3094 414.0091 504.6097 390.0286 528.5902
Aug 2020       463.9914 416.1024 511.8804 390.7515 537.2313
Sep 2020       462.8675 412.5227 513.2122 385.8718 539.8631
Oct 2020       462.3241 409.6379 515.0103 381.7476 542.9007
Nov 2020       463.5226 408.5947 518.4505 379.5176 547.5275
    

Other Areas

Marketing analytics is a humongous area, and it is not possible to cover all the concepts and techniques in one guide, or for that matter even one book. We have covered the most widely used marketing analytics techniques above. Some other popular ones are mentioned below.

  1. Customer Life Time Value (or CLV): In this technique, you can decide which customers are most valuable to the business by modeling their CLV.

  2. Natural Language Processing (or NLP): NLP is a domain in itself and is increasingly being used for sentiment analysis, topic modeling, and text classification.

  3. Dimensionality Reduction: This technique is often used for the survey of CRM data, which can be extensive.

  4. Affinity Analysis: In this technique, you can decide which items have a higher likelihood of being bought together.

  5. Marketing Campaign Optimization: In this technique, you can model the effectiveness of a marketing campaign and optimize your marketing budget.

Conclusion