Polynomial Functions Analysis with R

Aug 24, 2020 • 7 Minute Read

Introduction

When you have feature points aligned in almost a straight line, you can use simple linear regression or multiple linear regression (in the case of multiple feature points). But how will you fit a function on a feature(s) whose points are non-linear? In this guide you will learn to implement polynomial functions that are suitable for non-linear points. You will be working in R and should already have a basic knowledge on regression to follow along.

Describing the Original Data and Creating Train and Test Data

Original Data

Consider a dependent variable Ft1 and an independent variable Ft2 with 19 data points as shown:

Ft1	Ft2
38	23
52	85
36	67
92	15
83	200
170	180
140	35
201	156
112	99
132	43
80	92
134	62
150	250
160	240
190	270
145	220
166	260
120	155
142	133

You can visualize the complete data using the ggplot2 library as shown:

      # Load ggplot2 library
library(ggplot2)

# Load data from a CSV file
data <- read.csv("file.csv")

# Visualize the data
ggplot(data) +
  geom_point(aes(Ft1, Ft2),size=3) +
  theme_bw()
    

Creating Train and Test Data

You can split the original data into train and test in a ratio of 75:25 with the following code:

      # Set a seed value for reproducible results
set.seed(70)

# Split the data
ind <- sample(x = nrow(data), size = floor(0.75 * nrow(data)))

# Store the value in train and test dataframes
train <- data[ind,]
test <- data[-ind,]
    

Building Polynomial Regression of Different Degrees

To build a polynomial regression in R, start with the lm function and adjust the formula parameter value. You must know that the "degree" of a polynomial function must be less than the number of unique points.

At this point, you have only 14 data points in the train dataframe, therefore the maximum polynomial degree that you can have is 13. The given code builds four polynomial functions of degree 1, 3, 5, and 9.

      # Order 1
poly_reg1 <- lm(formula = Ft1~poly(Ft2,1), 
                data = train)
# Order 3
poly_reg3 <- lm(formula = Ft1~poly(Ft2,3), 
                data = train)
# Order 5
poly_reg5 <- lm(formula = Ft1~poly(Ft2,5), 
                data = train)
# Order 9
poly_reg9 <- lm(formula = Ft1~poly(Ft2,9), 
                data = train)
    

Once you have successfully built these four models you can visualize them on your training data using the given ggplot code:

      ggplot(train) +
  geom_point(aes(Ft2, Ft1, col = "Original"), cex = 2) +
  stat_smooth(method = "lm", formula = y~poly(x,1), aes(Ft2, poly_reg1$fitted.values, col = "Order 1")) +
  stat_smooth(method = "lm", formula = y~poly(x,3), aes(Ft2, poly_reg3$fitted.values, col = "Order 3")) +
  stat_smooth(method = "lm", formula = y~poly(x,5), aes(Ft2, poly_reg5$fitted.values, col = "Order 5")) +
  stat_smooth(method = "lm", formula = y~poly(x,9), aes(Ft2, poly_reg9$fitted.values, col = "Order 9")) +
  scale_colour_manual("",
                      breaks = c("Original",  "Order 1", "Order 3", "Order 5", "Order 9"),
                      values = c("red","cyan", "blue","orange","green")) +
  theme_bw()
    

Measuring the RSS Value on Train and Test Data

You have all the information to get the RSS value on train data, but to get the RSS value of test data, you need to predict the Ft1 values. Use the given code to do so:

      # Predicting values using test data by each model
poly1_pred <- predict(object = poly_reg1, 
                      newdata =  data.frame(Ft2 = test$Ft2))
poly3_pred <- predict(object = poly_reg3, 
                      newdata =  data.frame(Ft2 = test$Ft2))
poly5_pred <- predict(object = poly_reg5, 
                      newdata =  data.frame(Ft2 = test$Ft2))
poly9_pred <- predict(object = poly_reg9, 
                      newdata =  data.frame(Ft2 = test$Ft2))
    

Now, you can find RSS values for both the data as shown:

      # RSS for train data based on each model
train_rss1 <- mean((train$Ft1 - poly_reg1$fitted.values)^2)  # Order 1
train_rss3 <- mean((train$Ft1 - poly_reg3$fitted.values)^2)  # Order 3
train_rss5 <- mean((train$Ft1 - poly_reg5$fitted.values)^2)  # Order 5
train_rss9 <- mean((train$Ft1 - poly_reg9$fitted.values)^2)  # Order 9

# RSS for test data based on each model
test_rss1 <- mean((test$Ft1 - poly1_pred)^2)  # Order 1
test_rss3 <- mean((test$Ft1 - poly3_pred)^2)  # Order 3
test_rss5 <- mean((test$Ft1 - poly5_pred)^2)  # Order 5
test_rss9 <- mean((test$Ft1 - poly9_pred)^2)  # Order 9
    

train_rss1	train_rss3	train_rss5	train_rss9
0.000	1673.867	1405.703	436.5995

test_rss1	test_rss3	test_rss5	test_rss9
42004.6972	669.9034	725.9172	5385.6354

From the above two tables you can observe that the RSS value for train data starts to decrease after the first degree, which means the higher the degree better the curve fitting and reduced error. However, at the same time the test RSS increases with the increase of the degree, which implies underfitting. You can observe these patterns from the given plot.

      # Visualizing train and test RSS for each model
 
# excluding degree 1
train_rss <- scale(c(train_rss3, train_rss5, train_rss9)) # scaling 
test_rss <- scale(c(test_rss3, test_rss5, test_rss9))      # scaling
orders <- c(3, 5, 9)

ggplot() +
  geom_line(aes(orders, train_rss, col = "Train RSS")) +
  geom_line(aes(orders, test_rss, col = "Test RSS")) +
  scale_colour_manual("", 
                      breaks = c("Train RSS", "Test RSS"), 
                      values = c("green", "red")) +
  theme_bw() + ylab("RSS")
    

From this plot you can deliver an insight that only the polynomial of degree five is optimal for this data, as it will give the lowest error for both the train and the test data.

Conclusion

You have learned to apply polynomial functions of various degrees in R. You observed how underfitting and overfitting can occur in a polynomial model and how to find an optimal polynomial degree function to reduce error for both train and test data.

Ft1	Ft2
38	23
52	85
36	67
92	15
83	200
170	180
140	35
201	156
112	99
132	43
80	92
134	62
150	250
160	240
190	270
145	220
166	260
120	155
142	133

Ft1	Ft2
38	23
52	85
36	67
92	15
83	200
170	180
140	35
201	156
112	99
132	43
80	92
134	62
150	250
160	240
190	270
145	220
166	260
120	155
142	133

Ft1	Ft2
38	23
52	85
36	67
92	15
83	200
170	180
140	35
201	156
112	99
132	43
80	92
134	62
150	250
160	240
190	270
145	220
166	260
120	155
142	133