The domain of analytics that addresses how computers understand text is called Natural Language Processing (NLP). NLP has multiple applications like sentiment analysis, chatbots, AI agents, social media analytics, as well as text classification. In this guide, you will learn how to build a supervised machine learning model on text data, using the popular statistical programming language, 'R'.
The data we’ll be using in this guide comes from Kaggle, a machine learning competition website. This is a women's clothing e-commerce data, consisting of the reviews written by the customers. In this guide, we will take up the task of predicting whether the customer will recommend the product or not. In this guide, we are taking a sample of the original dataset. The sampled data contains 500 rows and three variables, as described below: 1. Clothing ID: This is the unique ID. 2. Review Text: Text containing reviews by the customer. 3. Recommended IND: Binary variable stating where the customer recommends the product ("1") or not ("0"). This is the target variable. Let us start by loading the required libraries and the data.
1 2 3 4 5 6 7 8 9 10
library(readr) library(dplyr) #Text mining packages library(tm) library(SnowballC) #loading the data t1 <- read_csv("ml_text_data.csv") glimpse(t1)
1 2 3 4 5
Observations: 500 Variables: 3 $ Clothing_ID <int> 1088, 996, 936, 856, 1047, 862, 194, 1117, 996... $ Review_Text <chr> "Yummy, soft material, but very faded looking.... $ Recommended_IND <int> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
The above output shows that the data has three variables, but the important ones are the variables 'Review_Text', and 'Recommended_IND'.
Since the text data is not in the traditional format of observations in rows, and variables in columns, we will have to perform certain text-specific steps. The list of such steps is discussed in the subsequent sections.
The variable containing text needs to be converted to a corpus for preprocessing. A corpus is a collection of documents. The first line of code below performs this task. The second line prints the content of the first corpus, while the third line prints the corresponding recommendation score.
1 2 3 4
corpus = Corpus(VectorSource(t1$Review_Text)) corpus[] t1$Recommended_IND
 "Yummy, soft material, but very faded looking. so much so that i am sending it back. if a faded look is something you like, then this is for you."  0
Looking at the review text, it is obvious that the customer was not happy with the product, and hence gave the recommendation score of zero.
The model needs to treat Words like 'soft' and 'Soft' as same. Hence, all the words are converted to lowercase with the lines of code below.
1 2 3
corpus = tm_map(corpus, PlainTextDocument) corpus = tm_map(corpus, tolower) Corpus[]
 "yummy, soft material, but very faded looking. so much so that i am sending it back. if a faded look is something you like, then this is for you."
The idea here is to remove everything that isn't a standard number or letter.
corpus = tm_map(corpus, removePunctuation) corpus[]
 "yummy soft material but very faded looking so much so that i am sending it back if a faded look is something you like then this is for you"
Stopwords are unhelpful words like 'i', 'is', 'at', 'me', 'our'. These are not helpful because the frequency of such stopwords is high in the corpus, but they don't help in differentiating the target classes. The removal of Stopwords is therefore important.
The line of code below uses the tm_map function on the 'corpus' and removes stopwords, as well as the word 'cloth'. The word 'cloth' is removed because this dataset is on clothing review, so this word will not add any predictive power to the model.
corpus = tm_map(corpus, removeWords, c("cloth", stopwords("english"))) Corpus[]
 "yummy soft material faded looking much sending back faded look something like
The idea behind stemming is to reduce the number of inflectional forms of words appearing in the text. For example, words such as "argue", "argued", "arguing", "argues" are reduced to their common stem "argu". This helps in decreasing the size of the vocabulary space. The lines of code below perform the stemming on the corpus.
corpus = tm_map(corpus, stemDocument) Corpus[]
 "yummi soft materi fade look much send back fade look someth like"
The most commonly used text preprocessing steps are complete. Now we are ready to extract the word frequencies, which will be used as features in our prediction problem. The line of code below uses the function called DocumentTermMatrix from the tm package and generates a matrix. The rows in the matrix correspond to the documents, in our case reviews, and the columns correspond to words in those reviews. The values in the matrix are the frequency of the word across the document.
frequencies = DocumentTermMatrix(corpus)
The above command results in a matrix that contains zeroes in many of the cells, a problem called sparsity. It is advisable to remove such words that have a lot of zeroes across the documents. The following lines of code perform this task.
sparse = removeSparseTerms(frequencies, 0.995)
The final data preparation step is to convert the matrix into a data frame, a format widely used in 'R' for predictive modeling. The first line of code below converts the matrix into dataframe, called 'tSparse'. The second line makes all the variable names R-friendly, while the third line of code adds the dependent variable to the data set.
1 2 3
tSparse = as.data.frame(as.matrix(sparse)) colnames(tSparse) = make.names(colnames(tSparse)) tSparse$recommended_id = t1$Recommended_IND
Now we are ready for building the predictive model. But before that, it is always a good idea to set the baseline accuracy of the model. The baseline accuracy, in the case of a classification problem, is the proportion of the majority label in the target variable. The line of code below prints the proportion of the labels in the target variable, 'recommended_id'.
prop.table(table(tSparse$recommended_id)) #73.6% is the baseline accuracy
0 1 0.264 0.736
The above output shows that 73.6 percent of the reviews are from customers who recommended the product. This becomes the baseline accuracy for predictive modeling.
For evaluating how the predictive model is performing, we will divide the data into training and test data. The first line of code below loads the caTools package, which will be used for creating the training and test data. The second line sets the 'random seed' so that the results are reproducible.
The third line creates the data partition in the manner that it keeps 70% of the data for training the model. The fourth and fifth lines of code create the training ('trainSparse') and testing ('testSparse') dataset.
1 2 3 4 5
library(caTools) set.seed(100) split = sample.split(tSparse$recommended_id, SplitRatio = 0.7) trainSparse = subset(tSparse, split==TRUE) testSparse = subset(tSparse, split==FALSE)
The Random Forest classification algorithm is the collection of several classification trees that operate as an ensemble. It is one of the most robust machine learning algorithms. In 'R', the randomForest library can be used to build the random forest model, which is loaded in the first line of code below. The second line sets the random state for reproducibility, while the third and fourth lines of code converts the target variable into the 'factor' type.
The fifth line trains the random forest algorithm on the training data, while the sixth line uses the trained model to predict on the test data. The seventh line prints the confusion matrix.
1 2 3 4 5 6 7 8 9 10 11 12
library(randomForest) set.seed(100) trainSparse$recommended_id = as.factor(trainSparse$recommended_id) testSparse$recommended_id = as.factor(testSparse$recommended_id ) #Lines 5 to 7 RF_model = randomForest(recommended_id ~ ., data=trainSparse) predictRF = predict(RF_model, newdata=testSparse) table(testSparse$recommended_id, predictRF) # Accuracy 117/(117+33) #78%
1 2 3 4 5
predictRF 0 1 0 12 28 1 5 105  0.78
The above output shows that out of 150 records in the test data, the model got the predictions correct for 117 of them, giving an accuracy of 78 percent.
The baseline accuracy we had set for our data was 73 percent. The Random Forest model is conveniently beating this baseline model by achieving the accuracy score of 78 percent.
In this guide, you have learned the fundamentals of text cleaning and pre-processing using the powerful statistical programming language, 'R'. You also learned how to build and evaluate a random forest classification algorithm on the text data. The random forest model out performed the baseline method.