Natural Language Processing (or NLP) is ubiquitous and has multiple applications. A few use cases include email classification into spam and ham, chatbots, AI agents, social media analysis, and classifying customer or employee feedback into positive, negative, or neutral. This guide will demonstrate how to build a supervised machine learning model on text data with Azure Machine Learning Studio.
In this guide, you will take up the task of automating reviews in medicine. Medical literature is voluminous and rapidly changing, which increases the need for reviews. Often such reviews are done manually, which is tedious and time-consuming. You will try to address this problem by building a text classification model that will automate the process.
The dataset you will use comes from a PubMed search, and contains 1,748 observations and four variables, as described below.
title: Variable that consists of the titles of papers retrieved
abstract: Variable that contains the abstracts of papers retrieved
trial: Variable indicating whether the paper is a clinical trial testing a drug therapy for cancer
class: Target variable which indicates whether the paper is a clinical trial (Yes) or not (No)
You will start by loading the data.
Once you have logged into your Azure Machine Learning Studio account, click on the EXPERIMENTS option, listed on the left sidebar, followed by the NEW button.
Next, click on the blank experiment and a new workspace will open. Give the name "Text Analytics" to the workspace.
Next you will load the data into the workspace. Click NEW, and select the DATASET option shown below.
The selection above will open a window shown below, which can be used to upload the dataset from the local system.
Once the data is loaded, you can see it in the Saved Datasets option. The file name is nlpdata2.csv. The next step is to drag it from the Saved Datasets list into the workspace. To explore this data, right-click and select the Visualize option as shown below.
Select the different variables to examine the basic statistics. For example, the image below displays the details for the target variable
You will notice that the target variable takes two unique values. Also, it is displayed as string feature, which needs to be converted to a categorical feature.
Start by typing "edit metadata" in the search bar to find the Edit Metadata module, and then drag it into the workspace.
The next step is to click on the Launch column selector option, and select the
Once you have made this selection, the selected column will be displayed in the workspace. Next, from the dropdown options under Categorical, select the Make categorical option. Next, run the experiment.
Now you are ready to build the text classifier. However, this is where things begin to get trickier in NLP. The data is in raw text format, which cannot be used as features. So, this requires text pre-processing.
The common pre-processing steps are given below.
The Preprocess Text module is used to perform the above as well as other text cleaning steps. Search and drag the module into the workspace. Connect it to the output port of Edit Metadata module.
You must specify the text variable to be preprocessed. To do this, click on the Launch column selector option, and select the
There are several text cleaning options, and because this is clinical research data that may have a complex structure, you will select all the options. Run the experiment.
The next step is to explore the resulting pre-processed data. Right click and select Visualize option.
The output below shows that one more variable,
Preprocessed abstract, is added, which contains the changes made in the Preprocess Text module.
Run the experiment.
You have preprocessed the text, and the next step is to generate a set of features. This is done with the Feature Hashing module. How this works is that it takes the text variable and converts it into set of features represented as integers. Search and drag the Feature Hashing module into the workspace.
Click on the Launch column selector option, and select the
Preprocessed abstract variable.
Next, use the Hashing bitsize parameter to specify the number of bits to use when creating the hash table. Keep the default option of ten. The next step is to provide the value to N-grams parameter. Set the value to two. This argument defines the length of the word sequence. Keeping the value to two will result in creation of two word sequences, along with unigrams. Run the experiment.
Once the experiment run is complete, right click and select the Visualize option.
Completing the above step will result in the output below. You can see that new features have been added to the data, which now has 1748 observations and 37 columns.
You have created new features, and the next step is to select the variables of interest. The Select Columns in Dataset module performs this task. Drag it to the workspace.
Select the variables of interest with Launch column selector. The target variable,
class, and the preprocessed hashed features will be included in model building.
Run the experiment.
You have converted the text data into a format of independent variables, and a target variable. The next step is to build the machine learning model. You will build the classifier with the Two Class Boosted Decision Tree module. Search and drag it in the workspace.
This module creates a binary classifier using boosted decision tree. This is based on the ensemble machine learning model, in which every tree builds upon the previous tree by correcting its error. For the data used in this guide, every single tree will make predictions on the target class of the dependent variable,
class. The final predictions are based on the entire ensemble of trees taken together.
The next step is to specify the parameters of the Two-Class Boosted Decision Tree module. To do this, click on the module and you will see several training parameters. For Create trainer mode, select the Single Parameter option that is used when you know how you want to configure the algorithm. The second parameter is Maximum number of leaves per tree, which indicates the maximum number of terminal nodes to be created in any tree. Set this value to 20. Fill the other options as shown below.
Model validation plays an integral part in building powerful and robust machine learning models. Model validation helps ensure that the model performs well on new data, and helps in selecting the best model, the parameters, and the accuracy metrics. The Cross Validate Model module performs this task in Azure Machine Learning Studio. Search and drag the Cross Validate Model module into the workspace, and create the connections as shown below.
You can see the red flag in the Cross Validate Model, which needs to be corrected. Click on the Launch column selector option, and select the target variable,
class, as shown below.
Run the experiment.
You have built the predictive model on text data, and the next step is to evaluate the model performance. The right output port contains the evaluation results by fold. Right click and select the Visualize option.
The following output will be displayed to show the evaluation results by folds. There are ten folds, zero through nine, and for every fold you have results across several metrics such as accuracy, precision, recall, and so on.
If you scroll downwards, you will see the mean results across the ten folds.
From the above output, you can infer that the mean accuracy, F-score, and AUC value for boosted tree model is 0.70, 0.64 and 0.75, respectively. These results indicate satisfactory model performance.
Natural language processing is an emerging area of data science and artificial intelligence. You can see the application areas of text classification when you open your Gmail account, where the emails are often classified into Primary, Social, and Promotions. Facebook and chatbots are another common application areas. Even traditional industries like banks and manufacturing have adopted text classification.
To learn more about data science and machine learning using Azure Machine Learning Studio, please refer to the following guides: