Extract Key Phrases from Text in Azure Machine Learning Studio

In this guide, you will learn how to use Azure Machine Learning Studio to extract key phrases from a text corpus using medical literature as an example.

By Deepika Singh

Oct 8, 2020 • 8 Minute Read

Subscribe to the newsletter

Introduction

One of the key areas of natural language processing (NLP) is extracting one or more meaningful phrases from a corpus of text. There are several situations, mostly in the consumer space, where key phrase extraction is imperative. In this guide, you will learn how to use the module in Azure Machine Learning Studio to extract the key phrases from a text corpus.

Problem Statement and Data

In this guide, you will take up the task of automating reviews in medicine. Medical literature is voluminous and rapidly changing, which increases the need for reviews. Often such reviews are done manually, which is tedious and time-consuming. You will try to extract key phrases from the input variable, abstract.

The dataset you will use comes from a PubMed search, and contains 1,748 observations and four variables, as described below.

title: Variable that consists of the titles of papers retrieved
abstract: Variable that contains the abstracts of papers retrieved
trial: Variable indicating whether the paper is a clinical trial testing a drug therapy for cancer
class: Target variable which indicates whether the paper is a clinical trial (Yes) or not (No)

Start by loading the data into the workspace.

Load Data

Once you have logged into your Azure Machine Learning Studio account, click the EXPERIMENTS option, listed on the left sidebar, followed by the NEW button.

Next, click the blank experiment and a new workspace will open. Give the name Azure ML Experiment to the workspace.

Next, load the data into the workspace. Click NEW, and select the DATASET option shown below.

The selection above will open a window, as shown below, which can be used to upload the dataset from the local system.

Once the data is loaded, you can see it in the Saved Datasets option. The file name is nlpdata2.csv. The next step is to drag it from the Saved Datasets list into the workspace. To explore this data, right-click and select the Visualize option, as shown below.

You can see there are 1748 rows and four columns.

Prepare Text

It is important to pre-process text before you run the module to extract key phrases from the corpus. Common pre-processing steps include:

Remove punctuation: The rule of thumb is to remove everything that is not in the form x,y,z.
Remove stop words: These are unhelpful words such as 'the', 'is', or 'at'. These are not helpful because the frequency of such stop words is high in the corpus, but they don't help in differentiating the target classes. The removal of stop words also reduces the data size.
Conversion to lowercase: Words like 'Clinical' and 'clinical' need to be considered as one word. Hence, words with a capital letter are converted to lowercase.
Stemming: The goal of stemming is to reduce the number of inflectional forms of words appearing in the text. This causes words such as “argue,” "argued," "arguing," and "argues" to be reduced to their common stem, “argu”. This helps in decreasing the size of the vocabulary space.

The Preprocess Text module is used to perform these steps as well as other text cleaning steps. Search and drag the module into the workspace. Connect it to the data as shown below.

You must specify the text variable to be preprocessed. To do this, click on the Launch column selector option, and select the abstract variable.

Run the experiment and click on Visualize to see the result.

You can look at the result below. The Preprocessed abstract variable contains the processed text. If you compare it with the abstract variable, you can see the difference between pre- and post-text preprocessing.

Extract Key Phrases

You have performed the pre-processing step, and the corpus is ready to extract key phrases. In Azure Machine Learning Studio, the Extract Key Phrases from Text module performs this task. Search and drag the module into the workspace.

This module builds upon the natural language processing APIs for key phrase extraction. The module captures the context of the sentence in form of phrases. To specify the text variable, click the module. Next, click the Launch column selector option, and select the Preprocessed abstract variable.

Run the module and once the run is completed, right-click and select the Visualize option.

The above command will produce the following output. You can see how a long text corpus is converted into more meaningful key phrases or words. The key phrases from the first records are day day, patient tetracosactrin, and mg tetracosactrin, and so on.

Conclusion

In this guide, you learned how to perform key phrase extraction with Azure Machine Learning Studio. There are several application areas, such as monitoring social media and brand sentiment analysis. Some media houses use keyword extraction to understand trending topics, which they use in content production. Research companies use keyword extraction to identify the most representative words in survey responses. Another prominent application is in Search Engine Optimization (SEO), where the main objective is to extract strategic keywords for targeted marketing. You can learn more about this concept here.

To learn more about data science and machine learning using Azure Machine Learning Studio, please refer to the following guides:

Deepika S.

Coming soon...

More about this author