Important Update
The Guide Feature will be discontinued after December 15th, 2023. Until then, you can continue to access and refer to the existing guides. Deepika Singh

# Building Features from Text Data

• Jul 19, 2019
• 9,268 Views
• Jul 19, 2019
• 9,268 Views
Data
Python

## Introduction

Text data is different from structured tabular data and, therefore, building features on it requires a completely different approach. In this guide, you will learn how to extract features from raw text for predictive modeling. You will also learn how to perform text preprocessing steps, and create Tf-Idf and Bag-of-words (BOW) feature matrices. We will begin by exploring the data.

## Data

In this guide, we will be using tweet data about the company 'Apple'. The objective is to create features that can be used for building a sentiment predictor model.

The dataset contains 1181 observations and 3 variables, as described below:

2. Avg: Average sentiment of the tweets (-2 means extremely negative while +2 means extremely positive). This classification was done using the Amazon Mechanical Turk.

3. Sentiment: Consists of the sentiment labels - positive, negative, and neutral.

``````1# Import required libraries
2import pandas as pd
3import matplotlib.pyplot as plt
4import re
5import numpy as np
6import matplotlib.pyplot as plt
7import seaborn as sns
8import string
9import nltk
10import warnings
11%matplotlib inline
12warnings.filterwarnings("ignore", category=DeprecationWarning)
13from nltk.corpus import stopwords
14stop = stopwords.words('english')``````
python

The first line of code below reads in the data as pandas dataframe, while the second line prints the shape - 1,181 observations of 3 variables. The third line prints the first five observations.

``````1dat = pd.read_csv('datatweets.csv')
2print(dat.shape)
python

Output:

``````1(1181, 3)
2
3|   	| Tweet                                             	| Avg  	| Sentiment 	|
4|---	|---------------------------------------------------	|------	|-----------	|
5| 0 	| iphone 5c is ugly as heck what the freak @appl... 	| -2.0 	| Negative  	|
6| 1 	| freak YOU @APPLE                                  	| -2.0 	| Negative  	|
7| 2 	| freak you @apple                                  	| -2.0 	| Negative  	|
8| 3 	| @APPLE YOU RUINED MY LIFE                         	| -2.0 	| Negative  	|
9| 4 	| @apple I hate apple!!!!!                          	| -2.0 	| Negative  	|``````

We will start by performing basic analysis of the data. The line of code below prints the number of tweets, as per the 'Sentiment' label. The output shows that the highest number of tweets are for the negative sentiment, while the lowest are for the positive sentiment.

``````1# Get the number of dates / entries in each month
2dat.groupby('Sentiment')['Tweet'].count()``````
python

Output:

``````1Sentiment
2Negative    541
3Neutral     337
4Positive    303
5Name: Tweet, dtype: int64``````

The sentiment score for the tweets is stored in the variable 'Avg', that ranges from -2 (extremely negative) to +2 (extremely positive). We will explore if there is a difference in the average sentiment scores across the 'sentiment' label. The line of code below performs this task and the output shows that the average negative score is -0.74, while the average positive score is 0.57.

``1dat.groupby('Sentiment')['Avg'].mean()``
python

Output:

``````1Sentiment
2Negative   -0.743068
3Neutral     0.000000
4Positive    0.574257
5Name: Avg, dtype: float64``````

## Building Simple Features from Raw Text

Many simple but important features can be extracted from the raw text data, as discussed below.

### Character Length

The hypothesis is that the length of the characters in a tweet varies across the sentiment it carries. The first line of code below creates a new variable 'character_cnt' that takes in the text from the 'Tweet' variable and calculates the count of the characters in the text. The second line performs the 'groupby' operation on the 'Sentiment' label and prints the average character length across the labels.

The output shows that the neutral sentiments have a lower character count on an average, as compared to the positive and the negative tweets. This inference can be useful for separating the neutral tweets from the other types of tweets.

``````1dat['character_cnt'] = dat['Tweet'].str.len()
2dat.groupby('Sentiment')['character_cnt'].mean()``````
python

Output:

``````1Sentiment
2Negative    91.763401
3Neutral     85.379822
4Positive    94.825083
5Name: character_cnt, dtype: float64``````

### Word Count

Just like the character count in a tweet, the word count can also be a useful feature. The first line of code below creates a new variable 'word_counts' that takes in the text from the 'Tweet' variable and calculates the count of the words in the text. The second line performs the 'groupby' operation on the 'Sentiment' label and prints the average word length across the labels.

The output shows that the negative sentiments have the highest average word count, suggesting that the disappointed customers tend to write longer tweets. This inference can be useful for separating the 'sentiment' labels.

``````1dat['word_counts'] = dat['Tweet'].str.split().str.len()
2dat.groupby('Sentiment')['word_counts'].mean()``````
python

Output:

``````1Sentiment
2Negative    15.336414
3Neutral     12.356083
4Positive    14.676568
5Name: word_counts, dtype: float64``````

### Average Character Length per Word

Since we have created the 'character_cnt' and the 'word_counts' features, it is easy to create the ratio of these two variables that will give the average length of the character per word in each tweet.

The first line of code below creates a new variable 'characters_per_word' that is the ratio of the number of characters and the number of words in a tweet. The second line performs the 'groupby' operation on the 'Sentiment' label and prints the average character length per word across the labels.

The output shows that neutral sentiments have the highest average character length per word. This inference can be useful for separating the 'Sentiment' labels.

``````1dat['characters_per_word'] = dat['character_cnt']/dat['word_counts']
2dat.groupby('Sentiment')['characters_per_word'].mean()``````
python

Output:

``````1Sentiment
2Negative    6.191374
3Neutral     7.425695
4Positive    6.687928
5Name: characters_per_word, dtype: float64``````

### Special Character Count

It is also possible to create a feature that contains the count of special characters like '@' or '#'. The first line of code below creates a new feature 'spl' that takes in the text from the 'Tweet' variable and calculates the count of the words starting with the special character '@'. We use the starts with function for performing this operation. The second line prints the first five observations containing the 'Tweet' and the 'spl' variable.

``````1dat['spl'] = dat['Tweet'].apply(lambda x: len([x for x in x.split() if x.startswith('@')]))
python

Output:

``````1|   	| Tweet                                             	| spl 	|
2|---	|---------------------------------------------------	|-----	|
3| 0 	| iphone 5c is ugly as heck what the freak @appl... 	| 2   	|
4| 1 	| freak YOU @APPLE                                  	| 1   	|
5| 2 	| freak you @apple                                  	| 1   	|
6| 3 	| @APPLE YOU RUINED MY LIFE                         	| 1   	|
7| 4 	| @apple I hate apple!!!!!                          	| 1   	|``````

### Number Count

Just like we created a feature on the count of words in a tweet, we can also create a feature on the count of numbers in a tweet. The first line of code below creates a new variable 'num' that takes in the text from the 'Tweet' variable and calculates the count of the numbers in the text. The second line performs the 'groupby' operation on the 'Sentiment' label and prints the average count of numbers across the labels.

The output shows that the neutral sentiment labels have the lowest average count of numbers in a tweet, whereas the negative tweets have the highest average.

``````1#Number of numerics
2dat['num'] = dat['Tweet'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
3dat.groupby('Sentiment')['num'].mean()``````
python

Output:

``````1Sentiment
2Negative    0.125693
3Neutral     0.068249
4Positive    0.108911
5Name: num, dtype: float64``````

## Pre-processing the Raw Text

So far, we have created simple features from the raw text. We can also create more advanced features but, before that, we will have to clean the text. The common pre-processing steps are summarized below:

1. Removing punctuation - the rule of thumb is to remove everything that is not in the form x,y,z. The first line of code below performs this task.

2. Removing stopwords - these are unhelpful words like 'the', 'is', 'at'. These are not helpful because the frequency of such stopwords is high in the corpus, but they don't help in differentiating the target classes. The removal of Stopwords also reduces the data size. The second line of code below performs this task.

3. Conversion to lowercase - words like 'Phone' and 'phone' need to be considered as one word. Hence, these are converted to lowercase. The third line of code below performs this task.

4. Stemming - the goal of stemming is to reduce the number of inflectional forms of words appearing in the text. This causes words such as “argue”, "argued", "arguing", "argues" to be reduced to their common stem “argu”. There are many ways to perform Stemming, the popular one being the “Porter Stemmer” method by Martin Porter. The fourth to sixth lines of code below perform this task.

The last line of code prints a summary of all the new features that we have built so far.

``````1dat['processedtext'] = dat['Tweet'].str.replace('[^\w\s]','')
2dat['processedtext'] = dat['processedtext'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
3dat['processedtext'] = dat['processedtext'].apply(lambda x: " ".join(x.lower() for x in x.split()))
4
5#Lines 4 to 6
6from nltk.stem import PorterStemmer
7stemmer = PorterStemmer()
8dat['processedtext'] = dat['processedtext'].apply(lambda x: " ".join([stemmer.stem(word) for word in x.split()]))
9
python

Output:

``````1|   	| character_cnt 	| word_counts 	| characters_per_word 	| spl 	| num 	| processedtext                               	|
2|---	|---------------	|-------------	|---------------------	|-----	|-----	|---------------------------------------------	|
3| 0 	| 64            	| 11          	| 5.818182            	| 2   	| 0   	| iphon 5c ugli heck freak appl iphonecompani 	|
4| 1 	| 16            	| 3           	| 5.333333            	| 1   	| 0   	| freak you appl                              	|
5| 2 	| 16            	| 3           	| 5.333333            	| 1   	| 0   	| freak appl                                  	|
6| 3 	| 25            	| 5           	| 5.000000            	| 1   	| 0   	| appl you ruin my life                       	|
7| 4 	| 24            	| 4           	| 6.000000            	| 1   	| 0   	| appl i hate appl                            	|``````

## Term Frequency-Inverse Document Frequency (TF-IDF) Vector

We have cleaned the text which is now stored in a new variable 'processedtext'. However, in order to use it for building machine learning models, we will have to convert it to word frequency vectors.

One of the most popular methods to do this is through the TF-IDF representation, which is used as a weighting factor in text mining applications. In simple terms, TF-IDF attempts to highlight important words which appear frequently in a document but not across documents. The terms are briefly explained below:

1. Term Frequency (TF): This summarizes the normalized Term Frequency within a document.

2. Inverse Document Frequency (IDF): This reduces the weight of terms that appear a lot across documents.

Now, we will work on creating the TF-IDF vectors for our tweets. The first line of code below imports the 'TfidfVectorizer' from sklearn.feature_extraction.text module. The second line initializes the TfidfVectorizer object, called 'tfidf', while the third line fits and transforms the variable 'processedtext' from the data.

The important arguments we have used in initiating the TfidfVectorizer object are the 'max_features' and the 'ngram_range'. While the 'max_features' argument specifies the maximum number of features to be created, the argument 'ngram_range=(1,1)' specifies that unigrams will be considered for feature creation.

The fourth line prints a summary of the object, which is a sparse matrix containing the number of observations (1181) and the number of features (500).

``````1from sklearn.feature_extraction.text import TfidfVectorizer
2
3tfidf = TfidfVectorizer(max_features=500, lowercase=True, analyzer='word', stop_words= 'english',ngram_range=(1,1))
4
5dat_tfIdf = tfidf.fit_transform(dat['processedtext'])
6dat_tfIdf``````
python

Output:

``````1 <1181x500 sparse matrix of type '<class 'numpy.float64'>'
2   with 6473 stored elements in Compressed Sparse Row format>``````

## Bag-of-words Vector

Another popular technique for creating word vectors is the Bag-of-words approach. It is a simplistic method for identifying topics in a document. It works on the assumption that the higher the frequency of the term, the higher its importance.

The first line of code below imports the 'CountVectorizer' utility from the 'sklearn.feature_extraction.text' module. The second line initializes the CountVectorizer object, called 'bag_words', while the third line fits and transforms the variable 'processedtext' from the data. The fourth line prints a summary of the object, which is, again, a sparse matrix containing the number of observations (1181) and the number of features (500).

``````1from sklearn.feature_extraction.text import CountVectorizer
2bag_words = CountVectorizer(max_features=500, lowercase=True, ngram_range=(1,1),analyzer = "word")
3dat_BOW = bag_words.fit_transform(dat['processedtext'])
4dat_BOW``````
python

Output:

``````1 <1181x500 sparse matrix of type '<class 'numpy.int64'>'
2  with 7181 stored elements in Compressed Sparse Row format>``````

## Conclusion

In this guide, you have learned the fundamentals of building features from the raw and the processed text data. You can now use the basic as well as advanced features for building a machine learning algorithm that can predict the sentiment of a tweet.