Text data is different from structured tabular data and, therefore, building features on it requires a completely different approach. In this guide, you will learn how to extract features from raw text for predictive modeling. You will also learn how to perform text preprocessing steps, and create Tf-Idf and Bag-of-words (BOW) feature matrices. We will begin by exploring the data.
In this guide, we will be using tweet data about the company 'Apple'. The objective is to create features that can be used for building a sentiment predictor model.
The dataset contains 1181 observations and 3 variables, as described below:
Tweet: Consists of the twitter comments by the users. The twitter data is publicly available.
Avg: Average sentiment of the tweets (-2 means extremely negative while +2 means extremely positive). This classification was done using the Amazon Mechanical Turk.
1# Import required libraries
2import pandas as pd
3import matplotlib.pyplot as plt
4import re
5import numpy as np
6import matplotlib.pyplot as plt
7import seaborn as sns
8import string
9import nltk
10import warnings
11%matplotlib inline
12warnings.filterwarnings("ignore", category=DeprecationWarning)
13from nltk.corpus import stopwords
14stop = stopwords.words('english')
The first line of code below reads in the data as pandas dataframe, while the second line prints the shape - 1,181 observations of 3 variables. The third line prints the first five observations.
1dat = pd.read_csv('datatweets.csv')
2print(dat.shape)
3dat.head(5)
Output:
1(1181, 3)
2
3| | Tweet | Avg | Sentiment |
4|--- |--------------------------------------------------- |------ |----------- |
5| 0 | iphone 5c is ugly as heck what the freak @appl... | -2.0 | Negative |
6| 1 | freak YOU @APPLE | -2.0 | Negative |
7| 2 | freak you @apple | -2.0 | Negative |
8| 3 | @APPLE YOU RUINED MY LIFE | -2.0 | Negative |
9| 4 | @apple I hate apple!!!!! | -2.0 | Negative |
We will start by performing basic analysis of the data. The line of code below prints the number of tweets, as per the 'Sentiment' label. The output shows that the highest number of tweets are for the negative sentiment, while the lowest are for the positive sentiment.
1# Get the number of dates / entries in each month
2dat.groupby('Sentiment')['Tweet'].count()
Output:
1Sentiment
2Negative 541
3Neutral 337
4Positive 303
5Name: Tweet, dtype: int64
The sentiment score for the tweets is stored in the variable 'Avg', that ranges from -2 (extremely negative) to +2 (extremely positive). We will explore if there is a difference in the average sentiment scores across the 'sentiment' label. The line of code below performs this task and the output shows that the average negative score is -0.74, while the average positive score is 0.57.
1dat.groupby('Sentiment')['Avg'].mean()
Output:
1Sentiment
2Negative -0.743068
3Neutral 0.000000
4Positive 0.574257
5Name: Avg, dtype: float64
Many simple but important features can be extracted from the raw text data, as discussed below.
The hypothesis is that the length of the characters in a tweet varies across the sentiment it carries. The first line of code below creates a new variable 'character_cnt' that takes in the text from the 'Tweet' variable and calculates the count of the characters in the text. The second line performs the 'groupby' operation on the 'Sentiment' label and prints the average character length across the labels.
The output shows that the neutral sentiments have a lower character count on an average, as compared to the positive and the negative tweets. This inference can be useful for separating the neutral tweets from the other types of tweets.
1dat['character_cnt'] = dat['Tweet'].str.len()
2dat.groupby('Sentiment')['character_cnt'].mean()
Output:
1Sentiment
2Negative 91.763401
3Neutral 85.379822
4Positive 94.825083
5Name: character_cnt, dtype: float64
Just like the character count in a tweet, the word count can also be a useful feature. The first line of code below creates a new variable 'word_counts' that takes in the text from the 'Tweet' variable and calculates the count of the words in the text. The second line performs the 'groupby' operation on the 'Sentiment' label and prints the average word length across the labels.
The output shows that the negative sentiments have the highest average word count, suggesting that the disappointed customers tend to write longer tweets. This inference can be useful for separating the 'sentiment' labels.
1dat['word_counts'] = dat['Tweet'].str.split().str.len()
2dat.groupby('Sentiment')['word_counts'].mean()
Output:
1Sentiment
2Negative 15.336414
3Neutral 12.356083
4Positive 14.676568
5Name: word_counts, dtype: float64
Since we have created the 'character_cnt' and the 'word_counts' features, it is easy to create the ratio of these two variables that will give the average length of the character per word in each tweet.
The first line of code below creates a new variable 'characters_per_word' that is the ratio of the number of characters and the number of words in a tweet. The second line performs the 'groupby' operation on the 'Sentiment' label and prints the average character length per word across the labels.
The output shows that neutral sentiments have the highest average character length per word. This inference can be useful for separating the 'Sentiment' labels.
1dat['characters_per_word'] = dat['character_cnt']/dat['word_counts']
2dat.groupby('Sentiment')['characters_per_word'].mean()
Output:
1Sentiment
2Negative 6.191374
3Neutral 7.425695
4Positive 6.687928
5Name: characters_per_word, dtype: float64
It is also possible to create a feature that contains the count of special characters like '@' or '#'. The first line of code below creates a new feature 'spl' that takes in the text from the 'Tweet' variable and calculates the count of the words starting with the special character '@'. We use the starts with function for performing this operation. The second line prints the first five observations containing the 'Tweet' and the 'spl' variable.
1dat['spl'] = dat['Tweet'].apply(lambda x: len([x for x in x.split() if x.startswith('@')]))
2dat[['Tweet','spl']].head()
Output:
1| | Tweet | spl |
2|--- |--------------------------------------------------- |----- |
3| 0 | iphone 5c is ugly as heck what the freak @appl... | 2 |
4| 1 | freak YOU @APPLE | 1 |
5| 2 | freak you @apple | 1 |
6| 3 | @APPLE YOU RUINED MY LIFE | 1 |
7| 4 | @apple I hate apple!!!!! | 1 |
Just like we created a feature on the count of words in a tweet, we can also create a feature on the count of numbers in a tweet. The first line of code below creates a new variable 'num' that takes in the text from the 'Tweet' variable and calculates the count of the numbers in the text. The second line performs the 'groupby' operation on the 'Sentiment' label and prints the average count of numbers across the labels.
The output shows that the neutral sentiment labels have the lowest average count of numbers in a tweet, whereas the negative tweets have the highest average.
1#Number of numerics
2dat['num'] = dat['Tweet'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
3dat.groupby('Sentiment')['num'].mean()
Output:
1Sentiment
2Negative 0.125693
3Neutral 0.068249
4Positive 0.108911
5Name: num, dtype: float64
So far, we have created simple features from the raw text. We can also create more advanced features but, before that, we will have to clean the text. The common pre-processing steps are summarized below:
Removing punctuation - the rule of thumb is to remove everything that is not in the form x,y,z. The first line of code below performs this task.
Removing stopwords - these are unhelpful words like 'the', 'is', 'at'. These are not helpful because the frequency of such stopwords is high in the corpus, but they don't help in differentiating the target classes. The removal of Stopwords also reduces the data size. The second line of code below performs this task.
Conversion to lowercase - words like 'Phone' and 'phone' need to be considered as one word. Hence, these are converted to lowercase. The third line of code below performs this task.
The last line of code prints a summary of all the new features that we have built so far.
1dat['processedtext'] = dat['Tweet'].str.replace('[^\w\s]','')
2dat['processedtext'] = dat['processedtext'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
3dat['processedtext'] = dat['processedtext'].apply(lambda x: " ".join(x.lower() for x in x.split()))
4
5#Lines 4 to 6
6from nltk.stem import PorterStemmer
7stemmer = PorterStemmer()
8dat['processedtext'] = dat['processedtext'].apply(lambda x: " ".join([stemmer.stem(word) for word in x.split()]))
9
10dat[['character_cnt','word_counts','characters_per_word', 'spl', 'num', 'processedtext']].head()
Output:
1| | character_cnt | word_counts | characters_per_word | spl | num | processedtext |
2|--- |--------------- |------------- |--------------------- |----- |----- |--------------------------------------------- |
3| 0 | 64 | 11 | 5.818182 | 2 | 0 | iphon 5c ugli heck freak appl iphonecompani |
4| 1 | 16 | 3 | 5.333333 | 1 | 0 | freak you appl |
5| 2 | 16 | 3 | 5.333333 | 1 | 0 | freak appl |
6| 3 | 25 | 5 | 5.000000 | 1 | 0 | appl you ruin my life |
7| 4 | 24 | 4 | 6.000000 | 1 | 0 | appl i hate appl |
We have cleaned the text which is now stored in a new variable 'processedtext'. However, in order to use it for building machine learning models, we will have to convert it to word frequency vectors.
One of the most popular methods to do this is through the TF-IDF representation, which is used as a weighting factor in text mining applications. In simple terms, TF-IDF attempts to highlight important words which appear frequently in a document but not across documents. The terms are briefly explained below:
Term Frequency (TF): This summarizes the normalized Term Frequency within a document.
Now, we will work on creating the TF-IDF vectors for our tweets. The first line of code below imports the 'TfidfVectorizer' from sklearn.feature_extraction.text module. The second line initializes the TfidfVectorizer object, called 'tfidf', while the third line fits and transforms the variable 'processedtext' from the data.
The important arguments we have used in initiating the TfidfVectorizer object are the 'max_features' and the 'ngram_range'. While the 'max_features' argument specifies the maximum number of features to be created, the argument 'ngram_range=(1,1)' specifies that unigrams will be considered for feature creation.
The fourth line prints a summary of the object, which is a sparse matrix containing the number of observations (1181) and the number of features (500).
1from sklearn.feature_extraction.text import TfidfVectorizer
2
3tfidf = TfidfVectorizer(max_features=500, lowercase=True, analyzer='word', stop_words= 'english',ngram_range=(1,1))
4
5dat_tfIdf = tfidf.fit_transform(dat['processedtext'])
6dat_tfIdf
Output:
1 <1181x500 sparse matrix of type '<class 'numpy.float64'>'
2 with 6473 stored elements in Compressed Sparse Row format>
Another popular technique for creating word vectors is the Bag-of-words approach. It is a simplistic method for identifying topics in a document. It works on the assumption that the higher the frequency of the term, the higher its importance.
The first line of code below imports the 'CountVectorizer' utility from the 'sklearn.feature_extraction.text' module. The second line initializes the CountVectorizer object, called 'bag_words', while the third line fits and transforms the variable 'processedtext' from the data. The fourth line prints a summary of the object, which is, again, a sparse matrix containing the number of observations (1181) and the number of features (500).
1from sklearn.feature_extraction.text import CountVectorizer
2bag_words = CountVectorizer(max_features=500, lowercase=True, ngram_range=(1,1),analyzer = "word")
3dat_BOW = bag_words.fit_transform(dat['processedtext'])
4dat_BOW
Output:
1 <1181x500 sparse matrix of type '<class 'numpy.int64'>'
2 with 7181 stored elements in Compressed Sparse Row format>
In this guide, you have learned the fundamentals of building features from the raw and the processed text data. You can now use the basic as well as advanced features for building a machine learning algorithm that can predict the sentiment of a tweet.
To learn more about Natural Language Processing and Text Analytics, please refer to the following guides: