Author avatar

Deepika Singh

​Visualizing Text Data Using a Word Cloud

Deepika Singh

  • Jun 27, 2019
  • 12 Min read
  • 439 Views
  • Jun 27, 2019
  • 12 Min read
  • 439 Views
Data
Machine Learning

Introduction

Text data has grown exponentially in recent years resulting in an ever-increasing need to analyze the massive amounts of such data. Word Cloud provides an excellent option to analyze the text data through visualization in the form of tags, or words, where the importance of a word is explained by its frequency.

In this guide, we will learn how to create word clouds and find important words that can help in extracting insights from the data. We will begin by understanding the problem statement and the data.

Problem Statement

The data is about a familiar topic which every email user must have encountered at some point of time - i.e, the "spam" emails, which are unsolicited messages, often advertising a product, containing links to malware or attempting to scam the recipient.

In this guide, we will use a publicly available dataset, first described in the 2006 conference paper "Spam Filtering with Naive Bayes -- Which Naive Bayes?" by V. Metsis, I. Androutsopoulos, and G. Paliouras. The "ham" messages in this dataset come from the inbox of former Enron Managing Director for Research Vincent Kaminski, one of the inboxes in the Enron Corpus. One source of spam messages in this dataset is the SpamAssassin corpus, which contains hand-labeled spam messages contributed by Internet users. The remaining spam was collected by Project Honey Pot, a project that collects spam messages and identifies spammers by publishing email address that humans would know not to contact but that bots might target with spam. The full dataset we will use was constructed as roughly a 75/25 mix of the ham and spam messages.

The dataset contains just two fields:

  1. text - The text of the email.
  1. spam - A binary variable indicating if the email was spam or not.

Let us start by importing the required libraries.

Importing Libraries

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')

#loading all necessary libraries
import numpy as np
import pandas as pd

import string
import collections
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.cm as cm
import matplotlib.pyplot as plt
% matplotlib inline
python

Reading the File and Understanding the Data

The first line of code below reads in the data as pandas dataframe, while the second line prints the shape - 5726 observations of 2 variables. The third line prints the first five records. There are only two variables - 'text' and 'spam' - that have been explained above. Majority of the emails are 'ham' emails, labeled as '0', constituting 76 percent of the total data.

1
2
3
4
5
6
7
8
# loading the data file
df = pd.read_csv('emails2.csv')

#shape of the dataframe
print('The shape of the dataframe is :',df.shape)

#first few records
df.head()
python

Output:

1
2
3
4
5
6
7
8
9
The shape of the dataframe is : (5726, 2)

|   	| text                                              	| spam 	|
|---	|---------------------------------------------------	|------	|
| 0 	| Subject: naturally irresistible your corporate... 	| 1    	|
| 1 	| Subject: the stock trading gunslinger fanny i...  	| 1    	|
| 2 	| Subject: unbelievable new homes made easy im ...  	| 1    	|
| 3 	| Subject: 4 color printing special request add...  	| 1    	|
| 4 	| Subject: do not have money , get software cds ... 	| 1    	|

Let us check for missing values in the text variable, which can be done by the line of code below. The output shows no missing values.

1
2
#Checking for null values in `description`
df['text'].isnull().sum()
python

Output:

1
0

We will start by building the word cloud for all the emails that are spam. The first line of code below filters the data with spam emails, while the second line prints the shape - 1368 observations of 2 variables.

1
2
spam1 = df[df.spam == 1]
print(spam1.shape)
python

Output:

1
2
(1368, 2)
    

Data Cleaning and Preparation

Before building the word cloud, it is important to clean the data. The common steps are carried out in subsequent sections.

Converting the Text to Lower Case

The first line of code below converts the text to lower case, while the second line prints the top five records. This will ensure that words like 'Enron' and 'enron' are treated as the same while counting word frequencies.

1
2
spam1['text']= spam1['text'].str.lower()
spam1['text'].head()
python

Output:

1
2
3
4
5
6
0    subject: naturally irresistible your corporate...
1    subject: the stock trading gunslinger  fanny i...
2    subject: unbelievable new homes made easy  im ...
3    subject: 4 color printing special  request add...
4    subject: do not have money , get software cds ...
Name: text, dtype: object

Splitting and Removing Punctuation from the Text

1
2
all_spam = spam1['text'].str.split(' ')
all_spam.head()
python

Output:

1
2
3
4
5
6
0    [subject:, naturally, irresistible, your, corp...
1    [subject:, the, stock, trading, gunslinger, , ...
2    [subject:, unbelievable, new, homes, made, eas...
3    [subject:, 4, color, printing, special, , requ...
4    [subject:, do, not, have, money, ,, get, softw...
Name: text, dtype: object

Joining the Entire Review

In this step, we will join all the 'text' records. This is required to build the text corpus which will be used to build the word cloud. The lines of code below complete this task for us.

1
2
3
4
5
6
7
8
9
10
11
all_spam_cleaned = []

for text in all_spam:
    text = [x.strip(string.punctuation) for x in text]
    all_spam_cleaned.append(text)

all_spam_cleaned[0]

text_spam = [" ".join(text) for text in all_spam_cleaned]
final_text_spam = " ".join(text_spam)
final_text_spam[:500]
python

Output:

1
'subject naturally irresistible your corporate identity  lt is really hard to recollect a company  the  market is full of suqgestions and the information isoverwhelminq  but a good  catchy logo  stylish statlonery and outstanding website  will make the task much easier   we do not promise that havinq ordered a iogo your  company will automaticaily become a world ieader it isguite ciear that  without good products  effective business organization and practicable aim it  will be hotat nowadays mar'

Word Cloud for 'spam' Emails

Let us build the first word cloud. The first line of code generates the word cloud on the 'final_text_spam' corpus, while the second to fifth lines of code prints the word cloud.

1
2
3
4
5
6
7
wordcloud_spam = WordCloud(background_color="white").generate(final_text_spam)

# Lines 2 - 5
plt.figure(figsize = (20,20))
plt.imshow(wordcloud_spam, interpolation='bilinear')
plt.axis("off")
plt.show()
python

Output:

png

The word cloud displayed above is good, but some of the words are larger than the others. This is because the size of the word in the word cloud is proportional to the frequency of the word inside the corpus. There are various parameters which can be adjusted to change the display of the word cloud, and the list of such parameters can be viewed using the '?WordCloud' command.

In this guide, we will be using the following parameters:

  1. background_color: This parameter specifies the background color for the word cloud image, with the default color being 'black'.

  2. max_font_size: This parameter specifies the maximum font size for the largest word. If none, the height of the image is used.

  3. max_words: This parameter specifies the maximum number of words, with the default being 200.

  4. stopwords: This parameter specifies the words that will not be considered while building the word cloud. If none, the build-in stopwords list will be used.

Let us modify the earlier word cloud to include these parameters. The first line of code below utilizes the existing list of stopwords. In the previously built word cloud, words like 'subject', 'will', 'us','enron','re', etc. are common words and do not provide much insight. The second line updates the stopwords with these words specific to our data.

The third line generates the word cloud on the 'final_text_spam' corpus. Note that we have changed some optional arguments like max_font_size, max_words, and background_color to better visualize the word cloud.

The fourth to seventh lines of code plot the word cloud. The argument 'interpolation=bilinear' is used to make the image appear smoother.

1
2
3
4
5
6
7
8
9
10
stopwords = set(STOPWORDS)
stopwords.update(["subject","re","vince","kaminski","enron","cc", "will", "s", "1","e","t"])

wordcloud_spam = WordCloud(stopwords=stopwords, background_color="white", max_font_size=50, max_words=100).generate(final_text_spam)

# Lines 4 to 7
plt.figure(figsize = (15,15))
plt.imshow(wordcloud_spam, interpolation='bilinear')
plt.axis("off")
plt.show()
python

Output:

png

From the above image, we can see that the stop words are not displayed. Also, we observe that words like new, account, company, program, mail, etc. are the most prominent words in the word cloud. Next, we will learn about another technique of extracting the most popular words as a frequency table. In our case, we will extract the most frequent thirty words. The lines of code below perform this task and prints the top thirty words along with its count as output.

1
2
3
4
5
6
7
8
9
10
filtered_words_spam = [word for word in final_text_spam.split() if word not in stopwords]
counted_words_spam = collections.Counter(filtered_words_spam)

word_count_spam = {}

for letter, count in counted_words_spam.most_common(30):
    word_count_spam[letter] = count
    
for i,j in word_count_spam.items():
        print('Word: {0}, count: {1}'.format(i,j))
python

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Word: business, count: 844
Word: company, count: 805
Word: email, count: 804
Word: information, count: 740
Word: 5, count: 687
Word: money, count: 662
Word: 2, count: 613
Word: free, count: 606
Word: 3, count: 604
Word: mail, count: 586
Word: one, count: 581
Word: please, count: 581
Word: now, count: 575
Word: 000, count: 560
Word: us, count: 537
Word: click, count: 531
Word: time, count: 521
Word: new, count: 504
Word: make, count: 496
Word: may, count: 489
Word: website, count: 465
Word: adobe, count: 462
Word: 0, count: 450
Word: software, count: 438
Word: message, count: 418
Word: 10, count: 405
Word: list, count: 392
Word: report, count: 391
Word: 2005, count: 374
Word: want, count: 364

Conclusion

In this guide, you have learned about how to build a word cloud and the important parameters that can be altered to improve its appearance. You also learned about extracting the top words, while identifying and removing noise using the stopwords dictionary.

In this guide, we have applied the techniques on the 'spam' emails of the data set. Similar steps could be performed to create word cloud for the 'ham' emails or for the entire text. The important words can then be used for decision making or as features for model building.

To learn more about Natural Language Processing with Python, please refer to the following guides:

0