Author avatar

Deepika Singh

​Visualizing Text Data Using a Word Cloud

Deepika Singh

  • Jun 27, 2019
  • 12 Min read
  • Jun 27, 2019
  • 12 Min read
Machine Learning


Text data has grown exponentially in recent years resulting in an ever-increasing need to analyze the massive amounts of such data. Word Cloud provides an excellent option to analyze the text data through visualization in the form of tags, or words, where the importance of a word is explained by its frequency.

In this guide, we will learn how to create word clouds and find important words that can help in extracting insights from the data. We will begin by understanding the problem statement and the data.

Problem Statement

The data is about a familiar topic which every email user must have encountered at some point of time - i.e, the "spam" emails, which are unsolicited messages, often advertising a product, containing links to malware or attempting to scam the recipient.

In this guide, we will use a publicly available dataset, first described in the 2006 conference paper "Spam Filtering with Naive Bayes -- Which Naive Bayes?" by V. Metsis, I. Androutsopoulos, and G. Paliouras. The "ham" messages in this dataset come from the inbox of former Enron Managing Director for Research Vincent Kaminski, one of the inboxes in the Enron Corpus. One source of spam messages in this dataset is the SpamAssassin corpus, which contains hand-labeled spam messages contributed by Internet users. The remaining spam was collected by Project Honey Pot, a project that collects spam messages and identifies spammers by publishing email address that humans would know not to contact but that bots might target with spam. The full dataset we will use was constructed as roughly a 75/25 mix of the ham and spam messages.

The dataset contains just two fields:

  1. text - The text of the email.
  1. spam - A binary variable indicating if the email was spam or not.

Let us start by importing the required libraries.

Importing Libraries

1# Supress Warnings
2import warnings
5#loading all necessary libraries
6import numpy as np
7import pandas as pd
9import string
10import collections
11from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
12import as cm
13import matplotlib.pyplot as plt
14% matplotlib inline

Reading the File and Understanding the Data

The first line of code below reads in the data as pandas dataframe, while the second line prints the shape - 5726 observations of 2 variables. The third line prints the first five records. There are only two variables - 'text' and 'spam' - that have been explained above. Majority of the emails are 'ham' emails, labeled as '0', constituting 76 percent of the total data.

1# loading the data file
2df = pd.read_csv('emails2.csv')
4#shape of the dataframe
5print('The shape of the dataframe is :',df.shape)
7#first few records


1The shape of the dataframe is : (5726, 2)
3|   	| text                                              	| spam 	|
4|---	|---------------------------------------------------	|------	|
5| 0 	| Subject: naturally irresistible your corporate... 	| 1    	|
6| 1 	| Subject: the stock trading gunslinger fanny i...  	| 1    	|
7| 2 	| Subject: unbelievable new homes made easy im ...  	| 1    	|
8| 3 	| Subject: 4 color printing special request add...  	| 1    	|
9| 4 	| Subject: do not have money , get software cds ... 	| 1    	|

Let us check for missing values in the text variable, which can be done by the line of code below. The output shows no missing values.

1#Checking for null values in `description`



We will start by building the word cloud for all the emails that are spam. The first line of code below filters the data with spam emails, while the second line prints the shape - 1368 observations of 2 variables.

1spam1 = df[df.spam == 1]


1(1368, 2)

Data Cleaning and Preparation

Before building the word cloud, it is important to clean the data. The common steps are carried out in subsequent sections.

Converting the Text to Lower Case

The first line of code below converts the text to lower case, while the second line prints the top five records. This will ensure that words like 'Enron' and 'enron' are treated as the same while counting word frequencies.

1spam1['text']= spam1['text'].str.lower()


10    subject: naturally irresistible your corporate...
21    subject: the stock trading gunslinger  fanny i...
32    subject: unbelievable new homes made easy  im ...
43    subject: 4 color printing special  request add...
54    subject: do not have money , get software cds ...
6Name: text, dtype: object

Splitting and Removing Punctuation from the Text

1all_spam = spam1['text'].str.split(' ')


10    [subject:, naturally, irresistible, your, corp...
21    [subject:, the, stock, trading, gunslinger, , ...
32    [subject:, unbelievable, new, homes, made, eas...
43    [subject:, 4, color, printing, special, , requ...
54    [subject:, do, not, have, money, ,, get, softw...
6Name: text, dtype: object

Joining the Entire Review

In this step, we will join all the 'text' records. This is required to build the text corpus which will be used to build the word cloud. The lines of code below complete this task for us.

1all_spam_cleaned = []
3for text in all_spam:
4    text = [x.strip(string.punctuation) for x in text]
5    all_spam_cleaned.append(text)
9text_spam = [" ".join(text) for text in all_spam_cleaned]
10final_text_spam = " ".join(text_spam)


1'subject naturally irresistible your corporate identity  lt is really hard to recollect a company  the  market is full of suqgestions and the information isoverwhelminq  but a good  catchy logo  stylish statlonery and outstanding website  will make the task much easier   we do not promise that havinq ordered a iogo your  company will automaticaily become a world ieader it isguite ciear that  without good products  effective business organization and practicable aim it  will be hotat nowadays mar'

Word Cloud for 'spam' Emails

Let us build the first word cloud. The first line of code generates the word cloud on the 'final_text_spam' corpus, while the second to fifth lines of code prints the word cloud.

1wordcloud_spam = WordCloud(background_color="white").generate(final_text_spam)
3# Lines 2 - 5
4plt.figure(figsize = (20,20))
5plt.imshow(wordcloud_spam, interpolation='bilinear')



The word cloud displayed above is good, but some of the words are larger than the others. This is because the size of the word in the word cloud is proportional to the frequency of the word inside the corpus. There are various parameters which can be adjusted to change the display of the word cloud, and the list of such parameters can be viewed using the '?WordCloud' command.

In this guide, we will be using the following parameters:

  1. background_color: This parameter specifies the background color for the word cloud image, with the default color being 'black'.

  2. max_font_size: This parameter specifies the maximum font size for the largest word. If none, the height of the image is used.

  3. max_words: This parameter specifies the maximum number of words, with the default being 200.

  4. stopwords: This parameter specifies the words that will not be considered while building the word cloud. If none, the build-in stopwords list will be used.

Let us modify the earlier word cloud to include these parameters. The first line of code below utilizes the existing list of stopwords. In the previously built word cloud, words like 'subject', 'will', 'us','enron','re', etc. are common words and do not provide much insight. The second line updates the stopwords with these words specific to our data.

The third line generates the word cloud on the 'final_text_spam' corpus. Note that we have changed some optional arguments like max_font_size, max_words, and background_color to better visualize the word cloud.

The fourth to seventh lines of code plot the word cloud. The argument 'interpolation=bilinear' is used to make the image appear smoother.

1stopwords = set(STOPWORDS)
2stopwords.update(["subject","re","vince","kaminski","enron","cc", "will", "s", "1","e","t"])
4wordcloud_spam = WordCloud(stopwords=stopwords, background_color="white", max_font_size=50, max_words=100).generate(final_text_spam)
6# Lines 4 to 7
7plt.figure(figsize = (15,15))
8plt.imshow(wordcloud_spam, interpolation='bilinear')



From the above image, we can see that the stop words are not displayed. Also, we observe that words like new, account, company, program, mail, etc. are the most prominent words in the word cloud. Next, we will learn about another technique of extracting the most popular words as a frequency table. In our case, we will extract the most frequent thirty words. The lines of code below perform this task and prints the top thirty words along with its count as output.

1filtered_words_spam = [word for word in final_text_spam.split() if word not in stopwords]
2counted_words_spam = collections.Counter(filtered_words_spam)
4word_count_spam = {}
6for letter, count in counted_words_spam.most_common(30):
7    word_count_spam[letter] = count
9for i,j in word_count_spam.items():
10        print('Word: {0}, count: {1}'.format(i,j))


1Word: business, count: 844
2Word: company, count: 805
3Word: email, count: 804
4Word: information, count: 740
5Word: 5, count: 687
6Word: money, count: 662
7Word: 2, count: 613
8Word: free, count: 606
9Word: 3, count: 604
10Word: mail, count: 586
11Word: one, count: 581
12Word: please, count: 581
13Word: now, count: 575
14Word: 000, count: 560
15Word: us, count: 537
16Word: click, count: 531
17Word: time, count: 521
18Word: new, count: 504
19Word: make, count: 496
20Word: may, count: 489
21Word: website, count: 465
22Word: adobe, count: 462
23Word: 0, count: 450
24Word: software, count: 438
25Word: message, count: 418
26Word: 10, count: 405
27Word: list, count: 392
28Word: report, count: 391
29Word: 2005, count: 374
30Word: want, count: 364


In this guide, you have learned about how to build a word cloud and the important parameters that can be altered to improve its appearance. You also learned about extracting the top words, while identifying and removing noise using the stopwords dictionary.

In this guide, we have applied the techniques on the 'spam' emails of the data set. Similar steps could be performed to create word cloud for the 'ham' emails or for the entire text. The important words can then be used for decision making or as features for model building.

To learn more about Natural Language Processing with Python, please refer to the following guides: