Author avatar

Vivek Kumar

Wrangling Text Data

Vivek Kumar

  • Mar 25, 2019
  • 7 Min read
  • 19 Views
  • Mar 25, 2019
  • 7 Min read
  • 19 Views
Data
Pandas

Introduction

This guide introduces the basic pre-processing steps for text data using Pandas such as handling alphanumeric data, handling extra spaces, invalid words, repeating words, etc.

To implement these steps, let us take a file containing the reviews on Iphone X as shown:

Reviews
Battery health health as received was 96percentage, which I was happy about @896.
It is not a REFURBISHED iphone, it is activated.
I rarely leave reviews but when I opened this phone #$%^@ I was aggravated.
The iPhone came with 89percent of it’s original original factory battery health.
Won't connect with my Bose wireless wireless speakers, and won't connect with my wireless JBL speakers.

We will break our learning in the following objectives: 1. Convert all the reviews to lowercase and remove extra whitespace. 2. Remove the invalid as well as the special characters. 3. Remove the repeating words. 4. Insert space between alphabets and numbers of an alphanumeric word.

The Baseline

To begin the process, we will import the Pandas library and read these five reviews from a reviews.csv file to a DataFrame variable as shown:

1
2
3
4
5
# Necessary library
import pandas as pd

# Reading reviews and storing them in a DataFrame variable
reviews = pd.read_csv('reviews.csv')
python

Convert All the Reviews to Lowercase and Remove Extra Whitespace

To start with the pre-processing, let us first convert all the reviews to lower-case alphabets. This can be achieved by using lower method available inside string module, str.lower(), as shown:

1
2
3
# Converting reviews to lowercase and storing results back to the main file
reviews.Reviews = reviews.Reviews.str.lower()
reviews
python

Output:

Reviews
battery health health as received was 96percentage, which i was happy about @896.
it is not a refurbished iphone, it is activated.
i rarely leave reviews but when i opened this phone #$%^@ i was aggravated.
the iphone came with 89percent of it’s original original factory battery health.
won't connect with my bose wireless wireless speakers, and won't connect with my wireless jbl speakers.

Next, as you can observe in the second and third reviews, we have some extra whitespaces. To remove any whitespace that is greater than one, we can follow the given steps:

1
2
# Removing white spaces from between the text using apply
reviews.Reviews = reviews.Reviews.apply(lambda x: " ".join(x.split()))
python

Output

Reviews
battery health health as received was 96percentage, which i was happy about @896.
it is not a refurbished iphone, it is activated.
i rarely leave reviews but when i opened this phone #$%^@ i was aggravated.
the iphone came with 89percent of it’s original original factory battery health.
won't connect with my bose wireless wireless speakers, and won't connect with my wireless jbl speakers.

In the above code, we split each row on single space using split() and later joined all the elements except extra space elements using join. This anonymous function is applied to each row using apply and the results are stored back to the original file.

Remove the Invalid as Well as the Special Characters

Now, let us remove invalid and special characters from the reviews. Here, we consider punctuations as our targets. The code below helps us to remove them:

1
2
# Removing all the invalid words 
reviews.Reviews = reviews.Reviews.str.replace('[^\w\s]','')
python

Output

Reviews
battery health health as received was 96percentage, which i was happy about 896
it is not a refurbished iphone it is activated
i rarely leave reviews but when i opened this phone i was aggravated
the iphone came with 89percent of its original original factory battery health
wont connect with my bose wireless wireless speakers and wont connect with my wireless jbl speakers

As can be observed from the above output, all the punctuations have been removed from the reviews.

Remove the Repeating Words

You must have observed some consecutive repeating words like in the first review, health is repeated twice. Such consecutive repeating words can be removed using regex by referencing the first group followed by one or more space and then checking if the duplicate occurs or not.

1
2
# Removing consecutive duplicates
reviews.Reviews = reviews.Reviews.str.replace(r'\b(\w+)(\s+\1)+\b', r'\1')
python

Output

Reviews
battery health as received was 96percentage, which i was happy about 896
it is not a refurbished iphone it is activated
i rarely leave reviews but when i opened this phone i was aggravated
the iphone came with 89percent of its original factory battery health
wont connect with my bose wireless wireless speakers and wont connect with my wireless jbl speakers

As you can observe from the output, the word health in first row now occurs only one time.

Insert Space Between Alphabets and Numbers of an Alphanumeric Word

Among the given reviews, there are two instances, the first row and fourth rows, where we see alphanumeric words, 96percentage and 89percent. Let us break them by introducing a single space between each of these words.

This can be achieved by using regex. We again start with the introduction of the anonymous function implemented through lambda keyword which utilizes regular expressions. We try to split all the words including the alphanumeric parts and later join them using ' '.join() which introduces a single space during the join. The regular expression used is [^\W\d]+\d+. This anonymous function is applied to all the rows using the Pandas apply method.

Let us take a look at the code:

1
2
3
4
5
# Import regular expression library
import re

# Splitting alphanumeric words
reviews.Reviews = reviews.Reviews.apply(lambda x: ' '.join(re.findall(r"[^\W\d]+|\d+", x)))
python

Output

Reviews
battery health as received was 96 percentage, which i was happy about 896
it is not a refurbished iphone it is activated
i rarely leave reviews but when i opened this phone i was aggravated
the iphone came with 89 percent of its original factory battery health
wont connect with my bose wireless wireless speakers and wont connect with my wireless jbl speakers

Conclusion

This completes a basic overview of how you can perform basic text pre-processing using Pandas. We have learned how to convert a text to lowercase, remove extra white-spaces, punctuations, consecutive repeating words, and split alphanumeric words.

0