- Lab
- Data

Cleaning String Data in Python Hands-on Practice
In this lab, you'll master the essentials of text cleaning in Python, starting from basic operations like importing data and manipulating strings, to advanced techniques involving regular expressions and BeautifulSoup. You'll learn efficient methods for handling and preprocessing text data, including string joining, splitting, case conversion, and cleaning HTML content. By the end, you'll have honed your skills in preparing text data for analysis, making you adept at tackling various data science and text analytics challenges.

Path Info
Table of Contents
-
Challenge
Basic Text Cleaning
Basic Text Cleaning
To review the concepts covered in this step, please refer to the Getting Started with String Cleaning Ops module of the Cleaning String Data in Python course.
Understanding how to perform basic text cleaning in Python is important because it forms the foundation of more advanced string cleaning operations. It helps in removing unwanted characters from strings, making them more readable and easier to process.
Before we dive into the text cleaning operations, let's start by importing and loading our data. This will set the stage for us to practice on real examples.
Task 1.1: Importing and Loading Data
Import Pandas and load the data from "Qualitative_Data.csv". After loading the data, print the first few rows to verify it's loaded correctly.
π Hint
Use the
read_csv()
function in the Pandas library to import and read the CSV data. You can print the first few rows using thehead()
method on the dataframe.π Solution
import pandas as pd # Load the data df = pd.read_csv('Qualitative_Data.csv') # Print the first few rows of the dataframe print(df.head())
Task 1.2: Joining a List of Strings
Join a list of strings into a single string using the
join()
method. The list of strings will be the notes from previous conversations. Print the results.π Hint
Use the
join()
method on a string that represents the delimiter you want to use to join the strings. For example, if you want to join the strings with a space in between, you would do' '.join(notes_list)
.π Solution
notes_list = df['Notes From Previous Conversations'].tolist() notes_string = ' '.join(notes_list) print(notes_string)
Task 1.3: Splitting a String
Split the string you created in
Task 1.2
into a list of words using thesplit()
method. Print the results.π Hint
Use the
split()
method on the string you want to split. If you want to split the string by spaces, you would donotes_string.split(' ')
.π Solution
words_list = notes_string.split(' ') print(words_list)
Task 1.4: Removing Specific Characters
In this task, you will remove specific characters from the string you created in
Task 1.2
using thereplace()
method. Let's remove all occurrences of the word 'product'. Print the result.π Hint
Use the
replace()
method on the string you want to modify. The first argument should be the substring you want to replace, and the second argument should be what you want to replace it with. For example, to remove all occurrences of the word 'banana', you would doexample_string.replace('banana', '')
.π Solution
notes_string_no_product = notes_string.replace('product', '') print(notes_string_no_product)
-
Challenge
Vectorized Text Cleaning
Vectorized Text Cleaning
To review the concepts covered in this step, please refer to the Getting Started with String Cleaning Ops module of the Cleaning String Data in Python course.
Vectorized text cleaning is important because it allows us to perform operations on entire collections of strings in a fast and efficient manner. This is particularly useful when dealing with large datasets.
Now, let's move on to vectorized text cleaning. The goal here is to learn how to perform operations on entire collections of strings at once. We'll be using the pandas library for this purpose. Specifically, we'll practice how to extract the second word from each string, lowercase all strings in a pandas DataFrame, and split strings into multiple words. Remember to use the
str
accessor to call string methods on Pandas data Series objects.
Task 2.1: Importing Pandas and Loading the Dataset
First, import the pandas library and then load the
'Qualitative_Data.csv'
file into a pandas DataFrame. After loading, display the first few rows of the DataFrame to ensure it's loaded correctly.π Hint
To import pandas, use the
import
keyword followed bypandas as pd
. Use thepd.read_csv()
function to read the csv file anddf.head()
to display the first few rows of the DataFrame.π Solution
import pandas as pd # Load the dataset df = pd.read_csv('Qualitative_Data.csv') # Display the first few rows df.head()
Task 2.2: Extracting the Second Word from Each String
Extract the second word from each string in the
'Notes From Previous Conversations'
column. Display the first 5 rows of the resulting Pandas Series object.π Hint
Use the
Series.str.split()
function to split the strings into words and then use indexing to get the second word. For example,df['column_name'].str.split().str[1]
.π Solution
second_words = df['Notes From Previous Conversations'].str.split().str[1] second_words.head()
Task 2.3: Lowercasing All Strings
Lowercase all strings in the
'Notes From Previous Conversations'
column. Print the first few rows of the results.π Hint
Use the
Series.str.lower()
function to lowercase all strings. For example,df['column_name'].str.lower()
.π Solution
df['Notes From Previous Conversations'] = df['Notes From Previous Conversations'].str.lower() df['Notes From Previous Conversations'].head()
Task 2.4: Splitting Strings into Multiple Words
Split the strings in the
'Notes From Previous Conversations'
column into multiple words. Display the results.π Hint
Use the
Series.str.split()
function to split the strings into words. For example,df['column_name'].str.split()
.π Solution
words = df['Notes From Previous Conversations'].str.split() print(words)
-
Challenge
Advanced Text Cleaning with Regular Expressions
Advanced Text Cleaning
To review the concepts covered in this step, please refer to the Advanced String Cleaning Ops module of the Cleaning String Data in Python course.
Understanding how to use regular expressions for text cleaning is crucial as they provide a powerful and flexible way to search, replace, and manipulate text. This is particularly useful when dealing with complex string patterns.
In this step, we'll use the
re
andbs4
modules in Python to practice some advanced text cleaning techniques on the provided string data. We will learn how to remove URLs, hashtags, specific substrings, numbers from strings, and correct HTML. There.sub()
function is your main tool for this step, but we'll also touch onBeautifulSoup.get_text()
.
Task 3.1: Removing Numbers from Text
Let's start by removing all numbers from the provided string data. After removing, print the results.
π Hint
Use the
re.sub()
function with the patternr'\d+'
to replace all numbers with an empty string. The\d
in the regex pattern represents any digit, and the+
sign indicates one or more times. Therefore,\d+
matches one or more digits.π Solution
import re # Assuming 'data' is the provided string with numbers cleaned_data = re.sub(r'\d+', '', data) print(cleaned_data)
Task 3.2: Removing Specific Substrings from Text
Next, remove the substring 'product' from the provided string data and print the results.
π Hint
Use the
re.sub()
function with the patternr'product'
to replace the substring 'product' with an empty string. The regex pattern here is a direct match for the word 'product' in the data.π Solution
# Assuming 'cleaned_data' is the current state of the data cleaned_data = re.sub(r'product', '', cleaned_data) print(cleaned_data)
Task 3.3: Cleaning and Correcting HTML
Now, replace all '&' HTML entities in the provided string data with '&' and print the results.
π Hint
Use the
re.sub()
function with the patternr'&'
to replace '&' with '&'. This pattern looks for the exact sequence of '&' and replaces it with a single ampersand.π Solution
# Assuming 'cleaned_data' is the current state of the data cleaned_data = re.sub(r'&', '&', cleaned_data) print(cleaned_data)
Task 3.4: Removing URLs from Text
Remove all URLs from the provided string data and print the results.
π Hint
Use the
re.sub()
function with the patternr'http[s]?://\S+'
to match and replace URLs with an empty string. In this pattern,http[s]?
matches 'http' followed by an optional 's', and\S+
matches one or more non-whitespace characters, which continues until a space is encountered.π Solution
# Assuming 'cleaned_data' is the current state of the data cleaned_data = re.sub(r'http[s]?://\S+', '', cleaned_data) print(cleaned_data)
Task 3.5: Removing Hashtags from Text
Finally, remove all hashtags from the provided string data and print the results.
π Hint
Use the
re.sub()
function with the patternr'#\w+'
to match and replace hashtags with an empty string. The#
character in the regex pattern matches the hashtag symbol, and\w+
matches one or more word characters (letters, digits, or underscores).π Solution
# Assuming 'cleaned_data' is the current state of the data cleaned_data = re.sub(r'#\w+', '', cleaned_data) print(cleaned_data)
Task 3.6: Extracting Text from HTML Tags
Import the
BeautifulSoup
class from thebs4
module. Write a Python function that takes a string containing HTML as input and returns the text content of the HTML using theget_text()
method. After extracting, print the results to verify.π Hint
You can use the
BeautifulSoup
class from thebs4
module to parse the HTML and extract the text. Here's an example of initializing the BeautifulSoup class with your data:soup = BeautifulSoup(html_string, 'html.parser')
Remember,
get_text()
will retrieve the text content from the parsed HTML.π Solution
from bs4 import BeautifulSoup # Assuming 'html_string' is the provided string containing HTML def extract_text(html_string): soup = BeautifulSoup(html_string, 'html.parser') return soup.get_text() text_content = extract_text(html_data) print(text_content)
What's a lab?
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Provided environment for hands-on practice
We will provide the credentials and environment necessary for you to practice right within your browser.
Guided walkthrough
Follow along with the authorβs guided walkthrough and build something new in your provided environment!
Did you know?
On average, you retain 75% more of your learning if you get time for practice.