Libraries: If you want this lab, consider one of these libraries.
Data

Cleaning String Data in Python Hands-on Practice

In this lab, you'll master the essentials of text cleaning in Python, starting from basic operations like importing data and manipulating strings, to advanced techniques involving regular expressions and BeautifulSoup. You'll learn efficient methods for handling and preprocessing text data, including string joining, splitting, case conversion, and cleaning HTML content. By the end, you'll have honed your skills in preparing text data for analysis, making you adept at tackling various data science and text analytics challenges.

Get started Contact sales

Lab Info

Level

Beginner

Last updated

Jan 08, 2026

Duration

34m

Challenge

Basic Text Cleaning
Basic Text Cleaning

To review the concepts covered in this step, please refer to the Getting Started with String Cleaning Ops module of the Cleaning String Data in Python course.

Understanding how to perform basic text cleaning in Python is important because it forms the foundation of more advanced string cleaning operations. It helps in removing unwanted characters from strings, making them more readable and easier to process.

Before we dive into the text cleaning operations, let's start by importing and loading our data. This will set the stage for us to practice on real examples.

Task 1.1: Importing and Loading Data

Import Pandas and load the data from "Qualitative_Data.csv". After loading the data, print the first few rows to verify it's loaded correctly.

🔍 Hint

Use the read_csv() function in the Pandas library to import and read the CSV data. You can print the first few rows using the head() method on the dataframe.
🔑 Solution

import pandas as pd # Load the data df = pd.read_csv('Qualitative_Data.csv') # Print the first few rows of the dataframe print(df.head())
Task 1.2: Joining a List of Strings

Join a list of strings into a single string using the join() method. The list of strings will be the notes from previous conversations. Print the results.

🔍 Hint

Use the join() method on a string that represents the delimiter you want to use to join the strings. For example, if you want to join the strings with a space in between, you would do ' '.join(notes_list).
🔑 Solution

notes_list = df['Notes From Previous Conversations'].tolist() notes_string = ' '.join(notes_list) print(notes_string)
Task 1.3: Splitting a String

Split the string you created in Task 1.2 into a list of words using the split() method. Print the results.

🔍 Hint

Use the split() method on the string you want to split. If you want to split the string by spaces, you would do notes_string.split(' ').
🔑 Solution

words_list = notes_string.split(' ') print(words_list)
Task 1.4: Removing Specific Characters

In this task, you will remove specific characters from the string you created in Task 1.2 using the replace() method. Let's remove all occurrences of the word 'product'. Print the result.

🔍 Hint

Use the replace() method on the string you want to modify. The first argument should be the substring you want to replace, and the second argument should be what you want to replace it with. For example, to remove all occurrences of the word 'banana', you would do example_string.replace('banana', '').
🔑 Solution

notes_string_no_product = notes_string.replace('product', '') print(notes_string_no_product)
Challenge

Vectorized Text Cleaning
Vectorized Text Cleaning

To review the concepts covered in this step, please refer to the Getting Started with String Cleaning Ops module of the Cleaning String Data in Python course.

Vectorized text cleaning is important because it allows us to perform operations on entire collections of strings in a fast and efficient manner. This is particularly useful when dealing with large datasets.

Now, let's move on to vectorized text cleaning. The goal here is to learn how to perform operations on entire collections of strings at once. We'll be using the pandas library for this purpose. Specifically, we'll practice how to extract the second word from each string, lowercase all strings in a pandas DataFrame, and split strings into multiple words. Remember to use the str accessor to call string methods on Pandas data Series objects.

Task 2.1: Importing Pandas and Loading the Dataset

First, import the pandas library and then load the 'Qualitative_Data.csv' file into a pandas DataFrame. After loading, display the first few rows of the DataFrame to ensure it's loaded correctly.

🔍 Hint

To import pandas, use the import keyword followed by pandas as pd. Use the pd.read_csv() function to read the csv file and df.head() to display the first few rows of the DataFrame.
🔑 Solution

import pandas as pd # Load the dataset df = pd.read_csv('Qualitative_Data.csv') # Display the first few rows df.head()
Task 2.2: Extracting the Second Word from Each String

Extract the second word from each string in the 'Notes From Previous Conversations' column. Display the first 5 rows of the resulting Pandas Series object.

🔍 Hint

Use the Series.str.split() function to split the strings into words and then use indexing to get the second word. For example, df['column_name'].str.split().str[1].
🔑 Solution

second_words = df['Notes From Previous Conversations'].str.split().str[1] second_words.head()
Task 2.3: Lowercasing All Strings

Lowercase all strings in the 'Notes From Previous Conversations' column. Print the first few rows of the results.

🔍 Hint

Use the Series.str.lower() function to lowercase all strings. For example, df['column_name'].str.lower().
🔑 Solution

df['Notes From Previous Conversations'] = df['Notes From Previous Conversations'].str.lower() df['Notes From Previous Conversations'].head()
Task 2.4: Splitting Strings into Multiple Words

Split the strings in the 'Notes From Previous Conversations' column into multiple words. Display the results.

🔍 Hint

Use the Series.str.split() function to split the strings into words. For example, df['column_name'].str.split().
🔑 Solution

words = df['Notes From Previous Conversations'].str.split() print(words)
Challenge

Advanced Text Cleaning with Regular Expressions
Advanced Text Cleaning

To review the concepts covered in this step, please refer to the Advanced String Cleaning Ops module of the Cleaning String Data in Python course.

Understanding how to use regular expressions for text cleaning is crucial as they provide a powerful and flexible way to search, replace, and manipulate text. This is particularly useful when dealing with complex string patterns.

In this step, we'll use the re and bs4 modules in Python to practice some advanced text cleaning techniques on the provided string data. We will learn how to remove URLs, hashtags, specific substrings, numbers from strings, and correct HTML. The re.sub() function is your main tool for this step, but we'll also touch on BeautifulSoup.get_text().

Task 3.1: Removing Numbers from Text

Let's start by removing all numbers from the provided string data. After removing, print the results.

🔍 Hint

Use the re.sub() function with the pattern r'\d+' to replace all numbers with an empty string. The \d in the regex pattern represents any digit, and the + sign indicates one or more times. Therefore, \d+ matches one or more digits.
🔑 Solution

import re # Assuming 'data' is the provided string with numbers cleaned_data = re.sub(r'\d+', '', data) print(cleaned_data)
Task 3.2: Removing Specific Substrings from Text

Next, remove the substring 'product' from the provided string data and print the results.

🔍 Hint

Use the re.sub() function with the pattern r'product' to replace the substring 'product' with an empty string. The regex pattern here is a direct match for the word 'product' in the data.
🔑 Solution

# Assuming 'cleaned_data' is the current state of the data cleaned_data = re.sub(r'product', '', cleaned_data) print(cleaned_data)
Task 3.3: Cleaning and Correcting HTML

Now, replace all '&' HTML entities in the provided string data with '&' and print the results.

🔍 Hint

Use the re.sub() function with the pattern r'&' to replace '&' with '&'. This pattern looks for the exact sequence of '&' and replaces it with a single ampersand.
🔑 Solution

# Assuming 'cleaned_data' is the current state of the data cleaned_data = re.sub(r'&', '&', cleaned_data) print(cleaned_data)
Task 3.4: Removing URLs from Text

Remove all URLs from the provided string data and print the results.

🔍 Hint

Use the re.sub() function with the pattern r'http[s]?://\S+' to match and replace URLs with an empty string. In this pattern, http[s]? matches 'http' followed by an optional 's', and \S+ matches one or more non-whitespace characters, which continues until a space is encountered.
🔑 Solution

# Assuming 'cleaned_data' is the current state of the data cleaned_data = re.sub(r'http[s]?://\S+', '', cleaned_data) print(cleaned_data)
Task 3.5: Removing Hashtags from Text

Finally, remove all hashtags from the provided string data and print the results.

🔍 Hint

Use the re.sub() function with the pattern r'#\w+' to match and replace hashtags with an empty string. The # character in the regex pattern matches the hashtag symbol, and \w+ matches one or more word characters (letters, digits, or underscores).
🔑 Solution

# Assuming 'cleaned_data' is the current state of the data cleaned_data = re.sub(r'#\w+', '', cleaned_data) print(cleaned_data)
Task 3.6: Extracting Text from HTML Tags

Import the BeautifulSoup class from the bs4 module. Write a Python function that takes a string containing HTML as input and returns the text content of the HTML using the get_text() method. After extracting, print the results to verify.
🔍 Hint

You can use the BeautifulSoup class from the bs4 module to parse the HTML and extract the text. Here's an example of initializing the BeautifulSoup class with your data:

soup = BeautifulSoup(html_string, 'html.parser')

Remember, get_text() will retrieve the text content from the parsed HTML.
🔑 Solution

from bs4 import BeautifulSoup # Assuming 'html_string' is the provided string containing HTML def extract_text(html_string): soup = BeautifulSoup(html_string, 'html.parser') return soup.get_text() text_content = extract_text(html_data) print(text_content)

About the author

Real skill practice before real-world application

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Learn by doing

Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.

Follow your guide

All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.

Turn time into mastery

On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.

Cleaning String Data in Python Hands-on Practice

Lab Info

Table of Contents

Basic Text Cleaning

Basic Text Cleaning

Task 1.1: Importing and Loading Data

Task 1.2: Joining a List of Strings

Task 1.3: Splitting a String

Task 1.4: Removing Specific Characters

Vectorized Text Cleaning

Vectorized Text Cleaning

Task 2.1: Importing Pandas and Loading the Dataset

Task 2.2: Extracting the Second Word from Each String

Task 2.3: Lowercasing All Strings

Task 2.4: Splitting Strings into Multiple Words

Advanced Text Cleaning with Regular Expressions

Advanced Text Cleaning

Task 3.1: Removing Numbers from Text

Task 3.2: Removing Specific Substrings from Text

Task 3.3: Cleaning and Correcting HTML

Task 3.4: Removing URLs from Text

Task 3.5: Removing Hashtags from Text

Task 3.6: Extracting Text from HTML Tags

About the author

Real skill practice before real-world application

Learn by doing

Follow your guide

Turn time into mastery

Get started with Pluralsight