Hamburger Icon
  • Labs icon Lab
  • Data
Labs

Cleaning String Data in Python Hands-on Practice

In this lab, you'll master the essentials of text cleaning in Python, starting from basic operations like importing data and manipulating strings, to advanced techniques involving regular expressions and BeautifulSoup. You'll learn efficient methods for handling and preprocessing text data, including string joining, splitting, case conversion, and cleaning HTML content. By the end, you'll have honed your skills in preparing text data for analysis, making you adept at tackling various data science and text analytics challenges.

Labs

Path Info

Level
Clock icon Beginner
Duration
Clock icon 37m
Published
Clock icon Dec 12, 2023

Contact sales

By filling out this form and clicking submit, you acknowledge ourΒ privacy policy.

Table of Contents

  1. Challenge

    Basic Text Cleaning

    Basic Text Cleaning

    To review the concepts covered in this step, please refer to the Getting Started with String Cleaning Ops module of the Cleaning String Data in Python course.

    Understanding how to perform basic text cleaning in Python is important because it forms the foundation of more advanced string cleaning operations. It helps in removing unwanted characters from strings, making them more readable and easier to process.

    Before we dive into the text cleaning operations, let's start by importing and loading our data. This will set the stage for us to practice on real examples.


    Task 1.1: Importing and Loading Data

    Import Pandas and load the data from "Qualitative_Data.csv". After loading the data, print the first few rows to verify it's loaded correctly.

    πŸ” Hint

    Use the read_csv() function in the Pandas library to import and read the CSV data. You can print the first few rows using the head() method on the dataframe.

    πŸ”‘ Solution
    import pandas as pd
    
    # Load the data
    df = pd.read_csv('Qualitative_Data.csv')
    
    # Print the first few rows of the dataframe
    print(df.head())
    

    Task 1.2: Joining a List of Strings

    Join a list of strings into a single string using the join() method. The list of strings will be the notes from previous conversations. Print the results.

    πŸ” Hint

    Use the join() method on a string that represents the delimiter you want to use to join the strings. For example, if you want to join the strings with a space in between, you would do ' '.join(notes_list).

    πŸ”‘ Solution
    notes_list = df['Notes From Previous Conversations'].tolist()
    notes_string = ' '.join(notes_list)
    print(notes_string)
    

    Task 1.3: Splitting a String

    Split the string you created in Task 1.2 into a list of words using the split() method. Print the results.

    πŸ” Hint

    Use the split() method on the string you want to split. If you want to split the string by spaces, you would do notes_string.split(' ').

    πŸ”‘ Solution
    words_list = notes_string.split(' ')
    print(words_list)
    

    Task 1.4: Removing Specific Characters

    In this task, you will remove specific characters from the string you created in Task 1.2 using the replace() method. Let's remove all occurrences of the word 'product'. Print the result.

    πŸ” Hint

    Use the replace() method on the string you want to modify. The first argument should be the substring you want to replace, and the second argument should be what you want to replace it with. For example, to remove all occurrences of the word 'banana', you would do example_string.replace('banana', '').

    πŸ”‘ Solution
    notes_string_no_product = notes_string.replace('product', '')
    print(notes_string_no_product)
    
  2. Challenge

    Vectorized Text Cleaning

    Vectorized Text Cleaning

    To review the concepts covered in this step, please refer to the Getting Started with String Cleaning Ops module of the Cleaning String Data in Python course.

    Vectorized text cleaning is important because it allows us to perform operations on entire collections of strings in a fast and efficient manner. This is particularly useful when dealing with large datasets.

    Now, let's move on to vectorized text cleaning. The goal here is to learn how to perform operations on entire collections of strings at once. We'll be using the pandas library for this purpose. Specifically, we'll practice how to extract the second word from each string, lowercase all strings in a pandas DataFrame, and split strings into multiple words. Remember to use the str accessor to call string methods on Pandas data Series objects.


    Task 2.1: Importing Pandas and Loading the Dataset

    First, import the pandas library and then load the 'Qualitative_Data.csv' file into a pandas DataFrame. After loading, display the first few rows of the DataFrame to ensure it's loaded correctly.

    πŸ” Hint

    To import pandas, use the import keyword followed by pandas as pd. Use the pd.read_csv() function to read the csv file and df.head() to display the first few rows of the DataFrame.

    πŸ”‘ Solution
    import pandas as pd
    
    # Load the dataset
    df = pd.read_csv('Qualitative_Data.csv')
    
    # Display the first few rows
    df.head()
    

    Task 2.2: Extracting the Second Word from Each String

    Extract the second word from each string in the 'Notes From Previous Conversations' column. Display the first 5 rows of the resulting Pandas Series object.

    πŸ” Hint

    Use the Series.str.split() function to split the strings into words and then use indexing to get the second word. For example, df['column_name'].str.split().str[1].

    πŸ”‘ Solution
    second_words = df['Notes From Previous Conversations'].str.split().str[1]
    second_words.head()
    

    Task 2.3: Lowercasing All Strings

    Lowercase all strings in the 'Notes From Previous Conversations' column. Print the first few rows of the results.

    πŸ” Hint

    Use the Series.str.lower() function to lowercase all strings. For example, df['column_name'].str.lower().

    πŸ”‘ Solution
    df['Notes From Previous Conversations'] = df['Notes From Previous Conversations'].str.lower()
    df['Notes From Previous Conversations'].head()
    

    Task 2.4: Splitting Strings into Multiple Words

    Split the strings in the 'Notes From Previous Conversations' column into multiple words. Display the results.

    πŸ” Hint

    Use the Series.str.split() function to split the strings into words. For example, df['column_name'].str.split().

    πŸ”‘ Solution
    words = df['Notes From Previous Conversations'].str.split()
    print(words)
    
  3. Challenge

    Advanced Text Cleaning with Regular Expressions

    Advanced Text Cleaning

    To review the concepts covered in this step, please refer to the Advanced String Cleaning Ops module of the Cleaning String Data in Python course.

    Understanding how to use regular expressions for text cleaning is crucial as they provide a powerful and flexible way to search, replace, and manipulate text. This is particularly useful when dealing with complex string patterns.

    In this step, we'll use the re and bs4 modules in Python to practice some advanced text cleaning techniques on the provided string data. We will learn how to remove URLs, hashtags, specific substrings, numbers from strings, and correct HTML. The re.sub() function is your main tool for this step, but we'll also touch on BeautifulSoup.get_text().


    Task 3.1: Removing Numbers from Text

    Let's start by removing all numbers from the provided string data. After removing, print the results.

    πŸ” Hint

    Use the re.sub() function with the pattern r'\d+' to replace all numbers with an empty string. The \d in the regex pattern represents any digit, and the + sign indicates one or more times. Therefore, \d+ matches one or more digits.

    πŸ”‘ Solution
    import re
    
    # Assuming 'data' is the provided string with numbers
    cleaned_data = re.sub(r'\d+', '', data)
    print(cleaned_data)
    

    Task 3.2: Removing Specific Substrings from Text

    Next, remove the substring 'product' from the provided string data and print the results.

    πŸ” Hint

    Use the re.sub() function with the pattern r'product' to replace the substring 'product' with an empty string. The regex pattern here is a direct match for the word 'product' in the data.

    πŸ”‘ Solution
    # Assuming 'cleaned_data' is the current state of the data
    cleaned_data = re.sub(r'product', '', cleaned_data)
    print(cleaned_data)
    

    Task 3.3: Cleaning and Correcting HTML

    Now, replace all '&' HTML entities in the provided string data with '&' and print the results.

    πŸ” Hint

    Use the re.sub() function with the pattern r'&' to replace '&' with '&'. This pattern looks for the exact sequence of '&' and replaces it with a single ampersand.

    πŸ”‘ Solution
    # Assuming 'cleaned_data' is the current state of the data
    cleaned_data = re.sub(r'&', '&', cleaned_data)
    print(cleaned_data)
    

    Task 3.4: Removing URLs from Text

    Remove all URLs from the provided string data and print the results.

    πŸ” Hint

    Use the re.sub() function with the pattern r'http[s]?://\S+' to match and replace URLs with an empty string. In this pattern, http[s]? matches 'http' followed by an optional 's', and \S+ matches one or more non-whitespace characters, which continues until a space is encountered.

    πŸ”‘ Solution
    # Assuming 'cleaned_data' is the current state of the data
    cleaned_data = re.sub(r'http[s]?://\S+', '', cleaned_data)
    print(cleaned_data)
    

    Task 3.5: Removing Hashtags from Text

    Finally, remove all hashtags from the provided string data and print the results.

    πŸ” Hint

    Use the re.sub() function with the pattern r'#\w+' to match and replace hashtags with an empty string. The # character in the regex pattern matches the hashtag symbol, and \w+ matches one or more word characters (letters, digits, or underscores).

    πŸ”‘ Solution
    # Assuming 'cleaned_data' is the current state of the data
    cleaned_data = re.sub(r'#\w+', '', cleaned_data)
    print(cleaned_data)
    

    Task 3.6: Extracting Text from HTML Tags

    Import the BeautifulSoup class from the bs4 module. Write a Python function that takes a string containing HTML as input and returns the text content of the HTML using the get_text() method. After extracting, print the results to verify.

    πŸ” Hint

    You can use the BeautifulSoup class from the bs4 module to parse the HTML and extract the text. Here's an example of initializing the BeautifulSoup class with your data:

    soup = BeautifulSoup(html_string, 'html.parser')
    

    Remember, get_text() will retrieve the text content from the parsed HTML.

    πŸ”‘ Solution
    from bs4 import BeautifulSoup
    
    # Assuming 'html_string' is the provided string containing HTML
    def extract_text(html_string):
        soup = BeautifulSoup(html_string, 'html.parser')
        return soup.get_text()
    
    text_content = extract_text(html_data)
    print(text_content)
    

What's a lab?

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Provided environment for hands-on practice

We will provide the credentials and environment necessary for you to practice right within your browser.

Guided walkthrough

Follow along with the author’s guided walkthrough and build something new in your provided environment!

Did you know?

On average, you retain 75% more of your learning if you get time for practice.