Hamburger Icon
  • Labs icon Lab
  • Core Tech
Labs

Guided: Regex Patterns for Practical Solutions

In this lab, you will process employee data using regular expressions. You will work on structured data to filter employee feedback, extract client information, analyze text, enhance client privacy, and systematically parse addresses.

Labs

Path Info

Level
Clock icon Intermediate
Duration
Clock icon 2h 5m
Published
Clock icon Feb 20, 2024

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.

Table of Contents

  1. Challenge

    Introduction

    Hello Developers! Your goal in this lab is to process employee data using regular expression. You will use Python and a command-line interface to process the data.

    Lab Scope

    You will work on five steps to complete this lab, as shown:

    1. Use negated character class and quantification techniques.
    2. Use non-capturing groups and named capturing groups.
    3. Utilize lookahead and lookbehind assertions.
    4. Implement re.sub() function.
    5. Implement re.split() function.

    Lab Structure

    The lab directory structure is shown below:

    • content: This directory holds two data files: train_data.csv (used for creating the functions and validating the test cases) and test_data.csv (to be used with the menu.py file). Each of these files have seven features:
      1. fullID: The complete ID of the employee. It encodes four different departments namely - HR (Human Resources), SA (Sales), DE (Developers), and IT (Support). For example, HR-758846
      2. name: The full name of the employee. For example, John Smith
      3. address: The full address of the employee. For example, 35, Second Avenue, Los Angeles, 90210
      4. task: The task assigned to the employee. For example, Project Brave: Work with Emily on the documentation task.
      5. report: The report submitted by the employees on the assigned task. For example, I have been helping Emily for three days.
      6. feedback: Feedback for the employees submitted by their managers. For sales employees, feedback has a date component at the end. For example, John has never let the company down.
      7. cemail: The client email addresses. For example, [email protected]
    • src: This directory contains six files - datafile, step1, step2, step3, step4, and step5. The datafile file is already populated with code and fetches the content of the CSV file. You will be working in the remaining step files.
    • solutions: This directory contains the solutions for each step. You can access these files if you get stuck.
    • menu.py: You must run this file after you have validated all your test cases from the five steps. This file fetches your written functions and applies them on the test_data.csv file.
  2. Challenge

    Use Negated Character Class and Quantification Techniques

    In this first step, you must filter out the employees who are working in the sales department along with their feedback.

    You will work in the src/step1.py file using the content/train_data.csv data. The script already has two populated functions:

    1. step1_features() function which fetches the fullID, name, and feedback features for this step.
    2. display_sales_emp() function which displays the names of sales employees.
    In the following task you will use the negated character class and `re.match()` function:
    • The negated character class is denoted by [^pattern]. The carat ^ sign must be placed after the opening square bracket. You use it when you do not want to include the pattern in your search. For instance, [^\d] will match any pattern that is not a digit.

    • The re.match(pattern, string) function starts at the beginning of a string to match zero or more characters of the provided pattern.

    The task also requires you to implement a for loop and a list comprehension.

    Syntax of a for loop:

    for i in a_list: 
       print(i)
    

    Syntax of a list comprehension:

    [val_if_cond_true for i in a_list if a_condition]
    ``` info> **DID YOU KNOW?** 
    <br> 1. The function of the <code>^</code> character varies depending on whether it is placed inside or outside the square brackets.
    <br>2. Python also provides a dictionary comprehension <code>{}</code> similar to a list comprehension.</br></br> In the following task, you will use greedy quantifiers and the <code>re.findall()</code> function including a nested `for` loop list comprehension.
    
    * Greedy quantifiers use a pattern to match a string as much as possible. Some of the greedy quantifiers are:  <code>*</code>, <code>+</code>, and <code>{m,n}</code>.
    	
    * The <code>re.findall(pattern, string)</code> function searches a string for all non-overlapping matches of a given pattern.
    
    * Syntax of a nested `for` loop list comprehension: 
    ```python
    [val for sub in parent for val in sub]
    ``` Lazy quantification operates in the opposite manner compared to greedy quantification, as it aims to match the smallest possible string that satisfies the given pattern. The common patterns are: `*?`, `+?`, and `{m,n}?`.
  3. Challenge

    Use Non-Capturing Groups and Named Capturing Groups

    In this step, you will work in the src/step2.py file that holds the already populated step2_features() function. This function returns the fullID and cemail features. Groups serve two purposes in regular expressions:

    1. Grouping: Unites multiple tokens together.
    2. Capturing: Captures groups for future references.
    However, it is not mandatory to always refer to a group in the future. Hence, the introduction of non-capturing groups. You use them only for grouping and they are represented using the syntax (?:pattern).

    In the following task, you will use a non-capturing group and the re.search(pattern, string) function which checks the string and returns the location of the first pattern match. Non-capturing groups are useful when you don't need to reference a specific part of a matched pattern. However, what if you need to extract specific parts from a pattern? This is where named-capturing groups prove invaluable.

    In the following task, you use named-capturing groups to give names to groups and reference them in the future. They are represented by the syntax: (?Ppattern). info> INFORMATION ADD-ON!
    You must start a named-capturing group with either a letter or an underscore. You cannot start the name with a digit though you can include digits in the name. You can use the following link to learn about the named groups across various programming languages.

  4. Challenge

    Utilize Lookahead and Lookbehind Assertions

    In this step, you will work in the src/step3.py file. Much like the other steps, the script already has a function,step3_features(), to extract the features, task and report, that will be used in this step.

    You will work on the following regular expression assertions:

    • Lookahead: Matches content, which is followed by a pattern.
      • Positive Lookahead: ?=pattern
      • Negative Lookahead: ?!pattern
    • Lookbehind: Matches content, which is preceded by a pattern.
      • Positive Lookbehind: ?<=pattern
      • Negative Lookbehind: ?<!pattern
  5. Challenge

    Implement the `re.sub()` Function

    Securing the identity of clients is important in any organization. In this step, you will mask the names of your clients in their email addresses with the pattern xxxx.

    To complete the task in this step, you will work in the src/step4.py file that already has a populated function,step4_features(), returning the clients' email addresses, cemail. In the following task, you will depend upon the re.sub(pattern, replacement, content) function which scans content based on a regex pattern and replaces the existing content with a user-defined content, in this case xxxx. You will also create a new Python function. The syntax of a Python function is:

    def function_name(argument):
       # body
       return a_value
    ``` info> **DID YOU KNOW?** <br>To keep the script short, you can skip creating the <code>replace_name()</code> function and instead use the <code>re.sub()</code> function directly inside the list comprehension. </br>
  6. Challenge

    Implement the `re.split()` Function

    In this final step of the lab, you will split each employees' address into separate components - house number, street name, city, and postal code.

    To do so, you will work in the src/step5.py file which already has a populated function, step5_features(). This function returns the complete address of each employee. You will use the re.split(pattern, content) function to complete this task. This function splits the content by a given pattern. In this case, the pattern is a comma (,). You can also control the number of splits produced by this function using the maxsplit argument. Congratulations! You have successfully completed this lab, thus further improving your regular expression knowledge.

    Test your Functions on New Data

    It is time to apply your written functions on untouched data, test_data.csv. Run the menu.py file in the Terminal and follow menu steps to process the file. Review the output of each step to validate the functionality of your implemented functions. info> TIPS FOR AN IMPROVED REGEX PATTERN
    1. Keep the regex simple and prioritize readability even if it increases the regex length.
    2. Do not feed raw data to your regex. First, clean the data.
    3. Use lazy quantifiers to increase the regex performance.
    4. Use anchors and quantifiers carefully.
    5. Always test your regex pattern against various input data to ensure you have considered all of the cases.

Written content author.

What's a lab?

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Provided environment for hands-on practice

We will provide the credentials and environment necessary for you to practice right within your browser.

Guided walkthrough

Follow along with the author’s guided walkthrough and build something new in your provided environment!

Did you know?

On average, you retain 75% more of your learning if you get time for practice.