- Lab
- Core Tech

Guided: Regex Patterns for Practical Solutions
In this lab, you will process employee data using regular expressions. You will work on structured data to filter employee feedback, extract client information, analyze text, enhance client privacy, and systematically parse addresses.

Path Info
Table of Contents
-
Challenge
Introduction
Hello Developers! Your goal in this lab is to process employee data using regular expression. You will use Python and a command-line interface to process the data.
Lab Scope
You will work on five steps to complete this lab, as shown:
- Use negated character class and quantification techniques.
- Use non-capturing groups and named capturing groups.
- Utilize lookahead and lookbehind assertions.
- Implement
re.sub()
function. - Implement
re.split()
function.
Lab Structure
The lab directory structure is shown below:
content
: This directory holds two data files:train_data.csv
(used for creating the functions and validating the test cases) andtest_data.csv
(to be used with themenu.py
file). Each of these files have seven features:fullID
: The complete ID of the employee. It encodes four different departments namely - HR (Human Resources), SA (Sales), DE (Developers), and IT (Support). For example,HR-758846
name
: The full name of the employee. For example,John Smith
address
: The full address of the employee. For example,35, Second Avenue, Los Angeles, 90210
task
: The task assigned to the employee. For example,Project Brave: Work with Emily on the documentation task.
report
: The report submitted by the employees on the assigned task. For example,I have been helping Emily for three days.
feedback
: Feedback for the employees submitted by their managers. For sales employees, feedback has a date component at the end. For example,John has never let the company down.
cemail
: The client email addresses. For example,[email protected]
src
: This directory contains six files -datafile
,step1
,step2
,step3
,step4
, andstep5
. Thedatafile
file is already populated with code and fetches the content of the CSV file. You will be working in the remaining step files.solutions
: This directory contains the solutions for each step. You can access these files if you get stuck.menu.py
: You must run this file after you have validated all your test cases from the five steps. This file fetches your written functions and applies them on thetest_data.csv
file.
-
Challenge
Use Negated Character Class and Quantification Techniques
In this first step, you must filter out the employees who are working in the sales department along with their feedback.
You will work in the
src/step1.py
file using thecontent/train_data.csv
data. The script already has two populated functions:step1_features()
function which fetches thefullID
,name
, andfeedback
features for this step.display_sales_emp()
function which displays the names of sales employees.
-
The negated character class is denoted by
[^pattern]
. The carat^
sign must be placed after the opening square bracket. You use it when you do not want to include the pattern in your search. For instance,[^\d]
will match any pattern that is not a digit. -
The
re.match(pattern, string)
function starts at the beginning of a string to match zero or more characters of the provided pattern.
The task also requires you to implement a
for
loop and a list comprehension.Syntax of a
for
loop:for i in a_list: print(i)
Syntax of a list comprehension:
[val_if_cond_true for i in a_list if a_condition] ``` info> **DID YOU KNOW?** <br> 1. The function of the <code>^</code> character varies depending on whether it is placed inside or outside the square brackets. <br>2. Python also provides a dictionary comprehension <code>{}</code> similar to a list comprehension.</br></br> In the following task, you will use greedy quantifiers and the <code>re.findall()</code> function including a nested `for` loop list comprehension. * Greedy quantifiers use a pattern to match a string as much as possible. Some of the greedy quantifiers are: <code>*</code>, <code>+</code>, and <code>{m,n}</code>. * The <code>re.findall(pattern, string)</code> function searches a string for all non-overlapping matches of a given pattern. * Syntax of a nested `for` loop list comprehension: ```python [val for sub in parent for val in sub] ``` Lazy quantification operates in the opposite manner compared to greedy quantification, as it aims to match the smallest possible string that satisfies the given pattern. The common patterns are: `*?`, `+?`, and `{m,n}?`.
-
Challenge
Use Non-Capturing Groups and Named Capturing Groups
In this step, you will work in the
src/step2.py
file that holds the already populatedstep2_features()
function. This function returns thefullID
andcemail
features. Groups serve two purposes in regular expressions:- Grouping: Unites multiple tokens together.
- Capturing: Captures groups for future references.
(?:pattern)
.In the following task, you will use a non-capturing group and the
re.search(pattern, string)
function which checks the string and returns the location of the first pattern match. Non-capturing groups are useful when you don't need to reference a specific part of a matched pattern. However, what if you need to extract specific parts from a pattern? This is where named-capturing groups prove invaluable.In the following task, you use named-capturing groups to give names to groups and reference them in the future. They are represented by the syntax:
(?P
. info> INFORMATION ADD-ON!pattern)
You must start a named-capturing group with either a letter or an underscore. You cannot start the name with a digit though you can include digits in the name. You can use the following link to learn about the named groups across various programming languages. -
Challenge
Utilize Lookahead and Lookbehind Assertions
In this step, you will work in the
src/step3.py
file. Much like the other steps, the script already has a function,step3_features()
, to extract the features,task
andreport
, that will be used in this step.You will work on the following regular expression assertions:
- Lookahead: Matches content, which is followed by a pattern.
- Positive Lookahead:
?=pattern
- Negative Lookahead:
?!pattern
- Lookbehind: Matches content, which is preceded by a pattern.
- Positive Lookbehind:
?<=pattern
- Negative Lookbehind:
?<!pattern
-
Challenge
Implement the `re.sub()` Function
Securing the identity of clients is important in any organization. In this step, you will mask the names of your clients in their email addresses with the pattern
xxxx
.To complete the task in this step, you will work in the
src/step4.py
file that already has a populated function,step4_features()
, returning the clients' email addresses,cemail
. In the following task, you will depend upon there.sub(pattern, replacement, content)
function which scans content based on a regex pattern and replaces the existing content with a user-defined content, in this casexxxx
. You will also create a new Python function. The syntax of a Python function is:def function_name(argument): # body return a_value ``` info> **DID YOU KNOW?** <br>To keep the script short, you can skip creating the <code>replace_name()</code> function and instead use the <code>re.sub()</code> function directly inside the list comprehension. </br>
-
Challenge
Implement the `re.split()` Function
In this final step of the lab, you will split each employees' address into separate components - house number, street name, city, and postal code.
To do so, you will work in the
src/step5.py
file which already has a populated function,step5_features()
. This function returns the completeaddress
of each employee. You will use there.split(pattern, content)
function to complete this task. This function splits the content by a given pattern. In this case, the pattern is a comma (,
). You can also control the number of splits produced by this function using themaxsplit
argument. Congratulations! You have successfully completed this lab, thus further improving your regular expression knowledge.Test your Functions on New Data
It is time to apply your written functions on untouched data,
test_data.csv
. Run themenu.py
file in the Terminal and follow menu steps to process the file. Review the output of each step to validate the functionality of your implemented functions. info> TIPS FOR AN IMPROVED REGEX PATTERN
1. Keep the regex simple and prioritize readability even if it increases the regex length.
2. Do not feed raw data to your regex. First, clean the data.
3. Use lazy quantifiers to increase the regex performance.
4. Use anchors and quantifiers carefully.
5. Always test your regex pattern against various input data to ensure you have considered all of the cases.
What's a lab?
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Provided environment for hands-on practice
We will provide the credentials and environment necessary for you to practice right within your browser.
Guided walkthrough
Follow along with the author’s guided walkthrough and build something new in your provided environment!
Did you know?
On average, you retain 75% more of your learning if you get time for practice.