Libraries: If you want this lab, consider one of these libraries.
AI

Feature Engineering in Action

In this Lab, learners perform feature engineering on real-world housing data by handling missing values, skewed distributions, outliers, mixed-type columns, and multicollinearity. They build a baseline Ridge regression model, apply transformations, encoding, scaling, and preprocessing workflows using pipelines and column transformers, then retrain and evaluate the improved model using RMSE and R-squared metrics. Learners also visualize predictions and analyze feature importance in this Machine Learning lab.

Get started Contact sales

Lab Info

Last updated

Jul 10, 2026

Duration

46m

Challenge

Introduction
Welcome to the Feature Engineering in Action Code Lab. In this hands-on lab, you will explore every stage of a real feature engineering workflow on a house prices dataset. The workflow includes flagging problematic columns, engineering derived columns, applying the right transformation to each column group, and demonstrating that the same machine learning model predicts substantially better once the features are properly prepared.

Concepts

Feature engineering is the process of transforming raw input columns into representations that make patterns easier for a model to learn. Think of it as the data equivalent of preparing ingredients before cooking a meal.

Derived features are new columns computed from existing ones. A house's raw year-built and year-sold columns each carry a limited signal on their own; their difference, HouseAge, is a direct predictor of price.

Skew correction re-shapes left or right-skewed numeric distributions so that linear models can fit them with a straight line. A log1p transform is enough for mild skew, whereas strongly skewed columns can be reshaped with a Box-Cox transformation.

Encoding strategy converts categorical columns into numbers without losing the structure the model needs. Ordinal columns (ExterQual column signifying quality ratings like Excellent -> Good -> Typical -> Fair -> Poor) are mapped to integers that preserve their order. Low-cardinality nominals get one-hot encoding. High-cardinality columns get target encoding, which replaces each category with its cross-validated mean of the target — far fewer columns and a stronger signal than one-hot.

RobustScaler standardizes each numeric column using the median and interquartile range instead of the mean and standard deviation. Because it is based on quantiles rather than moments, it is insensitive to the outliers that are common in real-world data, especially housing data.

Prerequisites

Before starting this lab, you should be comfortable with:
- Python basics: functions, classes, list comprehensions, and imports
- pandas: reading CSVs, selecting columns by dtype, computing descriptive statistics
- scikit-learn basics: Pipeline, ColumnTransformer, train_test_split, and calling .fit() / .predict()
- Machine Learning fundamentals: what train, validation, and test splits are for; what RMSE and R-squared measure
- Regression concepts: why scaling matters for regularized models; what Ridge regression penalizes
The Scenario

You are a data scientist at TrueHome Analytics. Your team has just received the TrueHome Housing dataset — 1,460 sales records with 80 features covering everything from lot area and basement finish type to garage condition and sale month. Management wants to know how much better a machine learning regression model performs on the dataset with and without feature engineering. Your job is to first train a baseline Ridge model on the raw data with minimal preprocessing and record its RMSE and R-squared metrics. Later on, you must complete a feature engineering pipeline step by step: understand the data, identify its problems, create richer features, apply the appropriate transformations, retrain the model with identical hyperparameters, and report the gains.

The Application Structure

Key files in the lab environment

| File | Purpose | |------|---------| | house_price.csv | The TrueHome Housing dataset — 1,460 rows, 80 input features, target column SalePrice | | understand.py | Use the loaded dataset to compute summary statistics, flag high-missing, and outlier columns | | baseline.py | Build the minimal-preprocessing baseline pipeline and record RMSE / R-squared | | analysis.py | Engineer derived and interaction features; quantify skew, cardinality, near-zero variance, and collinearity | | transform.py | Implement the full ColumnTransformer with skew correction, ordinal encoding, and target encoding | | retrain.py | Retrain Ridge on the engineered feature set and compare results to the baseline |

Complete the tasks in order — each task's outputs (flagged columns, split indices, derived features) are consumed by the tasks that follow.

Run any of your files at any point with:
```
python3 -m filename.py
```
If you get stuck, refer to the solution code for each step in the solutions/ folder.
Challenge

Understand the Dataset
Before building any model, you must understand the raw material you are working with. In this step, you will work with the TrueHome Housing dataset and answer three practical questions a data scientist must answer before touching a single feature:
- What does the data look like?
- Which columns are too empty to trust?
- Which columns are full of extremes?
The dataset has already been loaded for you as df. Your job in this step is to implement three analysis functions inside understand.py. Missing values are records for which the dataset contains no recorded entry. A feature with nearly 95% missing values provides little useful signal and can even hurt model performance. In this lab, you will set the cutoff at 40%.

Outliers are values that fall far from the majority of a column's distribution. In this lab, you will use the Interquartile Range method to detect outliers. Any value below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR is considered an outlier. Here, Q1 represents the 25th percentile, Q3 represents the 75th percentile, and IQR is equal to Q3 - Q1. ## What information did you gather from the above tasks?

The dataset contains 1,460 rows and 81 columns spread across 43 object, 35 int64, and 3 float64 data types. The analysis further revealed that six columns have more than 40% missing values, while 30 numeric columns contain outliers.

Now, you must build a machine learning model on the same dataset and check its performance.
Challenge

Baseline Model

A baseline model is the simplest possible version of your pipeline with minimal preprocessing, no feature engineering, and no tuning. Its only job is to give you an honest reference score. Without it, you have no way to measure whether the feature engineering you do in later steps actually helps. The dataset has already been split into train, validation, and test sets for you via df_split. Your job is to build a Ridge regression pipeline on raw features and record its performance. Before moving to metrics, it is worth noting that there is no skew correction, no ordinal encoding, no interaction features, and no robust scaling. The Ridge model and its hyperparameter alpha=1.0 will stay identical in the engineered model and will use the same row indices, so any score difference you see later is attributable purely to better features, nothing else. The baseline gives you two pairs of numbers: RMSE and R-squared on both the validation and test sets. These are your benchmarks for the final engineered model.
Challenge

Feature Analysis

Raw columns rarely carry their signal in the most useful form. Before transforming anything, you must understand the structure of what you have, starting with understanding which new columns are worth creating, which existing ones are skewed, how categorical columns are distributed, and which columns are redundant or nearly constant.

Derived Features

A single column often captures only part of a story. TotalBsmtSF tells you basement area, 1stFlrSF tells you first floor area, but a buyer cares about the total liveable footprint.

Derived features combine or transform existing columns into representations that expose these relationships directly, giving the model a signal it could not easily learn from the raw columns. ## Remove Near-Zero Variance (NZV) and Correlated Columns

Near-zero variance (NZV) columns are categorical, where one value dominates. In short, it means that typically a column would have 95% or more of its rows with the same entry.

Pearson correlation measures the linear relationship between two numeric columns on a scale from −1 to 1. When two columns have high correlation (for the following task, you must consider high correlation to be above 0.75), they carry largely the same information, and keeping both inflates the feature space without adding signal.

Hence, you must drop all the NZV and correlated columns.

Skewed Features

Skewness is a measure of how asymmetric a distribution is. A perfectly symmetric column has skewness of 0.

In housing data, where a few luxury properties have enormous values, there happens to be a long tail to the right and a positive skewness score.

In the following task, you must categorize mildly skewed features (skewness: 0.5 to 1) from strongly skewed features (skewness: >1). ## Cardinality Cardinality is the number of distinct values in a qualitative column. A column could have only two unique values, whereas another column could have 20+ unique values. This distinction matters because one-hot encoding a high-cardinality column produces a very wide, sparse matrix in which most entries are zero, adding noise rather than signal.

In the following task, you must use a threshold of 10; columns at or below the cutoff get one-hot encoding, whereas columns above it get target encoding in the feature transformation step. In this step, you have successfully derived new features, dropped NZV, and correlated columns to arrive at df_eng. The new DataFrame now has 63 columns (excluding TARGET) instead of the initial 81 columns.

In the next step, you will use this engineered DataFrame along with the following variables: MILD_SKEW_COLS, STRONG_SKEW_COLS, NORMAL_NUM_COLS, ORDINAL_COLS, ONE_HOT_ENC_COLS, and TARGET_ENC_COLS.
Challenge

Feature Transformation

With the analysis lists from the previous step in hand, this step puts them to work. You must assign a pipeline to each column group tailored to its characteristics — the right imputation strategy, the right transformation, and the right encoding. Finally, you must assemble all six pipelines into a single ColumnTransformer that applies every step consistently across train, validation, and test sets with no data leakage.

What Gets Applied to What?

Before writing any code, here is exactly what each column group receives and why:

| Column Group | Pipeline | What it does | |---|---|---| | NORMAL_NUM_COLS | num_pipe() | Median imputation -> RobustScaler | | MILD_SKEW_COLS | num_pipe(log1p_tr) | Median imputation -> log1p transform -> RobustScaler | | STRONG_SKEW_COLS | num_pipe(BoxCoxTransformer()) | Median imputation -> fitted Box-Cox -> RobustScaler | | ORDINAL_COLS | ord_pipe | Constant imputation ("NA") -> OrdinalEncoder with defined category order | | ONE_HOT_ENC_COLS | ohe_pipe | Constant imputation ("NA") -> OneHotEncoder | | TARGET_ENC_COLS | te_pipe | Constant imputation ("NA") -> TargetEncoder (5-fold cross-validated) |

You must end every numeric pipeline with RobustScaler regardless of what transformation precedes it because after correcting skew, the column must still be centred and scaled for Ridge to apply its regularisation fairly across all features. BoxCoxTransformer is stateful: it fits a shift and a lambda per column on training data and reapplies the same parameters at transform time, so no information from the validation or test set influences the transform. TargetEncoder uses 5-fold cross-validation for the same reason, preventing target leakage during training. ## Take Away COLS_TR is now a single scikit-learn object that encodes every analytical decision made in the previous step. Now, plug it into a pipeline with TransformedTargetRegressor in the next step, and the entire feature engineering workflow — imputation, skew correction, encoding, and scaling runs in one .fit() call on training data and applies identically to validation and test sets.
Challenge

Engineered Feature Model
You have a baseline score, a fully engineered feature set, and a fitted ColumnTransformer. This final step brings everything together. You will plug COLS_TR into a new Ridge pipeline and wrap the target with a TransformedTargetRegressor so the model trains on log-scaled prices and predicts in dollars. You must retrain on the same train split, and evaluate on the same validation and test sets.

The Ridge regressor with alpha=1.0 is identical to the baseline, so any differences in the numbers that follow are the direct, measurable return from the four steps of feature engineering you just completed. ## Results: What Feature Engineering Delivered?

To see the impact of the feature engineering steps, run the following command in the Terminal:
```
python3 steps/observations.py
```
This generates an observations.html file that compares the baseline and engineered models.

Next, click the Web Browser tab beside the Terminal, refresh the page, and open the browser in a new tab if needed. Then open observations.html to view three visualizations that highlight how the models performed.

Use the explanations below to interpret the results.

Average dollar saved per prediction

Across 40 randomly sampled test records, the engineered model's predictions are, on average, $4,017 closer to the actual sale price than the baseline. That is the practical meaning of better features, not just a lower RMSE on any competition's leaderboard, but a more accurate dollar figure on a real house.

Prediction Comparison

The chart below plots actual sale prices against predictions from both models. The engineered model's points sit consistently closer to the best-fit line, particularly in the mid and upper price ranges where the baseline tends to over- or underestimate the most.

RMSE Comparison

The RMSE chart below shows the validation and test error for both models side by side. The engineered model's bars are noticeably shorter on both splits, confirming the improvement is consistent and not an artefact of a lucky test set.

Top Five Contributing Features by Coefficient

Ridge assigns a weight to every feature after scaling. The five columns with the largest absolute coefficients from which the model learned the most are:
- MSZoning — general zoning classification
- GrLivArea — above grade living area in square feet
- Exterior2nd — exterior covering material
- Neighborhood — physical location within city limits
- OverallQual — overall material and finish quality
A Note on How Far This Goes

The above results derive entirely from feature engineering, with the model and its hyperparameters held fixed throughout. Feature engineering is typically the highest-leverage time investment in a real Machine Learning project. Combining it with systematic model selection and hyperparameter tuning compounds these gains further, but it is the features that give the model something worth tuning in the first place.

✨ A Final Tip

The more handcrafted features you create, the better a model will perform with its base hyperparameters.

Keep learning! 😊

About the author

Chhaya Wagmi

Written content author.

Real skill practice before real-world application

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Learn by doing

Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.

Follow your guide

All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.

Turn time into mastery

On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.

Feature Engineering in Action

Lab Info

Table of Contents

Introduction

Concepts

Prerequisites

The Scenario

The Application Structure

Key files in the lab environment

Understand the Dataset

Baseline Model

Feature Analysis

Derived Features

Skewed Features

Feature Transformation

What Gets Applied to What?

Engineered Feature Model

Average dollar saved per prediction

Prediction Comparison

RMSE Comparison

Top Five Contributing Features by Coefficient

A Note on How Far This Goes

About the author

Real skill practice before real-world application

Learn by doing

Follow your guide

Turn time into mastery

Get started with Pluralsight