Libraries: If you want this lab, consider one of these libraries.
AI

Lab: Clustering and Dimensionality Reduction

In this guided lab, you'll combine two real-world-shaped datasets (a 1,500-employee compensation database and a year of activity log data with over 30,000 events) to discover natural engagement segments using unsupervised learning. You'll clean messy transactional records (missing IDs, reversals, inconsistent formatting), engineer the classic Recency–Frequency–Monetary feature set alongside compensation features, then apply both linear (PCA) and nonlinear (UMAP) dimensionality reduction to surface latent structure. From there you'll cluster with three different algorithms — K-Means with the elbow method, DBSCAN with k-distance eps tuning, and HDBSCAN which auto-selects cluster count and handles noise gracefully — and compare them head to head on the same projection. Finally, you'll evaluate segment quality with silhouette scores, profile each cluster's RFM and compensation signature, and assign HR-meaningful labels like "Engaged Champions," "Disengaging Veterans," and "New Hires Ramping." Walk away with a complete unsupervised learning playbook you can apply to any entity-segmentation problem.

Get started Contact sales

Lab Info

Level

Intermediate

Last updated

Jun 10, 2026

Duration

1h 21m

Challenge

## Introduction to Unsupervised Segmentation

Welcome to this guided lab on clustering and dimensionality reduction! Unsupervised learning sits in a fundamentally different place from the supervised models most ML engineers cut their teeth on. There's no label column telling you what's right. Instead, you're asking the data itself: what natural groupings exist here? That question turns up everywhere in industry - whether it's customer segmentation, fraud detection, anomaly hunting, and (as in this lab) people analytics. Getting good answers requires three skills working together: thoughtful feature engineering, dimensionality reduction that surfaces structure without losing it, and clustering algorithms tuned to the shape of your data.

In this lab, you'll work through the complete unsupervised learning pipeline on a realistic, messy, two-table dataset. You'll start by cleaning transactional event data - handling missing IDs, reversals, duplicate records, and inconsistent formatting - then engineer Recency, Frequency, and Monetary features per employee from raw event history. You'll layer compensation features on top, apply scaling, and discover why scaling is non-negotiable for distance-based methods. From there you'll explore PCA for linear dimensionality reduction and UMAP for nonlinear projection, learning when each is the right tool. You'll cluster with K-Means (the workhorse), DBSCAN (density-based, handles noise), and HDBSCAN (auto-selects cluster count and handles variable density). You'll compare them on the same data and learn to read silhouette scores critically. Finally, you'll profile each cluster's signature and assign meaningful labels.

These skills generalize far beyond HR analytics. The pipeline you'll build - clean, engineer, scale, reduce, cluster, evaluate, label - is the same pipeline a customer segmentation, a fraud-ring detector, or a user-behavior clustering project might use in the real world. By the end of this lab, you'll have a complete unsupervised playbook you can deploy on any entity-segmentation problem.

info> If you get stuck at any point, the solutions/ folder contains the completed code for every task.
Challenge

## Data Cleaning and RFM Feature Engineering

The single biggest predictor of whether an unsupervised model produces useful segments is the quality of the features you feed it. Garbage in, garbage clusters. In this step, you'll take two raw HR data exports - an employee profile table and a transactional activity log - and turn them into a clean, scaled feature matrix ready for dimensionality reduction. You'll work in step2_features.py, which contains stubbed functions that read the raw CSVs and (currently) produce nothing useful. By the end of Step 2, this script will produce a feature matrix saved to outputs/features.parquet and a distribution plot to outputs/02_rfm_distributions.png showing the shape of your engineered features. Your raw inputs live in two separate CSVs: an employee profile (one row per employee with demographics and compensation) and an activity log (one row per transaction). Before you can engineer features, you need to read both files and join them on employee_id using a left-join, so every employee from the profile table is preserved even if they never appear in the activity log. That left-join is the foundation: it ensures your final clustering covers every employee, not just the active ones. With a clean activity log in hand, you can now turn raw transactions into the three classic RFM features that summarize each employee's engagement: Recency (days since last activity), Frequency (number of distinct active days), and Monetary value (total points). These three numbers, computed per employee, are the foundation for your unsupervised models. After computing RFM from the activity log, you'll join it back to the employee profile so every employee - including those who never appeared in the activity log - has a complete feature row. Raw transactional data is messy. The activity log contains four distinct kinds of dirt: rows with missing employee_id, reversals (negative-quantity transactions that represent refunds/cancellations and shouldn't be counted as engagement), inconsistent casing in event_type (some values are PURCHASE instead of purchase), and whitespace padding in category. You'll also need to drop duplicate event_id rows. Each of these is a real pattern from real-world data exports, and the order you handle them matters: clean the categorical columns before filtering on them. The RFM features capture behavioral engagement, but they're not the whole story. Each employee also has structural compensation attributes - base salary, total bonus payouts, and stock grants - that come straight from the employee profile. Combining behavioral and compensation features lets clustering find groups that differ in both dimensions. Finally, because unsupervised algorithms like K-Means and DBSCAN are extremely sensitive to feature scale, you'll standardize the full feature set with StandardScaler so every column contributes proportionally to distance calculations.
Challenge

## Dimensionality Reduction with PCA and UMAP

Your scaled feature matrix has six dimensions - too many to visualize directly, and likely containing correlated, redundant signal. Dimensionality reduction lets you compress the feature space while preserving the structure that matters for clustering. In this step you'll apply two complementary techniques: PCA, a fast linear method whose components are interpretable as variance-explained directions, and UMAP, a nonlinear manifold-learning method that often produces visually crisp, separable clusters in 2D. PCA helps you decide how much dimensionality reduction is justified by the variance; UMAP gives you the 2D embedding you'll use for every subsequent visualization in this lab. Start with PCA because it's fast, deterministic, and gives you a concrete answer to the question "how many dimensions do you actually need?" The explained-variance-ratio plot - a bar chart of per-component variance plus a cumulative line - lets you see exactly how much information each principal component contributes and where the cumulative curve plateaus. For a small feature set like this one, you'll typically find that the first two or three components capture the overwhelming majority of the variance, which tells you that 2D embeddings won't be hiding much structure. PCA is great for understanding variance, but linear projections often smear together clusters that are easily separable on a curved manifold. UMAP is the nonlinear counterpart: it preserves local neighborhood structure while pulling distinct populations apart in 2D, making it the workhorse visualization for clustering tasks. By plotting PCA's 2D projection next to UMAP's 2D embedding, you'll see immediately whether the nonlinear method reveals tighter, more visually separated groups that PCA flattens out. This is a strong hint about which embedding will pair best with the clustering algorithms you'll fit next.
Challenge

## Clustering with K-Means, DBSCAN, and HDBSCAN

Now the actual modeling. You'll fit three fundamentally different unsupervised algorithms on the same scaled feature matrix and compare their behavior on the UMAP embedding you created in Step 3. K-Means partitions every point into exactly k spherical clusters - simple, fast, and a great baseline, but it requires you to pick k in advance and forces every employee into some cluster. DBSCAN finds dense regions and labels low-density points as noise (-1), making it ideal when you suspect outliers, but it's notoriously sensitive to the eps parameter. HDBSCAN generalizes DBSCAN to multiple density scales and is usually the most robust choice when cluster sizes vary. Comparing all three side-by-side reveals which method respects the structure of your data and which fights against it. K-Means is the right place to start: it's deterministic given a seed, it scales to large datasets, and the only real choice you have to make is k, the number of clusters. The standard heuristic is the elbow method: fit K-Means for k = 2, 3, 4, …, 10, plot the resulting inertia (within-cluster sum of squared distances), and look for the "elbow" where adding another cluster stops producing dramatic improvements. The elbow tells you the most parsimonious k that still respects the structure of the data. DBSCAN works on a fundamentally different principle: instead of partitioning, it grows clusters by finding points that have at least min_samples neighbors within distance eps. Points that don't fit any density region are labeled noise (-1). The catch is choosing eps: too small and everything becomes noise; too large and clusters merge into one giant blob. The standard diagnostic is the k-distance plot - sort the distance from each point to its k-th nearest neighbor and plot it; the "knee" in that curve is the natural eps boundary between dense and sparse regions. HDBSCAN is the modern evolution of DBSCAN. It builds a hierarchy of density-based clusters across multiple eps values simultaneously, then automatically extracts the most stable clusters from that hierarchy. The practical payoff: you don't tune eps at all, and the algorithm handles variable-density clusters that vanilla DBSCAN struggles with. It still labels noise as -1, so you keep the outlier-detection benefit. For most real-world clustering problems where you have no strong prior on cluster shape or density, HDBSCAN is the strongest default. Numbers from n_clusters_ only tell you so much; the visual comparison reveals whether each algorithm is finding the structure that actually exists in your data. By plotting K-Means, DBSCAN, and HDBSCAN labels side-by-side on the same UMAP projection, you can immediately see: which methods agree, which produce many tiny clusters, which lump everything into one giant group, and how each handles the outliers. This three-panel figure becomes the most diagnostic artifact of the lab - the one image you'd show a stakeholder to defend the algorithm you finally pick.
Challenge

## Evaluation, Profiling, and Business Labeling

You've fit three clustering algorithms. Now you need to decide which one to actually use - and once you've picked it, you need to translate cluster IDs (0, 1, 2, …) into something a business stakeholder can act on. This step has three parts: first, an internal numerical evaluation using silhouette scores to rank clustering quality without ground-truth labels; second, cluster profiling that summarizes each cluster by the means of your original features so you can see what kind of employee lives in each group; and third, HR-meaningful labeling that maps each numerical cluster to a human-readable persona like "Stable Veterans" or "At-Risk Newcomers," producing a final visualization ready to present to a non-technical audience. Without ground-truth labels, you can't compute accuracy, but you can compute the silhouette score, which measures how well each point sits inside its cluster relative to the nearest neighboring cluster. Scores near +1 mean tight, well-separated clusters; scores near 0 mean overlapping clusters; negative scores mean points are probably mis-assigned. Computing silhouette for K-Means, DBSCAN (excluding noise), and HDBSCAN (excluding noise) gives you a quantitative ranking that complements the visual side-by-side comparison from Step 4. Silhouette tells you which clustering is best; profiling tells you what each cluster means. The standard approach is to group the original (unscaled) feature matrix by cluster label and compute the mean of every feature within each group. Looking at the resulting table, you can immediately spot the distinguishing patterns: which clusters have high tenure but low engagement, which have low compensation but high frequency, which look like new hires versus long-tenured veterans. This profile table is the bridge from "K-Means produced 4 clusters" to "K-Means produced 4 interpretable employee segments." The final step makes everything actionable. You'll take the K-Means cluster IDs and replace them with descriptive persona labels you derive from the profile table - names like "Stable Veterans," "At-Risk Newcomers," "Star Performers," or "Disengaged Seniors" - chosen to reflect each cluster's distinguishing features. You'll then produce a final UMAP scatter colored by these human-readable labels with a clean legend. This final figure is exactly the kind of artifact you'd put in a slide deck to defend the segmentation to a non-technical HR partner or executive.
Challenge

## Conclusion

You've now built a complete unsupervised learning pipeline from raw, messy data to a labeled, visualized segmentation suitable for HR handoff. You handled missing IDs, reversals, inconsistent casing, and duplicate records - the kind of dirt every real-world dataset throws at you. You engineered the canonical RFM feature set alongside compensation features, applied log1p and StandardScaler to make them play nicely with distance-based methods, and produced a feature matrix ready for any downstream model.

From there, you saw two complementary dimensionality reduction approaches: PCA for fast, linear, interpretable variance capture, and UMAP for nonlinear local structure preservation. You clustered with three different algorithms - K-Means with the elbow method, DBSCAN with principled eps selection via the k-distance plot, and HDBSCAN with auto cluster-count and noise handling - and compared them honestly on the same projection. You evaluated quality with silhouette scores, profiled each cluster's signature, and assigned business-meaningful labels.

This pipeline - clean, engineer, scale, reduce, cluster, evaluate, label - is the same shape as customer segmentation, fraud detection, anomaly clustering, behavior analytics, or any other unsupervised problem you'll encounter. The specifics of the features change with the domain; the structure of the work doesn't.

As next steps, consider exploring how the segmentation changes if you bring in additional features (department one-hot encoding, tenure, event-type breakdowns), or try clustering on the UMAP projection itself rather than on the original feature space - that's a controversial choice in practice with both fans and critics, and the tradeoffs are worth understanding. You could also dive deeper into HDBSCAN's cluster_persistence_ output to understand which segments are robust versus marginal. Each direction builds directly on what you've done here. Happy clustering!

About the author

Zachary Bennett

Zach is currently a Senior Software Engineer at VMware where he uses tools such as Python, Docker, Node, and Angular along with various Machine Learning and Data Science techniques/principles. Prior to his current role, Zach worked on submarine software and has a passion for GIS programming along with open-source software.

Real skill practice before real-world application

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Learn by doing

Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.

Follow your guide

All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.

Turn time into mastery

On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.

Lab: Clustering and Dimensionality Reduction

Lab Info

Table of Contents

## Introduction to Unsupervised Segmentation

## Data Cleaning and RFM Feature Engineering

## Dimensionality Reduction with PCA and UMAP

## Clustering with K-Means, DBSCAN, and HDBSCAN

## Evaluation, Profiling, and Business Labeling

## Conclusion

About the author

Real skill practice before real-world application

Learn by doing

Follow your guide

Turn time into mastery

Get started with Pluralsight