- Lab
-
Libraries: If you want this lab, consider one of these libraries.
- AI
Introduction to Decision Trees
In this lab, you’ll practice creating a decision tree. When you’re finished, you’ll have a basic decision tree and a fundamental understanding of use cases for them.
Lab Info
Table of Contents
-
Challenge
Introduction to Decision Trees
Introduction to Decision Trees
Welcome to the "Introduction to Decision Trees" lab! This lab is designed to give you a foundational understanding of Decision Trees, including use cases, interpretability, and the advantages and disadvantages of decision trees.
A Decision Tree is a supervised learning algorithm used for both classification and regression,and for this lab you will focus on a classification decision tree, which models decisions and their possible consequences in a tree-like structure. Decision trees split data into smaller subsets based on feature conditions, making it easy to interpret while effectively capturing patterns in the data.
Learning Objectives
- Understand use cases of decision trees
- Learn how to implement, train, and measure Decision tree metrics
- Understand ethical considerations and interpretability of decision tree models
-
Challenge
Creation of Classification Data
Creation of Classification Data
For the creation of data in this lab you will use the code provided below, as the synthetic data creation is not the key focus of this lab. The data generated will be features related to classification problems and align well with decision trees and random forests, which excel at high interpretability classification problems.
X, y = make_classification( n_samples=500, # Number of samples n_features=10, # Number of features n_informative=8, # Number of informative features n_redundant=2, # Number of redundant features n_classes=2, # Binary classification random_state=42 )Once the synthetic data is generated ensure the data is properly split the data into test and train with a random seed established.
-
Challenge
Setting Up Decision Tree Models
For this step you will be determining the hyperparameters for the decision tree model.
Setting Up Decision Tree Models
For initializing your decision tree model you will be using the
DecisionTreeClassifierfrom the scikit learn library. For this basic demonstration you will only be using 3 key parameters:criterion: consists of three algorithms for classification.
giniimpurity measures the probability of misclassifying a randomly chosen element from a dataset if it were randomly labeled according to the class distribution.- Faster to compute than entropy because it avoids logarithmic calculations.
- Prefers larger class splits, leading to a more balanced tree.
entropyderived from Shannon's Information Theory, quantifies the amount of disorder (uncertainty) in a dataset- Measures "purity" more rigorously than Gini.
- More computationally expensive due to the logarithm.
- Results in deeper trees since it may favor splits with multiple smaller classes.
log_loss(or Binary Cross-Entropy) is primarily used in probabilistic classification, where predictions are given as probabilities rather than hard class labels.- Penalizes incorrect probabilistic predictions more heavily.
- Unlike Gini and Entropy, Log Loss works with predicted probabilities rather than discrete class labels.
max_depth:The maximum depth of the tree. There are several other hyperparameters to limit tree depth but for this example we will simply establish the max_depth. Ensuring a proper depth for your decision tree is important as a shallow tree will under fit data and a deep tree will over fit the data, causing failure for generalization. Within this lab a max_depth under 6 would be acceptable, feel free to adjust to change the visualization of the tree.Note: In practice the best way to determine optimal depth is by using Grid Search with Cross-Validation.
random_stateany integer value you desire to ensure repeatability within training. -
Challenge
Training and Evaluating
Training and Evaluating Decision Trees
Now that your data is collected and models hyperparameters established we can train and fit the model with the two lines provided below.
dt_model.fit(X_train, y_train) y_pred = dt_model.predict(X_test)Key Measurements:
Decision trees can be measured with accuracy and the classification report functionality.
- Precision – How many predicted positives are actually positive?
- Recall – How many actual positives were correctly predicted?
- F1-score – Balance between precision and recall.
- Accuracy – Overall correctness (not always the best metric for imbalanced datasets).
- Support – The number of instances per class.
- Macro vs. Weighted Average – Choose based on whether you want equal importance or class-weighted evaluation.
Below is a simple list explaining general scenarios on which metric should be valued the most and why.
- When False Positives are Costly (e.g., spam detection, fraud detection)
- Key Metric: Precision
- Reason: Avoids mislabeling negatives as positives.
- When False Negatives are Costly (e.g., medical diagnosis, security alerts)
- Key Metric: Recall
- Reason: Ensures most actual positive cases are correctly identified.
- When Both False Positives and False Negatives Matter Equally (e.g., general classification tasks)
- Key Metric: F1-score
- Reason: Balances precision and recall for a fair evaluation.
- When Dealing with an Imbalanced Dataset (e.g., fraud detection with few fraud cases)
- Key Metric: Weighted Average
- Reason: Accounts for class distribution to prevent bias toward majority classes.
- When All Classes Should Be Treated Equally (e.g., multi-class problems where each class is equally important)
- Key Metric: Macro Average
- Reason: Gives equal importance to each class, regardless of frequency.
-
Challenge
Interpreting and Visualizing Decision Trees
Visualizing and Interpreting Decision Trees
To visualize the tree you can implement the code below. Properly named features will increase readability and assist with understanding repeatability of decisions made by the AI model.
plt.figure(figsize=(12, 6)) plot_tree(dt_model, feature_names=[f'Feature {i}' for i in range(X.shape[1])], class_names=['Class 0', 'Class 1'], filled=True) plt.title("Decision Tree Visualization") plt.show()Feel free to alter max depth and adjust other hyperparameters to see how the visualization and weights of the tree change.
About the author
Real skill practice before real-world application
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Learn by doing
Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.
Follow your guide
All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.
Turn time into mastery
On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.