- Lab
-
Libraries: If you want this lab, consider one of these libraries.
- AI
Implementing XGBoost in a Random Forest Classifier
In this lab, you’ll practice implementing XGBoost in a random forest classifier. When you’re finished, you’ll have a random forest classifier within XGBoost.
Lab Info
Table of Contents
-
Challenge
Introduction
Welcome to the lab! This lab will provide an environment where you can use XGBoost within a random forest classifier.
Learning Objectives
- Explain how XGBoosting works.
- Review ideal use cases of XGBoosting.
- Recognize the difference between AdaBoost and XGBoost.
Prerequisites
- A high level understanding of random forest models
- Moderate to advanced levels of data engineering
-
Challenge
Synthetic Data Generation
Key Requirements
For this lab, you will choose a few key parameters for the synthetic data generation. It is important to note that XGBoost automatically uses L1 normalization with the hyperparameter
reg_alpha, and have the functionality to also support L2 normalization with the hyperparameterreg_lambda, without you implementing it beforehand. XGBoost also has parallelization to reduce computing times and making it more efficient for larger datasets than other boosting methods. Finally XGBoost also automatically handles missing values.Initializing Data Generation
For this section your key action steps will include the following:
- Run the
imports cell in Jupyter, choosing the latest stable Python, Python 3.11 as of publication. - Then add in parameters within the ranges below, and run the Synthetic Data Generation cell to generate the data.
With all these benefits of XGBoost in mind, the following parameter ranges are ideal to see the increase in accuracy that XGBoost provides.
n_samples: between40000and70000n_features: between10and20n_informative: betweenn_features - 10andn_features - 1n_redundant: must be equal ton_features - n_informativen_classes: between2and5
As you increase the complexity of the dataset, XGBoost's improvements become more obvious.
- Run the
-
Challenge
Training and Evaluating Both Models
For this section of the lab, you will learn how the models are training, and how to evaluate them. You can also alter the synthetic data and hyperparameters of the model to see how the model would fine tune with different aspects. A few key parameters and how they act would be:
- Increasing max depth will increase chance of overfitting, and decreasing will decrease chance for model to generalize
- Increasing learning rate will train model faster, but potentially skip over local and global minimum for loss functions
- Increasing n_estimators will increase model training time and increase chance of overfitting
Train Both Models
Start by running both cells under Training Both Models, then continue on to the explanation of how XGBoost works as the models train.
Understanding XGBoosting
You will train two models: an XGBoosted model, and a baseline random forest model. XGBoost is a boosting method placed within a random forests generation to ensure subsequent decision trees in the model make up for previous models.
XGBoosting uses gradient boosting, meaning that each new tree corrects the errors of the previous tree by minimizing the loss function. This differs from AdaBoosting by minimizing the entirety of the loss function (similar to deep neural nets), as opposed to increasing the weights of misclassified data points. Due to XGBoost's focus on minimizing loss functions, and loss functions not requiring previous models to be trained, XGBoost can run in parallel, training many trees at once.
Note: Random forests with small datasets or that converge before all estimators are created, will hardly benefit from the boosting. Making XGBoosts performance enhancements much more obvious on large, multi-class datasets.
Key Metrics Definitions and Use cases
Accuracy: Measures how often the classifier is correct.
If accuracy is high (close to 1), the model is making correct predictions most of the time. However, accuracy alone can be misleading if the dataset is imbalanced.
Example Use Case: Image Classification (Cats versus Dogs)
- If you have an equal number of cat and dog images, a high accuracy means the model is performing well overall.
- However, if the dataset is imbalanced (for example 95% dogs, 5% cats), accuracy may be misleading. A model that always predicts "dog" will be 95% accurate but useless for identifying cats.
Precision: Measures the percentage of correctly predicted positive instances out of all predicted positives. A high precision score means that when the model predicts "positive," it's usually correct.
Example Use Case: Spam Email Detection
- If a non-spam email is misclassified as spam (FP), a user might miss an important message.
- High precision ensures that when the model says "spam," it is almost always correct.
- Even if some spam emails sneak through (FN), that’s better than losing an important email.
Recall (Sensitivity or True Positive Rate): Measures how many actual positives were correctly predicted. High recall means the model is good at capturing actual positives, but it may produce more false positives.
Example Use Case: Medical Diagnosis (Cancer Detection)
- If the model fails to detect cancer (FN), it could cost a life.
- A high recall ensures that almost all cancerous cases are detected, even if some healthy people get flagged (FP).
- Further testing can eliminate false positives, but missing real cancer cases is unacceptable.
F1 Score: The harmonic mean of precision and recall, balancing the trade-off between the two.
A high F1 score means the model has a good balance between precision and recall.Example Use Case: Customer Churn Prediction
- If a company wants to predict which customers will leave, both FP (wrongly thinking a customer will leave) and FN (missing actual churners) matter.
- High precision ensures that marketing efforts go to likely churners.
- High recall ensures that most potential churners are detected.
- A high F1 score balances both, ensuring the best overall effectiveness.
Evaluating and Comparing Both Models
Run the cells under Evaluating Both Models.
When evaluating a model its key to remember the use case of the model heavily dictates which metric holds the most value. The key metrics within the classification report are commonly used to assess most use cases of models.
For this lab, depending on what metrics are used for the synthetic data creation, you should be able to see that implementation of XGBoosting resulted in a model being trained faster than without and that all metrics are relatively the same if not better than without the boosting. It is important to note that precision and recall are a balance so boosting cannot strictly improve both aspects of the model.
Experiment and Repeat
Working through the lab again:
- Alter your synthetic data parameters to see how different types of datasets can affect how effective XGBoost is.
- Also modify some of the hyperparameters of XGBoost itself to see how different parameterization can affect boosting.
Note: In cases where boosting seems barely better, you can increase the complexity of the synthetic data.
About the author
Real skill practice before real-world application
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Learn by doing
Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.
Follow your guide
All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.
Turn time into mastery
On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.