Course info
Dec 31, 2018
1h 33m

At the core of applied machine learning is supervised machine learning. In this course, Machine Learning with XGBoost Using scikit-learn in Python, you will learn how to build supervised learning models using one of the most accurate algorithms in existence. First, you will discover what XGBoost is and why it’s revolutionized competitive modeling. Next, you will explore the importance of data wrangling and see how clean data affects XGBoost’s performance. Finally, you will learn how to build, train, and score XGBoost models for real-world performance. When you are finished with this course, you will have a foundational knowledge of XGBoost that will help you as you move forward to becoming a machine learning engineer.

About the author
About the author

Mike has Bachelor of Science degrees in Business and Psychology. He's passionate about machine learning and data engineering.

More from the author
Section Introduction Transcripts
Section Introduction Transcripts

Preparing Data for Gradient Boosting
Hello, my name is Mike West, and welcome back to an Introduction to XGBoost Using Scikit-learn in Python. In this module, you'll learn how data is prepared for machine learning models. A model is only as good as the data passed into it. In machine learning, the process of massaging data into a modellable state is called data wrangling. In this module, you'll learn what data wrangling is, and why it's so important to the machine learning process. In this module, the AI hierarchy will be discussed. You'll learn about the two core types of models, the artificial neural network and traditional models. XGBoost is a traditional model. Data wrangling is the same process for traditional models, as well as it is for artificial neural networks. Machine learning is separated into two core types of learning. The first is supervised, and the second is unsupervised. This module will cover the difference between the two, and why most applied machine learning is supervised learning. In this module, you'll learn about the applied machine learning world versus the world of research or academia. Supervised learning is all about data, and real-world machine learning is all about cleaning data and modeling that cleansed dataset. The module will cover the array at a high level. Linear algebra is the mathematics of structured data, and the array is the core object that houses that data. You'll learn what an array is and how to navigate arrays using indexes. You'll also become familiar with data cleansing within the context of an array. Machine learning is very process oriented. Machine-learning engineers follow the same steps in order to build predictive models. This module will cover that process and explain the importance of data wrangling within the context of the machine learning process. Lastly, you'll wrangle the Titanic dataset and use XGBoost to create a highly accurate model against that cleansed dataset.

Saving the Trained Model
Hello, my name is Mike West, and welcome back to an Introduction to XGBoost Using Scikit-learn in Python. In this module, you'll learn how to save models to disk and retrieve them anytime you need them. Once you've completed and tested your model, you need to save the model for safekeeping. There are several approaches to saving your completed Python model for later use. In this module, you'll learn about serializing objects. The process of saving your model using these approaches is called serialization. Serialization is simply converting the object to a byte stream. A byte is a bit, a bit is a 0 or a 1. The raw data is just a flow of one byte from another. A byte stream can come from a file, a network connection, a serialized object, a random number generator, etc. The most popular library for saving your models is pickle. Pickle is used for serializing and deserializing a Python object structure. Any object in Python can be pickled so that it can be saved to disk. Pickle serializes the object first before writing it to file. Pickle is a way to convert any Python object, a list, a dict, etc., into a character stream. Since its inception, JSON has quickly become the de facto standard for information exchange. JSON, or JavaScript Object Notation, is a minimal readable format for structuring data. Due to its ubiquity many like to use JSON for persisting their objects to disk. Python can take full advantage of JSON. Pickle isn't perfect. It has some security issues and other flaws you should know about before using it in a production environment. This module will cover some of the problems with pickle. Lastly, you'll use pickle and JSON in a variety of demonstrations for saving your models to disk, and for deserializing them for later use.

Selecting Features in Gradient Boosting
Hello, my name is Mike West, and welcome back to an Introduction to XGBoost Using Scikit-learn in Python. In this module, you'll learn about feature engineering. When your goal is to get the best possible results from a predictive model, you need to get the most you can from the data you have. Features are the numbers that are fed into the model. If you're working with structured data, think of a feature as a column in that array or table. There are three general classes of feature selection algorithms, filter methods, wrapper methods, and embedded methods. Each method will be defined in this module. You can use the wrong models, or one that is less optimal, and still get good results. Most models can pick up on good structured data. The flexibility of good features will allow you to use less complex models that are faster to run, easier to understand, and easier to maintain. One of the benefits of using gradient boosting is that after the boosted trees are constructed, you can retrieve the important scores for each attribute. XGBoost is a gradient boosting model. Generally, importance provides a score that indicates how useful or valuable each feature was in the construction of the boosted decision tree within that model. This module will also cover feature construction. This technique is the one that most people are referring to when they talk about feature engineering. This is the process of manually constructing new attributes from raw data. It involves intelligently combining or splitting existing raw features into one which will have a higher predictive power. Finally, we'll use feature selection and feature engineering in various demonstrations. All the demonstrations will use XGBoost.