Course info
Aug 9, 2018
1h 0m

At the core of applied machine learning is a thorough knowledge of data wrangling. In this course, Data Wrangling with Pandas for Machine Learning Engineers, you will learn how to massage data into a modellable state. First, you will discover what data wrangling is and its importance to the machine learning process. Next, you will explore the Pandas DataFrame and see how data is manipulated within the DataFrame. Finally, you will learn how to build an accurate model with the cleansed dataset. When you are finished with this course, you will have a foundational knowledge of data wrangling that will help you as you move forward to becoming a machine learning engineer.

About the author
About the author

Mike has Bachelor of Science degrees in Business and Psychology. He's passionate about machine learning and data engineering.

Section Introduction Transcripts
Section Introduction Transcripts

Course Overview
Hello. My name is Mike West, and welcome to my course Data Wrangling with Pandas for Machine Learning Engineers. While artificial neural networks are getting all the attention, one of the most overlooked aspects of machine learning is the data. Regardless of the algorithm type, almost all machine learning models need well formatted, structured data to perform optimally. It's the job of the machine learning engineer to wrangle the data into a modellable state. Data wrangling is one of the most difficult and time-consuming parts of machine learning. In the real world, data is dirty and machine learning models are temperamental. These models only want highly-structured, well-cleansed data. In this course, we'll provide you with the foundation you need to wrangle those unruly datasets. The course will introduce you to applied data wrangling. You'll want to have developers take real-world datasets and wrangle them to highly-structured numerical entities machine learning models need. The core library used by machine learning engineers to wrangle their data in Python is called pandas. You'll learn how to manipulate tabular data in an array. The array is the core data object in machine learning. Once the data has been properly wrangled, you'll build a highly accurate model that will predict a person's survivability if they were aboard the Titanic at the time of the sinking. Python has become the gold standard in applied machine learning, and a library called pandas, the preferred tool utilized by developers to massage their data into a well-cleansed state. By the end of the course, you'll be familiar with the basics of data wrangling and the process machine learning engineers use to create well-cleansed, model-ready datasets. I hope you will join me on this journey to learn more about data wrangling with Python at Pluralsight.

Getting Started in Data Wrangling
Hello. My name is Mike West, and welcome to my course Data Wrangling with Pandas for Machine Learning Engineers. In this course, we're going to cover the basics of data wrangling. Machine learning is one of the most in-demand careers in the world and will be for a long time to come. A large part of real-world machine learning is data wrangling. This first module will provide some basic information about data wrangling. This includes an explanation of the two core types of machine learning, supervised and unsupervised learning. Machine learning is very process oriented. This first module will cover that process and explain how data wrangling fits into the larger picture. Machine learning models don't like poorly structured data. The cleaner the data oftentimes the better the model's performance. By the end of this module, you'll understand why data wrangling is so important to machine learning. We're also going to cover the skills you need for the course and the skills you do not. Finally, we're going to work through a simple example in Python and use pandas to remove our first attribute from our dataset.

Pandas DataFrame Basics
Hello. I'm Mike West, and welcome back to Data Wrangling in Python for Machine Learning Engineers. This second module provides some more detailed information about the core object in pandas, the DataFrame. This includes defining the DataFrame, understanding why the DataFrame was built on top of NumPy arrays, and why we use methods to manipulate the data. The module will also cover data types. Data types define how data is stored. Using the correct data type is critical to the outcome of your model. For example, you can't use mathematical methods on an object data type so storing numbers correctly is very important to successful data wrangling. Additionally, this module will cover the sundry components of the DataFrame. The DataFrame has three core components and you'll learn what they are in this module. In SQL a select statement is used to retrieve data. In pandas indexers are used. This module will define and show you how indexers return data. After loading, merging, and preparing a dataset a familiar task is to compute group statistics or possibly pivot tables for reporting. In this module, grouping and aggregation will also be covered. Additionally, you'll continue wrangling the Titanic dataset. Upon completing each module, your dataset will be one more step closer to the highly structured numerical dataset these models need.

Pandas Data Structures
Hello. I'm Mike West, and welcome back to Data Wrangling in Python for Machine Learning Engineers. This third module will provide some more detailed information about the other core object in pandas, the series. A pandas series is a one-dimensional array of indexed data. The array is the main data structure in all of machine learning. In this module, the array and other similar structures will be covered. For example, Google has a framework called TensorFlow. A tensor is nothing more than a multi-dimensional array. In this lesson, the importance of domain knowledge will be discussed. Thus far, you've been able to pick some easy attributes to remove. Removing the name attribute and an imported primary key didn't take a lot of real dataset knowledge. In order to wrangle the dataset much further you'll need more detailed knowledge about the dataset. Python uses a lot of methods and functions. In this module, the core functions used in data wrangling will be covered. For example, almost all datasets have missing values. Handling missing values correctly is critical to improving model performance. Additionally, the module will cover a real-world, abridged guide to data wrangling. This will include a step-by-step process of wrangling real-world data. Finally, you'll wrangle the rest of the dataset. Upon completion of this module the dataset will be model ready. That means the entire dataset will be composed of numbers.

Modeling the Cleansed Data
Hello. I'm Mike West, and welcome back to Data Wrangling in Python for Machine Learning Engineers. This fourth module is about modeling our cleansed dataset. Machine learning engineers often affectionately refer to this as the fun part of the job. Building and tweaking models in order to get the best predictive capability is a rewarding process. In this module, the two core types of machine learning models will be covered. There are only two and you'll learn what they are in this module. Many machine learning models fall into four broad categories. For example, we will learn the difference between classification and regression, two categories of models that are often used in the applied space. In this module, the core steps of building a predictive model in scikit-learn will be covered. Model building, like data wrangling, is a process-oriented endeavor, and scikit-learn has become the gold standard for building traditional machine learning models in Python. Model selection for those new to machine learning can be very difficult. How do you know what model to choose for what project or task? This module will cover a general guide to model selection. Lastly, you're going to be building several models using the dataset that was cleansed throughout the course. Your data wrangling efforts will be rewarded with a successful real-world model.