Cleaning Data: Python Data Playbook

Cleaning the dataset is an essential part of any data project, but it can be challenging. This course will teach you the basics of cleaning datasets with pandas, and will teach you techniques that you can apply immediately in real world projects.
Course info
Level
Beginner
Updated
Dec 10, 2018
Duration
1h 8m
Table of contents
Description
Course info
Level
Beginner
Updated
Dec 10, 2018
Duration
1h 8m
Description

At the core of any successful project that involves a real world dataset is a thorough knowledge of how to clean that dataset from missing, bad, or inaccurate data. In this course, Cleaning Data: Python Data Playbook, you'll learn how to use pandas to clean a real world dataset. First, you'll learn how to understand, view, and explore the data you have. Next, you'll explore how to access just the data that you want to keep in your dataset. Finally, you'll discover different ways to handle bad and missing data. When you're finished with this course, you'll have a foundational knowledge of cleaning real world datasets with pandas that will help you as you move forward to working on real world data science or machine learning problems.

About the author
About the author

Chris is a software consultant focused on web, mobile, and machine learning. He uses React, React Native, Node.js, Ruby on Rails, and Python.

Section Introduction Transcripts
Section Introduction Transcripts

Course Overview
Hi everyone. My name is Chris Achard, and welcome to my course, Cleaning Data: Python Data Playbook. I'm an independent software consultant with 10 years of industry experience in web and mobile apps and machine learning. In any project that I've done that deals with a lot of data, cleaning data always seems to take an enormous amount of time, and I've always been looking for ways to speed up that process. With pandas and Python, I've found a toolset that makes it fast and efficient to sniff out the bad data and will give you any confidence that you can clean up any bad data in your dataset. It's an often overlook topic, but I'm excited to teach you about how to clean data because of how much it can help in your projects. In this course, we are going to look at a real-world dataset with 69, 000 rows. Some of the major topics that this course will cover include selecting and renaming columns and rows, filtering the dataset, transforming, or manipulating your data, and identifying and dealing with bad data. By the end of this course, you will have the toolset that you need to tackle and clean up any structured dataset and prepare it for the next step in your project, whether that's a machine learning model, sending it off to a database, or connecting to your API. Before beginning this course, you should have a basic understanding of Python, but you don't need to know anything specifically about pandas before we get started. I hope that you'll join me on this journey about how to clean up data with pandas, at Pluralsight.

Removing and Fixing Columns with pandas
Welcome to the second module in the Cleaning Data course. Here we'll learn how to remove and fix the names of columns with pandas. Seeing that we're going to fix a column might sound vague, but we really mean one of three things. First, we might drop columns from the dataset if we know that we don't want to use them at all, or second, we could change the case of all the columns at once or use a function to change the case for us, and lastly, we can rename either individual columns or all the columns at once. If you're using a dataset for an internal project, it may not be immediately obvious why you want to or need to fix columns in your dataset as one of the first steps of the project. So here are three reasons why fixing columns is important. First, if you're working on a project as a team, it's a hassle to share large files with a bunch of outdated or irrelevant information, so you may fix up your dataset just to make it easier to collaborate with your team. Second and probably most prevalent is that you're probably fixing your data in order to be used later by another algorithm or system, and in that case, it's imperative that the dataset you produce match the expected input of the other system, and renaming or dropping columns is an important part of that. And lastly, if you have a lot of data and many files, it can simply be a hassle to keep track of everything if there is irrelevant or badly named column names floating around. In the demo for this module, we'll continue looking at the Tate Gallery dataset, and we'll start by simply dropping columns that aren't relevant to our analysis. We'll also learn the importance of inplace=True, which applies not just for fixing columns, but for many, many of the methods in pandas. Then we'll look at how to make a Batch change to all of our columns, like how to change the case for all columns to lowercase. And finally, we'll look at how to rename columns, either one at a time or all at once, and even as we import the CSV in the first place.

Indexing and Filtering Datasets
Welcome to the third module in Cleaning Data with pandas. Here we'll talk about indexing and filtering datasets. In this module, you'll learn the tools that will really help you dive into and explore your data and narrow your dataset down to exactly what you want. This is one of the most core parts of handling data with pandas, so let's dive in. When we talk about indexing and filtering, we mean viewing just certain data, either a certain row or column or by filtering the data based on some condition. We'll start by looking at how to access just an individual row or column or a group of rows or columns, and we'll continue with the loc and iloc methods, which are powerful ways to dig into your data, and then we'll finish by using the string method contains, which allows us to filter the data using loc. In the demo, we'll continue using the Tate Gallery dataset, and we'll start by learning about using square brackets right on the dataframe. This looks like array access, but there are some important differences that you'll need to keep in mind. Next, we'll continue with loc and iloc, which let us filter by row or column label or by row or column index. And finally, we'll see how to combine loc with string methods like. contains to get customizable filtering of our dataset.

Handling Bad, Missing, and Duplicate Data
Welcome to the last module in this course about cleaning data with pandas. In this module, we're going to talk about how to handle bad, missing, and duplicate data. Once we understand our data and have filtered it down to the columns and rows that we want, this is the last step to make really clean datasets that we can feed into our models and other processes. There are many types of errors that we may have to fix with our data. So the first thing that you have to figure out is what is bad data? It might mean missing values or values that haven't been parsed correctly, or it may mean something else entirely, based on your specific dataset. And once you know what you're looking for, the next thing is to figure out what you want to do with your dataset and figure out what your goal is. If you have a very specific API or machine learning model that your data is going to be fed into, then that might define your goal state for you. Or if you just want to generally clean up you data so that you can interpret it more clearly, then you may choose different options when cleaning the data, and your specific goal will inform your decision about what to do when you find bad data. You can just drop it, fill it with some specific value, or replace one value for something else. We can't anticipate the exact needs that you'll have with your own dataset, but I can teach you the tools that pandas has available to clean your dataset. For our demo, we'll continue to look at the Tate dataset, though we'll look at more examples on the entire dataset because large datasets tend to have more bad data problems that are interesting to solve. We'll start with fixing a systemic issue in many datasets, and we'll look at how to strip white space out of an entire column at once. Then we'll learn how to replace specific bad data with something else and move on to filling bad, missing data with not a number values, which can help us later. Then we'll be able to drop entire rows based on those Nan values. And we'll finish up by looking for duplicate rows and dropping or otherwise dealing with them.