Beginning Data Exploration and Analysis with Apache Spark

80% of a data scientist's job is data preparation. This course is all about data preparation i.e. cleaning, transforming, summarizing data using Spark.
Course info
Rating
(90)
Level
Beginner
Updated
Oct 20, 2016
Duration
1h 57m
Table of contents
Description
Course info
Rating
(90)
Level
Beginner
Updated
Oct 20, 2016
Duration
1h 57m
Description

Data preparation is a staple task for any data professional, whether you just want to explore data or develop sophisticated Machine Learning models. Spark is an engine that helps do this in a very intuitive way, using functional constructs that abstract the user from all the messiness of working with large datasets. In this course, Beginning Data Exploration and Analysis with Apache Spark, you'll go through exploratory data analysis and data munging with Spark, step-by-step. First, you'll explore RDDs and functional constructs that make processing in Spark extremely intuitive. Next, you'll discover how to transform and clean unstructured data. Finally, you'll learn how to summarize data along dimensions and how to model relationships to build co-occurrence networks. By the end of this course, you'll be able to use Spark to transform data in any way that you would like.

About the author
About the author

Swetha loves playing with data and crunching numbers to get cool insights. She is an alumnus of top schools like IIT Madras and IIM Ahmedabad.

More from the author
Classification Using Tree Based Models
Beginner
1h 57m
Jan 6, 2017
More courses by Swetha Kolalapudi
Section Introduction Transcripts
Section Introduction Transcripts

Course Overview
Hi, everyone. My name is Swetha Kolalapudi, and welcome to my course, Beginning Data Exploration and Analysis with Apache Spark. I am the co-founder of a startup called Loonycorn. Cleaning, transforming, and preparing data is a staple task for any data professional, whether they just want to explore data and play with it, or develop sophisticated, Machine Learning models. Spark is an engine that helps us do this in a very intuitive way, using functional constructs that abstract the user from all the messiness of working with large data sets. This course is all about using Spark and resilient distributed data sets to process complicated data. By the time you are done, you will be comfortable using functional constructs like filter, map, and reduce to transform data and use RDDs and Pair RDDs to summarize and merge data sets. Some of the major topics that we will cover include transforming and cleaning unstructured data, summarizing data along dimensions, and modeling relationships to build co-occurrence networks. By the end of this course, you'll be able to use Spark to transform data in any way that you like. Before beginning the course, you should be familiar with Python at the basic level. I hope you'll join me on this journey to learn Beginning Data Exploration and Analysis with Apache Spark, at Pluralsight.