Data Science & Hadoop Workflows at Scale With Scalding

Learn how to use Scalding and Algebird and join Twitter, Etsy, eBay, and others to efficiently extract value and process data at scale on Hadoop.
Course info
Rating
(67)
Level
Beginner
Updated
Dec 4, 2014
Duration
2h 8m
Table of contents
Description
Course info
Rating
(67)
Level
Beginner
Updated
Dec 4, 2014
Duration
2h 8m
Description

This course teaches you how to use Scalding (a domain specific language) built on Scala and Cascading to build distributed applications on Hadoop. The course also focuses on the data science aspect using Algebird, an abstract algebra library for Scala, to solve real-world sketching/streaming problems on distributed systems. You will learn how to reason about a variety of problems, how to build and test locally, and how to deploy on Hadoop. You will also learn the algorithms used to solve problems at scale where performance, compute and memory resources, and the window of time you have to process streaming data are all challenges you'll have to overcome, and how you can use Scalding and Algebird to solve for these constraints. This course also covers some Scala basics to get you up to speed and looks into how you can monitor, visualize, and troubleshoot your application's workflow and performance problems. Watch this course if you were considering, or already know how to use Pig, Hive, or any other DSL for Hadoop and not only wanted more power over your workflows, but also a DSL that is actively being developed to support up and coming execution frameworks like Apache Tez and Apache Spark with all the flexibility that a full functional programming language like Scala has to offer. If you're serious about learning how to build enterprise-grade applications on Hadoop, data science, and Lambda architectures, then this course is for you.

About the author
About the author

Ahmad is a Data Architect specializing in the implementation of high-performance data warehouses and BI systems and enjoys speaking at various user groups and conferences.

More from the author
SQL on Hadoop - Analyzing Big Data with Hive
Intermediate
4h 16m
8 Oct 2013
Section Introduction Transcripts
Section Introduction Transcripts

Introduction to Scalding
Hi. My name is Ahmad Alkilani and I'll be guiding you throughout this course, but before we really get into the details I want to give you an idea of what we're going to cover, so you know what to expect. If you've never heard of scalding, Scala or Hadoop and you were just curious about some of the data science type problems we're going to solve, well this just might also be the course for you, as I will cover enough basics, so you're at least comfortable venturing out on your own, so why scalding for data science and why should you care? Well let's take the infamous Hadoop hello world example, the word count.

Building Applications With Scalding
Welcome to this module. Hi. My name is Ahmad Alkilani and in this module we'll be looking at Scalding in depth, to the point where you're comfortable building your own applications. Being able to work with data, cleanse it, filter it, join different data sets together are all essential parts of reaping the most benefits from data science. This module aims to equip you with the tools necessary to do that. Let's take a look at what we're going to cover. We'll start off by showing you how to setup the Scalding REPL and the REPL in Eclipse's Scala worksheet where we'll be spending most of our time going through demos. The demos we'll walk through will introduce map side operations like Map and FlatMap and we'll also look at the different options you have for reductions using Reduce, Fold, and Fold Left in addition to group operations where we'll examine a flight status set and try to solve some interesting problems. We'll also take a glimpse at some of the functions Scalding has built-in that utilize streaming algorithms like priority queues in the algebra library. We'll also cover the different types of joins with a focus on how choosing the correct type is relevant to the performance of your application running on Hadoop, so without any further ado let's get started.

Scalding on Hadoop
Hi. This is Ahmad Alkilani and welcome to this module. Our focus so far has been to get you up and running with Scalding and we've been doing all of our work locally. In this module I'll show you how easy it is to move your work to Hadoop and we'll focus on a few Hadoop specific features and then we'll look into how you can visualize your workflow to get a better understanding of your applications behavior and we'll look at two different ways you can do that, so let's get right to it and start off with a demo.

Data Science With Scalding
Hi and welcome to this module. My name is Ahmad Alkilani and in this module I'll introduce you to some of the basic techniques that make working with big data possible and kudos on making it this far. I hope you'll find this module to be just as useful building big data applications and applying data science. What I hope you'll achieve in this module is a change in how we think about big data problems relative to, let's say, a query against the database. If there's something I would like you to take with you out of this module it is the realization that exact numbers matter less compared to the speed you can achieve an approximate answer with a high degree of certainty and the resources you need to get to that answer, be it in terms of memory or processing power. In this module we'll cover monoids, what they mean, and why they're important and then we'll look into practical applications of monoids in the form of Priority Queues, Bloom Filters, and HyperLogLog and we'll walk through a few demos to demonstrate examples on how and where you might use these techniques and we'll also see how the algebra library and Scalding make working with these algorithms simpler, so let's get started.