Getting Started with Stream Processing with Spark Streaming

The Spark Streaming module lets you to work with large scale streaming data using familiar batch processing abstractions. This course starts with how standard transformations and operations are performed on streams, and moves to more advanced topics.
Course info
Rating
(60)
Level
Beginner
Updated
Jan 27, 2017
Duration
2h 35m
Table of contents
Description
Course info
Rating
(60)
Level
Beginner
Updated
Jan 27, 2017
Duration
2h 35m
Description

Traditional distributed systems like Hadoop work on data stored in a file system. Jobs can run for hours, sometimes days. This is a major limitation in processing real-time data such as trends and breaking news. The Spark Streaming module extends the Spark batch infrastructure to deal with data for real-time analysis. In this course, Getting Started with Stream Processing with Spark Streaming, you'll learn the nuances of dealing with streaming data using the same basic Spark transformations and actions that work with batch processing. Next, you'll explore you how you can extend machine learning algorithms to work with streams. Finally, you'll learn the subtle details of how the streaming K-means clustering algorithm helps find patterns in data. By the end of this course, you'll feel confident in your knowledge, and you can start integrating what you've learned into your own projects.

About the author
About the author

A problem solver at heart, Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the original engineers on Google Docs and holds 4 patents for its real-time collaborative editing framework.

More from the author
More courses by Janani Ravi
Section Introduction Transcripts
Section Introduction Transcripts

Course Overview
Hi, my name is Janani Ravi and I'm very happy to meet you today. I have a master's degree in electrical engineering from Stanford, and have worked at companies such as Microsoft, Google, and Flipkart. At Google, I was one of the first engineers working on real-time collaborative editing in Google Docs, and I hold four patterns for its underlying technologies. I currently work on my own startup, Loonycorn, a studio for high quality video content. Traditional distributed systems work on a large number of files, partitioned across multiple machines in a cluster. Jobs may take hours, and even days, to run. This is a major limitation when we want to analyze real-time data to see what's trending, or to track things like breaking news. Apache Spark is a general purpose engine for large scale data processing, which runs super fast and is very easy and intuitive to use. Spark has a special streaming module, which deals with real-time data. This works on a discretized stream abstraction that a stream is just a sequence of batch data. In this course, you'll learn the nuances of dealing with streaming data using the same basic Spark transformations and actions that work with batch processing. This course also shows you how you can extend machine learning algorithms to work with streams. This course will help you understand the subtle details of how the streaming k-means clustering algorithm helps find patterns in streaming data. And to top it all off, you'll build a fault-tolerant real-world project, where you can network with a live stream to track trending hashtags on tweets.