Apache Spark on Databricks

Paths

Expanded

Apache Spark on Databricks

Author: Janani Ravi

edit later

In Apache Spark on Databricks you will learn the in's and out's of of Apache Spark via Databricks. You will learn how to handle batch data, processing streaming data, windowing and joining operations, predictive analytics using MLib, executing graph algorithms and optimizing Apache Spark.

Pre-requisites

Intermediate programming experience in Python or Scala. Beginner experience with the DataFrame API.

Beginner

You will learn Spark transformations, actions, visualizations, and functions leveraging the Databricks API. You will also learn how to transform and aggregate batch data using Spark with built-in and user defined functions, and perform windowing and join operations on batch data.

Getting Started with Apache Spark on Databricks

by Janani Ravi

Oct 25, 2021 / 1h 52m

1h 52m

Start Course
Description

Azure Databricks allows you to work with big data processing and queries using the Apache Spark unified analytics engine. With Azure Databricks you can set up your Apache Spark environment in minutes, autoscale your processing, and collaborate and share projects in an interactive workspace.

In this course, Getting Started with Apache Spark on Databricks, you will learn the components of the Apache Spark analytics engine which allows you to process batch as well as streaming data using a unified API. First, you will learn how the Spark architecture is configured for big data processing, you will then learn how the Databricks Runtime on Azure makes it very easy to work with Apache Spark on the Azure Cloud Platform and will explore the basic concepts and terminology for the technologies used in Azure Databricks.

Next, you will learn the workings and nuances of Resilient Distributed Datasets also known as RDDs which is the core data structure used for big data processing in Apache Spark. You will see that RDDs are the data structures on top of which Spark Data frames are built. You will study the two types of operations that can be performed on Data frames - namely transformations and actions and understand the difference between them. You’ll also learn how Databricks allows you to explore and visualize your data using the display() function that leverages native Python libraries for visualizations.

Finally, you will get hands-on experience with big data processing operations such as projection, filtering, and aggregation operations. Along the way, you will learn how you can read data from an external source such as Azure Cloud Storage and how you can use built-in functions in Apache Spark to transform your data.

When you are finished with this course you will have the skills and ability to work with basic transformations, visualizations, and aggregations using Apache Spark on Azure Databricks.

Table of contents
  1. Course Overview
  2. Overview of Apache Spark on Databricks
  3. Transformations, Actions, and Visualizations
  4. Modify Data Using Spark Functions

Handling Batch Data with Apache Spark on Databricks

by Janani Ravi

Nov 22, 2021 / 2h 21m

2h 21m

Start Course
Description

Azure Databricks allows you to work with big data processing and queries using the Apache Spark unified analytics engine. Azure Databricks allows to work with a variety of batch sources and makes it seamless to analyze, visualize, and process data on the Azure Cloud Platform. In this course, Handling Batch Data with Apache Spark on Databricks, you will learn how to perform transformations and aggregations on batch data with selection, filtering, grouping, and ordering queries that use the DataFrame API. You will understand the difference between narrow transformations and wide transformations in Spark which will help you figure out why certain transformations are more efficient than others. You will also see how you can execute these same transformations by executing SQL queries on your data. Next, you will learn how you can implement your own custom user-defined functions to process your data. You will write code on Azure Databricks notebooks to define and register your UDFs and use them to transform your data. You will also understand how to define and use different flavors of vectorized UDFs for data processing and learn how vectorized UDFs are often more efficient than regular UDFs. Along the way, you will also see how you can read from Azure Cosmos DB as a source for your batch data. Finally, you will see how you can repartition your data in memory to improve processing performance, you will use window functions to compute statistics on your data and you will combine data frames using union and join operations. When you’re finished with this course you will have the skills and ability to perform advanced transformations and aggregations on batch data, including defining and using user-defined functions for processing.

Table of contents
  1. Course Overview
  2. Transforming Data Using DataFrames
  3. Transforming Data Using Spark SQL
  4. Applying User-defined Functions to Transform Data
  5. Processing Data Using Joins and Window Functions

Intermediate

You will learn how to use Spark abstractions for streaming data and perform transformations on streaming data using the Spark streaming APIs on Databricks as well as how to leverage windowing, watermarking and join operations on streaming data in Spark for your specific use-cases.

Processing Streaming Data with Apache Spark on Databricks

by Janani Ravi

Oct 25, 2021 / 2h 51s

2h 51s

Start Course
Description

Structured streaming in Apache Spark treats real-time data as a table that is being constantly appended. This leads to a stream processing model that uses the same APIs as a batch processing model - it is up to Spark to incrementalize our batch operations to work on the stream. The burden of stream processing shifts from the user to the system, making it very easy and intuitive to process streaming data with Spark.

In this course, Processing Streaming Data with Apache Spark on Databricks, you’ll learn to stream and process data using abstractions provided by Spark structured streaming. First, you’ll understand the difference between batch processing and stream processing and see the different models that can be used to process streaming data. You will also explore the structure and configurations of the Spark structured streaming APIs.

Next, you will learn how to read from a streaming source using Auto Loader on Azure Databricks. Auto Loader automates the process of reading streaming data from a file system, and takes care of the file management and tracking of processed files making it very easy to ingest data from external cloud storage sources. You will then perform transformations and aggregations on streaming data and write data out to storage using the append, complete, and update models.

Finally, you will learn how to use SQL-like abstractions on input streams. You will connect to an external cloud storage source, an Amazon S3 bucket, and read in your stream using Auto Loader. You will then run SQL queries to process your data. Along the way, you will make your stream processing resilient to failures using checkpointing and you will also implement your stream processing operation as a job on a Databricks Job Cluster.

When you’re finished with this course, you’ll have the skills and knowledge of streaming data in Spark needed to process and monitor streams and identify use-cases for transformations on streaming data.

Table of contents
  1. Course Overview
  2. Overview of the Streaming Architecture in Apache Spark
  3. Applying Transformations on Streaming Data
  4. Executing SQL Queries on Streaming Data

Windowing and Join Operations on Streaming Data with Apache Spark on Databricks

by Janani Ravi

Nov 2, 2021 / 2h 2m

2h 2m

Start Course
Description

Structured Streaming in Apache Spark treats real-time data as a table that is being constantly appended. In such a stream processing model the burden of stream processing shifts from the user to the system, making it very easy and intuitive to process streaming data with Spark. Apache Spark supports a range of windowing and join operations on streaming data using processing time and event time.

In this course, Windowing and Join Operations on Streaming Data with Apache Spark on Databricks, you will learn the difference between stateless operations that operate on a single streaming entity and stateful operations that operate on multiple entities accumulated in a stream. Then, you will explore the different kinds of windows supported by Apache Spark which includes tumbling windows, sliding windows, and global windows.

Next, you will understand the differences between event time, ingestion time, and processing time and see how you can perform windowing operations using both processing time as well as event time. Along the way, you will connect to an HDInsight Kafka cluster to read records for your input stream. You will then use watermarking to deal with late-arriving data and see how you can use watermarks to limit the state that Apache Spark stores.

Finally, you will perform join operations using streams and explore the types of joins that Spark supports for static-stream joins and stream-stream joins. You will also see how you can connect to Azure Event Hubs to read records.

When you are finished with this course, you will have the skills and knowledge of windowing and join operations needed to identify when these powerful transformations should be performed and how they are performed.

Table of contents
  1. Course Overview
  2. Performing Windowing Operations on Data
  3. Exploring Aggregations Using Watermarks
  4. Performing Join Operations on Data

Advanced

You will understand and implement important techniques for predictive analytics such as regression and classification using Apache Spark MLlib APIs on Databricks as well as learn how to implement graph algorithms such as Triangle Count and PageRank and visualize them using the GraphFrames API on Spark Databricks. You will also learn how to optimize the performance of Spark clusters by identifying and mitigating various performance issues such as data ingestion problems and leveraging the new features offered by Spark 3.

Predictive Analytics Using Apache Spark MLlib on Databricks

by Janani Ravi

Oct 26, 2021 / 1h 57m

1h 57m

Start Course
Description

The Spark unified analytics engine is one of the most popular frameworks for big data analytics and processing. Spark offers extremely comprehensive and easy to use APIs for machine learning which you can use to build predictive models for regression and classification and pre-process data to feed into these models.

In this course, Predictive Analytics Using Apache Spark MLlib on Databricks, you will learn to implement machine learning models using Spark ML APIs. First, you will understand the different Spark libraries available for machine learning, the older RDD-based library, and the newer DataFrame based library. You will then explore the range of transformers available in Spark for pre-processing data for machine learning - such as scaling and standardization transformers for numeric data and label encoding and one-hot encoding transformers for categorical data.

Next, you will use linear regression and ensemble models such as random forest and gradient boosted trees to build regression models. You will use these models for prediction on batch data. In addition, you will also see how you can use Spark ML Pipelines to chain together transformers and estimators to build a complete machine learning workflow.

Finally, you will implement classification models using logistic regression as well as decision trees. You will train the ML model using batch data but perform predictions on streaming data. You will also use hyperparameter tuning and cross-validation to find the best model for your data.

When you’re finished with this course, you’ll have the skills and knowledge to create ML models with Spark MLlib needed to perform predictive analysis using machine learning.

Table of contents
  1. Course Overview
  2. Getting Started with Machine Learning with Apache Spark on Databricks
  3. Performing Regression on Batch Data
  4. Implementing Classification on Streaming Data

Executing Graph Algorithms with GraphFrames on Databricks

by Janani Ravi

Nov 2, 2021 / 1h 34m

1h 34m

Start Course
Description

The Spark unified analytics engine is one of the most popular frameworks for big data analytics and processing. The GraphFrames package in Apache Spark allows you to represent graphs using a DataFrame-based API. GraphFrames also supports a number of graph algorithms such as Shortest Path, PageRank, Breadth-first search, and connected components.

In this course, Executing Graph Algorithms with GraphFrames on Databricks, you will explore how graphs can be used to model entities and relationships in the real world. First, you will learn about the different kinds of graphs such as directed and undirected graphs, weighted and unweighted graphs. Then, you will discover how graphs can be represented using the GraphFrames API in Apache Spark and how you can compute the properties of a graph such as indegree and outdegree of a vertex and perform filtering operations on vertices and edges.

Next, you will see how you can perform motif searches using GraphFrames in order to detect structural patterns in the graph. After that, you will learn how to use a domain-specific language for motif finding and run stateless and stateful queries on simple as well as complex real-world graphs.

Finally, you will explore the variety of graph algorithms supported by the GraphFrames API including Breadth-first search, Shortest Path, triangle count, connected and strongly connected components, and PageRank.

When you are finished with this course, you will have the skills and knowledge of graph algorithms in Spark needed to implement graph algorithms using the GraphFrames API provided by Spark.

Table of contents
  1. Course Overview
  2. Getting Started with Graph Algorithms in Spark
  3. Stateful Queries and Motifs
  4. Implementing Graph Algorithms

Optimizing Apache Spark on Databricks

by Janani Ravi

Nov 3, 2021 / 2h 4s

2h 4s

Start Course
Description

The Apache Spark unified analytics engine is an extremely fast and performant framework for big data processing. However, you might find that your Apache Spark code running on Azure Databricks still suffers from a number of issues. These could be due to the difficulty in ingesting data in a reliable manner from a variety of sources or due to performance issues that you encounter because of disk I/O, network performance, or computation bottlenecks.

In this course, Optimizing Apache Spark on Databricks, you will first explore and understand the issues that you might encounter ingesting data into a centralized repository for data processing and insight extraction. Then, you will learn how Delta Lake on Azure Databricks allows you to store data for processing, insights, as well as machine learning on Delta tables and you will see how you can mitigate your data ingestion problems using Auto Loader on Databricks to ingest streaming data.

Next, you will explore common performance bottlenecks that you are likely to encounter while processing data in Apache Spark, issues dealing with serialization, skew, spill, and shuffle. You will learn techniques to mitigate these issues and see how you can improve the performance of your processing code using disk partitioning, z-order clustering, and bucketing.

Finally, you will learn how you can share resources on the cluster using scheduler pools and fair scheduling and how you can reduce disk read and write operations using caching on Delta tables.

When you are finished with this course, you will have the skills and knowledge of optimizing performance in Spark needed to get the best out of your Spark cluster.

Table of contents
  1. Course Overview
  2. Exploring and Mitigating Data Ingestion Problems
  3. Diagnosing and Mitigating Performance Problems
  4. Optimizing Spark for Performance
Learning Paths

Apache Spark on Databricks

  • Number of Courses7 courses
  • Duration14 hours
  • Expanded

edit later

Courses in this path

Beginner

You will learn Spark transformations, actions, visualizations, and functions leveraging the Databricks API. You will also learn how to transform and aggregate batch data using Spark with built-in and user defined functions, and perform windowing and join operations on batch data.

Intermediate

You will learn how to use Spark abstractions for streaming data and perform transformations on streaming data using the Spark streaming APIs on Databricks as well as how to leverage windowing, watermarking and join operations on streaming data in Spark for your specific use-cases.

Advanced

You will understand and implement important techniques for predictive analytics such as regression and classification using Apache Spark MLlib APIs on Databricks as well as learn how to implement graph algorithms such as Triangle Count and PageRank and visualize them using the GraphFrames API on Spark Databricks. You will also learn how to optimize the performance of Spark clusters by identifying and mitigating various performance issues such as data ingestion problems and leveraging the new features offered by Spark 3.

Join our learners and upskill
in leading technologies