A visual guide to Azure Data Factory
What is Azure Data Factory and how can you use it? In this visual guide to ADF, we break down the what, why, and how of using ADF for data integration.
Jun 08, 2023 • 8 Minute Read
What is a visual guide?
Visual guides are hi-resolution “sketchnotes.” They summarize a given topic or content item in a single image using a combination of illustrations and text to visualize the core ideas. Research tells us that 65% of us have a visual learning ability. We absorb information faster from images, using the “big picture” to detect familiar patterns and connect the dots to concepts in a manner that helps retention and recall.
How can you use this visual guide?
I use it as pre-read and post-review resources to bookend my learning journeys. Try it. Scan the image quickly, then read the post and dive into linked resources. You may find your brain is primed to find keywords and connections quickly, helping reinforce learned concepts.
Now come back and scan the visual again — test your recall or identify areas where you still have gaps in understanding. Or just print the visual and hang it up in your work area as a handy reference. I recommend you download this hi-res version of the visual guide if you take that route!
Visual guide to Azure Data Factory
We are living in a world increasingly populated by connected devices and interactive applications across diverse platforms (mobile, web, IoT). As application developers, we need a way to harness and analyze the large volumes of raw data (relational, non-relational, and other) to derive useful business insights.
Azure Data Factory is a fully managed serverless data integration solution for ingesting, preparing, and transforming, all your data at scale. In this visual guide, we try to answer the following questions:
- What is Data Integration?
- What is Azure Data Factory (ADF)?
- How do we achieve Data Integration with ADF?
- What are the components of ADF?
- What are key benefits to using ADF?
Read on for a text summary of each, along with links to resources you can use for deeper dives.
What is Data Integration?
At a high level, data integration involves the collection of data from diverse sources, its transformation (including cleansing or augmentation)to create meaningful context for analysis, followed by a load step where the processed data is stored for use by relevant analytics engines.
Let’s set the stage for this discussion with a familiar enterprise scenario.
A gaming company has two data stores — one on-premises (with customer, game, and marketing, data) and one in the cloud (with gameplay logs). To gain insights into customer behaviors or gaps in play, they need to correlate data across both, at a large scale. This is where data integration solutions like Azure Data Factory can excel!
Want a quick intro to Azure Data Factory with this scenario? Try this 30-minute module: Introduction to Azure Data Factory.
What is Azure Data Factory (ADF)?
Azure Data Factory is an enterprise-ready cloud-based hybrid data integration service that helps orchestrate data movement and operationalize data processing workflows (pipelines) at scale. It is composed of a set of interconnected systems that provide an end-to-end platform for your data engineering needs, including:
- Data Ingest | ADF comes with 90+ standard connectors to simplify connection to diverse data sources and has a copy activity that simplifies the collection of data in a centralized location, for subsequent processing or transformation.
- Mapping Data Flow | ADF provides “code-free ETL”, enabling you to create and reuse data transformation graphs using a UI-based wizard. The transformation is done automagically on a Spark cluster, without requiring you to maintain or manage your own.
- Azure Compute | ADF can run code directly on any Azure compute, making it easy for you to handcraft your transformation routines and execute them as part of your data-driven workflow.
- Data Ops| ADF works with Azure DevOps and GitHub, making it easier for you to manage your data pipeline ops using your favorite platform. Plus, you have built-in activities to simplify data publication to Azure Data Warehouse, Azure SQL Database, or your favorite BI analytics engine.
- Monitoring & Alerts | ADF integrates seamlessly with Azure Monitor, API, PowerShell, and health panels on the Azure portal, making it possible for you to monitor the execution progress and health of your entire data pipeline at any time.
Want to learn how to achieve data integration with ADF? Check out this learning path: Data Integration at Scale with Azure Data Factory or Azure Synapse Pipeline
What are the core components of Azure Data Factory (ADF)?
In this section, we will look at the core concepts and components of the Azure Data Factory toolkit.
- Pipelines are the logical grouping of activities that perform one unit of work. An ADF instance can have one or more active pipelines, and activities can be scheduled in sequence (chaining) or in parallel (independent) for execution, as desired.
- Activities represent a single processing step in a pipeline. Three types of activities are currently supported — data copy, data transformation, and activity orchestration.
- Datasets represent a data structure that provides a selected view into a data store, ideally for use in defining (and binding) inputs and outputs to a given activity.
- Linked Services represent connection strings that can be used by an activity to establish a connection to an external service, typically pointing to a data source (ingest) or compute resource (transformation) required for execution.
- Mapping Data Flows create and manage data transformation graphs that can be applied to data of any size and be used to build up a reusable library of data transformation routines.
- Integration Runtimes are compute infrastructure used by ADF to provide fully managed data flows, data movement, activity dispatch, and SSIS package execution tasks in data pipelines.
In this context, a few additions to our terminology:
- Pipeline Run is an instance of pipeline execution. Pipelines are activated by passing arguments to the parameters defined by pipeline activities. Activation can be triggered or done manually.
- Trigger is a unit of processing whose outcome determines when to activate a pipeline run.
- Parameters are read-only key/value pairs that are populated from the runtime execution context of the pipeline. Dataset and linked service are strongly typed, reusable parameter entities that define the structure (of data) and connection information (of source) for activities.
- Variables are used inside pipelines to store temporary values e.g., for use with parameters to pass values or context between activities, data flows, and pipelines.
Want to dive deeper into these concepts? Start with Concepts: Pipelines and Activities in Azure Data Factory
What are the benefits of Azure Data Factory (ADF)?
Here are seven reasons why you should explore Azure Data Factory for your data integration needs:
1. It’s enterprise ready
Data integration at enterprise volumes requires solutions that scale and are cost-effective. Azure Data Factory is a cloud-based solution that works with both on-prem and cloud-based data stores, to simplify creation and management of data-driven workflows.
2. It’s enterprise data ready
Azure Data Factory comes with built-in support for 90+ connectors that make it effortless to integrate, and ingest data from, familiar enterprise data sources.
3. Code-free transformation
Azure Data Factory has mapping data flows with a UI-based wizard for creating data transformation graphs! Reuse graphs (templatize) and execute transformations auto-magically on Spark clusters (without having to own, or manage, them yourself).
4. Run code on any Azure compute
Azure Data Factory has a sizeable list of supported compute environments and activities that make task dispatch and execution easy, within data pipelines.
5. Many SSIS packages run on Azure
Azure Data Factory can run your SSIS packages in an Azure-SSIS integration runtime, providing tools to test package readiness (lift & shift) and to develop new packages as needed.
6. Seamless data ops
Azure Data Factory makes data pipeline ops easy with automated deploy, simple (reusable) templates, and ability to use familiar Azure DevOps or GitHub workflows.
7. Secure data integration
Managed virtual networks to simplify your networking and protect against data exfiltration. Explore the various security strategies in Azure Data Factory for more.
That concludes our rapid tour of the visual guide to Azure Data Factory. We talked about how it works, its core components, and its ability to deliver Code-Free ETL. We learned 7 things you should know about ADF when evaluating data integration solutions. Want to keep going? Here are two more resources to help:
Nitya Narasimhan is a PhD in Computer Engineering, with 20+ years of software research & development experience spanning distributed & ubiquitous computing, mobile & web application development. She is currently a Cloud Advocate in the Microsoft Developer Relations team where she spends her time on mobile and cross-platform development (for Azure and Microsoft Surface Duo), visual storytelling, and supporting our amazing developer communities. She's one of ACG's 21 Azure builders to follow.