- Lab
-
Libraries: If you want this lab, consider one of these libraries.
- Data
Analyze Sales Data in DuckDB
A hands-on Code Lab demonstrating how quickly DuckDB can be installed and used to consolidate and analyze a realistic sales dataset (Orders, Order Line Items, Products, Customers). Learners install DuckDB in seconds, then query CSV, Excel, and Parquet files — and even remote HTTPS lookups — together in a single SQL workflow. The lab focuses on answering common business questions (revenue trends, top products, customer activity) while practicing lightweight validation, type-correcting, joins across heterogeneous sources, and materializing curated summaries.
Lab Info
Table of Contents
-
Challenge
Step 1: Introduction
Overview
This lab explores the variety of ways you can query data using DuckDB. Rather than focusing on a single format or source, you'll work across CSV, Excel, Parquet, local files, HTTP endpoints, and cloud object storage, using the same familiar SQL syntax.
DuckDB meets your data where it lives, so you won't need to move or transform files before querying them.
info> Note: This lab uses DuckDB v1.2.2.
Working in This Lab
For a smoother workflow, use
scratch_pad.sqlin the file tree to write and refine your SQL queries.
Prerequisites
Before starting this lab, you should have:
- SQL fundamentals: Experience writing basic queries (
SELECT,WHERE,JOIN,GROUP BY,ORDER BY) - Data concepts: Familiarity with tabular data (rows, columns, primary/foreign keys) and common data types (integers, strings, dates).
- Command-line skills: Comfort navigating a terminal.
- File formats: Prior exposure to CSV and Excel file formats.
- DuckDB: No prior experience necessary; the lab teaches DuckDB from the ground up.
- SQL fundamentals: Experience writing basic queries (
-
Challenge
Step 2: Start DuckDB
Install DuckDB
Goal: Start the DuckDB CLI and load the extensions required to query all data sources used in this lab.
DuckDB extensions add support for data sources and formats that are not built into the core engine. In this lab, you’ll use extensions to read remote files, Excel workbooks, and cloud-style object storage.
-
Activate the DuckDB CLI from the command line:
duckdb -
Install and load the required extensions inside DuckDB:
-- Remote file access over HTTPS INSTALL httpfs; LOAD httpfs; -- Excel reading via st_read() INSTALL spatial; LOAD spatial; -- Alternative Excel reader (DuckDB 1.x+) INSTALL excel; LOAD excel; -
Confirm all three extensions are active before moving on:
SELECT extension_name, loaded, installed FROM duckdb_extensions() WHERE extension_name IN ('httpfs', 'spatial', 'excel');Result
When loaded correctly the select query should return:| extension_name | loaded | installed | |---|---|---| | httpfs | true | true | | spatial | true | true | | excel | true | true |
-
-
Challenge
Step 3: Query Raw Files
Inspect the Data Sources
Goal: Understand what data files exist, where they live, and what each source contains.
Note: If your terminal does not already have a DuckDB session, type
duckdbto start one.
-
Preview the accounts CSV file using
read_csv_auto:SELECT * FROM read_csv_auto ('data/salesforce/accounts.csv') LIMIT 5;Result
Your results should look like this:| account_id | account_name | industry | created_date | |---|---|---|---| | A00001 | Account 1 | Retail | 2021-03-09 | | A00002 | Account 2 | Finance | 2022-03-13 | | A00003 | Account 3 | Retail | 2020-10-01 | | A00004 | Account 4 | Finance | 2022-08-01 | | A00005 | Account 5 | Finance | 2020-03-13 |
-
Preview the opportunities CSV file using
read_csv_auto:SELECT * FROM read_csv_auto ('data/salesforce/opportunities.csv') LIMIT 5;Result
Your results should look like this:| id | account_id | amount | close_date | stage | |---|---|---|---|---| | O0000001 | A00387 | 194560.19 | 2021-10-17 | Closed Lost | | O0000002 | A01794 | 17925.62 | 2021-01-09 | Closed Lost | | O0000003 | A03871 | 182264.19 | 2022-12-04 | Proposal | | O0000004 | A02510 | 197361.29 | 2020-05-17 | Negotiation | | O0000005 | A00385 | 151152.83 | 2023-10-02 | Closed Won |
-
Preview orders Excel sources using
st_read:SELECT * FROM st_read ('data/oltp/orders.xlsx') LIMIT 5;Result
Your results should look like this:| order_id | account_id | order_total | order_date | |---|---|---|---| | 1000000 | A02164 | 3293.55 | 2021-06-17 00:00:00 | | 1000001 | A02533 | 1738.11 | 2021-02-15 00:00:00 | | 1000002 | A01857 | 0.0 | 2024-02-18 00:00:00 | | 1000003 | A02719 | 21417.56 | 2022-08-12 00:00:00 | | 1000004 | A03232 | 4770.22 | 2023-12-05 00:00:00 |
-
Preview inventory Excel sources using
st_read:SELECT * FROM st_read ('data/oltp/inventory.xlsx') LIMIT 5;Result
Your results should look like this:| sku | description | quantity | last_stocked | |---|---|---|---| | P000001 | Product 1 | 913 | 2023-04-29 00:00:00 | | P000002 | Product 2 | 133 | 2023-09-25 00:00:00 | | P000003 | Product 3 | 582 | 2024-01-24 00:00:00 | | P000004 | Product 4 | 680 | 2023-07-10 00:00:00 | | P000005 | Product 5 | 451 | 2023-05-25 00:00:00 |
Create and Query a DuckDB Table
Goal: Verify DuckDB's inferred schema and confirm data types are correct—acting as a gatekeeping check to ensure inferred types won't cause silent data issues downstream.
-
Materialize a CSV source to a temporary table:
CREATE TEMP TABLE t_accounts AS SELECT * FROM read_csv_auto ('data/salesforce/accounts.csv'); -
Inspect the inferred schema with
PRAGMA table_info:PRAGMA table_info('t_accounts'); -
Use
typeof()to inspect column types on sample rows:SELECT typeof(account_id), typeof(created_date), typeof(industry) FROM t_accounts LIMIT 5;Result
You should see the following:| typeof(account_id) | typeof(created_date) | typeof(industry) | |---|---|---| | VARCHAR | DATE | VARCHAR | | VARCHAR | DATE | VARCHAR | | VARCHAR | DATE | VARCHAR | | VARCHAR | DATE | VARCHAR | | VARCHAR | DATE | VARCHAR |
-
-
Challenge
Step 4: Create and Query DuckDB Tables
Create and Query DuckDB Tables
Goal: Load each raw source into a table so the rest of the lab can reference a clean name instead of a file path.
Direct file queries are useful for quick inspection. Creating tables gives you stable names to use in joins, reports, and repeated analysis.
-
Create a table for each source. In a Terminal with an active DuckDB session, run the following commands:
INSTALL spatial; LOAD spatial; CREATE TABLE accounts AS SELECT * FROM read_csv_auto ('data/salesforce/accounts.csv'); CREATE TABLE opportunities AS SELECT * FROM read_csv_auto ('data/salesforce/opportunities.csv'); CREATE TABLE orders AS SELECT * FROM st_read ('data/oltp/orders.xlsx'); CREATE TABLE inventory AS SELECT * FROM st_read ('data/oltp/inventory.xlsx'); -
Ensure all tables are accessible in the same DuckDB session:
SHOW TABLES; -
Confirm all tables are registered and return rows:
SELECT 'orders' AS table_name, COUNT(*) AS row_count FROM orders UNION ALL SELECT 'accounts' AS table_name, COUNT(*) AS row_count FROM accounts UNION ALL SELECT 'inventory' AS table_name, COUNT(*) AS row_count FROM inventory UNION ALL SELECT 'opportunities' AS table_name, COUNT(*) AS row_count FROM opportunities;Results
You should see the following results:| table_name | row_count | |---|---| | orders | 50000 | | accounts | 5000 | | inventory | 500 | | opportunities | 15000 |
-
-
Challenge
Step 5: Write Your First Report
Write Your First Report
Goal: Create a CSV file from a DuckDB query result.
You can use DuckDB to output query results directly to CSV, Parquet, or Excel files using the following syntax:
COPY( QUERY ) TO 'File.csv'The
COPY ... TOcommand writes query results to a file, making it useful for reports, exports, and downstream analysis.Note: By default, DuckDB writes files to the current directory where the DuckDB session starts. In this lab, that directory is
/home/ps-user/workspace/. -
Challenge
Step 6: Persist Aggregations to File
Persist Aggregations to File
Goal: Save curated results so they can be reused or shared without rerunning raw source queries.
Aggregating data into a summary table gives you a reusable result set for reporting and downstream analysis. Instead of rerunning joins against the raw source tables, you can query the curated table directly.
-
Challenge
Step 7: Access Remote Data with DuckDB
Access Remote Data with DuckDB
Goal: Use DuckDB's
httpfsextension to read data from an HTTP endpoint hosted by a Flask sample application and from object storage.With
### Conclusionhttpfs, DuckDB can query remote files using the same SQL patterns you used for local files.In this lab you learned:
- Installing and activating DuckDB extensions - powerful abstractions that unlock extra functionality, such as querying HTTP endpoints and S3 storage.
- Querying multiple file types and writing results to dedicated files - one of DuckDB's key advantages: you can start querying raw files with minimal setup or infrastructure.
- Reading raw files into persisted DuckDB tables - stored inside a
.duckdbdatabase file. - Querying S3 objects and HTTPS-hosted files directly from DuckDB.
If you enjoyed this content, consider exploring these DuckDB projects and features:
- Community Extensions - A list of opensource community developed extensions for DuckDB.
- DuckDB WASM - A live, in-browser example of DuckDB running via WebAssembly.
- DuckLake - DuckDB's lakehouse extension.
About the author
Real skill practice before real-world application
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Learn by doing
Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.
Follow your guide
All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.
Turn time into mastery
On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.