Libraries: If you want this lab, consider one of these libraries.
Data

Train a Machine Learning Model with PySpark MLlib

In this Code Lab, you will build and deploy machine learning models at scale using PySpark MLlib on an e-commerce dataset. You’ll start by performing essential data preparation (cleaning, feature engineering) and quickly move on to training two different types of models: a Random Forest classifier and a Linear Regression model for numeric predictions. You’ll evaluate each model with relevant metrics, perform hyperparameter tuning on the regression model, and then combine all the steps into an end-to-end pipeline using MLflow. Finally, you’ll learn how to save, retrieve, and run batch inference with your trained pipeline.

Get started Contact sales

Lab Info

Level

Intermediate

Last updated

Nov 25, 2025

Duration

1h 14m

Challenge

Introduction to the Lab
Building Machine Learning Models with PySpark MLlib

In this lab, you'll learn how to build and evaluate machine learning models using PySpark MLlib, the machine learning library for Apache Spark. You'll work with an e-commerce dataset to predict customer behavior and spending patterns, gaining hands-on experience with distributed machine learning.

🟦 Note: PySpark MLlib enables you to train machine learning models on massive datasets across distributed clusters, providing scalable solutions for real-world ML applications. Its ability to handle large-scale data makes it an essential tool in modern machine learning workflows.

Why It Matters

As machine learning applications grow in complexity and data volume, traditional ML libraries often struggle to scale. PySpark MLlib addresses this challenge by:
- Enabling distributed training of ML models across clusters
- Providing a comprehensive set of ML algorithms optimized for big data
- Supporting end-to-end ML pipelines with feature engineering and model evaluation
- Offering tools for hyperparameter tuning and model selection at scale
Mastering PySpark MLlib will equip you with the skills to build production-ready ML models that can handle real-world data volumes.

Key Concepts

MLlib Pipeline Components:
- Transformers: Convert input data into features
- Estimators: Learn parameters from data to create models
- Pipeline: Orchestrates the sequence of transformations and model training
- Cross-validation: Evaluates model performance across different data splits
Feature Engineering:
- Vector assemblers for combining features
- Standardization and normalization
- Categorical encoding
Model Training and Evaluation:
- Distributed model training
- Performance metrics calculation
- Hyperparameter tuning with cross-validation
Best Practices:
- Data preprocessing and cleaning
- Feature scaling and selection
- Model validation and selection
✅ Important: Understanding these concepts will enable you to build robust, scalable machine learning pipelines that can handle real-world data volumes and provide reliable predictions.

Learning Objectives

By the end of this lab, you will be able to:
1. Prepare and engineer features for ML models using PySpark
2. Train and evaluate Random Forest and Linear Regression models
3. Optimize model performance through hyperparameter tuning
4. Build end-to-end ML pipelines that scale to large datasets
Throughout this lab, you'll work with e-commerce data to predict customer behavior and spending patterns. Each step builds upon the previous one, gradually introducing you to the full power of PySpark MLlib for distributed machine learning.

info> If you feel stuck at any point of the lab, check the solution folder. You can find the step file and task you are solving and review the code explanation. The goal is to learn!

Now that you understand what you'll be learning, begin by setting up your environment and exploring the e-commerce dataset. Click on the Next Step arrow to start your journey into distributed machine learning with PySpark MLlib!
Challenge

Data Preparation and Feature Engineering
In this step of the lab, you'll learn how to prepare and engineer features for machine learning using PySpark MLlib. You'll work with e-commerce data, setting up your PySpark environment, loading the data, and applying various feature engineering techniques.

These skills are crucial for preparing data for machine learning models and ensuring high-quality predictions.
🟦 Why It Matters:

The quality of your machine learning model's predictions directly depends on the quality of your input features. Proper data preparation and feature engineering are essential for model performance.

Creating meaningful features from raw data is often the most important step in machine learning. Good features can make even simple models perform well, while poor features can make even complex models fail.

PySpark MLlib provides scalable implementations of common feature engineering techniques, allowing you to process large datasets efficiently.

The feature engineering steps you'll learn can be integrated into MLlib's Pipeline API, making your workflow reproducible and maintainable.

The techniques covered in this step are commonly used in production machine learning systems, from recommendation engines to fraud detection.
Now, proceed to the first task: loading the e-commerce data. > 🟦 Why It Matters:
Loading multiple datasets correctly is crucial for building comprehensive features for machine learning.

Proper data loading ensures you have the right data structure and content for your ML pipeline.

Efficient data loading sets the stage for optimal feature engineering.
💡 Code Explanation: Here is a break down of the solution code:
1. Dataset Loading:
  
  customers_df = spark.read.parquet("data/ecommerce_customers.parquet") interactions_df = spark.read.parquet("data/ecommerce_interactions.parquet") products_df = spark.read.parquet("data/ecommerce_products.parquet")
  
  spark.read: Accesses Spark's DataFrameReader
  
  .parquet(): Specifies the parquet file format
  
  Each call creates a DataFrame with the schema defined in the parquet files
2. Return Statement:
  
  return customers_df, interactions_df, products_df
  
  Returns a tuple containing all necessary DataFrames for subsequent feature engineering tasks. > 🟦 Why It Matters:
Feature Quality: Well-engineered features are crucial for model performance

Data Transformation: Converting raw data into meaningful features that capture important patterns

Model Readiness: Preparing data in a format suitable for ML algorithms

Target Variable Preparation: Properly preparing target variables for both classification and regression tasks
💡 Code Explanation: Here is a break down of the solution code:
1. Dataset Joining:
  
  df = interactions_df.join(customers_df, "customer_id").join(products_df, "product_id")
  
  Combines all three datasets using customer_id and product_id as join keys
2. Categorical Feature Processing:
  
  categorical_cols = ["device", "membership_level"] indexer = StringIndexer( inputCols=categorical_cols, outputCols=[f"{col}_indexed" for col in categorical_cols], handleInvalid="keep" # Handle any new categories in test data ) encoder = OneHotEncoder( inputCols=[f"{col}_indexed" for col in categorical_cols], outputCols=[f"{col}_encoded" for col in categorical_cols], dropLast=False # Keep all categories to ensure consistent feature dimensions )
  
  Converts categorical variables to numeric indices using StringIndexer
  
  Uses handleInvalid="keep" to handle any new categories in test data
  
  Creates binary features using one-hot encoding
  
  Uses dropLast=False to keep all categories for consistent feature dimensions
3. Target Variable Processing:
  
  target_indexer = StringIndexer( inputCol="interaction_type", outputCol="interaction_type_indexed", handleInvalid="keep" )
  
  Creates a separate StringIndexer for the classification target variable
  
  Ensures the target variable is properly indexed for model training
4. Numeric Feature Processing:
  
  numeric_cols = ["time_spent_seconds", "previous_visits", "price"] numeric_assembler = VectorAssembler( inputCols=numeric_cols, outputCol="numeric_features", handleInvalid="skip" # Skip rows with invalid values ) scaler = StandardScaler( inputCol="numeric_features", outputCol="scaled_features", withStd=True, withMean=True )
  
  Combines numeric features into a vector using VectorAssembler
  
  Uses handleInvalid="skip" to skip rows with invalid values
  
  Standardizes the features using StandardScaler with both mean and standard deviation scaling
  
  Note: purchase_amount is excluded as it's the target variable for regression
5. Feature Combination and Pipeline:
  
  feature_cols = [f"{col}_encoded" for col in categorical_cols] + ["scaled_features"] final_assembler = VectorAssembler( inputCols=feature_cols, outputCol="features", handleInvalid="skip" # Skip rows with invalid values ) pipeline = Pipeline(stages=[indexer, encoder, target_indexer, numeric_assembler, scaler, final_assembler]) model = pipeline.fit(df) transformed_df = model.transform(df)
  
  Combines all encoded categorical features with scaled numeric features
  
  Creates and executes a pipeline with all transformations in the correct order
  
  Includes the target variable indexer in the pipeline
  
  Fits and applies the transformations to the data
6. Data Quality Check:
  
  transformed_df = transformed_df.filter(transformed_df.features.isNotNull())
  
  Ensures no null values in the final feature vectors
  
  This step is crucial for model training as most ML algorithms cannot handle null values > 🟦 Why It Matters:
Model Evaluation: Separate datasets allow unbiased assessment of model performance

Overfitting Prevention: Testing on unseen data helps validate model generalization

Reproducibility: Consistent splits ensure reliable model development
💡 Code Explanation: Here is a break down of the solution code:
1. Train/Test Split:
  
  train_df, test_df = transformed_df.randomSplit([0.8, 0.2], seed=42)
  
  Uses randomSplit() to create 80/20 split
  
  Sets random seed for reproducibility
2. Directory Creation:
  
  os.makedirs("data/processed", exist_ok=True)
  
  Creates output directory if it doesn't exist
  
  Uses exist_ok=True to prevent errors if directory exists
3. Dataset Saving:
  
  train_df.write.mode("overwrite").parquet("data/processed/train_data.parquet") test_df.write.mode("overwrite").parquet("data/processed/test_data.parquet")
  
  Saves both datasets in parquet format
  
  Uses overwrite mode to replace existing files
Challenge

Classification Model Training and Evaluation
In this step of the lab, you'll learn how to train and evaluate machine learning models using PySpark MLlib. You'll work with the prepared features from the previous step to build, train, and assess various machine learning models.

This step will help you understand how to create effective ML pipelines and measure model performance.
🟦 Why It Matters:

Model training and evaluation are critical steps in the machine learning lifecycle that determine the effectiveness of your ML solution.

Understanding how to properly train models and evaluate their performance helps you make informed decisions about model selection and optimization.

PySpark MLlib provides a comprehensive set of tools for training various types of models and evaluating their performance at scale.

The ability to create reproducible ML pipelines ensures consistent results and makes it easier to deploy models to production.

Proper model evaluation helps you identify potential issues like overfitting, underfitting, or bias in your predictions.
Now, proceed to the first task: setting up the model training pipeline. > 🟦 Why It Matters:
Model Configuration: Proper configuration is crucial for model performance

Parameter Selection: Well-chosen parameters help prevent overfitting and underfitting

Reproducibility: Setting a random seed ensures consistent results

Scalability: Random Forest is well-suited for distributed computing
💡 Code Explanation: Here is a break down of the solution code:
1. Random Forest Configuration:
  
  rf = RandomForestClassifier( labelCol="interaction_type_indexed", featuresCol="features", numTrees=100, maxDepth=10, minInstancesPerNode=5, seed=42, impurity="gini" )
  
  labelCol: Specifies the target variable (interaction type)
  
  featuresCol: Points to the feature vector column
  
  numTrees: Number of decision trees in the forest (100 for good balance)
  
  maxDepth: Maximum depth of each tree (10 to prevent overfitting)
  
  minInstancesPerNode: Minimum samples per node (5 for stability)
  
  seed: Random seed for reproducibility
  
  impurity: Uses Gini impurity for splitting criterion
2. Return Statement:
  
  return rf
  
  Returns the configured model for use in training. > 🟦 Why It Matters:
Model Training: The quality of training directly affects model performance

Feature Importance: Understanding which features contribute most to predictions

Model Interpretability: Feature importances help explain model decisions

Performance Optimization: Insights for feature selection and engineering
💡 Code Explanation: Here is a break down of the solution code:
1. Model Training:
  
  model = rf.fit(train_df)
  
  fit(): Trains the model on the provided training data
  
  Creates a RandomForestClassificationModel instance
  
  Learns the patterns in the data to make predictions
2. Feature Importance Extraction:
  
  feature_importances = model.featureImportances
  
  featureImportances: Property containing importance scores
  
  Each score represents how much a feature contributes to predictions
  
  Scores sum to 1.0 across all features
3. Return Statement:
  
  return model, feature_importances
  
  Returns both the trained model and feature importance scores for further use. > 🟦 Why It Matters:
Model Assessment: Metrics help understand how well the model performs

Performance Comparison: Enables comparison with other models or baselines

Model Selection: Helps in choosing the best model for deployment

Quality Assurance: Ensures the model meets performance requirements
💡 Code Explanation: Here is a break down of the solution code:
1. Making Predictions:
  
  predictions = model.transform(test_df)
  
  transform(): Applies the trained model to test data
  
  Creates predictions for each instance in the test set
2. Evaluator Setup:
  
  evaluator = MulticlassClassificationEvaluator( labelCol="interaction_type_indexed", predictionCol="prediction" )
  
  Creates an evaluator for multiclass classification
  
  Configures label and prediction columns
3. Metric Calculation:
  
  accuracy = evaluator.evaluate(predictions, {evaluator.metricName: "accuracy"}) precision = evaluator.evaluate(predictions, {evaluator.metricName: "precisionByLabel"}) recall = evaluator.evaluate(predictions, {evaluator.metricName: "recallByLabel"})
  
  Calculates different performance metrics
  
  Each metric provides a different perspective on model performance
4. Return Statement:
  
  return { "Accuracy": accuracy, "Precision": precision, "Recall": recall }
  
  Returns a dictionary containing all calculated metrics for analysis.
Challenge

Linear Regression Model Training and Evaluation
In this step of the lab, you'll learn how to build and evaluate a Linear Regression model using PySpark MLlib. You'll work with the prepared features from Step 2 to predict numeric outcomes, specifically focusing on predicting purchase amounts.

This step will help you understand how to implement regression models and assess their performance using appropriate metrics.
🟦 Why It Matters:

Linear Regression is a fundamental machine learning algorithm that helps predict continuous values, which is crucial for business applications like sales forecasting and price prediction.

Understanding how to implement and evaluate regression models helps you make data-driven decisions about pricing, inventory management, and customer behavior analysis.

PySpark MLlib's Linear Regression implementation allows you to train models on large-scale datasets efficiently.

Proper model evaluation using metrics like RMSE (Root Mean Square Error) helps you understand the accuracy of your predictions and identify areas for improvement.

The ability to interpret regression coefficients provides valuable insights into which features have the strongest impact on your target variable.
Now, proceed to the first task: configuring the Linear Regression model. > 🟦 Why It Matters:
Model Configuration: Proper configuration is crucial for model performance

Parameter Selection: Well-chosen parameters help prevent overfitting and underfitting

Regularization: Elastic net regularization helps with feature selection and model stability

Scalability: Linear Regression is efficient for distributed computing
💡 Code Explanation: Here is a break down of the solution code:
1. Linear Regression Configuration:
  
  lr = LinearRegression( labelCol="purchase_amount", featuresCol="features", maxIter=100, regParam=0.3, elasticNetParam=0.8, standardization=True, solver="auto" )
  
  labelCol: Specifies the target variable (purchase amount)
  
  featuresCol: Points to the feature vector column
  
  maxIter: Maximum number of iterations for optimization (100 for good convergence)
  
  regParam: Regularization parameter (0.3 for balanced regularization)
  
  elasticNetParam: Elastic net mixing parameter (0.8 for balanced L1/L2 regularization)
  
  standardization: Standardizes features for better convergence
  
  solver: Automatically chooses the best optimization algorithm
2. Return Statement:
  
  return lr
  
  Returns the configured model for use in training. > 🟦 Why It Matters:
Model Training: The quality of training directly affects model performance

Feature Impact: Coefficients show how each feature influences predictions

Model Interpretability: Coefficients help explain the relationship between features and target

Performance Optimization: Insights for feature selection and engineering
💡 Code Explanation: Here is a break down of the solution code:
1. Model Training:
  
  model = lr.fit(train_df)
  
  fit(): Trains the model on the provided training data
  
  Creates a LinearRegressionModel instance
  
  Learns the patterns in the data to make predictions
2. Coefficient Extraction:
  
  coefficients = model.coefficients
  
  coefficients: Property containing the learned feature weights
  
  Each coefficient represents the impact of a feature on the target variable
  
  Positive coefficients indicate positive correlation, negative indicate inverse correlation
3. Return Statement:
  
  return model, coefficients
  
  Returns both the trained model and feature coefficients for further use. > 🟦 Why It Matters:
Performance Assessment: RMSE provides a clear measure of prediction accuracy

Model Selection: Helps in comparing different models or configurations

Quality Assurance: Ensures the model meets performance requirements

Business Impact: Translates model performance into practical value
💡 Code Explanation: Here is a break down of the solution code:
1. Making Predictions:
  
  predictions = model.transform(test_df)
  
  transform(): Applies the trained model to test data
  
  Creates predictions for each instance
2. Evaluator Setup:
  
  evaluator = RegressionEvaluator( labelCol="purchase_amount", predictionCol="prediction" )
  
  Creates a RegressionEvaluator for RMSE calculation
  
  Configures with appropriate column names for label and predictions
3. RMSE Calculation:
  
  rmse = evaluator.evaluate(predictions, {evaluator.metricName: "rmse"})
  
  Calculates RMSE using the evaluator
  
  Specifies "rmse" as the metric to calculate
4. Return Statement:
  
  return rmse
  
  Returns the calculated RMSE value for model assessment.
Challenge

Hyperparameter Tuning
In this step of the lab, you'll learn how to optimize your machine learning models through hyperparameter tuning using PySpark MLlib's CrossValidator and TrainValidationSplit.

You'll work with the Linear Regression model from the previous step to find the optimal configuration.

This step will help you understand how to systematically search for the best model parameters and evaluate their performance using cross-validation.
🟦 Why It Matters:

Hyperparameter tuning is crucial for maximizing model performance and finding the optimal balance between model complexity and generalization.

Cross-validation helps ensure that your model's performance is robust and not just good on a single train-test split.

Understanding how to use ParamGridBuilder and CrossValidator helps you systematically explore the hyperparameter space.

Proper model selection based on cross-validation metrics helps you choose the most reliable model for production use.

The ability to interpret and compare different model configurations helps you make informed decisions about model deployment.
Now, proceed to the first task: setting up the parameter grid for hyperparameter tuning. > 🟦 Why It Matters:
Hyperparameter Search: Parameter grid enables systematic search of hyperparameters

Model Optimization: Helps find the best model configuration

Comprehensive Testing: Tests multiple parameter combinations

Reproducibility: Ensures consistent hyperparameter search

Flexibility: Allows passing existing LinearRegression instances for testing
💡 Code Explanation: Here is a break down of the solution code:
1. Function Signature:
  
  def setup_param_grid(lr=None):
  
  Accepts an optional LinearRegression instance
  
  Makes the function more flexible for testing
2. Linear Regression Setup:
  
  if lr is None: lr = LinearRegression(labelCol="purchase_amount", featuresCol="features")
  
  Creates a base LinearRegression model if none provided
  
  Sets up label and feature columns
3. Parameter Grid Creation:
  
  param_grid = ParamGridBuilder() .addGrid(lr.regParam, [0.1, 0.5]) .addGrid(lr.elasticNetParam, [0.0, 0.5]) .addGrid(lr.maxIter, [50, 100]) .build()
  
  regParam: Regularization parameter values [0.1, 0.5]
  
  elasticNetParam: Elastic net mixing parameter values [0.0, 0.5]
  
  maxIter: Maximum iteration values [50, 100]
  
  Creates 8 different combinations (2 × 2 × 2) > 🟦 Why It Matters:
Model Validation: Cross-validation provides robust model evaluation

Hyperparameter Selection: Helps identify the best parameter combination

Bias-Variance Trade-off: Balances model complexity and generalization

Performance Estimation: Gives reliable estimates of model performance
💡 Code Explanation: Here is a break down of the solution code:
1. Function Signature:
  
  def setup_cross_validation(param_grid, lr=None):
  
  Accepts a parameter grid and an optional LinearRegression instance
  
  Makes the function more flexible for testing
2. Linear Regression Setup:
  
  lr = LinearRegression(labelCol="purchase_amount", featuresCol="features")
  
  Creates a base LinearRegression model
  
  Sets up label and feature columns
3. Evaluator Configuration:
  
  evaluator = RegressionEvaluator( labelCol="purchase_amount", predictionCol="prediction", metricName="rmse" )
  
  Sets up RMSE as the evaluation metric
  
  Configures label and prediction columns
4. Cross-Validator Setup:
  
  cv = CrossValidator( estimator=lr, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=3 )
  
  Combines all components for cross-validation
  
  Uses 3-fold cross-validation for robust evaluation > 🟦 Why It Matters:
Model Selection: Identifies the optimal model configuration

Performance Optimization: Ensures best possible model performance

Generalization: Helps prevent overfitting

Production Readiness: Provides a production-ready model
💡 Code Explanation: Here is a break down of the solution code:
1. Model Training and Selection:
  
  best_model = cv.fit(train_df).bestModel
  
  fit(): Fits the cross-validator to training data
  
  bestModel: Property containing the best performing model
2. Model Evaluation (in main execution):
  
  predictions = best_model.transform(test_df) evaluator = RegressionEvaluator( labelCol="purchase_amount", predictionCol="prediction", metricName="rmse" ) rmse = evaluator.evaluate(predictions)
  
  transform(): Makes predictions on test data
  
  evaluate(): Evaluates model performance using RMSE
  
  Prints best parameters and performance metrics
Challenge

Conclusion to the Lab
Congratulations on completing the "Train a Machine Learning Model with PySpark MLlib" lab!

You've taken a significant step in mastering one of the most powerful tools for distributed machine learning. Here is a recap of what you've accomplished and explore where you can go from here.

Lab Recap

Throughout this lab, you've progressed through key areas of PySpark MLLib:

Data Preparation and Feature Engineering

You began by learning how to:
- Prepare data for machine learning tasks using PySpark
- Handle missing values and outliers
- Create and transform features using PySpark's built-in functions
- Scale and normalize features for better model performance
Building ML Pipelines

You then advanced to building sophisticated ML pipelines:
- Creating and configuring ML pipeline stages
- Implementing feature transformers and estimators
- Building end-to-end ML workflows
- Evaluating model performance using various metrics
Model Training and Optimization

You learned critical aspects of model development:
- Training different types of ML models (classification, regression)
- Implementing cross-validation for robust evaluation
- Performing hyperparameter tuning
By working through these steps, you've gained practical experience that mirrors production ML challenges.

Next Steps

Your PySpark MLLib journey doesn't end here! Consider these paths to further enhance your machine learning skills:

Advanced ML Techniques
- Deep Learning: Explore integration with deep learning frameworks like TensorFlow and PyTorch
- Advanced Algorithms: Learn about more sophisticated algorithms like Random Forests, Gradient Boosting, and Neural Networks
- Unsupervised Learning: Master clustering and dimensionality reduction techniques
- Recommendation Systems: Build collaborative filtering and content-based recommendation systems
Production ML
- Model Deployment: Learn to deploy ML models in production environments
- Model Monitoring: Implement monitoring and maintenance strategies
- A/B Testing: Design and implement A/B testing frameworks
- MLOps: Explore tools and practices for ML operations
Integration and Scaling
- Distributed Training: Deep dive into distributed training strategies
- Feature Stores: Learn about feature management and serving
- Model Registry: Implement model versioning and management
- Real-time Predictions: Build streaming ML applications
Performance and Optimization
- Memory Management: Optimize memory usage in distributed environments
- Training Speed: Learn techniques to accelerate model training
- Resource Utilization: Master cluster resource management
- Model Compression: Explore techniques for model optimization
Remember, expertise in distributed machine learning comes with practice. Try to apply what you've learned to your own datasets or contribute to open-source projects to solidify your understanding.

Happy Machine Learning! 🚀

About the author

Warner Chaves

Warner is a SQL Server Certified Master, MVP, and Principal Consultant at Pythian. He manages clients in many industries and leads a talented team that maintains and innovates with their data solutions. When he's not working in Ottawa, Ontario, he can be found in his home country of Costa Rica.

Real skill practice before real-world application

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Learn by doing

Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.

Follow your guide

All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.

Turn time into mastery

On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.

Train a Machine Learning Model with PySpark MLlib

Lab Info

Table of Contents

Introduction to the Lab

Building Machine Learning Models with PySpark MLlib

Why It Matters

Key Concepts

Learning Objectives

Data Preparation and Feature Engineering

Classification Model Training and Evaluation

Linear Regression Model Training and Evaluation

Hyperparameter Tuning

Conclusion to the Lab

Lab Recap

Data Preparation and Feature Engineering

Building ML Pipelines

Model Training and Optimization

Next Steps

Advanced ML Techniques

Production ML

Integration and Scaling

Performance and Optimization

About the author

Real skill practice before real-world application

Learn by doing

Follow your guide

Turn time into mastery

Get started with Pluralsight