- Lab
- Data

Train a Machine Learning Model with PySpark MLlib
In this Code Lab, you will build and deploy machine learning models at scale using PySpark MLlib on an e-commerce dataset. You’ll start by performing essential data preparation (cleaning, feature engineering) and quickly move on to training two different types of models: a Random Forest classifier and a Linear Regression model for numeric predictions. You’ll evaluate each model with relevant metrics, perform hyperparameter tuning on the regression model, and then combine all the steps into an end-to-end pipeline using MLflow. Finally, you’ll learn how to save, retrieve, and run batch inference with your trained pipeline.

Path Info
Table of Contents
-
Challenge
Introduction to the Lab
Building Machine Learning Models with PySpark MLlib
In this lab, you'll learn how to build and evaluate machine learning models using PySpark MLlib, the machine learning library for Apache Spark. You'll work with an e-commerce dataset to predict customer behavior and spending patterns, gaining hands-on experience with distributed machine learning.
🟦 Note: PySpark MLlib enables you to train machine learning models on massive datasets across distributed clusters, providing scalable solutions for real-world ML applications. Its ability to handle large-scale data makes it an essential tool in modern machine learning workflows.
Why It Matters
As machine learning applications grow in complexity and data volume, traditional ML libraries often struggle to scale. PySpark MLlib addresses this challenge by:
- Enabling distributed training of ML models across clusters
- Providing a comprehensive set of ML algorithms optimized for big data
- Supporting end-to-end ML pipelines with feature engineering and model evaluation
- Offering tools for hyperparameter tuning and model selection at scale
Mastering PySpark MLlib will equip you with the skills to build production-ready ML models that can handle real-world data volumes.
Key Concepts
MLlib Pipeline Components:
- Transformers: Convert input data into features
- Estimators: Learn parameters from data to create models
- Pipeline: Orchestrates the sequence of transformations and model training
- Cross-validation: Evaluates model performance across different data splits
Feature Engineering:
- Vector assemblers for combining features
- Standardization and normalization
- Categorical encoding
Model Training and Evaluation:
- Distributed model training
- Performance metrics calculation
- Hyperparameter tuning with cross-validation
Best Practices:
- Data preprocessing and cleaning
- Feature scaling and selection
- Model validation and selection
✅ Important: Understanding these concepts will enable you to build robust, scalable machine learning pipelines that can handle real-world data volumes and provide reliable predictions.
Learning Objectives
By the end of this lab, you will be able to:
- Prepare and engineer features for ML models using PySpark
- Train and evaluate Random Forest and Linear Regression models
- Optimize model performance through hyperparameter tuning
- Build end-to-end ML pipelines that scale to large datasets
Throughout this lab, you'll work with e-commerce data to predict customer behavior and spending patterns. Each step builds upon the previous one, gradually introducing you to the full power of PySpark MLlib for distributed machine learning.
info> If you feel stuck at any point of the lab, check the
solution
folder. You can find the step file and task you are solving and review the code explanation. The goal is to learn!Now that you understand what you'll be learning, begin by setting up your environment and exploring the e-commerce dataset. Click on the Next Step arrow to start your journey into distributed machine learning with PySpark MLlib!
-
Challenge
Data Preparation and Feature Engineering
In this step of the lab, you'll learn how to prepare and engineer features for machine learning using PySpark MLlib. You'll work with e-commerce data, setting up your PySpark environment, loading the data, and applying various feature engineering techniques.
These skills are crucial for preparing data for machine learning models and ensuring high-quality predictions.
🟦 Why It Matters:
-
The quality of your machine learning model's predictions directly depends on the quality of your input features. Proper data preparation and feature engineering are essential for model performance.
-
Creating meaningful features from raw data is often the most important step in machine learning. Good features can make even simple models perform well, while poor features can make even complex models fail.
-
PySpark MLlib provides scalable implementations of common feature engineering techniques, allowing you to process large datasets efficiently.
-
The feature engineering steps you'll learn can be integrated into MLlib's Pipeline API, making your workflow reproducible and maintainable.
-
The techniques covered in this step are commonly used in production machine learning systems, from recommendation engines to fraud detection.
Now, proceed to the first task: loading the e-commerce data. > 🟦 Why It Matters:
- Loading multiple datasets correctly is crucial for building comprehensive features for machine learning.
- Proper data loading ensures you have the right data structure and content for your ML pipeline.
- Efficient data loading sets the stage for optimal feature engineering.
💡 Code Explanation: Here is a break down of the solution code:
-
Dataset Loading:
customers_df = spark.read.parquet("data/ecommerce_customers.parquet") interactions_df = spark.read.parquet("data/ecommerce_interactions.parquet") products_df = spark.read.parquet("data/ecommerce_products.parquet")
spark.read
: Accesses Spark's DataFrameReader.parquet()
: Specifies the parquet file format- Each call creates a DataFrame with the schema defined in the parquet files
-
Return Statement:
return customers_df, interactions_df, products_df
Returns a tuple containing all necessary DataFrames for subsequent feature engineering tasks. > 🟦 Why It Matters:
- Feature Quality: Well-engineered features are crucial for model performance
- Data Transformation: Converting raw data into meaningful features that capture important patterns
- Model Readiness: Preparing data in a format suitable for ML algorithms
- Target Variable Preparation: Properly preparing target variables for both classification and regression tasks
💡 Code Explanation: Here is a break down of the solution code:
-
Dataset Joining:
df = interactions_df.join(customers_df, "customer_id").join(products_df, "product_id")
- Combines all three datasets using
customer_id
andproduct_id
as join keys
- Combines all three datasets using
-
Categorical Feature Processing:
categorical_cols = ["device", "membership_level"] indexer = StringIndexer( inputCols=categorical_cols, outputCols=[f"{col}_indexed" for col in categorical_cols], handleInvalid="keep" # Handle any new categories in test data ) encoder = OneHotEncoder( inputCols=[f"{col}_indexed" for col in categorical_cols], outputCols=[f"{col}_encoded" for col in categorical_cols], dropLast=False # Keep all categories to ensure consistent feature dimensions )
- Converts categorical variables to numeric indices using
StringIndexer
- Uses
handleInvalid="keep"
to handle any new categories in test data - Creates binary features using one-hot encoding
- Uses
dropLast=False
to keep all categories for consistent feature dimensions
- Converts categorical variables to numeric indices using
-
Target Variable Processing:
target_indexer = StringIndexer( inputCol="interaction_type", outputCol="interaction_type_indexed", handleInvalid="keep" )
- Creates a separate
StringIndexer
for the classification target variable - Ensures the target variable is properly indexed for model training
- Creates a separate
-
Numeric Feature Processing:
numeric_cols = ["time_spent_seconds", "previous_visits", "price"] numeric_assembler = VectorAssembler( inputCols=numeric_cols, outputCol="numeric_features", handleInvalid="skip" # Skip rows with invalid values ) scaler = StandardScaler( inputCol="numeric_features", outputCol="scaled_features", withStd=True, withMean=True )
- Combines numeric features into a vector using
VectorAssembler
- Uses
handleInvalid="skip"
to skip rows with invalid values - Standardizes the features using
StandardScaler
with both mean and standard deviation scaling - Note:
purchase_amount
is excluded as it's the target variable for regression
- Combines numeric features into a vector using
-
Feature Combination and Pipeline:
feature_cols = [f"{col}_encoded" for col in categorical_cols] + ["scaled_features"] final_assembler = VectorAssembler( inputCols=feature_cols, outputCol="features", handleInvalid="skip" # Skip rows with invalid values ) pipeline = Pipeline(stages=[indexer, encoder, target_indexer, numeric_assembler, scaler, final_assembler]) model = pipeline.fit(df) transformed_df = model.transform(df)
- Combines all encoded categorical features with scaled numeric features
- Creates and executes a pipeline with all transformations in the correct order
- Includes the target variable indexer in the pipeline
- Fits and applies the transformations to the data
-
Data Quality Check:
transformed_df = transformed_df.filter(transformed_df.features.isNotNull())
- Ensures no null values in the final feature vectors
- This step is crucial for model training as most ML algorithms cannot handle null values > 🟦 Why It Matters:
- Model Evaluation: Separate datasets allow unbiased assessment of model performance
- Overfitting Prevention: Testing on unseen data helps validate model generalization
- Reproducibility: Consistent splits ensure reliable model development
💡 Code Explanation: Here is a break down of the solution code:
-
Train/Test Split:
train_df, test_df = transformed_df.randomSplit([0.8, 0.2], seed=42)
- Uses randomSplit() to create 80/20 split
- Sets random seed for reproducibility
-
Directory Creation:
os.makedirs("data/processed", exist_ok=True)
- Creates output directory if it doesn't exist
- Uses
exist_ok=True
to prevent errors if directory exists
-
Dataset Saving:
train_df.write.mode("overwrite").parquet("data/processed/train_data.parquet") test_df.write.mode("overwrite").parquet("data/processed/test_data.parquet")
- Saves both datasets in parquet format
- Uses overwrite mode to replace existing files
-
-
Challenge
Classification Model Training and Evaluation
In this step of the lab, you'll learn how to train and evaluate machine learning models using PySpark MLlib. You'll work with the prepared features from the previous step to build, train, and assess various machine learning models.
This step will help you understand how to create effective ML pipelines and measure model performance.
🟦 Why It Matters:
-
Model training and evaluation are critical steps in the machine learning lifecycle that determine the effectiveness of your ML solution.
-
Understanding how to properly train models and evaluate their performance helps you make informed decisions about model selection and optimization.
-
PySpark MLlib provides a comprehensive set of tools for training various types of models and evaluating their performance at scale.
-
The ability to create reproducible ML pipelines ensures consistent results and makes it easier to deploy models to production.
-
Proper model evaluation helps you identify potential issues like overfitting, underfitting, or bias in your predictions.
Now, proceed to the first task: setting up the model training pipeline. > 🟦 Why It Matters:
- Model Configuration: Proper configuration is crucial for model performance
- Parameter Selection: Well-chosen parameters help prevent overfitting and underfitting
- Reproducibility: Setting a random seed ensures consistent results
- Scalability: Random Forest is well-suited for distributed computing
💡 Code Explanation: Here is a break down of the solution code:
-
Random Forest Configuration:
rf = RandomForestClassifier( labelCol="interaction_type_indexed", featuresCol="features", numTrees=100, maxDepth=10, minInstancesPerNode=5, seed=42, impurity="gini" )
labelCol
: Specifies the target variable (interaction type)featuresCol
: Points to the feature vector columnnumTrees
: Number of decision trees in the forest (100 for good balance)maxDepth
: Maximum depth of each tree (10 to prevent overfitting)minInstancesPerNode
: Minimum samples per node (5 for stability)seed
: Random seed for reproducibilityimpurity
: Uses Gini impurity for splitting criterion
-
Return Statement:
return rf
Returns the configured model for use in training. > 🟦 Why It Matters:
- Model Training: The quality of training directly affects model performance
- Feature Importance: Understanding which features contribute most to predictions
- Model Interpretability: Feature importances help explain model decisions
- Performance Optimization: Insights for feature selection and engineering
💡 Code Explanation: Here is a break down of the solution code:
-
Model Training:
model = rf.fit(train_df)
fit()
: Trains the model on the provided training data- Creates a
RandomForestClassificationModel
instance - Learns the patterns in the data to make predictions
-
Feature Importance Extraction:
feature_importances = model.featureImportances
featureImportances
: Property containing importance scores- Each score represents how much a feature contributes to predictions
- Scores sum to 1.0 across all features
-
Return Statement:
return model, feature_importances
Returns both the trained model and feature importance scores for further use. > 🟦 Why It Matters:
- Model Assessment: Metrics help understand how well the model performs
- Performance Comparison: Enables comparison with other models or baselines
- Model Selection: Helps in choosing the best model for deployment
- Quality Assurance: Ensures the model meets performance requirements
💡 Code Explanation: Here is a break down of the solution code:
-
Making Predictions:
predictions = model.transform(test_df)
transform()
: Applies the trained model to test data- Creates predictions for each instance in the test set
-
Evaluator Setup:
evaluator = MulticlassClassificationEvaluator( labelCol="interaction_type_indexed", predictionCol="prediction" )
- Creates an evaluator for multiclass classification
- Configures label and prediction columns
-
Metric Calculation:
accuracy = evaluator.evaluate(predictions, {evaluator.metricName: "accuracy"}) precision = evaluator.evaluate(predictions, {evaluator.metricName: "precisionByLabel"}) recall = evaluator.evaluate(predictions, {evaluator.metricName: "recallByLabel"})
- Calculates different performance metrics
- Each metric provides a different perspective on model performance
-
Return Statement:
return { "Accuracy": accuracy, "Precision": precision, "Recall": recall }
Returns a dictionary containing all calculated metrics for analysis.
-
-
Challenge
Linear Regression Model Training and Evaluation
In this step of the lab, you'll learn how to build and evaluate a Linear Regression model using PySpark MLlib. You'll work with the prepared features from Step 2 to predict numeric outcomes, specifically focusing on predicting purchase amounts.
This step will help you understand how to implement regression models and assess their performance using appropriate metrics.
🟦 Why It Matters:
-
Linear Regression is a fundamental machine learning algorithm that helps predict continuous values, which is crucial for business applications like sales forecasting and price prediction.
-
Understanding how to implement and evaluate regression models helps you make data-driven decisions about pricing, inventory management, and customer behavior analysis.
-
PySpark MLlib's Linear Regression implementation allows you to train models on large-scale datasets efficiently.
-
Proper model evaluation using metrics like RMSE (Root Mean Square Error) helps you understand the accuracy of your predictions and identify areas for improvement.
-
The ability to interpret regression coefficients provides valuable insights into which features have the strongest impact on your target variable.
Now, proceed to the first task: configuring the Linear Regression model. > 🟦 Why It Matters:
- Model Configuration: Proper configuration is crucial for model performance
- Parameter Selection: Well-chosen parameters help prevent overfitting and underfitting
- Regularization: Elastic net regularization helps with feature selection and model stability
- Scalability: Linear Regression is efficient for distributed computing
💡 Code Explanation: Here is a break down of the solution code:
-
Linear Regression Configuration:
lr = LinearRegression( labelCol="purchase_amount", featuresCol="features", maxIter=100, regParam=0.3, elasticNetParam=0.8, standardization=True, solver="auto" )
labelCol
: Specifies the target variable (purchase amount)featuresCol
: Points to the feature vector columnmaxIter
: Maximum number of iterations for optimization (100 for good convergence)regParam
: Regularization parameter (0.3 for balanced regularization)elasticNetParam
: Elastic net mixing parameter (0.8 for balanced L1/L2 regularization)standardization
: Standardizes features for better convergencesolver
: Automatically chooses the best optimization algorithm
-
Return Statement:
return lr
Returns the configured model for use in training. > 🟦 Why It Matters:
- Model Training: The quality of training directly affects model performance
- Feature Impact: Coefficients show how each feature influences predictions
- Model Interpretability: Coefficients help explain the relationship between features and target
- Performance Optimization: Insights for feature selection and engineering
💡 Code Explanation: Here is a break down of the solution code:
-
Model Training:
model = lr.fit(train_df)
fit()
: Trains the model on the provided training data- Creates a
LinearRegressionModel
instance - Learns the patterns in the data to make predictions
-
Coefficient Extraction:
coefficients = model.coefficients
coefficients
: Property containing the learned feature weights- Each coefficient represents the impact of a feature on the target variable
- Positive coefficients indicate positive correlation, negative indicate inverse correlation
-
Return Statement:
return model, coefficients
Returns both the trained model and feature coefficients for further use. > 🟦 Why It Matters:
- Performance Assessment: RMSE provides a clear measure of prediction accuracy
- Model Selection: Helps in comparing different models or configurations
- Quality Assurance: Ensures the model meets performance requirements
- Business Impact: Translates model performance into practical value
💡 Code Explanation: Here is a break down of the solution code:
-
Making Predictions:
predictions = model.transform(test_df)
transform()
: Applies the trained model to test data- Creates predictions for each instance
-
Evaluator Setup:
evaluator = RegressionEvaluator( labelCol="purchase_amount", predictionCol="prediction" )
- Creates a
RegressionEvaluator
for RMSE calculation - Configures with appropriate column names for label and predictions
- Creates a
-
RMSE Calculation:
rmse = evaluator.evaluate(predictions, {evaluator.metricName: "rmse"})
- Calculates RMSE using the evaluator
- Specifies "rmse" as the metric to calculate
-
Return Statement:
return rmse
Returns the calculated RMSE value for model assessment.
-
-
Challenge
Hyperparameter Tuning
In this step of the lab, you'll learn how to optimize your machine learning models through hyperparameter tuning using PySpark MLlib's
CrossValidator
andTrainValidationSplit
.You'll work with the Linear Regression model from the previous step to find the optimal configuration.
This step will help you understand how to systematically search for the best model parameters and evaluate their performance using cross-validation.
🟦 Why It Matters:
-
Hyperparameter tuning is crucial for maximizing model performance and finding the optimal balance between model complexity and generalization.
-
Cross-validation helps ensure that your model's performance is robust and not just good on a single train-test split.
-
Understanding how to use
ParamGridBuilder
andCrossValidator
helps you systematically explore the hyperparameter space. -
Proper model selection based on cross-validation metrics helps you choose the most reliable model for production use.
-
The ability to interpret and compare different model configurations helps you make informed decisions about model deployment.
Now, proceed to the first task: setting up the parameter grid for hyperparameter tuning. > 🟦 Why It Matters:
- Hyperparameter Search: Parameter grid enables systematic search of hyperparameters
- Model Optimization: Helps find the best model configuration
- Comprehensive Testing: Tests multiple parameter combinations
- Reproducibility: Ensures consistent hyperparameter search
- Flexibility: Allows passing existing LinearRegression instances for testing
💡 Code Explanation: Here is a break down of the solution code:
-
Function Signature:
def setup_param_grid(lr=None):
- Accepts an optional
LinearRegression
instance - Makes the function more flexible for testing
- Accepts an optional
-
Linear Regression Setup:
if lr is None: lr = LinearRegression(labelCol="purchase_amount", featuresCol="features")
- Creates a base
LinearRegression
model if none provided - Sets up label and feature columns
- Creates a base
-
Parameter Grid Creation:
param_grid = ParamGridBuilder() .addGrid(lr.regParam, [0.1, 0.5]) .addGrid(lr.elasticNetParam, [0.0, 0.5]) .addGrid(lr.maxIter, [50, 100]) .build()
regParam
: Regularization parameter values [0.1, 0.5]elasticNetParam
: Elastic net mixing parameter values [0.0, 0.5]maxIter
: Maximum iteration values [50, 100]- Creates 8 different combinations (2 × 2 × 2) > 🟦 Why It Matters:
- Model Validation: Cross-validation provides robust model evaluation
- Hyperparameter Selection: Helps identify the best parameter combination
- Bias-Variance Trade-off: Balances model complexity and generalization
- Performance Estimation: Gives reliable estimates of model performance
💡 Code Explanation: Here is a break down of the solution code:
-
Function Signature:
def setup_cross_validation(param_grid, lr=None):
- Accepts a parameter grid and an optional
LinearRegression
instance - Makes the function more flexible for testing
- Accepts a parameter grid and an optional
-
Linear Regression Setup:
lr = LinearRegression(labelCol="purchase_amount", featuresCol="features")
- Creates a base
LinearRegression
model - Sets up label and feature columns
- Creates a base
-
Evaluator Configuration:
evaluator = RegressionEvaluator( labelCol="purchase_amount", predictionCol="prediction", metricName="rmse" )
- Sets up RMSE as the evaluation metric
- Configures label and prediction columns
-
Cross-Validator Setup:
cv = CrossValidator( estimator=lr, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=3 )
- Combines all components for cross-validation
- Uses 3-fold cross-validation for robust evaluation > 🟦 Why It Matters:
- Model Selection: Identifies the optimal model configuration
- Performance Optimization: Ensures best possible model performance
- Generalization: Helps prevent overfitting
- Production Readiness: Provides a production-ready model
💡 Code Explanation: Here is a break down of the solution code:
-
Model Training and Selection:
best_model = cv.fit(train_df).bestModel
fit()
: Fits the cross-validator to training databestModel
: Property containing the best performing model
-
Model Evaluation (in main execution):
predictions = best_model.transform(test_df) evaluator = RegressionEvaluator( labelCol="purchase_amount", predictionCol="prediction", metricName="rmse" ) rmse = evaluator.evaluate(predictions)
transform()
: Makes predictions on test dataevaluate()
: Evaluates model performance using RMSE- Prints best parameters and performance metrics
-
-
Challenge
Conclusion to the Lab
Congratulations on completing the "Train a Machine Learning Model with PySpark MLlib" lab!
You've taken a significant step in mastering one of the most powerful tools for distributed machine learning. Here is a recap of what you've accomplished and explore where you can go from here.
Lab Recap
Throughout this lab, you've progressed through key areas of PySpark MLLib:
Data Preparation and Feature Engineering
You began by learning how to:
- Prepare data for machine learning tasks using PySpark
- Handle missing values and outliers
- Create and transform features using PySpark's built-in functions
- Scale and normalize features for better model performance
Building ML Pipelines
You then advanced to building sophisticated ML pipelines:
- Creating and configuring ML pipeline stages
- Implementing feature transformers and estimators
- Building end-to-end ML workflows
- Evaluating model performance using various metrics
Model Training and Optimization
You learned critical aspects of model development:
- Training different types of ML models (classification, regression)
- Implementing cross-validation for robust evaluation
- Performing hyperparameter tuning
By working through these steps, you've gained practical experience that mirrors production ML challenges.
Next Steps
Your PySpark MLLib journey doesn't end here! Consider these paths to further enhance your machine learning skills:
Advanced ML Techniques
- Deep Learning: Explore integration with deep learning frameworks like TensorFlow and PyTorch
- Advanced Algorithms: Learn about more sophisticated algorithms like Random Forests, Gradient Boosting, and Neural Networks
- Unsupervised Learning: Master clustering and dimensionality reduction techniques
- Recommendation Systems: Build collaborative filtering and content-based recommendation systems
Production ML
- Model Deployment: Learn to deploy ML models in production environments
- Model Monitoring: Implement monitoring and maintenance strategies
- A/B Testing: Design and implement A/B testing frameworks
- MLOps: Explore tools and practices for ML operations
Integration and Scaling
- Distributed Training: Deep dive into distributed training strategies
- Feature Stores: Learn about feature management and serving
- Model Registry: Implement model versioning and management
- Real-time Predictions: Build streaming ML applications
Performance and Optimization
- Memory Management: Optimize memory usage in distributed environments
- Training Speed: Learn techniques to accelerate model training
- Resource Utilization: Master cluster resource management
- Model Compression: Explore techniques for model optimization
Remember, expertise in distributed machine learning comes with practice. Try to apply what you've learned to your own datasets or contribute to open-source projects to solidify your understanding.
Happy Machine Learning! 🚀
What's a lab?
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Provided environment for hands-on practice
We will provide the credentials and environment necessary for you to practice right within your browser.
Guided walkthrough
Follow along with the author’s guided walkthrough and build something new in your provided environment!
Did you know?
On average, you retain 75% more of your learning if you get time for practice.