Unsupervised learning is a type of machine learning algorithm where insights are generated from data without any dependent variable. There are several use cases of unsupervised learning. The most popular use case is market segmentation, where you divide the market or customer groups into different clusters. This helps in building targeted marketing strategy. Another application of unsupervised learning is in association mining or building recommendation engines. The most common unsupervised machine learning technique is k-means clustering.
K-means clustering is a process of dividing observations into k clusters. The records within a cluster are similar, while the k clusters differ from each other. The success of k-means clustering depends on how well the algorithm can create these partitions. This guide will demonstrate how to configure, train, and understand an unsupervised k-means clustering model in Azure Machine Learning Studio.
In this guide, you will work with the Pima Indian diabetes data set available in Azure Machine Learning Studio. This data originally comes from the National Institute of Diabetes and Digestive and Kidney Diseases. The dataset consists of several variables such as the number of pregnancies the patient has had, their BMI, insulin level, age, and so on. You can have a look at this data here.
You will start by loading the data.
Once you have logged into your Azure Machine Learning Studio account, click on the EXPERIMENTS option, listed on the left sidebar, followed by the NEW button.
Next, click on the blank experiment and name the experiment "Clustering". Under the saved datasets, drag the Pima Indians Diabetes dataset into the workspace.
Once you have loaded the data, next step is to explore it. To do this, right-click and select the Visualize option as shown below. This helps to understand the structure of data.
The data contains 768 rows and nine columns. You can examine any variable by clicking on it.
K-means clustering is an unsupervised machine learning algorithm that is used to group together similar items based on a similarity metric. The K-Means Clustering module is used in Azure Machine Learning Studio to configure and create a k-means clustering model. Start by searching and dragging the module into the workspace.
You have the module in the workspace, and the next step is to configure it. For Create trainer mode, select the Single Parameter option, which is used when you know how you want to configure the algorithm. The second parameter is Number of Centroids, which indicates the number of clusters to begin with. Set this value to 5. The final number of clusters might be different from this, but it helps the algorithm to start with a number.
The Initialization method is K-Means++, which is the default method for initiating clusters. Type an integer value in Random number seed to ensure reproducibility. For Metric select the Euclidean method, which is a method to calculate distance between cluster points.
The Iterations parameter specifies the number of iterations the algorithm will undergo to finalize the number of centroids. Set this value to 100. Keep the other options to default as shown below.
The Train Clustering Model module is used to train the clustering model. Search and drag the module into the workspace and connect it with other modules as shown below.
You can see the red flag next to Train Clustering Model module, which indicates that something needs to be corrected. Click on the Launch column selector option, and select all the numerical variables, as shown below. You will only select numerical variables because it is not possible to calculate distances for categorical variables.
Run the experiment.
The experiment run is successful, and the result is stored in the right output port of the Train Clustering Model module.
To explore the results, search and drag the Convert to CSV module into the workspace, as shown below.
Run the experiment. Next, right click and select Download option.
The above operation will download the dataset which you can see at the bottom left hand side, highlighted with a blue box and named "Clustering - 50615."
Open the file and you will see the following output. New variables have been added to the original dataset due to clustering. The
Assignment variable tells us which cluster that observation was assigned to. There are five clusters, zero through four. There are additional variables created that provide information on the distances to each specific cluster. This is shown below.
The above step required downloading the data onto your machine. There is an another method to understand the cluster results. You can save the result as dataset in the Azure studio itself. To do this, right click on the output port of the Convert to CSV module, and select Save as Dataset.
The above step will open a new window where you can provide the name. The file is named "k-means saved dataset".
The data will be saved in the My Datasets folder under Saved Datasets.
It is easy to visualize and perform operations on this dataset. For instance, you can drag the dataset into the workspace, and right-click to Visualize it.
The output shows the resulting dataset has additional columns. Click on the
Assignments variable to see the summary statistic. The
Assignments variable takes five unique values that represent the five clusters created.
In many data science projects, you will not have a target variable. Instead, you will have a data set with features, and you will be expected to generate valuable insights out of it. The classic examples are Google's search engine, Uber's taxi ride algorithm, Netflix's recommendation engine, and Amazon's market basket analysis. All of these powerful machine learning solutions are not dependent only on supervised learning, but also on unsupervised machine learning algorithms.
To learn more about data science and machine learning using Azure Machine Learning Studio, please refer to the following guides: