Introduction

2

Data science and machine learning often require formulating hypotheses and testing them with statistical tests. One such common hypothesis testing process is performing a t-test to compare whether two groups have different means. In this guide, you will learn how to compute and analyze t-test statistics with Azure Machine Learning Studio.

In this guide, you will work with a fictitious dataset of loan applicants that includes 600 observations and nine variables, as described below.

`Is_graduate`

: Whether the applicant is a graduate or not.`Income`

: Annual Income of the applicant (in US dollars).`Loan_amount`

: Loan amount (in US dollars) for which the application was submitted.`Credit_score`

: Whether the applicant's credit score was good ("Satisfactory") or not ("Not_satisfactory").`Age`

: The applicant’s age in years.`approval_status`

: Whether the loan application was approved ("Yes") or not ("No"). This is the dependent variable.`Sex`

: Gender of the applicant.`Investment`

: Investment amount (in US dollars) in stocks and mutual funds, as declared by the applicant.`Purpose`

: Purpose for applying for the loan.

Start by loading the data.

Once you have logged into your Azure Machine Learning Studio account, click on the **EXPERIMENTS** option, listed on the left sidebar, followed by the **NEW** button. Next, click on the blank experiment and give the name **T-test** to the workspace. The following screen will be displayed.

Next, load the data into the workspace. Click **NEW**, and select the **DATASET** option shown below.

The selection above will open a window, shown below, which can be used to upload the dataset from the local system.

Once the data is loaded, you can see it in the **Saved Datasets** option. The file name is **ttestdata.csv**. The next step is to drag it from the **Saved Datasets** list into the workspace. To explore this data, right-click and select the **Visualize** option as shown below.

The data contains 600 rows and nine columns. Selecting any variable will display its statistics, as shown below.

The data is loaded into the workspace and ready for statistical tests.

Before moving ahead to the statistical tests, you should understand a few important terms.

- Mean: Mean represents the arithmetic average of the data. It is calculated by taking the sum of the values and dividing by the number of observations.

- Null and Alternative Hypotheses : The statistical tests rely on testing a null hypothesis, which is specific for each case. The null hypothesis assumes the absence of a relationship between two or more variables. For example, for two groups, the null hypothesis assumes that there is no correlation or association between the two variables. The alternative hypothesis is simply the opposite of the null hypothesis.

- t-test: In general, a t-test helps you compare whether two groups have the same or different means. It is also used to test if a single group mean is comparable to a hypothesized mean.

- P-value: For any statistical test, the p-value is a statistic used to decide whether to reject or fail to reject the null hypothesis. The decision rule is that if the p-value for the test is less than 0.05, we reject the null hypothesis. If the p-value is greater than or equal to 0.05, we fail to reject the null hypothesis.

The following sections outline the implementation of a t-test in Azure Machine Learning Studio.

The idea behind the one-sample t-test is to compare the mean of a vector against a theoretical mean. In our data, you will consider the `Income`

variable and test the null hypothesis that the mean income of the applicants is USD $72,500. This can be any number based on your research.

Drag and drop the **Test Hypothesis using t-Test** module into the workspace.

Click on the **Launch column selector** option and select the `Income`

variable.

Next, complete the other arguments to configure the module. For **t-Test type**, select **SingleSampleSet** because you are considering one variable. For the option **Null hypothesized µ**, type the value of 72500. This specifies the expected mean income value against which the variable mean will be tested. Next, under **Hypothesis type**, select the default option of **TwoTail**. This is the most common type of test that assumes the distribution to be symmetric around zero. Alpha value, denoted by **α**, specifies the confidence factor. Set this value to 0.95.

Run the experiment. Once the module run is complete, the left output port of **Test Hypothesis using t-Test** module gives the probability score.

Right-click and select ** Visualize**.

The output shows that the p-value is lower than 0.05, which means that the null hypothesis is rejected.

You can confirm this with the right output port of **Test Hypothesis using t-Test** module. Right-click and select **Visualize**.

The output below shows that null hypothesis is to be rejected.

Based on the results of t-Test, you reject the null hypothesis that the mean income of the applicants is USD $72,500.

The idea behind the paired-sample t-test is to compare the mean of two sets of samples from the same population. In this data, you will consider the variables `Income`

and `Loan_amount`

. The null hypothesis to be tested is that the difference between the expected means of the two variables is zero.

Drag and drop the **Test Hypothesis using t-Test** module into the workspace. To configure the module, click on the **Launch column selector** option and select the `Income`

and `Loan_amount`

variables.

Next, complete the other arguments. For **t-Test type**, select **PairedSamples** because you are considering two variables. For the option, **Null hypothesized µ**, type the value of zero. This specifies that the expected difference in the means of the two variables is zero. Set the other option as shown below.

Run the experiment, and right-click on the left-input port to select **Visualize**.

The output shows that the p-value is lower than 0.05, which means that the null hypothesis is rejected.

Next, right-click on the right-input port of the **Test Hypothesis using t-Test** module, and select **Visualize**. The output shows that the null hypothesis is to be rejected.

Based on the results of the t-test, you can reject the null hypothesis that the difference between the expected means of the variables `Income`

and `Loan_amount`

is zero.

Hypothesis testing has several applications in the areas of market research, quality testing, and inferential statistics. It is used in the pharmaceutical industry in new drug discovery. It is also used in the manufacturing industry to compare two processes or machines. Hypothesis testing using t-tests is also used in diagnostic analytics, an important area of machine learning.

In this guide, you learned how to perform hypothesis testing using t-tests in Azure Machine Learning Studio. You can learn more about this concept with this guide on R.

To learn more about data science and machine learning using Azure Machine Learning Studio, please refer to the following guides:

2