Author avatar

Vivek Kumar

Working with SciPy Stats Module

Vivek Kumar

  • May 9, 2019
  • 11 Min read
  • 258 Views
  • May 9, 2019
  • 11 Min read
  • 258 Views
Data
SciPy

Introduction

In the previous guide in this series, you were introduced to the SciPy linear algebra module. In this guide, you will learn about how SciPy is useful in conducting statistical analysis.

By the end of this guide you'll have a strong understanding of the following topics:

  1. Working with distributions
  2. Analyzing one and two samples
  3. Kernel density estimation

The Baseline

Throughout this guide, we will be using the following libraries:

1
2
3
import numpy as np
import scipy as sp
from scipy import stats
python

Working with Distributions

Distribution can either be continuous or discrete. In SciPy there are methods available for 98 continuous distributions and 14 discrete distributions. You can find these names and count using the following code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# List of continuous distributions
continuous_dist = [d for d in dir(stats) if isinstance(getattr(stats, d), stats.rv_continuous)]
continuous_dist

# Output:
# ['alpha',
#  'anglit',
#  'arcsine',
#  'argus',
#  'beta',
#  'betaprime',
#  'bradford',
#  'burr',
#  'burr12',
#  'cauchy',
#  'chi',
#  'chi2',
#  'cosine',
#  .
#  .
#  .
#  't',
#  'trapz',
#  'triang',
#  'truncexpon',
#  'truncnorm',
#  'tukeylambda',
#  'uniform',
#  'vonmises',
#  'vonmises_line',
#  'wald',
#  'weibull_max',
#  'weibull_min',
#  'wrapcauchy'] 

# Total number of continuous distributions
len(continuous_dist) # 98

# List of discrete distributions
discrete_dist = [d for d in dir(stats) if isinstance(getattr(stats, d), stats.rv_discrete)]
discrete_dist

# Output:
# ['bernoulli',
#  'binom',
#  'boltzmann',
#  'dlaplace',
#  'geom',
#  'hypergeom',
#  'logser',
#  'nbinom',
#  'planck',
#  'poisson',
#  'randint',
#  'skellam',
#  'yulesimon',
#  'zipf']

# Total number of discrete distributions
len(discrete_dist) # 14
python

Now, that we know what all distributions are currently supported by the SciPy stats module, let us implement some basic distribution operations as shown:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# ============================================================
# We can pass a vector to compute cdf or pdf at various points
# ============================================================
stats.norm.cdf(np.array([5, 8, 7]))

# Output:
# array([0.99999971, 1.        , 1.        ])

stats.norm.pdf(np.array([-1, 0, 1]))

# Output:
# array([0.24197072, 0.39894228, 0.24197072])

# Inference:
# The above results suggest that basic methods like pdf and cdf
# supports vectorization.

# ==============================================
# Computing mean, variance and standard deviance
# ==============================================
stats.norm.mean(), stats.norm.var(), stats.norm.std()

# Output:
# (0.0, 1.0, 1.0)

stats.norm.stats(moments="mv")

# Output:
# (array(0.), array(1.))

# ==============================================================
# To find the median of a distribution, we can use percent point
# function (ppf) which is actually opposite of cdf
# ==============================================================
stats.norm.ppf(0.9)

# Output:
# 1.2815515655446004

# ============================================================
# To generate the sequence of random variates, you can utilize
# the size argument in the rvs method
# ============================================================
stats.norm.rvs(size=5)

# Output: (random)
# array([ 1.12143471,  1.13711089,  0.70657281, -0.73941283,  0.09078616])

# To achieve the same numbers every time, you can either set random_state
# inside the rvs method or rely on numpy.random.seed() method

np.random.seed(42) # Alternative to random_state argument

stats.norm.rvs(size=5, random_state=42)

# Output:
# array([ 0.49671415, -0.1382643 ,  0.64768854,  1.52302986, -0.23415337])
python

All continuous distributions take loc and scale as keyword arguments to adjust the location and scale of the distribution. For instance, for the standard normal distribution, the location is the mean and the scale is the standard deviation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# ====================
# Shifting and Scaling
# ====================
stats.norm.stats(loc=2, scale=3, moments="mv")

# Output:
# (array(2.), array(9.))

stats.uniform.cdf([0, 1, 5, 2, 4, 5], loc=1, scale=4)

# Output:
# array([0.  , 0.  , 1.  , 0.25, 0.75, 1.  ])

stats.norm.rvs(5, size=7) # Here 5 is passed to loc

# Output:
# array([4.51287462, 4.40760608, 4.13600923, 5.04852163, 4.16904988,
#        5.27045683, 4.94976189])
python

Passing the arguments loc and scale (in case of a distribution like gamma), again and again, can become quite cumbersome. Therefore, there's a way to avoid this by freezing the distribution. This is shown in the code below:

1
2
3
4
5
# Freezing the distribution
rv = stats.gamma(1, scale=2.)

# Here by passing the results to rv, you no longer have to 
# include the scale or the shape parameters anymore.
python

Now that you have learned the use of SciPy for basic operations on distribution, you can also build your own distribution (continuous or discrete) within SciPy as shown here with an example of building a continuous distribution:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# ==================================
# Building a continuous distribution
# ==================================
class continuous_dist(stats.rv_continuous):
    def _cdf(self, x):
        return np.where(x < 0, 0., 1.)
    def _stats(self):
        return 0., 0., 0., 0.

cont_dist = continuous_dist(name="deterministic")

cont_dist.cdf(np.arange(-2, 2, 0.5))
# array([0., 0., 0., 0., 1., 1., 1., 1.])

cont_dist.pdf(np.arange(-2, 2, 0.5))
# array([0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
#        5.83333333e+04, 4.16333634e-12, 4.16333634e-12, 4.16333634e-12])
python

Similarly, you can use rv_discrete to create a discrete distribution. For an example of building such distribution, you can refer to given resource:

Analyzing One and Two Samples

Starting with one sample test, let us create a random input data and perform analysis using descriptive statistics, T-test, KS-test, and tails of the distribution as shown in the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# ==========================================
# Creation of reproducible random variates
# using a sample from Student t distribution
# ==========================================
x = stats.t.rvs(10, size=1000, random_state=42)

# ======================
# Descriptive Statistics
# ======================
x.min(), x.max(), x.mean(), x.var(), x.std()

# Output:
# (-4.472953407828721, # Minimum
#  5.482947945852735, # Maximum
#  -0.02473291726931078, # Mean
#  1.2521483080679578, # Variance
#  1.1189943288810529) # Standard Deviance

# Let us observe how does the given sample properties
# compare to their theoretical counterparts.

m, v, s, k = stats.t.stats(10, moments='mvsk')
n, (smin, smax), sm, sv, ss, sk = stats.describe(x)

sstr = '%-14s mean = %6.4f, variance = %6.4f, skew = %6.4f, kurtosis = %6.4f'
sstr % ('distribution:', m, v, s ,k)
sstr % ('sample:', sm, sv, ss, sk)

# Output:
# distribution:  mean = 0.0000, variance = 1.2500, skew = 0.0000, kurtosis = 1.0000
# sample:        mean = -0.0247, variance = 1.2534, skew = 0.1019, kurtosis = 0.8509

# ======================
# Performing the t-test
# ======================
't-statistic = %6.3f pvalue = %6.4f' %  stats.ttest_1samp(x, m)

# Output:
# t-statistic = -0.699 pvalue = 0.4850

# ==================
# Performing KS-test
# ==================
'KS-statistic D = %6.3f pvalue = %6.4f' % stats.kstest(x, 't', (10,))

# Output:
# KS-statistic D =  0.018 pvalue = 0.9116

# ===========================================
# To check the upper tail of the distribution
# we can use the ppf method
# ===========================================
crit01, crit05, crit10 = stats.t.ppf([1-0.01, 1-0.05, 1-0.10], 10)
'critical values from ppf at 1%%, 5%% and 10%% %8.4f %8.4f %8.4f' % (crit01, crit05, crit10)

# Output:
# critical values from ppf at 1%, 5% and 10%   2.7638   1.8125   1.3722
python

Now, let us take two samples and analyze if both of these samples have the same statistical properties or not, using a two-sample test.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# ========================================
# Test with samples having identical means
# ========================================
rvs1 = stats.norm.rvs(loc=5, scale=10, size=1000, random_state=42)
rvs2 = stats.norm.rvs(loc=5, scale=10, size=1000, random_state=52)
stats.ttest_ind(rvs1, rvs2)

# Output:
# Ttest_indResult(statistic=0.5737630194402733, pvalue=0.5661927549326405)

# ========================================
# Test with samples having different means
# ========================================
rvs3 = stats.norm.rvs(loc=8, scale=10, size=1000, random_state=92)
stats.ttest_ind(rvs1, rvs3)

# Output:
# Ttest_indResult(statistic=-7.055949818457435, pvalue=2.3543548122721355e-12)

# =======
# KS Test
# =======
stats.ks_2samp(rvs1, rvs2)

# Output:
# Ks_2sampResult(statistic=0.021999999999999992, pvalue=0.9672441518664989)

stats.ks_2samp(rvs1, rvs3)

# Output:
# Ks_2sampResult(statistic=0.15100000000000002, pvalue=1.9396897659433313e-10)
python

Kernel Density Estimation

The most common task, while performing a statistical analysis, is estimating the probability density function of a random variable. This is often termed as density estimation. To learn about how to compute a univariate and multivariate estimate, you can refer the given resource:

Conclusion

In this guide, you have learned about the implementation of basic routines used in statistical analysis in Python. To get yourself familiar with other methods, you can look in the scipy.stats subpackage.

To learn more on SciPy, you can also refer the following guides:

1