PCA Demo¶

This uses random data to illustrate the concept of maximum variance.

Additional references:

Setup¶

Imports:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.sparse.linalg as spla
from sklearn.decomposition import PCA

RNG seed:

rng = np.random.default_rng(20201114)

Compute a PCA¶

Let’s compute a bunch of multivariate normals in 3 dimensions with some correlation:

means = [5, 10, -2]
mcv = [
    [1, 0.7, 0.2],
    [0.7, 1, -0.3],
    [0.2, -0.3, 1]
]
n = 500
data = rng.multivariate_normal(means, mcv, size=n)
data
array([[ 5.31358291, 10.10454979, -1.28170015],
       [ 4.63890901,  9.22500074, -0.22529381],
       [ 5.85599736, 10.77474896, -1.95883841],
       ...,
       [ 4.4931347 ,  9.93950184, -0.95846614],
       [ 4.91058303,  9.73080603, -2.05767846],
       [ 7.23444003, 12.52878242, -2.53911509]])

We how have 500 rows and 3 columns.

Let’s confirm the data has its expected means:

obs_means = np.mean(data, axis=0)
obs_means
array([ 4.99862798,  9.95351358, -1.9301308 ])

That is close to our configured means. Let’s see the covariance matrix:

np.cov(data.T)
array([[ 1.04670135,  0.74075551,  0.22196162],
       [ 0.74075551,  1.06484044, -0.31102634],
       [ 0.22196162, -0.31102634,  1.04063369]])

That also matches what we configured!

Let’s do scatterplots:

sns.pairplot(pd.DataFrame(data))
<seaborn.axisgrid.PairGrid at 0x25699799e50>
../../../_images/PCADemo_12_1.png

We can see strong correlation between 0 and 1, and negative but much milder correlation between 1 and 2.

PCA¶

Now we’re going to compute the mean-centered data in prepration for the PCA - thank you NumPy broadcasting:

mc_data = data - obs_means
mc_data
array([[ 0.31495493,  0.15103621,  0.64843064],
       [-0.35971896, -0.72851284,  1.70483699],
       [ 0.85736938,  0.82123538, -0.02870762],
       ...,
       [-0.50549327, -0.01401174,  0.97166465],
       [-0.08804494, -0.22270755, -0.12754766],
       [ 2.23581205,  2.57526884, -0.60898429]])

And now we’ll compute a 2D PCA:

P, sig, Qt = spla.svds(mc_data, 2)
P.shape
(500, 2)
Qt.shape
(2, 3)
sig
array([24.41604313, 29.99470823])
Qt
array([[ 0.34825608, -0.20454769,  0.91481033],
       [-0.68168465, -0.72513793,  0.0973705 ]])

Now, we’re going to compute the explained variance from the singular values, following this discussion, as follows:

\[\lambda_i = s_i^2 / (n - 1)\]
ev = np.square(sig) / (n - 1)
ev
array([1.19467568, 1.80297099])

The last row of the \(Q^T\) matrix explains the most variance, because svds returns the rows in order of increasing singular value instead of decreasing.

Let’s actually look at the line:

plt.scatter(mc_data[:, 0], mc_data[:, 1])
pca_xs = np.linspace(np.min(mc_data[:, 0]), np.max(mc_data[:, 0]))
pca_ys = pca_xs * Qt[1,0] / Qt[1,1]
plt.plot(pca_xs, pca_ys, color='red')
#plt.scatter(Qt[:, 0], Qt[:, 1], color='red')
# plt.arrow(0, 0, Qt[1, 0] * 3, Qt[1, 1] * 3, color='red', width=0.05, head_width=0.3)
# plt.arrow(0, 0, Qt[0, 0] * 3, Qt[1, 1] * 3, color='green', width=0.05, head_width=0.3)

plt.ylabel('Dim. 1')
plt.xlabel('Dim. 0')
plt.show()
../../../_images/PCADemo_24_0.png

What if we want to see multiple ones?

lens = np.sqrt(ev) * 2
plt.scatter(mc_data[:, 0], mc_data[:, 1], color='lightsteelblue')
plt.arrow(0, 0, Qt[1, 0] * lens[1], Qt[1, 1] * lens[1], color='red', width=0.05, head_width=0.3)
plt.arrow(0, 0, Qt[0, 0] * lens[0], Qt[1, 1] * lens[0], color='green', width=0.05, head_width=0.3)

plt.ylabel('Dim. 1')
plt.xlabel('Dim. 0')
plt.show()
../../../_images/PCADemo_26_0.png

Comparing with SKlearn PCA¶

Let’s compare this with SKLearn’s PCA.

pca = PCA(2)
pca.fit(data)
PCA(n_components=2)

It learns the means:

pca.mean_
array([ 4.99862798,  9.95351358, -1.9301308 ])

It learns our matrices, although the rows will be in the opposite order:

pca.components_
array([[ 0.68168465,  0.72513793, -0.0973705 ],
       [-0.34825608,  0.20454769, -0.91481033]])

And it learns the explained variances:

pca.explained_variance_
array([1.80297099, 1.19467568])

These match what we got from svds, with some possible sign flips and the reverse-ordered data points.