Wednesday, December 11, 2013

Implementing Principle Component Analysis (PCA) in Python

i take a look at PCA (principle component analysis). i'm not sure this is implemented somewhere else but a quick review of my collage notes (reference needed) lead me the code below, and  data is (reference needed):
x y
2.5 2.4
0.5 0.7
2.2 2.9
1.9 2.2
3.1 3.0
2.3 2.7
2 1.6
1 1.1
1.5 1.6
1.1 0.9

'''
 *@author beck 
 *@date Sep 14, 2012
 *PCA with Python 
 *bekoc.blogspot.com 
'''
import numpy as np
import matplotlib.pyplot as plt
import pylab


xs= np.loadtxt("pcaData",delimiter=" ", skiprows=1, usecols=(0,1)) # numpy array - similar to C array notation.
#get mean
meanx=np.average(xs[:,0])
meany=np.average(xs[:,1])

correctedX=[value-meanx for value in (xs[:,0])] #X data with the means subtracted
correctedY=[value-meany for value in (xs[:,1])] #Y data with the means subtracted
data= np.array([correctedX,correctedY])
print data.shape
covData=np.cov(data)#calculate covariance matrix

eigenvalues, eigenvectors = np.linalg.eig(covData)

print eigenvectors
print eigenvectors[0][0] #eigenvectors are both unit eigenvectors
print eigenvectors[1][0]
x= [n for n in range (-2,3)]
y=  [eigenvectors[1][0]*i/eigenvectors[0][0] for i in x ] 
y1=  [eigenvectors[1][1]*i/eigenvectors[0][1] for i in x ] 

print x
print y 
plt.plot(x, y,linestyle='--', label='eigenvector1')
plt.plot(x, y1, linestyle='--', label='eigenvector2')
plt.plot(data[0,:],data[1,:], marker='+', linestyle=' ',  label= "Normalized data" )

#plt.plot(xs[:,0],xs[:,1],marker='+',linestyle=' ')
pylab.ylim([-2,2])
pylab.xlim([-2,2])
plt.title('PCA example')
plt.legend()
plt.show()
The code includes step 1 to 5
 PCA summary :
1- Given a dataset calculate normalized data (mean substructed data), let's say n dimension (feature) data
2-calculate covariance matrix of normalized data
3-calculate eigenvalues and eigenvectors of the covariance matrix
4-eigenvector with the largest eigenvalue is the principal component
5-choose p eigenvectors and multiply with your data
6-now your data is p dimension.
The green dotted plot of the eigenvector shows the most significant relation between dimensions

Please refer to simple and consise tutorial at georgemdallas blog 

No comments:

Post a Comment