Principal Component

Density-based clustering is usually more effective when processing data of different densities. This method is pioneered by the Density-based Applied Noise Spatial Clustering (DBSCAN) algorithm. There is a significant difference in behavior between k-Means and DBSCAN, which is processing data that contains noise. To this end, this research studies the impact of dimensionality reduction on high-dimensional data on the clustering results of the k-Means algorithm represented by the centroid method and the clustering results of the DBSCAN algorithm represented by the density method. Although the quality of the clustering results on k-Means has been improved after the numerical reduction by Singular Value Decomposition (SVD), from the initial average distance of 1.04136 to 0.003, the statistical change is not significant or considered to be the same. Therefore, it can be concluded statistically that SVD has no effect on the quality of k-Means clustering results. On the other hand, in DBSCAN, the effect of SVD dimensionality reduction is very significant. It can change the quality of the clustering results from the initial average intra-cluster


Problem setting
Suppose you are presented with data under the form of a sequence of d-dimensional vectors x i , 1 ≤ i ≤ n.This is a very general setting.For instance, the data sequence can be a collection of membrane potential waveforms (as in spike sorting) or a set of images represented as a vector of pixels (as in machine learning).Taking the convention to represent data as a column vector we can pull all the data together and form the data matrix x 1,1 x 1,2 . . .x 1,n x 2,1 x 2,2 . . .x 2,n . . . . . .
, where each column is a data sample.Analyzing-and hopefully understandingthe result of an experiment often consists in uncovering regularity or structure in the data matrix.Unfortunately, measured data is often "messy" in the sense that it is too high-dimensional for us to detect structure by direct inspection and in the sense that noise and/or redundancy often impairs data visualization.Principal Component Analysis (PCA) is a handy tool to reveal structure via dimensionality reduction and denoising of the data.In a nutshell and loosely speaking, PCA consists in detecting characteristics "features" of the data that can be ranked by degree of relevance: the more relevant a feature, the more it explains the variability of the data.PCA is successful when considering only a few of the most relevant "features" is enough to describe the data satisfactorily.Thus, successful PCA offers the possibility to perform dimensionality reduction as the data can be represented in a space whose dimension is specified by the number of kept "features".At the same time, successful PCA can be seen since denoising of the data as the ignored "features" are most likely due to noise or redundancy in the data collection process.
Before making the above statements more precise, we need to first make the assumption that our data vector has zero mean.Such an assumption incurs no loss of generality as we can always subtract the sample mean from the original data samples to form a zero-mean vector sequence: The approach taken by PCA is to look for data "features" in the data covariance matrix defined as for zero mean data.The reason for looking for features of X in the covariance matrix C XX is that if a few "features" are enough to characterize the data, we expect the data to lie on a low-dimensional manifold, as opposed to fill the whole d-dimensional space it lives in.If that low-dimensional manifold is not too convoluted, it will lie into a low-dimensional vectorial space and the covariance matrix will capture the few directions along which the data is primarily varying.Notice that an intrinsic limitation to PCA is that it can only detect linear features and as such, it is not well-suited to discover data features that result from highly non-linear transformation of the raw data.
2 The idea behind PCA Given a data matrix X, let us look for the direction, i.e. the unit vector v, such that the orthogonal projection of the data onto v best captures the overall variability of the data.The projection coefficients of each data sample x i onto v defines a collection of numbers c i , 1 ≤ i ≤ n, which can be written in vectorial form as The variance of the data accounted the vector v is defined as the variance the projection coefficients c, which can be expressed as: Now, our problem is to find the unit vector v for which V(c) is maximum, which can be stated formally as where the suffix expression under max indicates that we restrain our search to unit vectors.In order to tackle the above optimization problem, it is convenient to first reformulate the variance V(c) in terms of the covariance matrix C XX , which contains all the information required for PCA: Our optimization problem consists then in finding The only difficulty involved in this problem is the fact that we restrain ourselves to unit vector.Without this constraint, we would just look for v by setting the derivative of v T C XX v with respect to the components of v to zero and solve the resulting system of equations.Also not formally exact, it turns out that this approach gives the right answer anyway.To see why, we need to realize that we can take into account the constraint of unit length by optimizing the function which depends on v and a new parameter λ called the Lagrangian multiplier.No- , it is always possible to make F as positive or negative as possible by varying λ and there is no optimum.These loose observations are the reason for introducing the function F a complete justification of this fact is beyond the scope of this class.Let us directly proceed with the optimization of F by first computing the derivatives with respect to v k : where we have used the fact that C XX is a symmetric matrix.Setting these derivatives to zero yields a system of d equations: This system can be conveniently written under matrix form as making apparent the fact that the vector v that maximizes the projected variance is an eigenvector of C XX .Moreover, setting the derivative of F with respect to λ to zero yields which is not a problem since eigenvectors are defined modulo their length: we can always chose a unit eigenvector.Now which eigenvector to choose?By the spectral theorem, we know that, in general, symmetric matrices have d distinct real eigenvalues s i , 1 ≤ i ≤ d.Moreover for covariance matrices, we know that these eigenvalues are all positives.Suppose, we pick an eigenvector v of unit length associated to eigenvalue λ.Then we have where the first and last equalities are due to the fact v = 1 and the second equality is due to the fact v is a λ-eigenvector.This shows that, to maximize v T C XX v, one has to choose the eigenvector associated with the top eigenvalue of the spectral decomposition of C XX .
The above analysis shows that we can extract from the covariance matrix a particular direction that best accounts for the data variability.The key idea is to use linear algebra to find that direction as the top eigenvector of the data covariance matrix.The following section generalizes this idea to considering all the eigenvectors of the covariance matrix and introduces PCA as the "best" change of basis to account for data variability.

PCA and change of basis
most direct way to understand PCA is to consider the following linear algebra question: assuming the data is living in a d-dimensional vectorial space, what is the best choice of basis to represent the data?Intuitively, a "good" basis would be a basis in which we expect the data structure to be salient.However it is hard to imagine an automated procedure that produces such a basis without knowledge of the data characteristic "features "in the first place.Alternatively, we can try to find the basis in which the data covariance matrix is as simple as possible, that is under a diagonal form.Remember that the data covariance matrix is diagonal if the components of the centered data vector are uncorrelated.PCA achieves such a goal.
To see how it works, let us remember that a change of basis affects the coordinates of the data via a change of matrix P .Specifically, if x i is the original data coordinate vector, the new coordinate vector y i is obtained via matrix multiplication by P : y i = P x i .Incidentally, we can consider the data matrix in the new coordinates: Y = P X where P is the same yet-to-be-defined change-of-basis matrix that simplifies the data covariance.To find P , we are going to use the fact that the covariance matrix of the new coordinates C Y Y is related to the covariance matrix of the original coordinates C XX by: which makes apparent what is the "good" choice for the change-of-basis matrix P .Choosing P as equal to the orthogonal matrix obtained via eigen decomposition, Thus, when considered in the basis defined by the eigenvectors of C XX , the covariance of the data is equal to the diagonal matrix D, whose diagonal entries satisfies s 1 ≥ s 2 ≥ . . .≥ s d ≥ 0. As intended, all the off-diagonal terms are zero, which means that the covariance between the data components in the eigenvector basis is zero: y i y j = 0.Moreover, as V is an orthogonal matrix, we can interpret the component of y as the projection coefficient of x onto the eigenvector v i , 1 ≤ i ≤ d which constitutes an orthonormal basis: .
In turn, we can interpret the singular value s i as the variance of the data when projected onto the eigenvector v i .Depending on the field of studies, the eigenvectors v are also called principal components or singular vectors.These eigenvectors can be thought of as data "features" that can be retrieved from the data covariance matrix.Projecting the data on the first k eigenvectors produces a k-dimensional representation while preserving as much of the data variability as possible.Indeed, the data variability captured by the first k eigenvalues is the sum of the k first eigenvalues where we remember that the eigenvalues are ranked by decreasing order.The fraction of the data variability accounted by the first k components is given by s 1 + . . .+ s k s 1 + . . .+ s k + . . .+ s d , where the denominator s 1 + . .+ s k + . . .+ s d is the total variance of the data.
The closer f k is to one the more faithful is the projection, i.e. the more accurate is the dimensional reduction.
)(P X) T = P 1 n − 1 XX T P T = P C XX P T .Now from the previous section, we know that a good candidate basis should include the top eigenvector of C xx .This suggests utilizing the spectral theorem to consider the full eigen decomposition of C XX C XX = V DV T , with D = are such that s 1 ≥ s 2 ≥ . . .≥ s d ≥ 0 and where the matrix U is orthogonal, i.e.U U T = I.This allows us to rewrite the covariance C Y Y under the form C Y Y = P C XX P T = P V DV T P T = (P V )D(P V ) T ,