A friendly introduction to Principal Component Analysis

part 1: minimizing reconstruction error

We will work from the outside in: we will view PCA first as a way of finding a smaller representation of a dataset. This is a typical machine learning problem: find a compressed representation of the data such that the reconstructions are as close to the original as possible. This is a simple view of PCA, an we’ll be able to compute it with nothing more than gradient descent with a few extra tricks for satisfying constraints.

Most of the technical stuff only becomes necessary when we want to understand why PCA works so well: this is where the spectral theorem and the eigenvalues and -vectors, come in to the story, they give us a deeper understanding of what we’re doing. We’ll look at these subjects in part two.

The spectral theorem is the heart of the method, so it pays to discuss it in some detail. We’ll state it and explain what it means in part 2, and leave the proof to part 3.

Finally, in part 4 we’ll look at the the singular value decomposition. This matrix decomposition is not only the most popular method for computing the solutions to PCA, it is also the go-to method for computing many related problems, like least squares fits, and matrix ranks.

The basics

Let’s begin by setting up some basic notation. We are faced with some high-dimensional dataset of instances (examples of whatever we’re studying) described with real-valued features. That is, we have \(\gc{n}\) instances \(\x_\gc{i}\) and each instance is described by a vector of \(\bc{m}\) real values. We describe the dataset as a whole as an \(\gc{n} \times \bc{d}\) matrix \(\X\); that is, we arrange the examples as rows, and the features as columns.

For a simple example, imagine that we have a dataset of 100 people with 2 features measured per person: their monthly salary and their income over the course of a quarter (i.e. three months). The second is just the first times 3, so this data is redundant. One value can be computed from the other, so we really only need to store one number per person. Here’s what that looks like in a scatterplot.

Our intuition that we really only need one number to represent the data is reflected in the fact that the data form a line. So long as we know what that line is, we only need to store how far along the line each instance is.

Of course, data in the wild is never this clean. Let’s introduce some small variations between the monthly salary and the income after three months. Some people may have changed jobs, some people may get bonuses or vacation allowances, some people may have extra sources of income. Here’s a more realistic version of the data.

The data is no longer perfectly linear, but it still seems pretty linear. If we imagine the same line we imagined on the last plot, and represent each person as a dot along that line, we lose a little information, but we still get a pretty good reconstruction of the data.

If you know how to do linear regression, you can probably work out how to draw such a line through the data, predicting the income from the salary or the other way around (and PCA is very similar to linear regression in many ways). However we’ll need something that translates to higher dimensions, where we don’t have a single target feature to predict. To do so, we’ll develop a method purely from these first principles:

  1. We want to represent the data using a small set of numbers per instance.
  2. We will limit ourselves to linear transformations of the data to derive this small set.
  3. We want to minimize the error in our reconstruction of the data. That is, when we map the data back to the original representation, as best we can, we want it to be as close as possible to the original.

One-dimensional PCA

We'll develop a one-dimensional version of PCA first. That is, we will represent each instance $\x_\gc{i}$ (row $\gc{i}$ in our data matrix $\X$), by a single number $z_\gc{i}$ as best we can. We will call $z_\gc{i}$ the latent representation of $\x_\gc{i}$.

To start with, we will assume that the data are mean-centered. That is, we have subtracted the mean of the data so that the mean of the new dataset is $0$ for all features.

Our task is to find a linear transformation from \(\x_\gc{i}\) to \(z_\gc{i}\), and another linear transformation back again.

A linear transformation from a vector $\x_\gc{i}$ to a single number is just the dot product with a vector of weights. We'll call this vector $\rc{\v}$. A linear transformation from a single number $z_\gc{i}$ to a vector is just the multiplication of $z_\gc{i}$ by another vector of weights. We'll call this vector $\rc{\w}$. This gives us

$$ \begin{align*} z_\gc{i} &= \rc{\v}^T \x_\gc{i} \\ \x'_\gc{i} &= z_\gc{i} \rc{\w} \end{align*} $$

where $\x'_\gc{i}$ is the reconstruction for instance $\gc{i}$.

Look closely at that second line. It expresses exactly the intuition we stated earlier: we will choose one line, represented by the vector \(\rc{\w}\), and then we just represent each data point by how far along the line it falls, or in other words, we represent \(\x’_\gc{i}\) as a multiple of \(\rc{\w}\).

All allowed values of \(\x'_\gc{i}\) are on the dotted red line, which is defined by our choice of \(\rc{\w}\). Where each individual ends up is defined by the multiplier \(z_\gc{i}\), wich is determined by the weights \(\rc{\v}\). Our objective is to choose \(\rc{\v}\) and \(\rc{\w}\) so that the reconstruction error, the distance between \(\x_\gc{i}\) and \(\x'_\gc{i}\) is minimized (over all \(\gc{i}\)).

We can simplify this picture in two ways.

First, note that many different vectors $\rc{\w}$ define the same dotted line in the image above. So long as the vector points in the same direction, any length of vector defines the same line, and if we rescale \(z_\gc{i}\) properly, the reconstructed points \(\x'_\gc{i}\) will also be the same. To make our solution unique, we will constrain \(\rc{\w}\) to be a unit vector. That is, a vector with length one: \(\rc{\w}^T\rc{\w} = 1\).

The second simplification is that we can get rid of \(\rc{\v}\). Imagine that we have some fixed \(\rc{\w}\) (that is, the dotted line is fixed). Can we work out which choice of \(\x_\gc{i}\) on the line will minimize the reconstruction error? We will go into a bit of detail here, since it helps to set up some intuitions that will be important in the later sections.

If you remember your linear algebra, you'll know that this happens when the line of the reconstruction error is orthogonal to the dotted line. In more fancy language, the optimal \(\x_\gc{i}\) is the orthogonal projection of \(\x_\gc{i}\) onto the dotted line. If you look at the image it's not too difficult to convince yourself that this it true. You can imagine the reconstruction error as a kind of rubber band pulling on $\x_\gc{i}$, and the point where it's orthogonal is where it comes to rest.

In higher dimensions, however, such physical intuitions will not always save us. Since the relation between orthogonality and least squares is key to understanding PCA, we will take some time to prove this properly.

Best approximation theorem (1D) Let $\rc{\w}, \x \in \mathbb{R}^n$, let $\rc{W}$ be the line of all multiples of \(\rc{\w}\), and let $\rc{\hat \w}$ be the orthogonal projection of $\x$ onto $\rc{W}$. Then, for any other $\rc{\bar \w}$ in $\rc{W}$, we have $$ \text{dist}(\x, \rc{\hat \w}) < \text{dist}(\x, \rc{\bar \w}) $$ where $\text{dist}(\a, \b)$ denotes the Euclidean distance $\|\a - \b\|$

Proof (Adapted from [2, Ch. 7 Thm. 9]). Note that $\a - \b$ is the vector that points from the tip of $\a$ to the bottom of $\b$. We can draw three vectors $\gc{\bar \w - \x}$, $\bc{\hat\w -\x}$ and $\rc{\bar \w - \hat \w}$ as follows:

By basic vector addition, we know that

$$ \gc{\bar \w - \x} = \bc{\hat\w -\x} + \rc{\bar \w - \hat \w} \text{,} $$

so the three vectors form a triangle (when we arrange them as shown in the picture).

We also know, by construction, that $\bc{\hat\w -\x}$ is orthogonal to $\rc{\bar \w - \hat \w}$, so the triangle is right-angled.

Since we have a right angled triangle, the Pythagorean theorem tells us that the lengths of the sides of the triangles are related by

$$ \gc{\text{dist}(\x, \bar \w)}^2 = \bc{\text{dist}(\hat \w, \x)}^2+ \rc{\text{dist}(\bar \w, \hat \w)}^2 \p $$

Since $\rc{\text{dist}(\bar \w, \hat \w)} > 0$ (because they are not the same point), we know that $\gc{\text{dist}(\x, \bar \w)}$ must be strictly larger than $\bc{\text{dist}(\hat \w, \x)}$.

So, the best reconstruction $\x'_\gc{i}$ of the point $\x_\gc{i}$ on the line defined by $z \cdot \rc{\w}$ (however we choose $\rc{\w}$) is the orthogonal projection of $\x_\gc{i}$ onto $\rc{\w}$. So how do we compute an orthogonal projection? Let's look at what we have:

We've projected $\x_\gc{i}$ down onto $\rc{\w}$ and we've given the vector from the projection to the original the name $\bc{\r}$. By vector addition we know that $z\rc{\w} + \bc{\r} = \x_\gc{i}$, so $\bc{\r} = \x_\gc{i} - z\rc{\w}$.

Two vectors are orthogonal if their dot product is zero, so we're looking for a $z$ such that $z\rc{\w}^T\bc{\r} = 0$, or equivalently, $\rc{\w}^T\bc{\r} = 0$. We rewrite

\[0 = \rc{\w}^T\bc{\r} = \rc{\w}^T\left(\x_\gc{i} - z\rc{\w}\right) = \rc{\w}^T\x_\gc{i} - z\rc{\w}^T\rc{\w} \p\]

This gives us $z = \rc{\w}^T\x_\gc{i} / \rc{\w}^T\rc{\w}$. And, since we'd already defined $\rc{\w}$ to be a unit vector (so $\rc{\w}^T\rc{\w} = 1$), we get $z = \rc{\w}^T\x_\gc{i}$.

Let's retrace our steps. We had two weight vectors: $\rc{\v}$ to encode $\x_\gc{i}$ into the single number $z_\gc{i}$, and $\rc{\w}$ to decode $\x_\gc{i}$ as $z_\gc{i}\rc{\w}$. We've now seen that for any given $\rc{\w}$, the best choice of $z_\gc{i}$ is $\rc{\w}^T\x_\gc{i}$. In other words, we can set $\rc{\v}$ equal to $\rc{\w}$ and use it to encode and to decode.

So, after all that, we can finally state precisely what we're looking for. Given $\rc{\w}$ our reconstruction is $\x'_\gc{i} = z_\gc{i} \cdot \rc{\w} = \oc{\w^T\x_i \cdot \w}$ . This means we can state our goal as the following constrained optimization problem:

\[\begin{align*} &\argmin{\rc{\w}} \sum_\gc{i}|| \oc{\w^T\x_i \cdot \w} - \x_\gc{i} || \\ &\;\;\;\;\text{such that } \rc{\w}^T\rc{\w} = 1 \p \end{align*}\]

How do we solve this? This is a simple problem and there are fast ways to solve it exactly. But we’ve done a lot of math already, and it’s time to show you some results, so we’ll just solve this by gradient descent for now. Basic gradient descent doesn’t include constraints, but in simple cases like these, we can use projected gradient descent: after each gradient upate, we project the parameters back to the subset of parameter space that satisfies the constraint (in this case simply by dividing $\rc{\w}$ by its length).

We start by initializing $\rc{\w}$ to some arbitrary direction. Here’s what the projections of the income data onto that $\rc{\w}$ look like.

The sum of the lengths of the blue lines is what we want to minimize. Clearly, there are better options than this choice of $\rc{\w}$. After a few iterations of gradient descent, this is what we end up with.

You can think of the blue lines of the reconstruction error as pulling on the line of $\rc{\w}$ and the line of $\rc{\w}$ as pivoting on the origin.

For any dataset (of however many dimensions), the optimal line $\rc{\w}$ is uniquely determined. It’s called the first principal component.

What can we say about the meaning of the elements of $\rc{\w}$? Remember that it does two things: it encodes from $\x$ to $z$ and it decodes from $z$ to $\x’$. The encoding is a dot product: a weighted sum over the elements of $\x$.

In the first example of the income data, before we added the noise, the second feature was always exactly three times the first feature. In that case, we could just remember the first feature, and forget the second. That would be equivalent to encoding with the vector $\rc{(1, 0)}$. The compressed representation $z$ would be equivalent to the first feature and we could decode with $z \times\rc{(1, 3)}$. Or, we could encode with $\rc{(0, 1)}$ and decode with $\rc{(\tfrac{1}{3}, 1)}$.

Why are the encoding and decoding vectors different in these cases? Because when we proved that they were the same, we assumed that they were unit vectors. Our encoding vector is a unit vector, but the corresponding decoding vector isn’t. PCA provides solution for which the encoder and the decoder are the same. It takes a mixture of both features, in different proportions (in our case $1/\sqrt{10}$ and $3/\sqrt{10}$) . There are a a lot of perspectives on exactly what this mixture means. We’ll illustrate the first by looking at a higher dimensional dataset.

We’ll use a dataset of grayscale images of faces called the Olivetti dataset [1]. Here is a small sample:

We will describe each pixel as a feature with a value between 0 (black) and 1 (white). The images are \(64 \times 64\) pixels, so each image can be described as a single vector of \(\bc{4096\text{ real values}}\) .

The Olivetti data contains \(\gc{400 \text{ images}}\), so we end up with a data matrix \(\X\) of \(\gc{400} \times \bc{4096}\).

This is a data scientist’s worst nightmare: data with many more features than instances. With so many features, the space of possible instances is vast, and we only have a tiny number to learn from. Our saving grace is that, like the income/salary example, the features in this dataset are highly dependent: knowing the value of one pixel allows us to say a lot about the value of other pixels.

For instance, pixels that are close together more often than not have similar values. The images are often roughly symmetric. All faces will have mostly uniform patches of skin in roughly the same place, and so on.

In short, while our dataset is expressed in \(\bc{4096}\) dimensions, we can probably express the same information in many fewer numbers, especially if we are willing to trade off a little accuracy for better compression.

The procedure we will use to find the first principal component for this data is exactly the same as before—search for a unit vector that minimizes the reconstruction loss—except now the instances have 4096 features, so $\rc{\w}$ has 4096 dimensions. First, let’s look at the reconstructions.

You’re disappointed, I can tell. Well, to be fair, we’ve compressed each image into a single number, we shouldn’t be surprised that there isn’t much left after we reconstruct it. But that doesn’t mean that 1D PCA doesn’t offer us anything useful.

What we can do is look at the first principal component in data space: $\rc{\w}$ is a vector with one element per pixel, so we can re-arrange it into an image and see what each element in the vector tells us about the original pixels of the data. We’ll color the positive elements of $\rc{\w}$ red and the negative values blue.

It looks like this:

If we think of this as the encoding vector, we can see a heatmap of which parts of the image the encoding looks at: the darker the red, the more the value of that pixel is added to $z$. The darker the blue, the more it is subtracted.

If we think of this as the decoding vector, we can see that the larger $z$ is, the more of the red areas gets added to the decoded image, but the more of the blue areas get subtracted. That is, two red pixels are positively correlated, and a red and a blue pixel are negatively correlated. A bright red pixel and a light red pixel have the same relation as our monthly salary and quarterly income: one is (approximately) a multiple of the other.

Another interpretation of the principal component is that it places the images in the dataset on a single line, which mean it orders the images along a single direction.

To see if there’s any interpretable meaning to this ordering, we try moving along the line and decoding the points we find. We can start at the origin (the vector $\mathbf 0$). If we decode that, we get the mean of our dataset (the so called mean face). By adding or subtracting a little bit of the principal component, we can see what happens to the face.

A few things are happening at once: the skin is becoming less clear, the lines in the face become more pronounced, the glasses becomes more pronounced and the mouth curves upward. Most of these are consistent with moving from a young subject to an old one.

We can test this on the faces in our dataset as well; our principal component is a direction in the data space, so we can start with an image from our data, and take small steps in the direction of the principal component, or in the opposite direction.

Here’s what we get.

Note the manipulation of the mouth, in particular in face 41. As the corners of the mouth go up, the bottom lip goes from curving outward, with a shadow under the lip to curving inwards, folding under the teeth, with the shadow turning into a highlight.

Here we see the real power of PCA. While the reconstructions may not be much to write home about, the principal component itself allows us, using nothing but a linear transformation consisting of a single vector, to perform realistic manipulation of facial data, based on a dataset of just 400 examples.

n-Dimensional PCA

Enough playing around in a one dimensional latent space. What if we want to improve our latent representations by giving them a little more capacity? How do we do that in a way that gives us better reconstructions, but keeps the meaningful directions in the latent space?

Let’s start by updating our notation. $\x_\gc{i}$ is still the same vector, row $\gc{i}$ in our data matrix $\X$, containing $\bc{m}$ elements, as many as we have features. $\z_\gc{i}$ is now also a vector (note the boldface). $\z_\gc{i}$ has $\rc{k}$ elements, where $\rc{k}$ is the number of latent features, a parameter we set. In the example above, $\rc{k}=1$. We’ll drop the $\gc{i}$ subscript to clarify the notation.

Let's say we set $\rc{k}=2$. If we stick to the rules we've followed so far—linear transformations by unit vectors—we end up with the following task: find two unit vectors $\rc{\w}_\rc{1}$ and $\rc{\w}_\rc{2}$ and define $\z = (z_\rc{1}, z_\rc{2})$ by $z_\rc{1} = \x ^T\rc{\w}_\rc{1}$ and $z_\rc{2} = \x^T \rc{\w}_\rc{2}$. Each latent vector gives us a reconstruction of $\x$. We sum these together to get our complete reconstruction.

We can combine the two vectors $\rc{\w_1}$ and $\rc{\w_2}$ in a single matrix $\rc{\W}$ (as its columns) and write

\[\begin{align*} \z &= \rc{\W}^T\x \\ \x' &= \rc{\W}\z \end{align*}\]

Or, in diagram form:

This would already work fine as a dimensionality reduction method. You can think of this as an autoencoder, if you’re familiar with those. However, we can add one more rule to improve our reduced representation. We will require that $\rc{\w_2}$ is orthogonal to $\rc{\w_1}$.

In general, each component $\rc{\w}_r$ we add should be orthogonal to all components before it: for $\rc{k} = 3$ we add another unit vector $\rc{\w_3}$, which should be orthogonal to both $\rc{\w_1}$ and $\rc{\w_3}$.

We can summarize these constraints neatly in one matrix equation: the matrix $\rc{\W}$, whose columns are our $\rc{\w}$ vectors, should satisfy:

\[\rc{\W}^T\rc{\W} = \I\]

where $\I$ is the $\rc{k} \times \rc{k}$ identity matrix. This equation combines both of our constraints: unit vectors, and mutually orthogonal vectors. On the diagonal of $\rc{\W}^T\rc{\W}$, we get the dot product of every column of $\rc{\W}$ with itself (which should be $1$ so that it is a unit vector) and off the diagonal we get the dot product of every column of $\rc{\W}$ with every other column (which should be $0$, so that they are orthogonal).

How do we find our $\rc{\W}$? The objective function remains the same: the sum of squared distances between the instances $\x$ and their reconstructions $\x’$. To satisfy the constraints, we can proceed in two different ways. We’ll call these the combined problem and the iterative problem.

The combined problem is simply to add the matrix constraint above and stick it into our optimization function. This gives us

\[\begin{align*} &\argmin{\rc{\W}} \sum_\x ||\rc{\W}\rc{\W}^T\x - \x||^2 \\ &\;\;\text{such that } \rc{\W}^T\rc{\W} = I \end{align*}\]

The iterative problem defines optima for the vectors $\rc{\w_r}$ in sequence. We use the same one-dimensional approach as before, and we find the principal components one after the other. Each step we add the constraint that the next principal component should be orthogonal to all the ones we’ve already found.

To put it more formally, we choose each $\rc{\w_1}, \ldots, \rc{\w_k}$ in sequence by optimizing

\[\;\;\;\;\;\;\;\; \rc{\w_r} = \begin{cases} &\argmin{\rc{\w}} \sum_\x ||\x^T\rc{\w} \times \rc{\w} - \x||^2 \\ &\;\;\;\;\;\;\text{such that } \rc{\w}^T\rc{\w} = 1, \\ &\;\;\;\;\;\;\text{and } \rc{\w}\perp\rc{\w_i} \text{ for } \rc{i} \in [1 \ldots \rc{r}-1] \end{cases}\]

These approaches are very similar. In fact, they’re sometimes confused as equivalent in the literature. Let’s look how they relate in detail.

The vector $\rc{\w_1}$ defined in the iterative problem, is the same vector we found in the one dimensional setting above: the first principal component. If the two problems are equivalent (i.e. they have the same set of solutions), this vector should always be one of the columns of $\rc{\W}$ in the combined problem.

To show that this isn’t guaranteed, we can look at the case where $\rc{k} = \bc{m}$. That is, we use as many latent features as we have features in our data. In this case, the first vector $\rc{\w}$ returned by the iterative approach is still the first principal component, as we’ve defined it above. However, for the combined approach, we can set $\rc{\W} = \I$ for a perfect solution: clearly the columns of $\I$ are orthogonal unit vectors, and $\rc{\W}\rc{\W}^T\x - \x = \x^T\I\I^T - \x = \x -\x = \mathbf{0}$, so the solution is optimal.

In short, a solution to the combined problem may not be a solution to the iterated approach. What about the other way around, does solving the iterated problem always give us a solution to the combined problem? Certainly the vectors returned are always mutually orthogonal unit vectors, so the constraint is satisfied. Do we also reach a minimum? It turns out that we do.

Optimality of the iterative approach. A solution to the iterative problem is also a solution to the combined problem.

We will prove this in the second part. For now, you’ll have to take my word for it. The combined problem has a large set of solutions, and the iterative approach provides a kind of unique point of reference within that space.

We can say more about this later, but for now we will equate the iterative solution with PCA: the $\rc{\W}$ which not only minimizes the reconstruction error as a whole, but also each column of $\rc{\W}$ minimizes the reconstruction error in isolation, constrained to the subspace orthogonal to the preceding columns. The combined problem does not give us the principal components.

Using the iterative approach, and solving it by projective gradient descent, we can have a look at what the other principal components look like. Let’s start with our income data, and derive the second principal component.

This is a bit of an open door: in two dimensions, there is only one direction orthogonal to the first principal component. Still, plotting both components like this gives some indication of what the second component is doing. The first component captures the main difference between the people in our dataset: their monthly salary. The second component captures whatever is left; all the noise we introduced like end of year bonuses and alternative sources of income.

Note how PCA reconstructs the original data points. Given the components $\rc{\w_1}, \ldots, \rc{\w_k}$, we represent each $\x$ as a weighted sum over these vectors where the latent features $z_\rc{1}, \ldots, z_\rc{k}$ are the weights: $$ \x' = z_\rc{1}\rc{\w_1} + \ldots + z_\rc{k}\rc{\w_k} $$

That’s it for the income dataset. We’ve reached $\rc{k} = \bc{m}$, so we can go no further.

Let’s turn to the dataset of faces, where there are many more principal components to explore.

If we compute the first 30 principal components, we get the following reconstructions.

You can still tell the originals from the reconstructions, but many of the salient features now surivive the compression process: the direction of the gaze, the main proportions of the face, the basic lighting, and so on. By looking at the first five principal components, we can see how this is done.

The first one we’ve seen already. It’s flipped around, this time, with the blues and reds reversed, but it defines the same line in space. Note that the magnitude of the higher PCs is much lower: the first principal component does most of the work of putting the image in the right region of space, and the more PCs we add, the more they fine-tune the details.

The second PC captures mostly lighting information. Adding it to a picture adds to the left side of the image, and subtracts from the right side. We can see this by applying it to some faces from the data.

The third PC does the same thing, but for top-to-bottom lighting changes. The fourth is a bit more subtle. It’s quite challenging to tell from the plot above what the effect is. Here’s what we see when we apply it to some faces.

The PC seems to capture a lot of the facial features we associate with gender. The faces on the left look decidedly more “female”, and the faces on the right more male. It’s a far cry from the face manipulation methods that are currently popular, but considering that we have only 400 examples, we are only allowed a linear transformation, and that the method originated in 1901 [4], it’s not bad.

Before we finish up, let’s look at two examples of PCA, as it’s used in research.

Let’s start with a problem which crops up often in the study of fossils and other bones. A trained anatomist can look at, say, a shoulder bone fossil, and tell instantly whether it belongs to a chimpanzee (which is not very rare) or an early ancestor of humans (which is extremely rare). Unfortunately such judgements are usually based on a kind of unconscious instinct, shaped by years of experience, which makes it hard to back it up scientifically. “This is is Hominid fossil, because it looks like one to me,” isn’t a very rigorous argument.

PCA is often used to turn such a snap judgement into a more rigorous analysis. We take a bunch of bones that are entirely indistinguishable to the layperson, and we measure a bunch of features, like the distances between various parts on the bone. We then apply PCA and scatterplot the first two principal components.

Here is what such a scatterplot looks like for a collection of scapulae (shoulder bones) of various great apes and hominids.

Reproduced from [4].

The figure is from [4], which is available online, but the literature is full of pictures like these. Here, the authors took scans of about 350 scapulae. We can clearly see different species forming separate clusters. If we find a new scapula, we can simply measure it, project it to the first two principal components, and show that it ends up among the Homo Ergaster to prove that our find is special. What’s more, not only can we tell the Hominin fossils apart, we see that they seem to lie on a straight line from chimpanzees to modern humans, giving a clue as to how human evolution progressed.

One final example, to really hammer home the magic of PCA. In [5] (also available), the authors applied PCA to a database of 1387 European DNA sequences. For each instance, they measured about half a million sites on the DNA sequences to use as features. This is similar to the face example: many more features than instances.

The authors applied PCA, and plotted the first two principal components. They colored the instances by the person’s ancestral country of origin (the country of origin of the grandparents if available, otherwise the person’s own country of origin). Here is what they saw.

Reproduced from [5].

The first principal component corresponds roughly to how far north the person lives (or their ancestors did) and the second principal component to how far east they live. This means that a scatterplot shows up as a fuzzy map of Europe. If we sent a thousand DNA samples off to aliens on the other side of the galaxy, they could work out a rough image of the geography of our planet.


Wrapping up, what have we learned so far? We’ve defined PCA as an iterative optimization problem designed to compress high dimensional data into fewer dimensions, and to minimize the resulting reconstruction loss. We’ve shown one simple way to find this solution, a laborious and inaccurate way, but enough to get the basic idea across. We then looked at various practical uses of PCA: analyzing human faces, human fossils, and human DNA. We showed that in many cases, PCA magically teases out quite high-level semantic features hidden in the data: the species of the fossil, the location of the subject in Europe, or the subject’s age.

What we haven’t discussed fully, is where this magical property comes from. We’ve shown that it isn’t just the compression objective, since optimizing just that in one single optimization doesn’t lead to the PCA solution. Among the set of all solutions that minimize the reconstruction error, the PCA solution takes a special place. Why that’s the case, and why it this should emerge form optimizing the principal components one by one, greedily if you will, instead of all together, we will discuss in the next part. To do so, we’ll need to dig into the subject of eigenvectors, the underlying force behind almost everything that is magical about linear algebra.

Acknowledgements. Many thanks to Emile van Krieken for corrections and suggestions.


[1] Olivetti Faces data. (sklearn) Produced by AT&T Laboratories Cambridge, between 1992 and 1994.
[2] Linear Algebra and its applications. David C. Lay, 1994. Addison Wesley.
[3] On Lines and Planes of Closest Fit to Systems of Points in Space. K. Pearson. 1901 doi:10.1080/14786440109462720.
[4] Fossil hominin shoulders support an African ape-like last common ancestor of humans and chimpanzees, Young et al. 2015 PNAS
[5] Genes mirror geography within Europe, Novembre et al. 2008 Nature.


Helpful linear algebra properties

The following properties may be good to reference when following the proofs and more technical parts in this series. Anything that holds for matrix multiplication holds fort vector/matrix multiplication and for vector/vector multiplication as a special case.

If you want an exercise to test yourself, see which of the following properties you need to show that: for a dataset $\x_1, \ldots, \x_\gc{n}$ the dot product of the mean of the data with a vector $\rc{\w}$ is the same as the mean of the dot products of each individual instance with $\rc{\w}$.

General best approximation theorem

We proved earlier that the orthogonal projection of a point $\x$ onto a line $\rc{W}$ is the closest point in $\rc{W}$ to $\x$. To generalize that result, we’ll take at a set of vectors $\rc{\w_1}, \ldots, \rc{\w_k}$ and look at the subspace of points defined by all sums of all multiples of one or more of the vectors. This subspace is called the space spanned by the vectors, which we write as $\text{Span } \{ \rc{\w_1}, \ldots, \rc{\w_k} \}$. Any point that we can define by summing one or more multiples of vectors in $\{ \rc{\w_1}, \ldots, \rc{\w_k} \}$ is in the set $\text{Span } \{ \rc{\w_1}, \ldots, \rc{\w_k} \}$.

We call a vector $\x$ orthogonal to a set $\rc{\W}$ if it is orthogonal to every vector in $\rc{\W}$

Best approximation theorem (n-dimensional) Let $\rc{\w_1}, \ldots, \rc{\w_k}, \x \in \mathbb{R}^n$, let $\rc{W} = \text{Span } \{\rc{\w_1}, \ldots, \rc{\w_k} \}$, and let $\rc{\hat \w}$ be an orthogonal projection of $\x$ onto $\rc{W}$. Then, for any other $\rc{\bar \w}$ in $\rc{W}$, we have

$$ \text{dist}(\x, \rc{\hat \w}) < \text{dist}(\x, \rc{\bar \w}) \p $$

Proof. The proof has the same structure as that of the 1-dimensional case. As before, draw three vectors $\gc{\bar \w - \x}$, $\bc{\hat\w -\x}$ and $\rc{\bar \w - \hat \w}$.


By vector addition, these three vectors can be arranged in a triangle.

$\rc{\bar \w}$ and $\rc{\hat \w}$ are both in $\rc{W}$ so $\rc{\bar \w - \hat \w}$ is too.

Since $\rc{\bar \w - \hat \w} \in \rc{W}$, $\bc{\hat\w -\x}$ is orthogonal to $\rc{\bar \w - \hat \w}$ by definition. This makes our triangle a right-angled one again, and as before, we have

$$ \gc{\text{dist}(\x, \bar \w)}^2 = \bc{\text{dist}(\hat \w, \x)}^2+ \rc{\text{dist}(\bar \w, \hat \w)}^2 \p $$

from which the theorem follows.