# Assignment 6 : PCA

- Due Mar 22, 2016 by 9am
- Points 10
- Submitting a file upload
- Available Mar 15, 2016 at 9am - Mar 27, 2016 at 9am 12 days

## PCA

**CLPS 1291: Assignment 6**Due:

*on*

**9:00AM**

**Mar****22**Before you get started:

Look at the Assignment Guidelines for formatting and coding style information, submission guidelines, etc. If you have any questions related to the assignment, please post them in this Discussion.

**As a reminder, we will neither accept answers that fail to follow the given template, nor consider code written outside of the allotted space. We will only review functions that follow our conventions and results documents submitted in the requested form.**

You will need to use this skeleton code and data files to complete the assignment. See the skeleton code for more helpful guidelines.

We expect you to turn in the following:

- results.pdf - a pdf containing your outputs and descriptions.
- assignment6.zip - a zip file containing:
- assignment6.m - a MATLAB script containing all of the code necessary for this assignment

**1. PCA and Eigenfaces**

Eigenfaces are sets of eigenvectors which can be used to work with face recognition applications. Each eigenface, as we'll see in a bit, appears as an array vector of pixel intensities. We can use PCA to determine which eigenfaces contribute the largest amount of variance in our data, and to eliminate those that don't contribute much. This process lets us determine how many dimensions are necessary to recognize a face as 'familiar'.Here, we're going to work with a dataset of faces, and determine how many principal components are necessary to represent these faces.

**1. a) Load the Dataset **

First, you'll need to load the face data (available on Canvas). By the end of this step, you should have a 2429x361 array, 'faces.' Each of the 2429 rows in this matrix should contain a 361-dimensional vector corresponding to a 19x19 pixel image. Each individual number in an array represents a pixel intensity.

**1. b) Visualizing Faces**

Now, we'll visualize the faces represented within each of these vectors! First, we'll need to use the 'reshape' function to convert each 1x361 row back into a 19x19 matrix. Do this for 4 random faces (rows) within your 'faces' dataset.

Next, you can visualize each of those four face matrices by using 'imagesc'. Using the 'subplot' function, you should be able to display all four faces in the same figure window.

To make your faces look prettier, use 'colormap(gray)' (sets the images to standard grayscale) and 'axis image' (sets the axes box to fit tightly around the image).

Add this figure to 'results.pdf'.

**1. c) Preprocess faces dataset**

Before we can run PCA on the faces dataset, we'll need to preprocess the dataset. In order to increase the symmetry of our principal components, we'll need to normalize the dataset and add mirrored symmetric faces (matrices complementing those that already exist) to our dataset. In order to normalize the dataset, we'll need to do a few things:

- Calculate the average pixel intensity of 'faces' and store this as a variable. If done correctly, this variable should contain a 1x361 vector.
- Create a matrix matching the size of 'faces' (2429x361), in which each element stores the average of 'faces'.
- Subtract this new matrix from 'faces', and store this as a variable
- Finally, use the 'fliplr' function to flip each row in 'faces' from left to right (for example, if row = [1 2 3], fliplr(row) = [3 2 1]). Save all of these flipped rows in a new matrix. This matrix represents a set of mirror-symmetric faces.

**1. d) Run PCA!**

Now we can run PCA on the dataset! This should look just like the line of code you used on last assignment's toy dataset. Retrieve the principal components (a.k.a. eigenfaces) and the variability associated with each PC (these should respectively be the first and third output arguments of *pca*).

How many principal components does this return? Why? Add your response to 'results.pdf'.

**1. e) Project the Data into Principal Component Space**

Next, we'll project one of the data points (one of the rows from 'faces') into the principal component space. This will also require a few steps:

Calculate the mean of your updated 'faces' dataset (like you did earlier). Then, retrieve a random row from 'faces' and subtract the mean of faces from that row.

If 'PC' is the matrix of principal components (an output of the 'pca' function, and 'r' is a row of a matrix of data points in the original space, you can project 'r' into the principal component space by multiplying 'r' by the PC output of the 'pca' function.

Now, we can reshape the 1x361 row back into a 19x19 square matrix (so that it might somewhat resemble a face when we visualize it later).

Create a new figure window and use 'imagesc' to visualize this row (a face) projected in PC space. Remember to use 'colormap(gray)' and 'axis image' again, so it looks nice!

Does this look like a face? If not, why? Add your figure and response to 'results.pdf'.

**2. Minimally Accounting for Maximal Variance **

**2. a) Principal Components Needed to Account for Most Variance**

Determine the smallest number *n* of the top principal components (in terms of variance explained; or the third output argument of the 'pca' function) you have to keep in order to account for at least 95% of the total variance.

HINT: 'pca' returns principal components *already sorted* in DESCENDING ORDER by amount of variance explained. So, you just have to do a few things:

Use the function 'cumsum' to calculate the cumulative sum of all of the elements in 'eigenvalues'. Save this as a new variable. If you're having trouble understanding how 'cumsum' works, use the 'help' command!

HINT: If you've done this correctly, the output of 'cumsum' should be the same size as 'eigenvalues'.

You can use the cumulative sum of the whole dataset (the *last* element of your 'cumsum' output) to determine the TOTAL variance of your 'faces' dataset.

Now, we can use the output of 'cumsum' and the total variance we just found to determine the *smallest* number of components that account for at least 95% of total variance. Save this number as a variable 'n'. HINT: Use the 'find' command to search for these items within the output of 'cumsum'!

**2. b) Visualize First n PCs**

Show the first *n* principal components you determined above as images. Use 'colormap(gray)' and 'axis image' again, to make everything pretty!

HINT: Use the 'subplot' function you used to visualize faces earlier to display all of these principal components in one figure window.

What properties do each eigenvector seem to code for? Based on your intuitive understanding of faces, why do you think these properties explain the most variance within the dataset? Add your figure and response to 'results.pdf'.

**2. c) Plot Cumulative Explained Variance vs. Number of PCs**

Now, we're going to create a scatter plot (in a new figure window!) illustrating how the percentage of cumulative explained variance is affected by number of principal components. You should have *number of PCs* on the x axis and *percentage of cumulative explained variance* on the y axis. HINT: You'll need to use the output of 'cumsum'!

% -- Your code here -- %

Where does your data plataeu? Based on this, how many PCs should you select (in order to represent about 95% of total variance with the least dimensions)? Compare this number to the observations you made of the face images you produced in part 2. b. How well does the number from the graph match your perceptual assessment of how many PCs you need to recognize the faces?

Add your figure and responses to 'results.pdf'.

**2. Extra Credit**

Begin by downloading the data set available http://www.cs.columbia.edu/CAVE/databases/SLAM_coil-20_coil-100/coil-100/coil-100.zip. This is a database from the Columbia Object Image Library (COIL) that contains color images of 100 objects that were photographed from several angles on a motorized turntable.

Read in the COIL-100 image data as you did with the faces. Remember to convert the images to grayscale.

Now run PCA on the data set you produced from the above for-loop

Visualize some of the first principal components (for example the 25 first) of these images. What do you see?