Hierarchical Clustering and Intro to PCA
CLPS 1291: Assignment 5
Due: 9:00AM on Mar 15
Before you get started:
Look at the Assignment Guidelines for formatting and coding style information, submission guidelines, etc. If you have any questions related to the assignment, please post them in this Discussion.
As a reminder, we will neither accept answers that fail to follow the given template, nor consider code written outside of the allotted space. We will only review functions that follow our conventions and results documents submitted in the requested form.
You will need to use this skeleton code and data files to complete the assignment. See the skeleton code for more helpful guidelines.
We expect you to turn in the following:
- results.pdf - a pdf containing your outputs and descriptions.
- assignment5.zip - a zip file containing:
assignment5.m - a MATLAB script containing all of the code necessary for this assignment
1. Hierarchical Clustering
Hierarchical clustering is another way to organize values in a dissimilarity matrix. The algorithm essentially places all of the items in a row at the bottom of a 'tree' of data, and recursively clusters these items based on similarity of pairs. So, the most similar individual items are next to each other, the most similar pairs are next to each other, etc. This is supposed to model the human cognitive process of characterizing stimuli. It can be a useful alternative or supplement to MDS when the items being categorized cannot be effectively described by a single physical dimension (since it might be difficult to accurately recover euclidean distances between these kinds of data points). For the first part of this assignment, we will be taking another look at the color and animal datasets from last week. This time, we will use a hierarchical clustering algorithm to represent the dissimilarity data.
1. a) Load the Datasets
First, you'll need to load the color and animal datasets (just like you did last week!). As a reminder, 'colors.mat' and 'animals.mat' are both structures, each containing two fields:
names = cell array containing the names of each item. For the color dataset, this corresponds to wavelengths. For the animal dataset, this corresponds to types of animal.
dsim = matrix of perceived dissimilarities between each item, as determined by human judgments. These values have been adjusted to be on a scale from 0 to 1. A value of 1 means that two elements are the most dissimilar, while a value of 0 means that they are identical.
Load the file 'colors.mat' and save it as a variable. Save the fields within this structure as their own variables. Then do the same with 'animals.mat'.
1. b) Perform Hierarchical Clustering
Now, we will run hierarchical clustering to try to understand how people organize these items relative to each other in psychological space. You'll want to create a dendrogram -- a tree-like structure with the item names at the bottom (the 'leaves' of the tree), and lines ('branches') connecting the items. These lines indicate the points at which different items are clustered together, based on their perceived similarity.
As the distance between two items ('leaves') at the bottom of the tree increases, so does their dissimilarity. For example, when looking at animal data, you might expect the points 'earthworm' and 'chimpanzee' to be far apart. Additionally, the lower the lines join, the more similar the connected data points are. For example, a line connecting the very similar animal points 'chimpanzee' and 'gorilla' would intuitively be lower than a line connecting the points 'chimpanzee' and 'goldfish'.
There is a function you can use to complete this task -- see if you can figure out what it is! This should only take one line of code per data set. HINT: You're trying to create a dendrogram...
Create a new figure window and run hierarchical clustering on color dataset. Make sure your dendrogram has a title! Then do the same thing for the animal dataset.
Add both of your dendrograms to 'results.pdf.'
1. c) Comparing Hierarchical Clustering and MDS Results
Comment on how well the spatial representation returned by the MDS (last assignment) and HC capture your intuitions about the similarity between colors and animals. Was there a difference in which kind of representation seemed to capture your intuitions about the similarity of animals and the similarity of colors? Why do you think this is? Name two more kinds of stimuli that you think would be represented best using hierarchical clustering, and two more kinds of stimuli that you think would be best captured by a spatial representation (MDS).
Add your response to 'results.pdf'.
2. Intro to Principal Component Analysis (PCA)
PCA models a different cognitive process: the compression of perceptual dimensions in memory encoding. The PCA algorithm lets us determine which dimensions of our data points are most necessary for creating a recognizable representation of that item. We can then toss out the dimensions that contribute to the least amount of variance in our data. This method is especially useful when studying things like object recognition.
Here, we are going to create and work with a toy 2D dataset in order to get a better understanding of how PCA works. This dataset will have one direction with a large variance, and another direction with a small variance. We will then visualize what happens when you use PCA to reduce the data set from 2 dimensions to 1 dimension.
2. a) Create a 2D Dataset
First, we will use the function 'mvnrnd' to generate a dataset of random vectors from the multivariate normal distribution. This function requires three inputs:
- MU: a vector representing the mean of the data
- SIGMA: a covariance matrix
- N: the number of rows you want in your output matrix
To start off, create a dataset with a mean MU = [-1 2], SIGMA = [1 2; 2 5], and N = 100.
2. b) Create a Scatter Plot of Raw Data
Now, we'll use the 'scatter' function to visualize your dataset as a scatter plot. For fun, we can also specify the color of each data point! In order to assign each data point a random color, you'll need to create a matrix with the same number of rows as your data set (each row corresponding to a datapoint) and three columns (one for each red/green/blue value). Save your color matrix as a variable. Now, make a new figure window and use 'scatter' to plot your dataset (using your color matrix to color your data markers!). Make sure to give your scatter plot a title! HINT: If you're unsure how to do this, type 'help scatter' in your command window.
What do you notice? What shape is your data cloud? Why do you think that is?
Add your scatter plot and responses to 'results.pdf'.
2. c) Run PCA on Your Data
Matlab has a built-in PCA function that will take in your dataset and return a matrix with columns containing coefficients for each principal component of the dataset (one per dimension). You'll need to specify quite a few parameters here: [COEFF, SCORE, LATENT, TSQUARED, EXPLAINED, ~] = pca(DATA). Each of these arguments tells the 'pca' function to output an array of data. Here is what you should use here:
COEFF = the principal component (PC) coefficients. Here, you should plug in 'PC' as the first argument.
SCORE = a representation of your dataset in PC space. This will tells Matlab to output a matrix 'score' with the same dimensions as your dataset. Plug in 'score' as the second argument.
LATENT = the eigenvalues of the covariance matrix you used to create your dataset. Each eigenvalue represents the amount of variance in your data explained by each PC. In order to get this array of eigenvalues, plug in 'eigenvalues' as the third argument.
TSQUARED = a T-squared statistic computed using all of the PCs of your data set. Plug in 'tsquared' as the fourth argument.
EXPLAINED = this is the interesting part! Plug in 'explained' as your fifth argument, and this returns the percentage of variance within your dataset that can be explained by each PC.
Now, put all that together and run PCA on the data set! This should only be one line of code.
2. d) Plot Principal Components
Now, we can plot the two PCs! To accomplish this, you'll need to plot each PC as a line passing through a point from a vector. In order to do this, we'll use the 'plot' function. This function can take two matrices as inputs, and will plot a vector using the columns/rows of these matrices.
To create a line representative of points in a vector, you should convert each PC to two endpoints of a line. You can do this by multiplying each input of 'plot' by an array of arbitrary size (let's say [-15 15] for now). The 'plot' function will then connect these endpoints to create a line corresponding to each PC.
First, you'll need to find the mean of your data (Matlab has a 'mean' function you can use). This will output a 1-by-2 array, with the first column representing rows, and the second representing columns. You'll need to add an item from this output to each input of 'plot' in the next step.
Now, make a new figure window and plot your principal components! Make each line a different color, and remember to give your figure axis labels and a descriptive title.
REMEMBER: the 'pca' function outputs an array with the PCs you need! Remember to check the Mathworks documentation if you need a reminder on how the functions works!
HINT: In order to get both lines to show up in the same figure, type 'hold on' before plotting the first line, and 'hold off' after plotting the second line.
2. e) Project Data Into Eigenvector Space
Now, we're going to project the data in eigenvector space and visualize it as a scatter plot. First, you'll need to multiply your raw dataset by your principal components, and save the output as a new variable. If you do this correctly, your new variable should be the same size as your raw dataset.
Now, make a new figure window and create a scatter plot using your projected data. Make sure to give your plot a title!
How does this compare to the initial scatter plot you made of your raw data? Describe any differences you observe.
Add your scatter plot and responses to 'results.pdf'.
2. f) Dimensionality Reduction
Drop one of the principal components and project your data back into the original space. In other words, create a scatter plot of your data along the principal component which explains the greatest amount of variance in your data. Then, we'll overlay this scatter plot with the lines you created earlier (corresponding to both PCs). This will let you visualize how removing a principle component changes the way your data is represented in space.
First, you'll need to multiply your projected dataset (the variable you created in the previous step) by your principal components, and save the output as a new variable. If you do this correctly, your new variable should be the same size as your raw dataset.
Next, create a new figure window and use 'scatter' to make a scatter plot of your updated dataset.
Finally, without creating a new figure window, use the code you used in step 2. d) to plot your principal components as vectors. Make sure to give your figure axis labels and a title!
What effect did removing a PC have on your data? Why do you think this is? What do you think would happen if you were using a dataset with more than 2 dimensions?
Add your figure and responses to 'results.pdf'.
In parts of this assignment, we played around with generating random datasets from the normal distribution. The ability to sample from a particular distribution is an incredibly valuable skill. Now, we'll look at a variety of ways of performing this task. Remember to look up any unfamilar functions on Mathworks or using the 'help' function in the command window.
First, we'll go about making a graph with three subplots.
Uniform distribution - The simplest distribution is the uniform, often denoted where a and b define the maximum and minimum of the range of evenly distributed random values. Using the 'randn' function, produce 100 values drawn from . Display a histogram of these on the first subplot.
- Convert this random sample into one drawn from the normal distribution, where , the mean, is zero and , the standard deviation, is 1. To do this, you'll have to plug each sample into the normal distribution equation: where is the mean, is the standard deviation, and is a given sample. In subplot 2, create a histogram of these values. Then using the 'normpdf' equation (which recovers the probability density function - essentially the probability of seeing a given value, at each point in the distribution), produce a normal curve of these values and superimpose that line over the histogram. (Look up 'hold on' if you're having trouble).
- Finally, we'll teach you the easy way to sample the normal distribution. This time, use the function 'normrnd' to generate 100 values (~ means distributed). Plot a histogram of these on the third subplot and once again, superimpose a normal curve of these values using normpdf.
Title and label the subplots and graph appropriately. What do you notice? How does the normal sample by hand compare to that produced by MATLAB?
Next, redo the three subplots above with 10,000 samples instead of 100. Include this graph as well. How does it compare to the first one? What do you notice?
For even more extra credit, look up the equation for the 'Poisson Distribution.' Repeat the steps above to convert uniform random samples to ones drawn from a Poisson and compare. For step 3, MATLAB has a helpful function for drawing Poission samples. There is also a function that will calculate the Poisson pdf, allowing you to plot the fit curve as above. Include both the 100 and 10,000 sample plots along with the same explanations as above.