# Assignment 7: K-Means

- Due Apr 5, 2016 by 9am
- Points 10
- Submitting a file upload
- Available Mar 22, 2016 at 9am - Apr 10, 2016 at 9am 19 days

## K-Means

**CLPS 1291: Assignment 7**Due:

*on*

**9:00AM**

*April 5*Before you get started:

Look at the Assignment Guidelines for formatting and coding style information, submission guidelines, etc. If you have any questions related to the assignment, please post them in this Discussion.

**As a reminder, we will neither accept answers that fail to follow the given template, nor consider code written outside of the allotted space. We will only review functions that follow our conventions and results documents submitted in the requested form.**

You will need to use this skeleton code and data files to complete the assignment. See the skeleton code for more helpful guidelines.

We expect you to turn in the following:

- results.pdf - a pdf containing your outputs and descriptions.
- assignment7.zip - a zip file containing:
- assignment7.m - a MATLAB script containing all of the code necessary for this assignment

###
**1. ****K-Means Clustering: Concept Practice**

K-means clustering attempts to model the cognitive process of *category learning* as minimizing an energy function. Here, "energy" refers to the distance between individual data points and the cluster mean. The k-means algorithm creates and updates centroids that represent prototypes of various clusters (categories). Before working with a real dataset, we will take a look at a toy 2D dataset in order to get a better sense of what the output of the k-means algorithm should look like.

You'll need to use the file 'kmeans_practice.pdf' (available in resources.zip on Canvas) for this part of the assignment.

This document has four figures, each depicting the same scatterplot of a 2D dataset. The figures have different k values and different random initializations of centroids. Given the initial centroids shown in each figure, where do you think the centroids will be located when the algorithm is finished running? For each figure, illustrate (using the drawing software of your choice) what you think a stable solution of the algorithm will look like.

Your answer should include:

- An outline surrounding each cluster
- Symbols depicting the final centroids
- A brief written explanation -- Why is this the solution found by k-means?

REMEMBER: Many things can influence the k-means output (density of points in each cluster, local minima, initialization position, etc.)!

Add your illustrated figures and explanations to 'results.pdf'.

###
**2. ****K-Means Clustering: Authorship Attribution**

The work behind this assignment is based on the research of Neal Fox, a recently-graduated CLPS doctoral student who studied linguistics and authorship attribution.

In this part of the assignment, you'll have the chance to apply your knowledge about k-means clustering to a current research problem: discovering groupings of documents according to the identity of their author. Don't worry too much about the background information (though it IS really cool!) -- we will guide you through the assignment step-by-step, and ask you conceptual questions based on your intuition and information you've already learned in class.

**2. Background Information: How the model of authorship attribution works**

When we don't know who the author of a document is, how can we make predictions about their identity, or find other documents that are also likely to be by this same unknown person? This is a longstanding question in computational linguistics and work on this question is currently informing research on the Bible, Shakespeare, plagiarism detection, and forensic science, just to name a few areas. For example, you can probably tell whether your mother or your father wrote an email to you, even if you don't know which email address it came from. You might imagine that we are tracking some simple word statistics that characterize their speaking/writing style (ex. your mom uses the word 'nevertheless' in the same places your father might use the word 'however'). In this lab, you will think about how we recruit our implicit knowledge of the statistics underlying their respective "styles" in order to identify them while reading that email.

The models we are going to implement rely on the assumption that the statistics you're tracking during your interactions with mom or dad are about lexical frequency (how often different words appear). Examining a corpus of texts by different authors, we will first look at the relative frequencies of all words in the corpus and see how predictive of authorship they are. Then, we will look at another model using frequencies of a very specific class of words that we call 'topic-independent words'. Linguists might call them 'function words' and computer scientists often refer to them as 'stop words', but the key is that they are words that are distributed fairly consistently across documents about all different topics; they primarily carry grammatical information, but not much content. Some examples are 'the', 'and', 'of', and 'respectively'. We will try to find out how informative each of these two measures is as we try to group documents based on authorship. The statistics that a model is making use of (tracking) are called the 'features' of that model. In the corpus you will examine (with a total of 49,197 words), there are 11,180 *different* words. Every document can thus be represented as a vector that has 11,180 entries in it. Some words will never appear in that document, so the entry for that feature will be 0, and the entry for every other word that does appear in the document will be equal to the relative frequency of that word in the document (that is, the # of times that word occurs in the document/the total # of words in the document). In the topic-independent feature list (Fox, Ehmoda & Charniak, 2011), there are only 476 words, so each document can be represented by a much shorter (476-dimensional) column vector. If each document is represented by a vector, you can think of documents as points in a very high dimensional space. Documents that are close together in that space contain similar words in similar frequencies. The critical assumption underlying computational models of authorship attribution is that whatever features a model is using, documents by a single author tend to be relatively consistent in those words' frequencies, so clustering points in the high-dimensional space is just like trying to find the groupings of documents that tend to center around some set of average distributions which represent each author.

You will be testing how well k-means clustering does at grouping together documents by the same author!

**Authorship and movie attribution using word frequency**

First, we will use the second model (word frequency) to group by author and by movie. Remember, this model is quite large -- there are over 10,000 features for each vector! Remind you of something we've done before? ;)

**2. a) Overview/loading datasets -- read carefully before continuing!**

NOTE: All of the datasets you need are available on Canvas!

Make sure you add the folder containing these files to your Matlab path! 72 movie reviews were downloaded from sources such as "Yahoo! Movies." This movie review corpus has the unique trait of being "fully crossed" with respect to "movie reviewed" and "movie reviewer": 6 reviewers reviewed each of 12 movies. For example, there are six reviews of "The Social Network," one by each author in the corpus.

Two variables were created from this dataset:

- 'true_authors.mat' is the true mapping of documents 1-72 to each document's author.
- 'true_movies.mat' is the true mapping of documents 1-72 to each document's topic -- that is, the mapping from a document to which movie the document was a review of.

The variable 'prob_docsT.mat' contains the full dataset, where each document is represented by its count for each of the 11,180 different words that exist in the corpus (that is, every word that appears at least once in the corpus). It contains a 72 x 11,180 matrix: each row is a document, and each column is a feature (one of the 11,180 unique words appearing across all 72 documents).

The variable 'TI_prob_docsT.mat' represents the full dataset in a different way. Here, each document is represented by its count for each of the 476 topic-independent words that exist in the corpus. It contains a 72 x 476 matrix: again, each row is a document and each column is a feature.

The file 'evaluateClusters.m' is a Matlab script that evaluates how "good" the clusters are in a quantitative way via a modified version of a measure known as "normalized mutual information." **You do not need to understand how it is calculated or what exactly it means!** Essentially, it is an estimate of how good one clustering is compared to another with respect to the true classes. A minimum accuracy score of 0 means that there is no overlap between the clusters and the true classes. A maximum accuracy score of 1 roughly means that there is perfect overlap between the clusters and the true classes.

**2. b) Test 'evaluateClusters'**

To use 'evaluateClusters.m', you need to open the file, change some variable names as described in the script's comments, save the script, and then type "evaluateClusters" in the command window. This will print the adjusted normalized mutual information score for you.

To better understand how this measure works, we will run the script on 'true_authors.mat' and 'true_movies.mat' to compare how well these two overlap.

What is the adjusted normalized mutual information between clusters? Why? HINT: 0 = NO overlap between clusters and true classes, and 1 = PERFECT overlap between clusters and true classes.

Now run the script on 'true_authors.mat' and 'true_authors.mat'. What is the adjusted normalized mutual information between clusters now? Why?

Add your responses to 'results.pdf.'

**2. c) Run k-means (k = true number of authors)**

Matlab has a built-in k-means algorithm, which you can call with the function 'kmeans'. Use this function to run k-means on the full dataset ('prob_docsT.mat') a few times. Set k = 6 (the true number of authors).

HINT: use 'help kmeans' if you aren't sure what to do! You should only need two arguments.

**2. d) Display k-means output as a bar graph**

Now, we'll display our kmeans output as a bar graph. Each of the 72 movie review documents should be on the x axis, with the y axis representing the potential author of the document (author 1, 2, 3, 4, 5, or 6). This will let us visualize which documents the kmeans algorithm matched to which authors.

Make a new figure window, and create a bar graph of the kmeans output. Give your graph a title, and label the x and y axes.

HINT: Use 'help bar' if you aren't sure what to do!

Try running your code a few times. Even though we are using the same 'kmeans' command with the same inputs each time, are the outputs consistently the same vector? Why or why not?

Add your bar graph (just one version, please!) and your responses to 'results.pdf'.

**2. e) Run and visualize k-means (k = true number of movies)**

This time, change the k value to 12 (the true number of movies). Try running kmeans()' once with the new parameters.

Again, visualize your results as a bar graph (with a title and x and y axis labels!). Remember, this time your y-axis represents the 12 possible movies the documents could be attributed to.

Add your bar graph to 'results.pdf'.

**2. f) Evaluate authorship and movie attribution (word frequency model)**

Using the *evaluateClusters.m* script by editing and running it, find out how the clustering algorithm does at determining the best groupings of movies (compare true_movies to a k = 12 clustering) and of authors (compare true_authors to a k = 6 clustering).

Describe your results. Why do you think this accuracy score might be higher for one group than the other?

HINT: You don't need to know about how accuracy score is computed to get this! Think intuitively about what the features are and how they relate to movies or authors.

Add the 'evaluateClusters' output for each of these two groupings and your responses to 'results.pdf.'

**3. Authorship and movie attribution using topic-independent features**

Now, we will use the second model (topic-independent features) to group by author and by movie. Remember, this model is much smaller - there are fewer than 500 features in each vector rather than over 10,000. The features themselves are quite different, though.

**3. a) Run and visualize k-means (k = true number of authors)**

Use the k-means algorithm again to find the best clusters using k = 6 (true number of authors), but this time use the data from 'TI_prob_docT'.

Make another bar graph!

Add your figure to 'results.pdf'.

**3. b) Run and visualize k-means (k = true number of movies)**

Now, use the k-means algorithm (yes, again) to find the best clusters using k = 12 (true number of movies) with the data from 'TI_prob_docT'.

Make another bar graph!

Add your figure to 'results.pdf'.

**3. c) Evaluate authorship and movie attribution (topic-independent features model)**

Use 'evaluateClusters.m' to evaluate how well this model does at grouping documents: By author? By movie?

Which model ('prob_docsT' or 'TI_prob_docsT') appears to be better at each task (authorship attribution and movie attribution)? Why?

Add the 'evaluateClusters' output for each of these two groupings and your responses to 'results.pdf.'

**Extra Credit!**

Obviously, when someone is doing authorship attribution on a set of documents, it is not always clear how many authors are represented in the set. There might be only 1 author who was very prolific, or there could be a different author for each document. Let's assume you do not know how many authors contributed to this set of 72 documents, but you have a hunch based on some other information that it's somewhere between 2 and 30 authors. One of the strengths of the adjusted normalized mutual information score (the output of evaluateClusters) is that it can help you select the "best" value of k. The "best" value will be the one that yields the highest score, which means there is high overlap with the veridical classes of the documents (the true authors) and the results of the clustering, but also tries to keep k to a reasonable size (you don't want to have a single document in every cluster, because then your clusters are meaningless). Modify evaluateClusters.m to find the optimal k for the topic-independent model. Report your results and comment.