CLPS 1291: Assignment 3
Due: 9:00AM on Mar 1
Before you get started:
Look at the Assignment Guidelines for formatting and coding style information, submission guidelines, etc. If you have any questions related to the assignment, please post them in this Discussion.
As a reminder, we will neither accept answers that fail to follow the given template, nor consider code written outside of the allotted space. We will only review functions that follow our conventions and results documents submitted in the requested form.
You will need to use this skeleton code and text file to complete the assignment.
We expect you to turn in the following:
- results.pdf - a pdf containing your outputs and descriptions.
- assignment3.zip - a zip file containing:
assignment3.m - a MATLAB script containing all of the code necessary for this assignment
- MobyWordFreqs.txt - a text file containing the 30 most frequently used words in Moby Dick (see below for instructions)
- MobyLetterFreqs.csv - a .csv file containing the frequencies of the letters in Moby Dick (see below for instructions)
Text Parsing: Moby Dick
The goal of this assignment is to help you become familiar with Matlab I/O (input/output) functions and to develop skills for parsing and processing various data files.
We will be working with Herman Melville's novel Moby Dick throughout the assignment. After reading this book as a .txt file into Matlab, we can extract some interesting data from the text! Specifically, we will be determining the relative frequency of individual letters and words in the document, and displaying our findings as output graphs. We won't have to code this from scratch, but you will have to fill in some code.
The specific instructions for this assignment can be found within the skeleton code. We give a general overview below. Please read the instructions carefully, and only write code within the designated "your code here" areas. This will make your assignment easier to grade, and ensure that the provided skeleton code continues to run properly.
1. a) Load the Document
In this section, you'll learn how to load a document into MATLAB and extract text information
NOTE: This part of the script might take a little while to run!
1. b) Determine the Top 30 Words and Output as .txt
Calculate the top 30 most frequently used words in Moby Dick! Then save this list of words as a .txt file.
Your text file should be called MobyWordFreqs.txt.
- It should have a descriptive header
Each word and its frequency should be on a different line. These should be listed in descending order (i.e., with the MOST frequent word at thetop of the list).
For example: Line 1: 1. Whale 100
Line 2: 2. Boat 99
1. c) Calculate Letter Frequency and Output as .csv
Calculate the frequency of all letters in Moby Dick and then save the results in a .csv file
The resulting file should be called MobyLetterFreqs.csv and contain three columns, the first with the letters, the second with their frequencies (number of appearances), and the third with their frequency as a percent of all letters (use the % symbol).
Letter | Count | Percent Total
A | 8500 | 9%
B | 4000 | 1%
After completing this step, you can try opening up the data in Excel! It's pretty cool!
1. d) Letter Frequency Bar Graph
Display the letter frequencies you calculated as a bar graph. (And include this figure in your results.pdf)
What do you notice about your results? Which letters are the highest? Why? Add your response to 'results.pdf.'
2. Word Length vs. Frequency
Gryzbek et al (2007) explain that there is an inverse relationship between word frequency and length. Their math is quite complicated, but essentially, they argue that frequency(w) is proportional to , for some negative number, x. This has interesting implications for the English language! For example, this suggests that certain form of a word are used more often, even when the root word is the same.
In this section, we'll try to replicate that effect (in a simplified manner)!
NOTE: Our code handles much of the more complicated computations for you. To ensure that it runs with your script, please make sure to give variables in the sections for which you are responsible the names that we request! (Also, please avoid changing our code.)
At the end, you'll plot the mean word length within each bin as a function of word frequency. Make sure to add a title and label your x and y axes! Have your graph start as zero. Remember the word length data is stored in the array, meanWordLengths, that we've created for you!
What does the graph look like? does it fit the predicted form? Add a copy of the graph and your response to results.pdf.
Analyzing text offers a lot of opportunities for interesting findings. Now is your chance to show us something beyond the required frequency analysis!
Project Gutenberg has made over 50,000 public domain texts available for free as ebooks, in many different languages. Using texts from this source, show us something interesting! Create a separate file called 'extracredit.pdf' with at least one output graph and a description of your findings. Here are some ideas:
Compare the same text across multiple languages. Can letter frequency help you predict the language of a new text?
- Perform second order statistics on the results you already acquired.
- Determine how frequently certain letters co-occur (ex., frequency of "th" vs. "pl," etc.).
- Compare word frequency between authors. Do certain words seem to be characteristic of one person over another?
- Anything else you can think of!