Homework 2-6 : COMP1D04

In this homework, we will finally finish what we set out to do and do some text analysis on the LegCo speeches.

In the first lecture (still remember that?), we talked about information and media literacy. Among the things that this entails is the ability to "Analyze messages in a variety of forms by identifying the author, purpose and point of view, and evaluating the quality and credibility of the content". In this last homework, we will use Python and text analysis to identify the speaker from LegCo speeches, and their emphasis on certain political issues.

To begin, download the following files. Make sure you save them in the same directory.

Task 1: Building a word frequency dictionary

The first task that we will do is to finish the buildDictionary() function. buildDictionary takes in a file, reads it in, and then builds up a word-frequency dictionary for all the words in the file. At the end of the function, it will return the word-frequency dictionary that it constructed.

Take a look at the outline for the buildDictionary() function. The wordFreq variable is there for you already -- that is what the dictionary is going to be called inside the function. The return statement is there for you too. Now fill in the rest of the programming statements. The comments should be able to give you some help.

When you are done, run the program by typing F5. You should get the following output:

Building the word frequency dictionary for cat.txt
The wordFrequency dictionary for cat.txt is {'the': 0.09090909090909091, 'and': 0.09090909090909091, 
'mat': 0.09090909090909091, 'on': 0.09090909090909091, 'cat': 0.09090909090909091, 'sitting': 
0.09090909090909091, 'hat': 0.09090909090909091, 'a': 0.18181818181818182, 'wearing': 
0.09090909090909091, 'is': 0.09090909090909091}
Building the word frequency dictionary for dog.txt
The wordFrequency dictionary for dog.txt is {'the': 0.09090909090909091, 'dog': 0.09090909090909091, 
'and': 0.09090909090909091, 'standing': 0.09090909090909091, 'mat': 0.09090909090909091, 'on': 
0.09090909090909091, 'hat': 0.09090909090909091, 'a': 0.18181818181818182, 'wearing': 0.09090909090909091, 
'is': 0.09090909090909091}

(The line breaks may be in different places, and the order of the keys may be difference, but the values of the keys and the linked values should be the same.)

Task 2: Writing the distance function

The next task will be to write the function calculateDistance(). calculateDistance takes as arguments two word-frequency dictionaries and a list of words that we are interested in, and it will calculate the distance between the two word-frequency dictionaries, based on the Pythagoras Theorem.

If you look inside the calculateDistance function, you should see a variable, distance, which is initialized to 0.0. Your job is to fill in the rest of this function. Remember that the distance is calculated as the square root of the sum of squares of the differences (gosh that is a mouthful!) -- that is, we calculate the difference between the frequencies for each word, then we square it, add up the squared differences, and then, finally, we take the square root.

Remember also that the power operator in Python is **. That is, 3 squared can be calculated as 3**2. Also, remember that square root is equivalent to raising a number to the power 0.5 -- i.e. the square root of 4 can be calculated as 4**0.5.

The comments in the function outline should be helpful. When you are done writing the function, run the program by typing F5 again. This will reload the program and rerun it. You should get the following output (after the rest of the output from Task 1):

  The distance between cat.txt and dog.txt based on the keywords is 0.128564869306645

Task 3: Identifying the Author/Speaker

This is an independent task. You may discuss this task only with teaching staff.

In this task, we will use the functions that you just created to identify the author (or, more precisely, the speaker) of a text. You will be given 5 files that contain transcripts of speeches. 4 of the files come from speeches by the same person, and 1 comes from a different speaker (let's call him/her the "outlier").

How would we do this? Well, it turns out that people are fairly consistent in how they use stopwords when writing, and, to a lesser extent, when speaking. Therefore, stopword frequencies can be used as a kind of "signature" for a speaker. We will test the hypothesis that for any text in our list, the one from the outlier will have the biggest difference in stopword frequencies when compared to the other texts.

First of all, comment out the line doTasks1_and_2() in the program (all the way at the bottom), and uncomment the line doTask3().
Unzip speeches.zip. You should get five text files, a0.txt, a1.txt, a2.txt, a3.txt, a4.txt. Make sure that the files are in the same directory as your Python program.
Run the Python program. If your Tasks 1 and 2 are correct, the program should terminate successfully and generate a CSV file (it should also print out something to that effect) called authorship.csv.
Open up a new spreadsheet in Google Sheets and import that file into the spreadsheet. Rename the tab Authorship. Name the spreadsheet Name_StudentID_HW2-6.
The CSV file contains the pairwise distances between each pair of files. Using conditional formatting in Google Sheets, color the cells in a way similar to what we did in the similarityVisualization in Activity 1-4.
Once you have the spreadsheet formatted, examine the calculated distances (remember that larger distance = less similar), and figure out which file is the outlier author. How did you come to that conclusion? Type your answer (and the explanation) into Cell A10 of the Authorship sheet.

Task 4: Calculating Similarities in Political Issues

This is an independent task. You may discuss this task only with teaching staff.

In this final task, we will use the same technique, but with keywords instead of stopwords, to look at how similar the legislators are in terms of the issues that they care about (we will assume that legislators are more likely to mention the issues that they care about in their speeches.)

First of all, comment out the line doTask3() in the program, and uncomment the line doTask4().
Unzip legislators.zip. You should get 36 text files, each with the name of a legislator who said something during the 2014-15 year when LegCo was in session. Make sure that the files are in the same directory as your Python program.
Run the Python program. If your Tasks 1 and 2 are correct, the program should terminate successfully and generate a CSV file (it should also print out something to that effect) called politicalView.csv.
Import that file into the HW2-6 Google Sheets spreadsheet that you created in Task 3. Rename the tab PoliticalView. The similarity matrix should span Columns B:AK and Rows 1:37.
Using conditional formatting in Google Sheets, color the cells in a way similar to what we did in the similarityVisualization in Activity 1-4. You will probably want to make Columns B:AK narrower so that it will all fit within the screen.
Since the legislators' names are sorted, Row 2 corresponds to Abraham Shek. In Cell AM2, write a Google Sheets formula that finds the largest distance to Abraham.
Cell AM2 now gives a number, which is the distance. That's not very interesting to us. In Cell AN2, use the MATCH and OFFSET functions to find the name of the legislator who is the least similar to Abraham. (Remember that larger distance = smaller similarity. Also, see if you can get just the legislator's name without the trailing ".txt". The Google Sheets function LEFT extracts the leftmost n characters, and the LEN function gets the length of a string. Put those two together and you should be able to strip the txt extension.)
Extend Cells AM2:AN2 down for all legislators. Put Largest Distance in Cell AM1 and Least Similar Person in Cell AN1.
We want to do the same for the smallest distance and the most similar legislator. However, since the most similar legislator will always be the person himself, using MIN won't work. Luckily, Google Sheets contains a SMALL function which takes a second argument, n, so we can find the nth smallest. In Cell AO2, find the smallest distance to Abraham's speeches (excluding Abraham's own speech.)
In Cell AP2, find the name of the most similar legislator to Abraham.
Extend AO2:AP2 down for all the legislators. In Cell AO2 and AP1, put in Smallest Distance and Most Similar Person respectively.
Inspect the results. Do they make sense? Do you think that this way of measuring similarities works? Why or why not? In Cell A40, write down your answer.

Homework 2-6

Task 1: Building a word frequency dictionary

Task 2: Writing the distance function

Task 3: Identifying the Author/Speaker

Task 4: Calculating Similarities in Political Issues

Handin