Homework 2-6

Adapted from Brown University CSCI0931. Used with Permission.

In this homework, we will finally finish what we set out to do and do some text analysis on the LegCo speeches.

In the first lecture (still remember that?), we talked about information and media literacy. Among the things that this entails is the ability to "Analyze messages in a variety of forms by identifying the author, purpose and point of view, and evaluating the quality and credibility of the content". In this last homework, we will use Python and text analysis to identify the speaker from LegCo speeches, and their emphasis on certain political issues.

To begin, download the following files. Make sure you save them in the same directory.

Task 1: Building a word frequency dictionary

The first task that we will do is to finish the buildDictionary() function. buildDictionary takes in a file, reads it in, and then builds up a word-frequency dictionary for all the words in the file. At the end of the function, it will return the word-frequency dictionary that it constructed.

Take a look at the outline for the buildDictionary() function. The wordFreq variable is there for you already -- that is what the dictionary is going to be called inside the function. The return statement is there for you too. Now fill in the rest of the programming statements. The comments should be able to give you some help.

When you are done, run the program by typing F5. You should get the following output:

Building the word frequency dictionary for cat.txt
The wordFrequency dictionary for cat.txt is {'the': 0.09090909090909091, 'and': 0.09090909090909091, 
'mat': 0.09090909090909091, 'on': 0.09090909090909091, 'cat': 0.09090909090909091, 'sitting': 
0.09090909090909091, 'hat': 0.09090909090909091, 'a': 0.18181818181818182, 'wearing': 
0.09090909090909091, 'is': 0.09090909090909091}
Building the word frequency dictionary for dog.txt
The wordFrequency dictionary for dog.txt is {'the': 0.09090909090909091, 'dog': 0.09090909090909091, 
'and': 0.09090909090909091, 'standing': 0.09090909090909091, 'mat': 0.09090909090909091, 'on': 
0.09090909090909091, 'hat': 0.09090909090909091, 'a': 0.18181818181818182, 'wearing': 0.09090909090909091, 
'is': 0.09090909090909091}

(The line breaks may be in different places, and the order of the keys may be difference, but the values of the keys and the linked values should be the same.)

Task 2: Writing the distance function

The next task will be to write the function calculateDistance(). calculateDistance takes as arguments two word-frequency dictionaries and a list of words that we are interested in, and it will calculate the distance between the two word-frequency dictionaries, based on the Pythagoras Theorem.

If you look inside the calculateDistance function, you should see a variable, distance, which is initialized to 0.0. Your job is to fill in the rest of this function. Remember that the distance is calculated as the square root of the sum of squares of the differences (gosh that is a mouthful!) -- that is, we calculate the difference between the frequencies for each word, then we square it, add up the squared differences, and then, finally, we take the square root.

Remember also that the power operator in Python is **. That is, 3 squared can be calculated as 3**2. Also, remember that square root is equivalent to raising a number to the power 0.5 -- i.e. the square root of 4 can be calculated as 4**0.5.

The comments in the function outline should be helpful. When you are done writing the function, run the program by typing F5 again. This will reload the program and rerun it. You should get the following output (after the rest of the output from Task 1):

  The distance between cat.txt and dog.txt based on the keywords is 0.128564869306645

Task 3: Identifying the Author/Speaker

This is an independent task. You may discuss this task only with teaching staff.

In this task, we will use the functions that you just created to identify the author (or, more precisely, the speaker) of a text. You will be given 5 files that contain transcripts of speeches. 4 of the files come from speeches by the same person, and 1 comes from a different speaker (let's call him/her the "outlier").

How would we do this? Well, it turns out that people are fairly consistent in how they use stopwords when writing, and, to a lesser extent, when speaking. Therefore, stopword frequencies can be used as a kind of "signature" for a speaker. We will test the hypothesis that for any text in our list, the one from the outlier will have the biggest difference in stopword frequencies when compared to the other texts.

Task 4: Calculating Similarities in Political Issues

This is an independent task. You may discuss this task only with teaching staff.

In this final task, we will use the same technique, but with keywords instead of stopwords, to look at how similar the legislators are in terms of the issues that they care about (we will assume that legislators are more likely to mention the issues that they care about in their speeches.)

Handin

Share your Name_StudentID_HW2-6.py program and Name_StudentID_HW2-6 Google Sheets document with PolyUCOMP1D04@gmail.com.

Note: Before you turn in your Python files, make sure they run without any errors(Save your Python file. Then select Run > Run Module or hit F5 on your keyboard)! If nothing appears in the Shell, don't worry as long as no red error messages appear. If they don't run, i.e. if red stuff starts appearing in the shell, points will be taken off!