In this homework, we will finally finish what we set out to do and do some text analysis on the LegCo speeches.
In the first lecture (still remember that?), we talked about information and media literacy. Among the things that this entails is the ability to "Analyze messages in a variety of forms by identifying the author, purpose and point of view, and evaluating the quality and credibility of the content". In this last homework, we will use Python and text analysis to identify the speaker from LegCo speeches, and their emphasis on certain political issues.
To begin, download the following files. Make sure you save them in the same directory.
Name_StudentID_HW2-6.py
The first task that we will do is to finish the buildDictionary()
function. buildDictionary
takes in a file, reads it in, and then builds up a word-frequency dictionary for all the words in the file. At the end of the function, it will return the word-frequency dictionary that it constructed.
Take a look at the outline for the buildDictionary()
function. The wordFreq
variable is there for you already -- that is what the dictionary is going to be called inside the function. The return
statement is there for you too. Now fill in the rest of the programming statements. The comments should be able to give you some help.
When you are done, run the program by typing F5
. You should get the following output:
Building the word frequency dictionary for cat.txt The wordFrequency dictionary for cat.txt is {'the': 0.09090909090909091, 'and': 0.09090909090909091, 'mat': 0.09090909090909091, 'on': 0.09090909090909091, 'cat': 0.09090909090909091, 'sitting': 0.09090909090909091, 'hat': 0.09090909090909091, 'a': 0.18181818181818182, 'wearing': 0.09090909090909091, 'is': 0.09090909090909091} Building the word frequency dictionary for dog.txt The wordFrequency dictionary for dog.txt is {'the': 0.09090909090909091, 'dog': 0.09090909090909091, 'and': 0.09090909090909091, 'standing': 0.09090909090909091, 'mat': 0.09090909090909091, 'on': 0.09090909090909091, 'hat': 0.09090909090909091, 'a': 0.18181818181818182, 'wearing': 0.09090909090909091, 'is': 0.09090909090909091}
(The line breaks may be in different places, and the order of the keys may be difference, but the values of the keys and the linked values should be the same.)
The next task will be to write the function calculateDistance()
. calculateDistance
takes as arguments two word-frequency dictionaries and a list of words that we are interested in, and it will calculate the distance between the two word-frequency dictionaries, based on the Pythagoras Theorem.
If you look inside the calculateDistance
function, you should see a variable, distance
, which is initialized to 0.0. Your job is to fill in the rest of this function. Remember that the distance is calculated as the square root of the sum of squares of the differences (gosh that is a mouthful!) -- that is, we calculate the difference between the frequencies for each word, then we square it, add up the squared differences, and then, finally, we take the square root.
Remember also that the power operator in Python is **
. That is, 3 squared can be calculated as 3**2
. Also, remember that square root is equivalent to raising a number to the power 0.5 -- i.e. the square root of 4 can be calculated as 4**0.5
.
The comments in the function outline should be helpful. When you are done writing the function, run the program by typing F5
again. This will reload the program and rerun it. You should get the following output (after the rest of the output from Task 1):
The distance between cat.txt and dog.txt based on the keywords is 0.128564869306645
This is an independent task. You may discuss this task only with teaching staff.
In this task, we will use the functions that you just created to identify the author (or, more precisely, the speaker) of a text. You will be given 5 files that contain transcripts of speeches. 4 of the files come from speeches by the same person, and 1 comes from a different speaker (let's call him/her the "outlier").
How would we do this? Well, it turns out that people are fairly consistent in how they use stopwords when writing, and, to a lesser extent, when speaking. Therefore, stopword frequencies can be used as a kind of "signature" for a speaker. We will test the hypothesis that for any text in our list, the one from the outlier will have the biggest difference in stopword frequencies when compared to the other texts.
doTasks1_and_2()
in the program (all the way at the bottom), and uncomment the line doTask3()
.
speeches.zip
. You should get five text files, a0.txt
, a1.txt
, a2.txt
, a3.txt
, a4.txt
. Make sure that the files are in the same directory as your Python program.authorship.csv
.Authorship
. Name the spreadsheet Name_StudentID_HW2-6
.similarityVisualization
in Activity 1-4.A10
of the Authorship
sheet.
This is an independent task. You may discuss this task only with teaching staff.
In this final task, we will use the same technique, but with keywords instead of stopwords, to look at how similar the legislators are in terms of the issues that they care about (we will assume that legislators are more likely to mention the issues that they care about in their speeches.)
doTask3()
in the program, and uncomment the line doTask4()
.
legislators.zip
. You should get 36 text files, each with the name of a legislator who said something during the 2014-15 year when LegCo was in session. Make sure that the files are in the same directory as your Python program.politicalView.csv
.HW2-6
Google Sheets spreadsheet that you created in Task 3. Rename the tab PoliticalView
. The similarity matrix should span Columns B:AK
and Rows 1:37
.similarityVisualization
in Activity 1-4. You will probably want to make Columns B:AK
narrower so that it will all fit within the screen.AM2
, write a Google Sheets formula that finds the largest distance to Abraham.AM2
now gives a number, which is the distance. That's not very interesting to us. In Cell AN2
, use the MATCH
and OFFSET
functions to find the name of the legislator who is the least similar to Abraham. (Remember that larger distance = smaller similarity. Also, see if you can get just the legislator's name without the trailing ".txt"
. The Google Sheets function LEFT
extracts the leftmost n characters, and the LEN
function gets the length of a string. Put those two together and you should be able to strip the txt
extension.)AM2:AN2
down for all legislators. Put Largest Distance
in Cell AM1
and Least Similar Person
in Cell AN1
.MIN
won't work. Luckily, Google Sheets contains a SMALL
function which takes a second argument, n, so we can find the nth smallest. In Cell AO2
, find the smallest distance to Abraham's speeches (excluding Abraham's own speech.)AP2
, find the name of the most similar legislator to Abraham.AO2:AP2
down for all the legislators. In Cell AO2
and AP1
, put in Smallest Distance
and Most Similar Person
respectively.A40
, write down your answer.
Share your Name_StudentID_HW2-6.py
program and Name_StudentID_HW2-6
Google Sheets document with PolyUCOMP1D04@gmail.com
.
Note: Before you turn in your Python files, make sure they run without any errors(Save your Python file. Then select Run > Run Module
or hit F5
on your keyboard)! If nothing appears in the Shell, don't worry as long as no red error messages appear. If they don't run, i.e. if red stuff starts appearing in the shell, points will be taken off!