#
.
In this homework, you will start from the last task in Activity 2-4 from class. You will look at a different way of doing the same problem.
Some words of wisdom: your program will most likely not work the first try. Advice to avoid sitting there clueless about what went wrong:
The first task makes sure that the starter code works for you and provides you examples of iterating through lists in a different way.
HW2-4.py
, which is more or less what you wrote for the final task in ACT2-4. Also download MobyDickShort.txt
and MobyDick.txt
and save it to the same folder as your program.for
loops that demonstrate how to iterate through a list in two different ways. You are familiar with the first one. The second one, however, gives you more power, since it allows you to look at the elements before and after the current element during the loop. For example, you can print 'facebook comes after twitter'
using the second technique, but you cannot do that with the first.vocab = buildVocab('MobyDickShort.txt')
print("The size of the vocabulary is", len(vocab))
The size of the vocabulary is 1013
.
If you run into trouble on the last step, make sure that MobyDickShort.txt
and MobyDick.txt
is saved to the correct folder (it has to be in the same folder as your program). Run the program by hitting F5. Then, inspect the value of the variable vocab
by typing vocab in the interactive interpreter. You should see a list of strings (words), and if you then inspect the length of vocab
(using the built-in function len(vocab)
), it should say 1013. If the program does not work as expected, send us an email with what you did and errors you get if any at all. Remember to give an honest effort to solve your problem(s) before contacting us!
You can probably imagine that the way we get rid of duplicates in a list is slow. Now imagine another way in which we can do the same thing. If we have a list:
myList = ["kitten", "cat", "dog", "apple", "cat", "dog"]
myList = ["apple", "cat", "cat", "dog", "dog", "kitten"]
myList
, we can check the current element against the previous one. If it is the same as the previous one, then it's already in the new list. Otherwise, we add it in.
For example, supposing that we are at element index 1, the first "cat"
. We check the previous element, at index 0. That is "apple"
. Since the list is sorted, we know that we haven't seen "cat"
before, and so we can add it to the new list.
Now suppose that we go to the next element, which is at index 2 (the second "cat"
). We check the previous element, at index 1. That is "cat"
, which is the same as the current element. Therefore, we skip over this element and do not add it to the new list.
This method is a lot faster than the previous one. Python also has a sort
function that will sort a list quickly.
Try to implement removeDuplicatesFast
. The instructions are as follows:
1
instead of 0
as the first argument to range()
.for
keyword) index
. Then each iteration of the loop, index
gets assigned a different value (1
, 2
, 3
, ...).
index
to a variable called if
statement to check if current
is different from previous
. If current
and previous
are different, this means we have seen current
before. If not, we want to add it to the list (check Line 34 for the syntax -- you have two options on how to do this -- either an append
, or a +
).return
it.testRemoveDuplicatesFast()
. Provide at least three test cases. The point of a test function is to provide tricky test cases that might fool the function you're testing, so come up with interesting cases. What should the output be if all the input words are the same? What if they're all different? Remember, removeDuplicatesFast()
sorts the words you give it before filtering them. Verify using the test function that removeDuplicatesFast()
works properly.Now, let's deploy our new method of removing duplicates.
vocabulary
function so that it uses removeDuplicatesFast()
instead of removeDuplicatesSlow()
.removeDuplicatesFast()
function definition.readFile()
, instead of reading in MobyDickShort.txt
, try MobyDick.txt
. We are no longer afraid of processing large files!removeDuplicatesSlow()
to go through the whole MobyDick.txt
. It took 10 minutes to run on my computer. How long does it take on yours?Congratulations! You have just completed a software upgrade.
Rename your program Name_StudentID_HW2-4.py
and share it with PolyUCOMP1D04@gmail.com
.
Note: Before you turn in your Python files, make sure they run without any errors(Save your Python file. Then select Run > Run Module
or hit F5
on your keyboard)! If nothing appears in the Shell, don't worry as long as no red error messages appear. If they don't run, i.e. if red stuff starts appearing in the shell, points will be taken off!