Homework 2-4

Adapted from Brown University CSCI0931. Used with Permission.

Reminders

For the following problems you may discuss the concepts that will help solve these problems with classmates and course staff. You may not simply copy down the answers of your classmates as that is a violation of the collaboration policy. The one exception to this rule are those problems marked as “(Independent)”. You may discuss independent problems with course staff only.
When you write your functions, remember to write down what your functions does and what the arguments mean, by commenting your code. Remember that comments start with a #.

Advice

In this homework, you will start from the last task in Activity 2-4 from class. You will look at a different way of doing the same problem.

Some words of wisdom: your program will most likely not work the first try. Advice to avoid sitting there clueless about what went wrong:

Task 1

The first task makes sure that the starter code works for you and provides you examples of iterating through lists in a different way.

  1. Download the starter code HW2-4.py, which is more or less what you wrote for the final task in ACT2-4. Also download MobyDickShort.txt and MobyDick.txt and save it to the same folder as your program.
  2. Take a look at the starter code. There are three portions:
    1. Lines 1-63 are the functions that we wrote in class for ACT 2-4. You may not have implemented them in exactly the same way, but the general idea should be the same.
    2. Lines 64-101 are the functions that you will be writing for this homework.
    3. Lines 102-125 are the programming statements that don't belong in any function. Python needs to read in the function definitions before it can use them, and it always reads from top to bottom, so it's good practice to put all programming statements at the bottom.
  3. Lines 114-115 and Lines 121-125, show two for loops that demonstrate how to iterate through a list in two different ways. You are familiar with the first one. The second one, however, gives you more power, since it allows you to look at the elements before and after the current element during the loop. For example, you can print 'facebook comes after twitter' using the second technique, but you cannot do that with the first.
  4. At the very very bottom of the file, we want to write in two programming statements that will call our vocabulary-building function, and then calculate the size of the vocabulary:
    vocab = buildVocab('MobyDickShort.txt')
    print("The size of the vocabulary is", len(vocab))
    Then let the program run. It should print out a bunch of things, and at the end, something like The size of the vocabulary is 1013.

If you run into trouble on the last step, make sure that MobyDickShort.txt and MobyDick.txt is saved to the correct folder (it has to be in the same folder as your program). Run the program by hitting F5. Then, inspect the value of the variable vocab by typing vocab in the interactive interpreter. You should see a list of strings (words), and if you then inspect the length of vocab (using the built-in function len(vocab)), it should say 1013. If the program does not work as expected, send us an email with what you did and errors you get if any at all. Remember to give an honest effort to solve your problem(s) before contacting us!

Task 2

You can probably imagine that the way we get rid of duplicates in a list is slow. Now imagine another way in which we can do the same thing. If we have a list:

myList = ["kitten", "cat", "dog", "apple", "cat", "dog"]
and suppose that we have some way of quickly sorting it by alphabetical order to get
myList = ["apple", "cat",  "cat", "dog", "dog", "kitten"]
Hopefully, you can see how this makes our life easier. On each iteration as we pass through myList, we can check the current element against the previous one. If it is the same as the previous one, then it's already in the new list. Otherwise, we add it in.

For example, supposing that we are at element index 1, the first "cat". We check the previous element, at index 0. That is "apple". Since the list is sorted, we know that we haven't seen "cat" before, and so we can add it to the new list.

Now suppose that we go to the next element, which is at index 2 (the second "cat"). We check the previous element, at index 1. That is "cat", which is the same as the current element. Therefore, we skip over this element and do not add it to the new list.

This method is a lot faster than the previous one. Python also has a sort function that will sort a list quickly.

Try to implement removeDuplicatesFast. The instructions are as follows:

  1. Loop through the sorted word list in the iterating-by-index way (refer to the provided example in Lines 121-125 in the Python code). One subtle difference is that you want to start with the second word, since (1) you already put the first one in the result list, and (2) you will compare a word to the previous one, but the first word has no previous. Start a for-loop as such, by supplying 1 instead of 0 as the first argument to range().
  2. Now we write the body of the loop (statements that will be executed multiple times). Suppose you called your loop variable (the variable right after the for keyword) index. Then each iteration of the loop, index gets assigned a different value (1, 2, 3, ...).
    1. Write an assignment statement that assigns the element of the list at position index to a variable called current. (Remember how to index into a list? The square brackets?)
    2. Write another assignment statement that assigns the element of the list at the previous position to a variable called previous.
    3. Write an if statement to check if current is different from previous. If current and previous are different, this means we have seen current before. If not, we want to add it to the list (check Line 34 for the syntax -- you have two options on how to do this -- either an append, or a +).
  3. After the loop is completed (be careful with the indentation), the result should hold the list without duplicates. return it.
  4. Fill in the test function testRemoveDuplicatesFast(). Provide at least three test cases. The point of a test function is to provide tricky test cases that might fool the function you're testing, so come up with interesting cases. What should the output be if all the input words are the same? What if they're all different? Remember, removeDuplicatesFast() sorts the words you give it before filtering them. Verify using the test function that removeDuplicatesFast() works properly.

Task 3

Now, let's deploy our new method of removing duplicates.

  1. Modify the vocabulary function so that it uses removeDuplicatesFast() instead of removeDuplicatesSlow().
  2. Run the program as in Task 1 and inspect the vocabulary list and its size. You should get the same answer as before (both methods are correct, just that the new one is faster). Make sure you understand why this new way of getting rid of duplicates works and is faster than the old one. Write your explanation in the comments before the removeDuplicatesFast() function definition.
  3. Now, in the function readFile(), instead of reading in MobyDickShort.txt, try MobyDick.txt. We are no longer afraid of processing large files!
  4. Run the program again. Are you impressed with how fast it computes the vocabulary list? This time, do not try to inspect the whole list because it takes too long to print. You can also try to use removeDuplicatesSlow() to go through the whole MobyDick.txt. It took 10 minutes to run on my computer. How long does it take on yours?

Congratulations! You have just completed a software upgrade.

Handin

Rename your program Name_StudentID_HW2-4.py and share it with PolyUCOMP1D04@gmail.com.

Note: Before you turn in your Python files, make sure they run without any errors(Save your Python file. Then select Run > Run Module or hit F5 on your keyboard)! If nothing appears in the Shell, don't worry as long as no red error messages appear. If they don't run, i.e. if red stuff starts appearing in the shell, points will be taken off!