Homework 1-4

Adapted from Brown University CSCI0931. Used with Permission.
Due Oct 19, 2015, 23:59

Reminders

For the following problems you may discuss the concepts that will help solve these problems with classmates and course staff. You may not simply copy down the answers of your classmates as that is a violation of the collaboration policy. The one exception to this rule is Task 5, marked as “(Independent)”. You may discuss independent problems with course staff only.

Setup

Copy this spreadsheet to your Google Drive and name it Name_StudentID_HW1-4.

Task 1

Matrix multiplication

In class last week, we built a spreadsheet that took all the voting records (+1 for Yes, -1 for No, 0 for missing, etc.) and compared them pairwise. The comparison used the idea that if you multiply two voting records and sum up the resulting products, you get the "agree minus disagree" number. (If you take absolute values before multiplying, you get "agree + disagree" instead). This notion of sums of pairwise products comes up over and over again in mathematics, economics, etc. It's almost never done with pairs of rows from a table, but instead with a row from one table and a column from another. When we do this, each table (which should be a table of NUMBERS) gets called a MATRIX (plural: matrices), and the process of computing all possible row-column sum-products is called "matrix multiplication." You're going to play with that a little bit in this problem.

  1. In the "Matrix Practice" sheet, you'll see matrix A, in pink, whose rows are labeled with the names of people, and whose columns are labeled with items for purchase. You'll also see matrix B, a table of prices (in dollars per item) for the three items, for each of two months. Notice that the number of COLUMNS of A is the same as the number of ROWS of B.

    Here is a description of how to compute the matrix product A* B:
    We place the matrices as shown in B12 through F16: the first one to the left, the second one up and to the right. We then fill in the space (marked in green) in the lower right. Each cell in the green space lies in the same row as some row of matrix A and in the same column as some column of matrix B. We place in this cell the SUMPRODUCT of that row and that column. When you use SUMPRODUCT, you can pass either two rows or two columns. To have the function to work with a row and a column, use the TRANSPOSE function in one of the arguments, transforming a row into a column or a column into a row. Your function should look like:

    SUMPRODUCT("row", TRANSPOSE("column"))
    or
    SUMPRODUCT(TRANSPOSE("row"), "column")

    where "row" and "column" are specified in the matrices.
  2. Do that now for the upper left corner of the green area, and use fill handles to extend the definition to the rest of the green area. The SUMPRODUCT formula you wrote wasn't too complex, but there's a far simpler way to produce the same answer.
  3. In the second area (rows 20-24), enter, in cell E23, the formula = MMULT(something, something else ) where the "something" is the range that defines the pink matrix, and the "something else" is the range defining the blue matrix. The results should be the same as those in the green area above.
  4. MMULT doesn't actually require that its two arguments (the things inside the parentheses) be placed in the way we placed them for the SUMPRODUCT computation: they can be anywhere. You'll see that in E7 to F8, we've made another green box. Enter a formula in E7 that will multiply the two original matrices (just to the left of the green box) and produce a result that should be the same as the one you got for the last two steps.

Task 2

Meaning of matrix multiplication

If we write the ROW labels for matrix A and the COLUMN labels for matrix B near our green matrix, the top left corner is "Robin, July". What does the number 20 that's in that spot mean? Well, it was computed as (1 x 2) + (2 x 3) + (3 x 3) so it's the total cost for Robin to buy her basket of goods in july: 1 soap at $2, 2 eggs at $3, and 3 pears at $4.

  1. At the bottom of the page, fill in the two yellow cells by looking up the appropriate values in any of the green arrays.

Task 3

Experiments with coin flips, making plots, etc.

Open the "Random Coin Flips" tab of the spreadsheet. Using the RANDBETWEEN() function, we've made 25 rows of 8 random numbers each, representing coin-flips: 0 is heads, 1 is tails. You're going to make a summary of these results. You might expect that if you flip a coin 8 times, on average it'll come up heads four times.

  1. Test that conjecture by entering, in column I, a formula to compute how many heads you got in each 8-flip trial. It'd be best if you entered just one formula in cell I2, and then filled down, of course.
  2. What's the smallest number you could possibly get in column I? (Hint: zero). The largest? Fill in your answers in the spaces provided.
  3. In cell A31, enter the word "Value", and beneath it enter a list of the possible values, from smallest to largest, starting with the smallest in cell A32.
  4. In cell B31, enter the label "Value Count", and in cell B32, enter formula to count how many times, in the 25 experiments, you got exactly zero heads. (Please try not to put the numbers directly into the formula. Make reference to the data in A32:A40). Fill that formula down, and see whether it gives you the counts of how many times you got 1, 2, 3, etc. heads.
  5. In cells A31 to B40 (more or less), you've got some data. Try to figure out a way to make a graph of that data, with the "values" on the x-axis and the value-counts on the y-axis. It could be a line graph, or bar graph, or anything else. Insert the graph onto this sheet.

Task 4: Data importing

Now we're going to go over some more data importing. We have been importing XML and CSV files so far, but IMPORTXML also allows us to import HTML files.

The first thing that we need to do is to disable Javascript on our browser. Javascript is a programming language that generates fancy effects (like pages that change content without reloading, etc). However, IMPORTXML does not understand Javascript. Therefore, we need to see the webpage just as IMPORTXML would "see" it.

Note: We could disable Javascript completely in our browser, but the problem is that Google Sheets requires Javascript to run! Therefore, we need to do this:

  1. Follow the instructions on this page. Click on "Do not allow any site to run Javascript"
  2. Do not click "Done" yet. Click on "Manage Exceptions..." in the Javascript section.
  3. We are going to tell Google Chrome that we will allow Google Docs to run Javascript, even though nobody else is allowed to do so. In the window that comes up, in "Hostname Pattern", put in https://docs.google.com. Make sure that the corresponding behavior is "Allow". Click on "Done" to close the window.
  4. Click on "Done" to close the Content Settings window. Javascript should now be disallowed on your browser. Note that this will "break" many sites. To allow it again when you're done with this hoemwork, you can go back into Settings and reenable it.

Go to the ImportExamples tab of the spreadsheet. At the top is an example of performing a google search and using importXML to extract the data that was returned (i.e., all the web pages you could click on). The xpath part of the formula is more complex than anything we saw in class; it's

//h3[@class='r']/a/@href

Let's interpret that. The first part, //h3, says to search for anything with the tag h3, which is the style used to display results in a google search.

To see this, go to Google (go to www.google.com, do not search from the address bar), search for "PolyU", and when the results show up, note two things:

  1. The URL has (probably) changed to a long, complicated string of text that contains https://www.google.com/search?q=PolyU somewhere within it.
  2. Place your cursor above the first site returned (the PolyU official page), right click and select "Inspect Element". A new panel will open down below showing you all about this element: it has an h3 tag, with the attribute class="r".
  3. Look in cell B3 and examine the XPath of the IMPORTXML command. The first part says "look for h3 tags with the class attribute equal to 'r'". The second part says "and within there, look for 'a' tags, and tell me the value of the 'href' attribute." It turns out that "a" tags are hyperlinks, and the "href" attribute is the URL for the linked page.
  4. Now click on cell B8 and note that it contains the IMPORTXML command. The cells from B8:B21 contain the search results from the Google search. These search results were imported by the IMPORTXML command.
  5. We need to do a little processing on the search results returned by Google:
    • The URLs found by this XPath begin with a "/url..." because they are relative URLs. In other words, all these links require "https://www.google.com" prepended to them in order to work. The absolute URL for each result is kept in the data-href attribute, but we cannot access these because of a bug in Google Spreadsheets (it cannot correctly parse hypenated names). To get around this, in cells C8:C21, we use substitute(cell name or string, target, replacement) to replace every instance of "/url?q=" with "", effectively getting rid of it. We also use substitute to get the right urls for image or news searches google returns, since those behave a bit differently than regular url results.
    • Another problem is that the URLs contain a command which instructs the browser to load up the cached version (a pre-saved version of the page) if the original site is inaccessible for some reason. If you look closely at the strings in B8:B21, you will see that almost all of them contain &sa=U&ved. Everything after this should be stripped away in order to get the correct URL. That's what the cells in D8:D21 do.
  6. We want you to fill in the formula in cell B5 to instead return the number of "results" listed at the top of the page (about 641,000 when we tried it!). Once again, use "Inspect Element" to help you. A BIG hint: if you right-click on the needed data in the "Elements" window that comes up, you will be able to copy the XPath directly. Compare that with the XPath that you would construct yourself.
  7. Let's try importing another kind of data. Create a new sheet called "MTR" and import the MTR routes and fares.

    In cell A1, use the IMPORTDATA command, using the following URL as the data source: "http://resource.data.one.gov.hk/mtr/data/mtr_lines_fares.csv"

    What we have just imported is online CSV data.

Task 5 (Independent)

Building your own app

Some words are more common than others; a quick way to tell is to do a quick google search. "Synchronicity" has about 5 million results, while "Diachronicity" has only about 9000.

Open a new tab on your spreadsheet. In that tab, build a spreadsheet that lets a viewer enter two words and then shows which one has more results in a Google search.

To find the URL, you'll do a Google search for some word and see what the URL is, and then try to alter that to handle other words. But beware: if you type your query in the address bar (as I often do), the resulting URL will look like this:
https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#safe=off&q=synchronicity
or possibly something even more complex. On the other hand, if you go to google.com and type your query there, the URL of the results page will be more like
https://www.google.com/search?safe=off&q=synchronicity
The part about safe=off can be safely removed, so all you need is .../search?q=synchronicity. Use this simpler form. Google Spreadsheets doesn't seem to be able to read from the more complicated URL for the results page.

Your spreadsheet should have the following:

Handin

Share the following spreadsheet with PolyuComp1D04@gmail.com

  • Name_StudentID_HW1-4
  • In the above file name, "Name" should be replaced with your official name and "StudentID" with your studentID. If it is not, we will take off points. Make sure every task has been completed.