Activity 1-1

Adapted from Brown University CSCI0931. Used with Permission.

Task 0: Get the Files We Need

  1. Visit this web page to access all the Legco meeting records. Scroll down until you get to the "Meeting Schedule, Voting Results and Records of Proceedings" table. We will use the "Minutes/Voting Results" for the meeting on "7 January 2015". Click on the "XML" file.

    We'll soon be importing data like this into a spreadsheet. Google Spreadsheets has a function, "ImportXML", for doing this, and Excel (on Windows) has something similar.

    If we were using Excel, you'd first have to download the data from the link above as an XML file and then convert it to CSV. We're first going to teach you how to import a CSV file into Google Spreadsheets from your desktop and then we'll teach you how to import data directly from the internet.

    We have already pre-processed the file and filtered out the data we need. The data is stored in something called CSV (Comma Separated Values) format. Open this formatted file by left-clicking it. You may need to change the text encoding (View->Encoding) to "Unicode (UTF-8)" to see the Chinese characters.
    What you're looking at is data in CSV format. Files saved like this should normally be named with a .csv ending, although we added a .txt so you could see what it actually looks like. (Files with a .csv ending are automatically downloaded or opened up in Google Spreadsheets by Chrome, so you can't see the file contents directly.) You should be able to see how it correlates to the XML data you saw on the website. We have removed some of the excess data fields before creating the CSV file, so not every field from the XML is present in the CSV file.
  2. Click the download icon in the center of the toolbar at the top of the document, and download it to your desktop

Task 1: Import a CSV File into Google Spreadsheets

We now have a CSV file that encodes senate data. We want to get this data into a spreadsheet. Go to Google Sheets. Then go to File > Import > Upload. Pick the Legco data file that you downloaded. You will be prompted to choose how the file will be imported. Select Insert new sheet(s) as your import action, tab as the separator character, and click import. A new sheet should open up containing data from your CSV file.

Questions

  1. Look at the table. How many rows are there?
  2. Examine all the columns. How are these related to the XML/CSV file itself?
  3. Which columns contain useful information?
  4. Suppose you have another CSV file with a set of older (pre-2012) Legco records. How would you add the information in that file to this table?

Task 2: Import data from an online XML file

In Task 1, we did most of the work for you (converting the Legco records to CSV). Suppose you didn't have us to do that for you. We'll now teach you how to import data directly from the internet.
  1. We have downloaded all the Legco XML records to the PolyU machines so we don't overload the Legco machines for our class! Visit this web page to get the voting records for January 7, 2015.
  2. Create a new sheet in the spreadsheet from Task 1, and in cell B3 enter the URL of the voting record. In cell B2, enter "URL:" as a label for B3.
  3. In cell D3, enter =importXML(B3, "//vote[@number=1]/*/member/@name-ch"). In a moment or two, the rest of the column should be filled with legislator names.
  4. Now, in cell E3, enter something that will generate the XPath that we will feed into importXML to import each legislator's votes for vote number 1. Ask the TA to check it for you.
  5. Now in cell F3, type in an importXML that will use the data in $B$3 and $E3 to import in the vote for Vote 1 for Jasper Tsang Yok Sing. Check that the vote is really correct by looking at the XML document.
  6. Extend the formula to import the votes for the rest of the legislators. Why did we put the $ signs around B3?
  7. In cell F2, enter something that will import the text of the item being voted on.

Questions

  1. Look at the table of names and votes. How many rows are there?
  2. Can you think of a way that you could have found out the number of rows before the legislator names were loaded?
  3. We had you enter the URL in one cell, and then refer to it from other cells. Change the URL to update the spreadsheet to display vote 2 rather than vote 1. Would this have been harder if you'd put the URL directly into each importXML expression? Would it be possible to make this even easier?

Task 3: Format the Data

  1. We're going to be using the spreadsheet that you imported from the CSV file in Task 1 for this. Rename it to something like "Activity 1-1".
  2. Get familiar with the data. How many rows are there? How many votes are in the first session? How many votes in the second session?
  3. We'll spend some time formatting the data to make it easier to work with.
    1. Click on the little arrow to the right of Column F ("name-ch"), and insert a new column to the right.
    2. Create a header for that new column. Call it "name".
    3. Each of the cells in that column should be a combination of the legislator's English and Chinese name. We'll start with the Chinese name so it's easier to sort with later (the English naming convention is inconsistent). In cell G2, enter something that will combine TSANG Yok-sing's English and Chinese names to give you "曾鈺成 TSANG Yok-sing".
    4. Using Fill Handles (or better yet, keyboard shortcuts), extend that formula all the way to the end of the spreadsheet.
  4. The next thing that we're going to do is to make a new, trimmed spreadsheet.
    1. Select the entire table by clicking on the cell in the top left corner of the spreadsheet (the cell to the left of the header row (A, B, C, D, ...) and to the top of the header column (1, 2, 3, ...)).
    2. Go to Edit->Copy, or better yet, use a keyboard shortcut. to copy the entire spreadsheet. This may take a few seconds.
    3. Click on the "+" sign at the bottom left of the spreadsheet to add a new sheet.
    4. Click on the new tab that comes up, then go to "Edit->Paste Special->Paste Values Only". This will copy the raw values from the old sheet over to the new one. This may take a few seconds.
    5. Now spend a few moments deleting unwanted data. We only need the following columns:
      • vote-date
      • number
      • motion-en
      • motion-ch (optional, if you want to keep it)
      • name
      • vote
  5. Rename the new spreadsheet with a different, descriptive name (maybe use the word "trimmed").
  6. We're going to start messing around with the order of the items in the sheet in a bit. However, while we do that, we do not want the first row -- the header row -- to get messed up. So we want to freeze it so it always stays as the header row. Go to "View->Freeze", and then select "1 Row". Your header row should always stay as the top row now, no matter how much you scroll up or down (try it.)
  7. Now, we are going to cut out some of the less interesting motions from the records (otherwise it'll be too big for Google Spreadsheets to handle). If you look at the names of the motions, you will see that there're a lot of motions that the legislators have to go through; we're going to keep only the votes from the first time that a bill is proposed, and when a motion is moved. So we need to look for the rows with a motion that starts with "FIRST PROPOSED RESOLUTION..." and "MOTION MOVED BY...". We want to keep only those rows which contain those words in the "motion-en" column. Figure out how to delete all the rows of the spreadsheet that correspond to votes that are not of interest. How can we tell which rows correspond to those which contain the needed words? Take a look at the motion-en column. Some of the cells in this column contain the words "FIRST PROPOSED RESOLUTION", or "THE FIRST PROPOSED RESOLUTION..." or "MOTION MOVED BY...". We want to keep only the rows with similar text in the motion-en column. Hint: this should only take a few minutes; it does not involve deleting the unwanted rows one by one. Also, it's possible to sort rows in alphabetical order. Hint 2: If you've learned elsewhere about "Filter"s for data, please don't use them; the results may not work well with later steps! Hint 3: You can select a range of rows by selecting top row with the Shift key held down, and then select the bottom row with the Shift key still held down.
  8. The resulting spreadsheet should be a lot smaller — how many rows are there now? You should have 2031 rows, including the header.
  9. We want a unique identifier for the vote of each bill in this congress. Which two columns together produce a unique identifier for each vote?
  10. Add another column to the table by entering vote_id column in cell F1. In row 2 of the vote_id column, write a formula to combine the “vote-date” for this row with the “number” for this row, placing a colon between them. Use Fill down to apply this formula to all the other rows.
    ABCDEF
    1vote-datenumbermotion-ennamevotevote-id
    2201303201FIRST PROPOSED RESOLUTION...何俊仁 Albert HOPresent20130320:1
    3201303201FIRST PROPOSED RESOLUTION...何俊賢 Steven HOYes20130320:1
    4..................
  11. Add another column to the table, just to the right of vote_id, and title it numerical-vote. In the second row of that column, enter a formula that produces 1 if a legislator voted “No,” and a 2 if s/he voted “Yes.”, and zero for anything else. Verify that nothing went wrong and that no cell contains the word “ERROR.” Hint: Try writing an expression that uses multiple IF functions "nested" within each other. The table should now look like this:
    ABCDEFG
    1vote-datenumbermotion-ennamevotevote-idnumerical-vote
    2201303201FIRST PROPOSED RESOLUTION...何俊仁 Albert HOYes20130320:12
    3201303201FIRST PROPOSED RESOLUTION...何俊賢 Steven HOYes20130320:12
    4.....................

Task 4: Summarize the Data with a Pivot Table

  1. Select the entire table (do you remember how to do that?) and select Data > Pivot Table from the toolbar. This will open up a new sheet for a pivot table based on the information you selected. If you don't know how to use pivot tables, review the tutorial.
  2. Let the row labels be Name, the column labels be vote_id, and the value in the cells be the numerical_vote that you just created.
  3. Change the value in the cells to be the maximum rather than the sum of the votes by selecting the right value from the Summarize by: menu. Try to play with sorting your values and see just how many different ways you can sort your values - you'll probably be using more of these options later.
  4. Delete the Grand Total row and column by unchecking Show totals in the rows and columns boxes

Task 5: Use Conditional Formatting to Highlight Blank Cells

  1. Select the entire table and select Format > Conditional Formatting from the toolbar. This will open up a the toolbar that lets you add rules to format cells depending on some formula.
  2. Use conditional formatting to color all blank cells red. If you don't know how to do this, review the tutorial.
  3. Delete all the blank columns to the right of your table. Ditto for the unused rows at the bottom of the table.
  4. Verify that there are no blank cells in the table.