CS 202 Fall 2012

Homework Assignment #5 : Due Monday 5:00pm October 15

This homework has three parts, none of which involve any programming! (Don't worry, we'll make up for it in our next assignment!) This homework focuses on Big Data. In the first part, you'll use Google's N-Gram Viewer to search for phrases in more than 5 million books. In the second part, you'll explore and visualize a large data set of your choice to answer a policy question. In the third part, you'll comment on issues of privacy with personal data.

Part A: Google N-Gram Viewer (3 points)

Google Labs' Ngram Viewer is a tool that lets you search for words in a database of 5 million books from across centuries.

To begin, watch this 15 minute TED talk by Erez Lieberman Aiden and Jean-Baptiste Michel to see how it works and some of the surprising facts they have learned.

From this talk, answer the following questions:

  • What are n-grams? Why did Google release n-grams instead of the full text of the books?
  • In what year did "thrived" become more popular than "throve" (at least in texts?)
  • What is culturomics?
  • Why was the word “beft” popular in texts before 1800?
Now, experiment with the Google n-gram viewer itself. You should pick some subject than you are interested in and see how the popularity of some related words have varied over time. You can pick anything you like, but if you wisely pick a subject that you have some additional knowledge of, you will likely find it easier to interpret the resulting data. Of course, don't use any of the search terms used in the TED talk!

You should compare between two and four n-grams that are related to your topic. You should write a short essay describing what you found. Your essay must include the following points:

  1. Specify the exact query you gave to retrieve your results (e.g., the n-gram phrases, the range of years, and the corpus language). You should describe how your n-grams are related (i.e., the overall subject you are investigating).
  2. The graph that was produced. You can include the image in your document or as a separate file, whicheve is most convenient for you.
  3. An objective description of the popularity of each n-gram relative to one another and over time. Is the popularity of these search terms increasing or decreasing over time? Has the relative popularity of the terms changed at all over time? Are there distinct moments in time when the popularity of the search term has abruptly increased or decreased?
  4. A subjective discussion about the relative popularity of the terms and their popularity over time. Why do you think some of the n-grams are more popular than others? How does the popularity of each n-gram correlate with what was going on in the real world? What information from the other sources do you have to back up your speculations?

Part B: Visualization (4 points)

A number of fascinating data sets exist on-line as well as tools to help you visualize that data. Many Eyes and Google Public Data Explorer are two popular sites for sharing visualizations.

In this part of the assignment, you will use Google Public Data Explorer to visualize a data set; you should pick a data set that allows you to answer a social question you find interesting or that could help steer public policy. (For example, the questions we discussed in the Lecture on Big Data.) A huge number of data sets are available for you to use.

After you've picked a data set of interest, explore the four different types of graphs you can produce: line, bar, map, and scatter. You can select the type of graph by picking one of the four icons near the top left of the page.

In many cases, to visualize interesting results you'll want to change what is being Compared and/or what is shown on one of the axes. To change this, use the dropdown menus along the left side of the page and along the axes. For example, many of the data sets default to comparing data across different States of the US. But, you can change this by selecting the dropdown arrow symbol and selecting a different category for comparison. Likewise, for the scatter graph, the default uses the same metric along both the x and y axes resulting in a boring x=y graph; you will want to change the metric being shown on either the x or y axis to look at correlations.

After you've explored some data, tell us what you found. To do this, you must produce three graphs; all three graphs will probably be of the same data set, but they must compare different categories of data, use different metrics, or be of different types (e.g., line, bar, map, or scatter).

  1. Include in a single document (e.g., MSWord) your three graphs. Don't just include links to your graphs on the google website. The easiest way to include the graphs directly is probably to grab a screenshot. Make sure the graphs include the x and y labels and data categories being shown.
  2. State the societal or public policy question you are trying to answer with the data shown on the graphs.
  3. Describe what is being shown in each of the three graphs. We expect a few sentences about each graph.
  4. Explain in a few paragraphs how the data across the graphs helps you to answer your societal question.

Blown to Bits (3 points)

The fact that massive amounts of data are being collected about you can be both a blessing and a curse. Begin by reading Chapter 2 of Blown to Bits. Then, consider one of these two scenarios:
  • A store tracks every purchase you make over time; the store even tracks the location of your shopping cart at each moment in the store.
  • A web search engine tracks every search you have ever performed and the pages you end up viewing.
For one of the above two scenarios, discuss the following issues:
  • How could this type of information collection benefit you as an individual? Be specific and give examples.
  • How does collecting this type of information benefit the store or the search company? Why do they want to collect this information? Again, give concrete examples.
  • How could this information collection be negative for an individual (even someone with nothing to hide)?
  • Do you think the benefits outweigh the drawbacks for most individuals? Why or why not?

Turning in your Homework

Save your files as a .doc, .docx, or .pdf file. Turn in all files through your Learn@UW account. Double check that you really submitted each of your files!


Fall 2012
Time: TuTh 9:30-10:45
Room: 1325 CS
Lab: 1370 CS (1st floor)

Prof Andrea Arpaci-Dusseau

Office Hours
TuTh 10:45-12:00
7375 Computer Sciences
Email: dusseau "at" cs.wisc.edu

Teaching Assistant:
Benjamin Bramble
Lab Hours (CS 1370)
Wed 2:00-4:00

Teaching Assistant:
Sharad Punuganti
Lab Hours (CS 1370)
Thu 1:30-3:30

  • CS202 Home
  • TAs and Lab Hours
  • Lecture Schedule w/ Slides
  • Grading
  • Homeworks
  • Projects
  • Exams
  • Scratch Examples
  • Readings
  • Computing Resources
  • Outreach Opportunity
  • Interesting Links
  • Scratch
  • UW Computer Sciences Dept