CS 202 Fall 2012
Homework Assignment #5 : Due Monday 5:00pm October 15
This homework has three parts, none of which involve any programming! (Don't worry, we'll make up for it in our next assignment!) This homework focuses on Big Data. In the first part, you'll use Google's N-Gram Viewer to search for phrases in more than 5 million books. In the second part, you'll explore and visualize a large data set of your choice to answer a policy question. In the third part, you'll comment on issues of privacy with personal data.Part A: Google N-Gram Viewer (3 points)
Google Labs' Ngram Viewer is a tool that lets you search for words in a database of 5 million books from across centuries.To begin, watch this 15 minute TED talk by Erez Lieberman Aiden and Jean-Baptiste Michel to see how it works and some of the surprising facts they have learned.
From this talk, answer the following questions:
- What are n-grams? Why did Google release n-grams instead of the full text of the books?
- In what year did "thrived" become more popular than "throve" (at least in texts?)
- What is culturomics?
- Why was the word “beft” popular in texts before 1800?
You should compare between two and four n-grams that are related to your topic. You should write a short essay describing what you found. Your essay must include the following points:
- Specify the exact query you gave to retrieve your results (e.g., the n-gram phrases, the range of years, and the corpus language). You should describe how your n-grams are related (i.e., the overall subject you are investigating).
- The graph that was produced. You can include the image in your document or as a separate file, whicheve is most convenient for you.
- An objective description of the popularity of each n-gram relative to one another and over time. Is the popularity of these search terms increasing or decreasing over time? Has the relative popularity of the terms changed at all over time? Are there distinct moments in time when the popularity of the search term has abruptly increased or decreased?
- A subjective discussion about the relative popularity of the terms and their popularity over time. Why do you think some of the n-grams are more popular than others? How does the popularity of each n-gram correlate with what was going on in the real world? What information from the other sources do you have to back up your speculations?
Part B: Visualization (4 points)
A number of fascinating data sets exist on-line as well as tools to help you visualize that data. Many Eyes and Google Public Data Explorer are two popular sites for sharing visualizations.In this part of the assignment, you will use Google Public Data Explorer to visualize a data set; you should pick a data set that allows you to answer a social question you find interesting or that could help steer public policy. (For example, the questions we discussed in the Lecture on Big Data.) A huge number of data sets are available for you to use.
After you've picked a data set of interest, explore the four different types of graphs you can produce: line, bar, map, and scatter. You can select the type of graph by picking one of the four icons near the top left of the page.
In many cases, to visualize interesting results you'll want to change what is being Compared and/or what is shown on one of the axes. To change this, use the dropdown menus along the left side of the page and along the axes. For example, many of the data sets default to comparing data across different States of the US. But, you can change this by selecting the dropdown arrow symbol and selecting a different category for comparison. Likewise, for the scatter graph, the default uses the same metric along both the x and y axes resulting in a boring x=y graph; you will want to change the metric being shown on either the x or y axis to look at correlations.
After you've explored some data, tell us what you found. To do this, you must produce three graphs; all three graphs will probably be of the same data set, but they must compare different categories of data, use different metrics, or be of different types (e.g., line, bar, map, or scatter).
- Include in a single document (e.g., MSWord) your three graphs. Don't just include links to your graphs on the google website. The easiest way to include the graphs directly is probably to grab a screenshot. Make sure the graphs include the x and y labels and data categories being shown.
- State the societal or public policy question you are trying to answer with the data shown on the graphs.
- Describe what is being shown in each of the three graphs. We expect a few sentences about each graph.
- Explain in a few paragraphs how the data across the graphs helps you to answer your societal question.
Blown to Bits (3 points)
The fact that massive amounts of data are being collected about you can be both a blessing and a curse. Begin by reading Chapter 2 of Blown to Bits. Then, consider one of these two scenarios:- A store tracks every purchase you make over time; the store even tracks the location of your shopping cart at each moment in the store.
- A web search engine tracks every search you have ever performed and the pages you end up viewing.
- How could this type of information collection benefit you as an individual? Be specific and give examples.
- How does collecting this type of information benefit the store or the search company? Why do they want to collect this information? Again, give concrete examples.
- How could this information collection be negative for an individual (even someone with nothing to hide)?
- Do you think the benefits outweigh the drawbacks for most individuals? Why or why not?
Turning in your Homework
Save your files as a .doc, .docx, or .pdf file. Turn in all files through your Learn@UW account. Double check that you really submitted each of your files!Menu
Fall 2012Time: TuTh 9:30-10:45
Room: 1325 CS
Lab: 1370 CS (1st floor)
Instructor:
Prof Andrea Arpaci-Dusseau
Office Hours
TuTh 10:45-12:00
Office:
7375 Computer Sciences
Email: dusseau "at" cs.wisc.edu
Teaching Assistant:
Benjamin Bramble
Lab Hours (CS 1370)
Wed 2:00-4:00
Teaching Assistant:
Sharad Punuganti
Lab Hours (CS 1370)
Thu 1:30-3:30