Instructor: Hyunseung Kang
Email: khyunsWHALE@whartonWHALE.upenn.edu (remove all marine mammals from the e-mail address)
Office: 434 JMHH
Office hours: Mon/Tues/Wed/Thur 12:15p.m. - 1:30p.m.
Syllabus: Syllabus
Updates
- Extra office hours will be on July 14 (Sat) from 1PM to 5PM and July 15 (Sun) from 7PM to 9PM in the Statistics Department (4th floor JMHH).
Course Overview
The course is aimed to equip students with the tools needed to analyze real-world data and to justify their use through mathematical theory. Together, we will study basic concepts related to statistical inference and examine commonly used methods, with an emphasis on understanding when and how to apply them. Students will also learn how use these methods on the statistical software R.
Prerequisites
The official prerequisite of the course is STAT 430. The effective prerequisite is fluency with basic probabilistic reasoning and analysis (e.g., probability distributions and densities; joint distributions; conditional probability, independence, correlation, and covariance; moment generating functions; law of large numbers; central limit theorem; etc.) For a refresher/overview of these topics, please refer to A First Course in Probability by Sheldon Ross.
It would be helpful to have previous exposure to linear algebra, but it is not required. Previous exposure to the statistical computing software R is also not required.
Textbook
There is no required textbook for this course. All course material will largely consist of taking the best parts of each textbook listed below and presented through lecture and lecture notes. However, if you wish to purchase a textbook, Devore is available at the Bookstore.
- (Recommneded) Probability and Statistics for Engineering and the Science, 8th Ed. , J.Devore, Brooks/Cole -Cengage Learning, 2011
- (Recommneded) Regression Analysis by Example, 4th Ed. , S Chatterjee and A. Hadi, Wiley-Interscience, 2006
- (Recommneded) Statistics and Data Analysis: from Elementary to Intermediate, A.C. Tamhane and D.D. Dunlop, Prentice Hall, 2000
Statistical Computing
The statistical computing software R (latest version) will be used in the course. It is free, and can be downloaded at the R-project website http://www.r-project.org . The website also contains a list of manuals for using the software. Basic usage of R will be illustrated in class and through sample codes posted on the course website. Again, no previous exposure to the software is required.
Grading Policy
- Assignments (25%), due every Monday before class begins
- Weekly quizzes (35%), every Monday at the beginning of class
- Final project (40%) due Thur, August 9th
Final Project (Due Thur, August 9th)
In the final project, students will analyze a real-world data set of their choosing using the tools learned from the class. The final project should focus on what statistical tools were used, whether the tools were appropriate in the setting, and why the tools were important in the analysis. Students may also develop new tools for analysis, as long as it is justified by theory.
Students may work in groups up to three people. Each group will submit a one-page, single-spaced, 12-point type, 1-inch margin, executive summary providing an overview of the project. Also, the group will submit a technical report containing the details of the group's analysis. Both documents must be in a single PDF file (no .txt, .doc, .docx, .tex, etc.). In the technical report, students are expected to provide some mathematical justification of their analysis and include relevant numerical analysis (e.g. p-values, t-tests, F-tests, etc.) of the data set.
If students are interested and if the quality of the analysis is exceptional, your instructor will help you get the final project published in an academic journal.
Also, here is a list of websites where you can obtain freely accessible data sets. This is only a small fraction of what's available online.
- Yahoo! Sandbox : This website is great for different types of data, from linguistics to imaging. They also have apps that leverage the data and conduct statistical anaylsis. Most of the data is from Yahoo users.
- Machine learning repository at UCI : The website contains all sorts of data from a variety of fields including, but not limited to, biology, social science, economics, game theory, and web traffic data. It includes information about where the data was obtained and what variables are included in the data.
- Machine learning data at Stanford (click Data) : Like the UCI repository, the website contains a huge array of data from a variety of fields, primarily biology and medicine. It also contains nicely formatted imaging data (the ZIP code data) that can be easily opened in R.
- Sports data: The website provides links to various sports-related data on the web.
- StatLib Datasets: It contains all sorts of nicely formatted data from a variety of disciplines.
- Data surfing website: It contains collection of links to a variety of data in many disciplines
- Yahoo! Finance data : You can download stock data from any U.S. based company using the guidelines on this page. For example, here is the Google stock data from its first IPO to July 2, 2012 that I got by following the instructions from this website.
- US Department of Labor: Bureau of Labor Statistics: This website is the gold mine for any data related to the U.S. economy. I would suggest using the "Text Files", which will provide a link to that data set plus any relevant information to understand the notation in the data.
- U.S. Census 2010: It contains the recent census data. The data is aggregated by zipcode or county.
- Image Processing Database : This website contains a wide array of image-based data. Despite its rich offering, unfortunately, you will have to spend some time reformatting the images into R-friendly format. If you are familiar with the image processing literature, simply download packages recommended here (under General Image Processing). If you aren't, I would suggest converting all your images into grayscale using GIMP, Adobe Photoshop, or Matlab
- Image data from NYU : This website contains relatively R-friendly image data. The offering is small in comparison to the link above, but it's much easier to work with. To use this data, you would need to use Matlab at some point to convert all the .mat data into R-friendly formats like .csv or .txt. The conversion is very easy to do and should take you less than 15 minutes (including the Google search about how to do this on Matlab). All the computers in Huntsman are equipped with Matlab.
- Social Science Database at Yale : This is Yale Library's data platform for data in the social sciences.
Lecture Notes
- Lecture 0 - Introduction (Powerpoint) or Lecture 0 - Introduction (PDF)
- Lecture 1 - Population and Sample (Powerpoint) or Lecture 1 - Population and Sample (PDF)
- Lecture 2 - Summarizing Data (Powerpoint) or Lecture 2 - Summarizing Data (PDF) with R Code and Cell Count Code
- Lecture 3 - Properties of Summary Statistics: Sampling Distribution (Powerpoint) or Lecture 3 - Properties of Summary Statistics: Sampling Distribution (PDF) with R Code
- Lecture 4 - Confidence Intervals (Powerpoint) or Lecutre 4 - Confidence Intervals (PDF) with R Code
- Lecture 5 to 7- Hypothesis Testing, Lecture Notes (PDF)
- Lecture 7 to 10 - Simple Linear Regression, Lecture Notes (PDF) with R Code
- Quiz Grades (Data)
- Spam Mail (Data) between any two columns
- Wine Quality (Data) between any two columns
- Lecture 10-17 - Multiple Linear Regression, Lecture Notes (PDF) . Lecture notes for Model Selection and Time Series
- TV Death (Data) and fun link for introduction to multiple regression
- Are you Fat? (Data) for basic multiple regression
- Longevity (Data) for basic multiple regression
- Common Household Food (Data) for basic multiple regression
- Wine Quality (Data) for basic multiple regression. Between flavinoids and proline, polynomial regression can be used
- Spam Mail (Data) for basic multiple regression
- Tax Return Forms (Data) for One-way and Two-way ANOVA
- Sentencing Criminals (Data) for ANOVA, ANCOVA, and multiple regression
- Wage (Data) for ANOVA, ANCOVA, and multiple regression
- Quartic regression (Data) for polynomial regression
- Boston Housing (Data) for model selection
- Google stock (Data) for time series data
- NASDAQ Closing in 2011 (Data) for time series data
- CO2 Levels in Mauna Loa (Data) for time series
- Facebook Stock (Data) for time series
- Generalized Linear Models: Logistic, Logit, and Poisson Regression, Lecture Notes (PDF)
- Final Review Guide
Homework
- Homework 1 (Due Monday July 9, 2012 before class starts!) with Facebook data
- Homework 1.5 (Due Monday July 16, 2012 before class starts!)
- Homework 2 (Due Monday July 16, 2012 before class starts!) with Penn Sex Survey (Data) and Penn Sex Survey (Description)
- Homework 3 (Due Monday July 23, 2012 before class starts!)
- Penn Sex Survey (Data) with Penn Sex Survey (Description)
- 14-Cancer Gene Expression (Microarray) Data . More information about the data can be found on this website
- Baseball Hitters (Post 1970) Data (thanks to Professor Shane Jensen )
- Homework 4 (Due Thursday, August 2, 2012 before 5PM)
- Optional Homework (Due Thursday, August 9, 2012, before class starts!)
Quiz
- Quiz 1 : Mean = 6.47, Median = 6, SD = 1.55, IQR = 1.5, Max = 9
- Quiz 2 : Mean = 8.97, Median = 9.5, SD = 2.31, IQR = 3.25, Max = 12.5
- Quiz 3 : Mean = 6.92, Median = 7, SD = 1.78, IQR = 2, Max = 9.5
- Quiz 4 : Mean = 4.50, Median = 4.50, SD = 2.01, IQR = 2.25, Max = 9