ORIE 4740 Project Information (Spring 2021)
[Back to course page]
In the final project, the techniques taught in the class are used to analyze a large dataset chosen by the students. Students work in teams of 2-4 students. Each team finds the necessary data, carries out the project, and writes a project report.
1. Due dates2. Project Teams
3. Project Assignment
4. Submission
5. Grading
6. Sample Project Reports
- • Team and dataset: Once you form your group and decide on which dataset to use, email the your TA Sam Tan (sst76) ASAP with the following information:
- i) the names and NetID of your group members,
- ii) source of the dataset(s),
- iii) a brief description of the data, and
- iv) the numbers of observations and predictors in the dataset (to the best of your knowledge).
- The general rule is that no two groups may use the same dataset. In case of a conflict, the first group that emails Sam will have priority. On Ed Discussions we will maintain a list of datasets that have been chosen.
- • Project
report and peer evaluation form (24% of final grade): Due May 25th,
to be submitted online via Gradescope.
You should work in a team of 2 to 4 students. Please try to form a team yourself; if you have trouble finding teammates then let me know and I will help you find a team. You may not work alone.
You may use the "Search for Project Teammates!" megathread on Ed Discussion.
You
will apply tools that you
have learned in 4740 to a dataset of your choice.
- (a) Simulated (artificially generated) datasets are not
allowed
(b) You may NOT use a dataset used before in HW/labs, nor any of the datasets from ISLR.
(c) You may NOT use the UC Irvine Machine Learning Repository
(d) You may NOT use CMU Statlib
You may obtain a dataset from a company, for instance if you have had an internship in a company and they are willing to provide you with such a dataset for this purpose. You may use a dataset from a research project at this university or another university, with permission. I highly encourage you to look around for a dataset on a topic that particularly interests you, rather than using generic datasets from data mining websites. Example: say I am interested in doing a project related to beer. A web search on “beer dataset” brings up a dataset with 1 million + beer reviews, from BeerAdvocate. You are allowed use datasets from other textbooks, but you cannot do an analysis similar to that done in the textbook on the same dataset.
Here
are a few data sources:
- • Stanford Large Network Dataset Collection: http://snap.stanford.edu/data/
- • Yelp dataset challenge: https://www.yelp.com/dataset_challenge/
- • https://www.kaggle.com
- • https://www.quandl.com
- • "The 50 Best Public Datasets for Machine Learning":
https://medium.com/datadriveninvestor/the-50-best-public-datasets-for-machine-learning-d80e9f030279 - • The call center data from: http://iew3.technion.ac.il/serveng/callcenterdata/index.html
- • Marine environment data at http://www.ices.dk/marine-data/dataset-collections/Pages/default.aspx
- • Datasets from the KDD Cup
- • Various datasets from Yahoo! http://webscope.sandbox.yahoo.com/
Note that some datasets in the above links are not acceptable as per rules (a)-(d) above. Regardless of the source of the data, this source must be referenced in your report. If the dataset is not in the public domain, then you must obtain permission for its use in this class project. No two groups may use the same dataset; if two groups propose the same dataset by chance, the one that emails Sam Tan first will have priority.
What is required? Each team must find the necessary data, carry out the data analysis, and write a project report. The analysis should be motivated by one or two particular scientific / commercial goals, such as (for the veteran’s data example): “We seek to predict whether or not an individual will contribute, and the contribution amount if they do contribute, based on their demographic characteristics and contribution history. This prediction can be used to choose which individuals receive solicitations, or to estimate the total expected contributions in order to guide the organization’s financial planning.”
The data analysis that you perform needs to be more than a direct mapping of one of our lab analyses to another dataset. You will need to use more than one of the approaches that we have learned in the class. An example of a data analysis with sufficient scope is:
For a data set like the veteran’s data that has a continuous outcome variable and a binary outcome variable: Applying linear regression to predict the continuous outcome and applying logistic regression and decision trees to predict the binary outcome, while handling missing data. Comparing the results from logistic regression and decision trees, and recommending which should be used.
The data
analyses that you perform should be appropriate for the goal(s) that
you have stated. You should choose one or several data sets that are
appropriate to address the goal(s) you have stated. If you have more
than one data set or more than one goal, the project should form a
coherent whole, rather than being two or three unrelated data analyses.
For instance, you may have a single scientific goal, and use two data
sets to address this goal. Or, you might have a single data set, with
which you address two related scientific questions.
If you analyze a single data set then it should have a reasonably large number of observations (at least 1000); otherwise, two smaller data sets suffice but they should each contain at least 600 observations. One of your data sets should have at least five predictors. (You should discuss with the professor or TA if you have a strong reason to use a dataset that does not satisfy the above requirements.)
Sometimes
a data analysis yields negative or inconclusive results. For instance,
perhaps none of the predictors were significant in the model, even
though they seemed like reasonable predictors. Perhaps the predictions
were poor, and the methods chosen, although they were a reasonable
choice and had good promise, turned out to not work well. These are
acceptable results, as long as all of the analyses and conclusions are
correct. You might in this case suggest alternative approaches in your
conclusion.
Work on the project is to be done entirely by the project group; communication between groups regarding project work is not allowed. You may not apply a technique that has been previously applied to the same data set in a published or unpublished work, if you are aware or could reasonably be expected to be aware of the existence of their work. You should cite in the bibliography any and all published or unpublished written works or spoken communications that have influenced your analysis.
You should employ at least one technique covered in class/ISLR, but are free to use any additional methods beyond class. You are encouraged to use R, but using another language (such as Python) is allowed.
Regarding time series and text data:
Some
problems may involve time series data, e.g., stock prices. Some
problems may involve text data, e.g., tweets, news articles, comments
on Yelp, etc. Techniques for time series analysis and natural language
processing are not covered by this course. While the course staff are
happy to provide general guidance, you may want to learn some of
these techniques yourself if you want to work on such a dataset.
You may discuss with your professor/TAs regarding your plan
about:
- • The proposed scientific/commercial goal(s) of the
analysis;
• The proposed data set(s) that will be used, including their source, number of variables and data points;
• The proposed data analyses to be performed;
• What figures or tables you might include;
• Why you expect the data set(s) and analysis methods to successfully address your goal;
• Any other details at your discretion
The report should be no more than 10 pages (double-spaced, 11+
pt font) and should contain:
- • Title page with authors and abstract
- • Introduction telling what the project is about, what your team has accomplished, and a brief statement of results and conclusions.
- • One or more sections describing the project
- • Conclusions
- • Bibliography
Tables
and figures can be
interspersed in the text or at the end of the report. All tables and
figures should be numbered and referred to by number. The report should
not contain raw computer output. Rather, any computer output should be
in a table or figure, with explanation in the main text. Do not hand in
the code
(R, Python, etc.) for your analysis, but the instructor reserves the
right to ask for your code if he deems it necessary.
Given the page limit, you should present your results in a concise and informative manner. Highlight the most interested/important findings; summarize and parse your results rather than just provide a list of numbers. You do not need to explain a standard algorithm, but feel free to provide references for an algorithm not covered in class. On the other hand, you may want to provide some details if you use a less well-known algorithm, modify an existing one, or adopt a new approach.
If your report has more than 10 pages, then there is no guarantee that the extra pages will be read by the instructor or graders.
3.3 Peer evaluationEach student is asked to fill out this peer evaluation form that assesses individual’s contribution to the group. This form is due the same day as the final project report.
The project report and peer evaluation form should be submitted electronically on Gradescope.
• Each student should submit their own peer evaluation form.
• Only one member of each team needs to submit the project report.
Grades will be based on:
- • Validity of the goal(s): Are the goal(s) of the project well-defined and well-explained?
• Data sets and data analyses: Are the data sets and analysies selected appropriate to address that goal?
• Scope of the data analysis: Is the proposed data analysis of a sufficient scope?
• Conclusions: Are the conclusions comprehensive and valid?
• Creativity: Are the problem formulation, methodology and analysis interesting and creative?
• Clarity and conciseness of the report: A wordy report will get a lower grade than one saying the same amount in less space.
• Team size: Projects done by larger teams are expected to be more extensive.
• Individual’s contribution to the group: as assessed by peer evaluations
Can be found on Canvas. Pleaes do not circulate these reports outside this class.
Note: These reports are not necessarily among the ones that received the highest grades in previous years, and may even contain errors and flaws. They simply give you a sense of the scope and structures of the project, as well as the possibility of techniques and outcomes.
[Back to course page]