ORIE 4740 Project Information (Fall 2018)
[Back to course page]
In the final project, the techniques taught in the class are used to analyze a large dataset chosen by the students. Students work in teams of 2-4 students. Each team finds the necessary data, carries out the project, and writes a project report.
1. Due dates2. Project Teams
3. Project Assignment
4. Submission
5. Grading
6. Sample Project Reports
- • Team and dataset: Once you form your group and decide on which dataset to use, email the instructor ASAP with the names and NetID of your group members as well as the source of the dataset(s). The general rule is that no two groups may use the same dataset. In case of a conflict, the first group that emails the instructor will have priority. On Piazza I will maintain a list of datasets that have been chosen.
- • Project
report and peer evaluation form (22% of final grade): December 14th (Friday) 11:59PM,
to be submitted online via Blackboard.
You should work in a team of 2 to 4 students. Please try to form a team yourself; if you have trouble finding teammates then let me know and I will help you find a team. You may not work alone.
You may use the "Search for Teammates!" function on Piazza.
You
will apply tools that you
have learned in 4740 to a dataset of your choice.
- (a) Simulated (artificially generated) datasets are not
allowed
(b) You may NOT use a dataset used before in HW/labs, nor any of the datasets from ISLR.
(c) You may NOT use the UC Irvine Machine Learning Repository
(d) You may NOT use CMU Statlib
You may obtain a dataset from a company, for instance if you have had an internship in a company and they are willing to provide you with such a dataset for this purpose. You may use a dataset from a research project at this university or another university, with permission. I highly encourage you to look around for a dataset on a topic that particularly interests you, rather than using generic datasets from data mining websites. Example: say I am interested in doing a project related to beer. A web search on “beer dataset” brings up a dataset with 1 million + beer reviews, from BeerAdvocate. You are allowed use datasets from other textbooks, but you cannot do an analysis similar to that done in the textbook on the same dataset.
Here are a few data sources:
- • Stanford Large Network Dataset Collection: http://snap.stanford.edu/data/
- • Yelp dataset challenge: https://www.yelp.com/dataset_challenge/
- • https://www.kaggle.com
- • https://www.quandl.com
- • "The 50 Best Public Datasets for Machine Learning":
https://medium.com/datadriveninvestor/the-50-best-public-datasets-for-machine-learning-d80e9f030279 - • The call center data from: http://iew3.technion.ac.il/serveng/callcenterdata/index.html
- • Marine environment data at http://www.ices.dk/marine-data/dataset-collections/Pages/default.aspx
- • Datasets from the KDD Cup
- • Various datasets from Yahoo! http://webscope.sandbox.yahoo.com/
Note that some datasets in the above links are not acceptable as per rules (a)-(d) above. Regardless of the source of the data, this source must be referenced in your report. If the dataset is not in the public domain, then you must obtain permission for its use in this class project. No two groups may use the same dataset; if two groups propose the same dataset by chance, the one that emails me first will have priority.
What is required? Each team must find the necessary data, carry out the data analysis, and write a project report. The analysis should be motivated by one or two particular scientific / commercial goals, such as (for the veteran’s data example): “We seek to predict whether or not an individual will contribute, and the contribution amount if they do contribute, based on their demographic characteristics and contribution history. This prediction can be used to choose which individuals receive solicitations, or to estimate the total expected contributions in order to guide the organization’s financial planning.”
The data analysis that you perform needs to be more than a direct mapping of one of our lab analyses to another dataset. You will need to use more than one of the approaches that we have learned in the class. An example of a data analysis with sufficient scope is:
For a data set like the veteran’s data that has a continuous outcome variable and a binary outcome variable: Applying linear regression to predict the continuous outcome and applying logistic regression and decision trees to predict the binary outcome, while handling missing data. Comparing the results from logistic regression and decision trees, and recommending which should be used.
The data
analyses that you perform should be appropriate for the goal(s) that
you have stated. You should choose one or several data sets that are
appropriate to address the goal(s) you have stated. If you have more
than one data set or more than one goal, the project should form a
coherent whole, rather than being two or three unrelated data analyses.
For instance, you may have a single scientific goal, and use two data
sets to address this goal. Or, you might have a single data set, with
which you address two related scientific questions.
If you analyze a single data set then it should have a reasonably large number of observations (at least 500); otherwise, two smaller data sets suffice but they should each contain at least 300 observations. One of your data sets should have at least five predictors.
Sometimes
a data analysis yields negative or inconclusive results. For instance,
perhaps none of the predictors were significant in the model, even
though they seemed like reasonable predictors. Perhaps the predictions
were poor, and the methods chosen, although they were a reasonable
choice and had good promise, turned out to not work well. These are
acceptable results, as long as all of the analyses and conclusions are
correct. You might in this case suggest alternative approaches in your
conclusion.
Work on the project is to be done entirely by the project group; communication between groups regarding project work is not allowed. You may not apply a technique that has been previously applied to the same data set in a published or unpublished work, if you are aware or could reasonably be expected to be aware of the existence of their work. You should cite in the bibliography any and all published or unpublished written works or spoken communications that have influenced your analysis.
You should employ at least one technique covered in class/ISLR, but are free to use any additional methods beyond class. You are encouraged to use R, but using another language (such as Python) is allowed.
3.1 Project PlanYou may discuss with your professor/TAs regarding your plan about:
- • The proposed scientific/commercial goal(s) of the
analysis;
• The proposed data set(s) that will be used, including their source, number of variables and data points;
• The proposed data analyses to be performed;
• What figures or tables you might include;
• Why you expect the data set(s) and analysis methods to successfully address your goal;
• Any other details at your discretion
The report should be no more than 15 pages (double-spaced, 11+
pt font) and should contain:
- • Title page with authors and abstract
- • Introduction telling what the project is about, what your team has accomplished, and a brief statement of results and conclusions.
- • One or more sections describing the project
- • Conclusions
- • Bibliography
Tables
and figures can be
interspersed in the text or at the end of the report. All tables and
figures should be numbered and referred to by number. The report should
not contain raw computer output. Rather, any computer output should be
in a table or figure, with explanation in the main text. Do not hand in
the code
(R, Python, etc.) for your analysis, but the instructor reserves the
right to ask for your code if he deems it necessary.
If your report has more than 15 pages, then there is no guarantee that the extra pages will be read by the instructor or graders.
3.3 Peer evaluationEach student is asked to fill out this peer evaluation form that assesses individual’s contribution to the group. This form is due the same day as the final project report.
The project report and peer evaluation form should be submitted electronically on Blackboard.
• Each student should submit their own peer evaluation form.
• Only one member of each team needs to submit the project report.
Grades will be based on:
- • Validity of the goal(s)
• Whether the data set(s) and data analyses selected are appropriate to address that goal
• Sufficient scope of the data analysis
• Comprehensiveness and validity of the conclusions
• Creativity
• Clarity and conciseness of the report. A wordy report will get a lower grade than one saying the same amount in less space.
• Number of students. Projects done by larger teams are expected to be more extensive.
• Individual’s contribution to the group, as assessed by peer evaluations
Can be found on Blackboard. Pleaes do not circulate these reports outside this class.
Note: These reports are not necessarily among the ones that received the highest grades in previous years, and may even contain errors and flaws. They simply give you a sense of the scope and structures of the project, as well as the possibility of techniques and outcomes.
[Back to course page]