Objective:
To creatively apply knowledge gained through the course of the semester to a substantial data analysis problem of your own choosing.
Deliverable:
For your final project, you will find a dataset and apply your data analysis skills to a new problem based on the data. You will turn in a PDF report discussing your efforts, don't include code in your report.
Grading standards:
Your entry will be graded on the following elements:
75% Project writeup
35% Exploratory data analysis
35% Description of technical approach
30% Analysis of performance of method
25% Project presentation
33% Clearly motivated problem
33% Clear description of technical approach
33% Clear presentation of results
Description:
The final project is designed to give you a chance to explore a data science project end-to-end, with minimal restrictions.
For this project, you must:
Select a dataset to analyze (perhaps one from Kaggle?)
Define a question or task to be performed
What is your goal in analyzing this dataset? Is it a prediction problem? Or are you searching for patterns?
If appropriate, define a cost function to be optimized
Choose an analysis strategy
If appropriate, define a model
If appropriate, choose an inference algorithm to answer your question, given a model
You are welcome to use any publicly available code on the internet to help you. For example, you may wish to use the Stan language to help you construct an HMC sampler. Other possibilities include PyMC, the Venture probabilistic programming language, BayesDB, etc.
Your writeup should be a serious report on the dataset you chose, the problem you set out to solve, the technical approach you took (and your rationale for it), the results of any exploratory data analysis, and the results of your final model / inference / optimization algorithm.
Your writeup should discuss questions similar to your recommender engine report:
This writeup must include five main sections:
A discussion of the dataset
Where did it come from? Who published it?
Who cares about this data?
A discussion of the problem to be solved
Is this a classification problem? A regression problem?
Is it supervised? Unsupervised?
What sort of background knowledge do you have that you could bring to bear on this problem?
What other approaches have been tried? How did they fare?
A discussion of your exploration of the dataset.
Before you start coding, you should look at the data. What does it include? What patterns do you see?
Any visualizations about the data you deem relevant
A clear, technical description of your approach. This section should include:
Background on the approach
Description of the model you use
Description of the inference / training algorithm you use
Description of how you partitioned your data into a test/training split
An analysis of how your approach worked on the dataset
What was your final RMSE on your private test/training split?
Did you overfit? How do you know?
Was your first algorithm the one you ultimately used for your submission? Why did you (or didn't you) iterate your design?
Did you solve (or make any progress on) the problem you set out to solve?
Possible sources of interesting datasets
Croudflower
KDD cup
UCI repository
Kaggle (current and past)
Data.gov
AWS
World bank
BYU CS478 datasets
data.utah.gov
Google research
BYU DSC competition