Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# SUSY dataset (DO NOT PUSH)
SUSY.csv
SUSY.csv.gz
SUSY_small.csv
SUSY_tiny.csv
*.csv
Binary file not shown.
150 changes: 150 additions & 0 deletions Kaggle Tabular Data.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "e64a1b88",
"metadata": {},
"source": [
"# Tabular Kaggle Project\n",
"\n",
"Guideline for steps for the Kaggle Tabular Project. You will \"turn in\" a GitHub repository, modeled after [Project Template](https://github.com/UTA-DataScience/ProjectTempate) on the day of the final, Friday, May 1 at 11 – 1:30 pm. During the final period we will have about 5 minutes to go over your project and your results.\n",
"\n",
"You can find a list of possible Tabular datasets here on [Excel File in Teams](https://mavsuta.sharepoint.com/:x:/r/teams/Course_2262_data_3402_001-tImyQiF6rCJKf/Shared%20Documents/General/TabularDatasets.xlsx?d=w4ae5174d4ac5455aa4a8f03e70918898&csf=1&web=1&e=Lqtpue). You are not limited to these datasets. If you find a Kaggle challenge not listed that you would like to attempt, please check with Dr. Farbin to make sure it is viable. Note that the requirement is that the data you use is tabular, meaning that it can be represented as a table, therefore excluding images, video, audio, and other more raw data formats as well as data that is more structure. Note that Kaggle hosts datasets without well defined competition associated with them, which will require you to define the problem and assessment metrics. Please select datasets associated with competitions. \n",
"\n",
"Your first task is to select a challange / dataset. I would like everyone to come to Lecture on Wednesday April 8 with at least one dataset in mind. I will ask students who select datasets not from the provided list to share the links so we can evaluate the dataset in class.\n",
"\n",
"This notebook outlines the steps you should follow. The file(s) in the GitHub repository should contain these steps. Note that you will be only considering classification projects. \n",
"\n",
"## Define Project\n",
"\n",
"* Provide Project link.\n",
"* Short paragraph describing the challenge. \n",
"* Briefly describe the data.\n"
]
},
{
"cell_type": "markdown",
"id": "a65cd3e3",
"metadata": {},
"source": [
"## Data Loading and Initial Look\n",
"\n",
"* Load the data. \n",
"* Count the number of rows (data points) and features.\n",
"* Any missing values? \n",
"* Make a table, where each row is a feature or collection of features:\n",
" * Is the feature categorical or numerical\n",
" * What values? \n",
" * e.g. for categorical: \"0,1,2\"\n",
" * e.g. for numerical specify the range\n",
" * How many missing values\n",
" * Do you see any outliers?\n",
" * Define outlier.\n",
"* For classification is there class imbalance?\n",
"* What is the target:\n",
" * Classification: how is the target encoded (e.g. 0 and 1)?\n",
" * Regression: what is the range?"
]
},
{
"cell_type": "markdown",
"id": "27c59841",
"metadata": {},
"source": [
"## Data Visualization\n",
"\n",
"* For classification: compare histogram every feature between the classes. Lots of examples of this in class.\n",
"* For regression: \n",
" * Define 2 or more class based on value of the regression target.\n",
" * For example: if regression target is between 0 and 1:\n",
" * 0.0-0.25: Class 1\n",
" * 0.25-0.5: Class 2\n",
" * 0.5-0.75: Class 3\n",
" * 0.75-1.0: Class 4\n",
" * Compare histograms of the features between the classes.\n",
" \n",
"* Note that for categorical features, often times the information in the histogram could be better presented in a table. \n",
"* Make comments on what features look most promising for ML task."
]
},
{
"cell_type": "markdown",
"id": "ba73f3b0",
"metadata": {},
"source": [
"## Data Cleaning and Preperation for Machine Learning\n",
"\n",
"* Perform any data cleaning. Be clear what are you doing, for what feature. \n",
"* Determinine if rescaling is important for your Machine Learning model.\n",
" * If so select strategy for each feature.\n",
" * Apply rescaling.\n",
"* Visualize the features before and after cleaning and rescaling.\n",
"* One-hot encode your categorical features."
]
},
{
"cell_type": "markdown",
"id": "39c8d295",
"metadata": {
"jp-MarkdownHeadingCollapsed": true
},
"source": [
"## Machine Learning\n",
"\n",
"\n",
"### Problem Formulation\n",
"\n",
"* Remove unneed columns, for example:\n",
" * duplicated\n",
" * categorical features that were turned into one-hot.\n",
" * features that identify specific rows, like ID number.\n",
" * make sure your target is properly encoded also.\n",
"* Split training sample into train, validation, and test sub-samples.\n",
"\n",
"### Train ML Algorithm\n",
"\n",
"* You only need one algorithm to work. You can do more if you like.\n",
"* For now, focus on making it work, rather than best result.\n",
"* Try to get a non-trivial result.\n",
"\n",
"### Evaluate Performance on Validation Sample\n",
"\n",
"* Compute the usual metric for your ML task.\n",
"* Compute the score for the kaggle challenge.\n",
"\n",
"### Apply ML to the challenge test set\n",
"\n",
"* Once trained, apply the ML algorithm the the test dataset and generate the submission file.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "12b0e44d",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading