Data Mining and Machine Learning Shaina Race, PhD

Data Mining


Topic Video Slides + Notes Code + Data Exercises + Homework
Introduction to Data Mining
(Modeling Part I)

(Self-paced by 10/5)

Bias-Variance Tradeoff
(Train-Valid-Test)


Missing Values

Transactional Data

Variable Transformations

Introduction Exercises
(Solutions)
Association Analysis

Association Grocery Data (SAS)
Viya Code
Association Analysis in R
Association Analysis in Python
Grocery Data (.csv)
Exercises
(Solutions)

Due BY 9/25 5pm:
Homework
Homework Data (.csv)
public.orderData2 (SAS Viya)
Submit link

Due BY 9/25 11:59pm:
Homework Quiz

Classification and Regression Trees (CART) Part 1

CART Viya Demo 1
Telco Churn(.csv)
Viya Tree Details
A Python run-through
Exercises (#2,3,5,7,8)
(Solutions)

Classification and Regression Trees (CART) Part 2

CART Viya Demo 2
Breast Cancer (SAS Dataset)
Demo 2 in R
Breast Cancer Descript'n (.txt)
Decision Trees in R
Breast Cancer (Rdata)

Exercises (ALL problems)
(Solutions)

Homework
Homework Data (.csv)
Homework Data (.RData)
Homework Data Dict. (.txt)
Submit link

Due BY Friday 10/16 11:59pm:
Homework Quiz

Clustering Part 1
k-means

Clustering
Clustering in R (Code)
Adult Data (.RData)
Adult Data (.csv)
The Curse of Dimensionality

(self-paced by 10/9)

Curse of Dimensionality.mp4 Slides Carl Sagan's Introduction to Flatland
(Just for fun)
Clustering Part 2
Hierarchical Clustering

Clustering HierarchicalClust using k-means
centroids in R (Adult data)


Clustering in SAS (code)
Breast Cancer (SAS Dataset)
Exercises
(Solutions)
(Detail Solution Problem 2)

Clustering Lab

Choose your Own Adventure - Clustering
TeenSNS Data (.sas7bdat)
(also in public library)
TeenSNS Data (.csv)
TeenSNS Data (.RData)
TeenSNS Description (.txt)

Link to Submit Profiles
One submission per team!
k-NearestNeighbor

kNN

PenDigitTrain (SAS)
PenDigitTest (SAS)
PenDigit Description (.txt)

kNN in Base SAS (.pdf)

kNN in R
PenDigits.Rdata
kNN Exercises
(kNN Solutions)

Modeling Part II

(self-paced by 10/19)

Model Evaluation

Exercises
(Solutions)
Ensemble Models

Ensemble Models Telco Churn(.csv)
Review

Review Slides (with solutions)
Exam Open Monday 10/19
Due Friday 10/23 11:59pm

You may use any notes but may not accept exam-specific help from another person
Ensemble Clustering
(Faculty Workshop - 10/1)
Recording Consensus Clustering R Code
Adult Data (.RData)
Adult Data (.csv)

Machine Learning

Topic Video Slides + Notes Code + Data Exercises + Homework
MODELING COMPETITION
.RData (Train, Valid, Test)

.csv Training Data
.csv Validation Data
.csv Test Data (for submission)

Project Description
Sample Submission
Bagging
and
Random Forests
Random Forests
Reference Text
Random Forests in R (.pdf)
Random Forests in R (Code)

PenDigits.Rdata

PenDigitTrain (SAS)
PenDigitTest (SAS)
PenDigit Description (.txt)

Python Tutorial
Exercises (#1a-f, 2, 3)
(Solutions)
Boosting
Gradient Boosting
Gradient Boosting

Reference Text
Gradient Boosting in R
Python Tutorial
Exercises
(Solutions)
Regularized Regression
Ridge and LASSO
Regularized Regression Ridge/Lasso in R (.pdf)
Ridge/Lasso in R (Code)
Hitters Data (Rdata)
Hitters Description
Hitters Data (SAS)

Python Tutorial
Exercises
(Solutions)
Neural Networks Neural Networks

Backpropogation Example
(Requires Calculus)

Backpropogation Video (3blue 1brown)
Neural Networks in R
Concrete Data (SAS)
Concrete Data (.RData)
Concrete Description (.txt)
Exercises
(Solutions)

MIDTERM QUIZ (ML Topics 1-4 PLUS k-means clustering)


Support Vector Machines (SVM)
and Kernels
Support Vector Machines SVM + Grid Search (R Code)
SVM + Grid Search in R (.pdf)
Customer Churn (Rdata)
Customer Churn Train (SAS)
Customer Churn Train (SAS)
Support Vector Regression in R
Exercises
(Solutions)
Naive Bayes Classifier
Naive Bayes Naive Bayes in R (.pdf)
SMS Data (.RData)
Naive Bayes in R (.R)
PCA Intro to SMS Data (.R)
Model Agnostic Interpretability
Individual Conditional Expectation (ICE)
Partial Dependence (PDP)
Permutation Variable Importance
Model Interpretability

Online Textbook
Code to create plots from Slides

Model Agnostic Interpretability
Accumulated Local Effects (ALE)
LIME
Shapley Values
Model Interpretability

Online Textbook
Code to create plots from Slides


FINAL QUIZ (ML Topics 5-8 PLUS Regularized Regression)
DUE: FRIDAY 11/20 11:59PM

Explainable Boosting Machine
(Faculty Workshop - 4/1)
Recording Slides notebook
Cool Links
Cheat Sheets for Open Source Data Science.
Finding the k in k-means with Python
NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set
Visualizing and Comparing Decision Boundaries of Classifiers. Another example with code in iPython notebook format is here.
The Neural Network Playground.
Tutorial on Sequence Analysis in R, including exploration using other variables (sequences by gender, for example)
Part 1 (Sequence Analysis and Plots).
Part 2 (Clustering).
Part 3 (Advanced Analysis).
DataCamp Tutorials/Courses
Complete (teachy) Regularized Regression Tutorial Python
The Illustrated Cliff Notes for Introduction to Statistical Learning Textbook.
Gradient Boosting with Scikit-Learn, XGBoost, LightGBM, and CatBoost.