Read in the data. The load() function reads in a dataset that has 20532 columns and may take some time. You may want to clear your environment (or open a new RStudio window) if you have other work open.
load('geneCancerUCI.RData')
table(cancerlabels$Class)
##
## BRCA COAD KIRC LUAD PRAD
## 300 78 146 141 136
Original Source: The cancer genome atlas pan-cancer analysis project
We are going to want to plot the data points according to their different classification labels. We should pick out a nice color palette for categorical attributes.
library(RColorBrewer)
display.brewer.all()
palette(brewer.pal(n = 8, name = "Dark2"))
The first step is typically to explore the data. Obviously we can’t look at ALL the scatter plots of input variables. For the fun of it, let’s look at a few of these scatter plots which we’ll pick at random. First pick two column numbers at random, then draw the plot, coloring by the label.
randomColumns = sample(2:20532,2)
plot(cancer[,randomColumns],col = cancerlabels$Class)
Project this data onto the first two principal components. View the projection for both correlation and covariance PCA.
Color the projection by the Class
What proportion of variance is captured in your 2-dimensional plot?
Bonus points if you can render the graph of the projection into 3-dimensions!