Cancer Genetics PCA

Read in the data. The load() function reads in a dataset that has 20532 columns and may take some time. You may want to clear your environment (or open a new RStudio window) if you have other work open.

load('geneCancerUCI.RData')
table(cancerlabels$Class)

## 
## BRCA COAD KIRC LUAD PRAD 
##  300   78  146  141  136

Original Source: The cancer genome atlas pan-cancer analysis project

BRCA = Breast Invasive Carcinoma
COAD = Colon Adenocarcinoma
KIRC = Kidney Renal clear cell Carcinoma
LUAD = Lung Adenocarcinoma
PRAD = Prostate Adenocarcinoma

We are going to want to plot the data points according to their different classification labels. We should pick out a nice color palette for categorical attributes.

library(RColorBrewer)
display.brewer.all()

palette(brewer.pal(n = 8, name = "Dark2"))

The first step is typically to explore the data. Obviously we can’t look at ALL the scatter plots of input variables. For the fun of it, let’s look at a few of these scatter plots which we’ll pick at random. First pick two column numbers at random, then draw the plot, coloring by the label.

randomColumns = sample(2:20532,2)
plot(cancer[,randomColumns],col = cancerlabels$Class)

Lab:

Project this data onto the first two principal components. View the projection for both correlation and covariance PCA.
Color the projection by the Class
What proportion of variance is captured in your 2-dimensional plot?
Bonus points if you can render the graph of the projection into 3-dimensions!

Cancer Genetics PCA

Shaina Race

Lab: