% compile: % graphics.off(); rm(list=ls()); % library(knitr); % knit("rnaseq2.rnw"); % for (i in 1:2) system('R CMD pdflatex rnaseq2.tex') % extract R code % purl('rnaseq2.rnw') \documentclass{article} \usepackage{amsmath} \usepackage{natbib} \usepackage{mathpazo} %\usepackage{enumerate} \usepackage{soul} \usepackage{cases} \setlength{\parindent}{0cm} \usepackage{xhfill} \usepackage[labelformat=empty]{caption} \date{July, 2016} % common abbreviations % bold symbols \newcommand{\bdelta}{\boldsymbol \delta} \newcommand{\bI}{\boldsysmbol I} \newcommand{\bM}{\boldsymbol M} \newcommand{\bC}{\boldsymbol C} \newcommand{\bmu}{\boldsymbol \mu} \newcommand{\bomega}{\boldsymbol \omega} \newcommand{\balpha}{\boldsymbol \alpha} \newcommand{\bpi}{\boldsymbol \pi} \newcommand{\bP}{\boldsymbol P} \newcommand{\bR}{\boldsymbol R} \newcommand{\bS}{\boldsymbol S} \newcommand{\bSigma}{\boldsymbol \Sigma} \newcommand{\btau}{\boldsymbol \tau} \newcommand{\bbeta}{\boldsymbol \beta} \newcommand{\bt}{\boldsymbol t} \newcommand{\bU}{\boldsymbol U} \newcommand{\bV}{\boldsymbol V} \newcommand{\bX}{\boldsymbol X} \newcommand{\bx}{\boldsymbol x} \newcommand{\bZ}{\boldsymbol Z} \newcommand{\bz}{\boldsymbol z} \newcommand{\bY}{\boldsymbol Y} \newcommand{\by}{\boldsymbol y} \newcommand{\bh}{\boldsymbol h} \newcommand{\ba}{\boldsymbol a} \newcommand{\bb}{\boldsymbol b} \newcommand{\cS}{\mathcal{S}} \newcommand{\btheta}{\boldsymbol \theta} \newcommand{\htheta}{\hat{\theta}} \newcommand{\bvarepsilon}{\boldsymbol \varepsilon} %column vectors \newcount\colvec[1] \newcommand*\colvec[1]{ \global\colveccount#1 \begin{\pmatrix} \colvecnext } \def\colvecnext#1{ #1 \global\advance\colveccount-1 \ifnum\colveccount>0 \\ \expandafter\colvecnext \else \end{pmatrix} \fi } <>= options(digits=3,width=100) opts_chunk$set(echo=TRUE, tidy=FALSE, include=TRUE, dev='png', fig.width=6, fig.height=3.5, comment=' ', dpi=300, cache=TRUE, lazy.load=FALSE, background='grey93') @ \title{Differential Expression Analysis of RNA--Seq Data} \begin{document} \maketitle \tableofcontents- %------------------------------------------- \section{Required packages} %------------------------------------------- <>= library(genefilter) library(ggplot2) library(plyr) library(LSD) library(gplots) library(S4Vectors) library(IRanges) library(edgeR) library(DESeq2) library(RColorBrewer) library(stringr) library(topGO) library(biomaRt) library(dplyr) library(EDASeq) library(fdrtool) library(xlsx) @ \section{Introduction} Here we will present an introduction of a general workflow for a typical RNA-Seq data analysis: \begin{itemize} \item data import \item data preprocessing \item differential expression analysis (DEA) \item gene set enrichment analysis (GSEA) \end{itemize} We will compare the expression profiles of wild-type (WT) mice to mice with a deletion in MYC enhancer which is essential for proper lip/palate formation. \section{Data import and preprocessing} We will start with the counts data of each gene for every sample, which is stored in a \file{.txt} file: <>= countData <- read.table("bottomly_count_table.txt", header=T, row.names=1) # import condition data phenoData <- read.table("bottomly_phenodata.txt", header=T, row.names=1) condition <- phenoData$strain @ \subsection{Quality control of the count data} Quality control is the essential step for RNA-Seq data analysis. Let's check how many genes have non-zero counts in all samples: <>= idx.nz <- apply(countData, 1, function(x) all(x > 0)) sum(idx.nz) @ In a typical RNA-Seq experiment, there will be at least several thousand genes that are expressed in all samples. If the number of non-zero genes is very low, there is usually something wrong with at least some of the samples. \section{Analysis using edgeR} We now can feed the count data as well as the phenotypic data to the \texttt{edgeR}: <>= # create DGEList object library(edgeR) y <- DGEList(counts=countData, group=condition, remove.zeros=TRUE) y @ `y` is an instance of \texttt{DGEList}, an \textcolor{blue}{S4} class for storing read counts and associated information from next-generation sequencing. \subsection{Normalization} As different libraries will have a different sequencing depths, normalization should be conducted to ensure that the parameters across samples are comparable. Generally, the ratios of the size factors should roughly match the ratios of library sizes. Dividing each column of the count data by the corresponding size factors (\textcolor{blue}{the product of library size and normalization factor}) yields normalized counts, which can be scaled to give a count per million (CPM) interpretation. <>= # normalization y <- calcNormFactors(y) @ Now the size factors are stored in \texttt{y$samples$norm.factors}: <>= y$samples$lib.size y$samples$norm.factors all(getOffset(y) == log(y$samples$lib.size * y$samples$norm.factors)) prod(y$samples$norm.factors) @ We can see that the normalization factors are adjusted to multiply to 1. Boxplots will illustrate the skewed distribution and zero-inflation of the count data for each sample: <>= # boxplot library(vioplot) par(mfrow=c(1,11)) for (i in 1:11) { vioplot(y$counts[,i]/(y$samples$lib.size[i] * y$samples$norm.factors[i])) } @ Can log-transformation turn the heavily right-skewed data to approximately symmetric? <>= vioplot(log(y$counts[,1]/(y$samples$lib.size[1] * y$samples$norm.factors[1]))) @ Although it is still not symmetric, however, it is less skewed now, right? \subsubsection{Exercise 1} Have a looook at the normalization method, can you figure out which sample is chosen as the reference sample for computing normalization factors using ``TMM'' normalization method? \subsection{Dispersion estimation} It is known that a standard Poisson model can only account for the technical noise in RNA-Seq data. In the Poisson model the variance is equal to the mean, while in real RNA-Seq data the variance is greater than the mean, a general phenomenon often encountered in count data and referred to as ``overdispersion''. A popular way to model this is to use a Negative-Binomial distribution (NB), which includes an additional parameter dispersion parameter $\alpha$ such that $E(NB) = \mu$ and \begin{align*} \text{Var[NB($\mu$, $\alpha$)]} = \mu + \alpha \cdot \mu^2 \end{align*} Hence, the variance is greater than the mean. Hence, the first step in the analysis of differential expression, is to obtain an estimate of the dispersion parameter for each gene. The typical shape of the dispersion fit is an exponentially decaying curve. The asymptotic dispersion for highly expressed genes can be seen as a measurement of biological variability in the sense of a squared coefficient of variation: a dispersion value of 0.01 means that the gene's expression tends to differ by typically $\sqrt{0.01}$ = 10\% between samples of the same treatment group. For weak genes, the Poisson noise is an additional source of noise, which is added to the dispersion. <>= # plot biological coefficients of variation versus gene ### abundance (log2(counts_per_million)) (CPM) y <- estimateCommonDisp(y) y <- estimateTrendedDisp(y) y <- estimateTagwiseDisp(y) plotBCV(y) @ \subsection{Statistical testing of Differential expression} Exact test is used to test for differential expression between two groups of negative binomial count libaries: <>= ## exact test for DEA res <- exactTest(y, pair=c(unique(y$samples$group)[1],unique(y$samples$group)[2])) tp <- topTags(res, 20000)$table @ Count data sets typically show a (left-facing) trombone shape in average vs mean--difference plots (MA-plots), reflecting the higher variability of log ratios at lower counts. In addition, points will typically be centered around a log ratio of 0 if the normalization factors are calculated appropriately, although this is just a general guide: <>= # MA-plot: logFC ~ logCPM #pdf("bottomly-edgeR-MA.pdf") plot(tp$logCPM, tp$logFC, xlim=c(-3, 20), ylim=c(-12, 12), pch=20, cex=.3, col = ifelse( tp$FDR < .10, "red", "black" ), xlab="logCPM", ylab="logFC") #dev.off() @ Multiple dimensionality scaling (MDS) can be used to plot samples on a 2-d scatterplot so that the distances on the plot approximate the typical log2 fold change between the samples: <>= ## MDS-plot: multi-dimensionality scaling (MDS) #pdf("bottomly-edgeR-MDS.pdf") plotMDS(y, col=ifelse(y$samples$group=='C57BL/6J', "red", "blue"), cex=.4) #dev.off() @ \subsubsection{Exercise 2} How many genes are significantly differential expressed between these two groups based on the BH-corrected P-values? \subsection{Volcano plot} <>= ## Volcano plot plot(tp$logFC, -log(tp$FDR), type='p', cex=.3, xlab="logFC", ylab="-log(P)", col=ifelse(abs(tp$logFC)>3 & -log(tp$FDR)>6, "red", "black")) abline(v=c(-3,3)) abline(h=6) @ We can also output the analysis result into a file: <>= # output CSV write.csv(tp, "bottomly_edgeR_mut-vs-wt.csv") @ \section{Analysis using DESeq2} `DESeq2` is used to estimate the variance-mean dependencies in count data from RNA-Seq assays and test for differential expression based on a model using the negative binomial distribution. \subsection{Creating a DESeqDataSet} <> library(DESeq2) # create DESeqDataSet: countData, colData, design des <- DESeqDataSetFromMatrix( countData = as.matrix(countData), colData = data.frame(condition), design = ~condition) @ The DataSet contains the required information for <>= rowRanges(des) mcols(rowRanges(des)) @ \subsection{Obtain extra annotation information using \texttt{biomaRt}} We first set up the mart and then get the annotation via the function \texttt{getBM} in \Rpackage{biomaRt}. <>= ## for mm10 library(biomaRt) ensembl76 <- useEnsembl("ensembl", dataset="mmusculus_gene_ensembl") bm <- getBM( attributes = c("ensembl_gene_id", "external_gene_name", "description"), filter="ensembl_gene_id", values = rownames(des), mart=ensembl76) # arrange rows of bm in ascending order according to the gene ids bm <- arrange(bm, ensembl_gene_id) head(bm) @ We then add this as row (or feature) metadata to the DESeq2 table using a join command: <>= DESeq2Features <- data.frame(ensembl_gene_id = rownames(des)) # join them together rowData <- dplyr::left_join(DESeq2Features, bm, by="ensembl_gene_id") rowData <- as(rowData, "DataFrame") ### add the annotation to the DESeq2 table mcols(rowRanges(des)) <- c(mcols(rowRanges(des)), rowData) save(des, file="geneCounts.RData") @ Then we can load the count data for later analylsis: <>= load("geneCounts.RData") des @ \section{Quality control and Normalization of the count data} After creating a count table and feeding it to the DESeq2 object, the following step is the quality control of the data. Let's check how many genes have non-zero counts in all samples: <>= GeneCounts <- counts(des) idx.nz <- apply(GeneCounts, 1, function(x) { all(x > 0)}) sum(idx.nz) @ In a typical RNA-Seq experiment, there will be at least several thousand genes that are expressed in all samples. If the number of expressed genes is very low, there must be something wrong with at least some of the samples (e.g. the sample preparation was not done properly, or there was some problem with the sequencing). \subsection{Normalization } As different libraries will be sequenced to different depths, offsets are built in the statistical model of \Biocpkg{DESeq2} to ensure that parameters are comparable. The term normalization is often used for that, but it should be noted that the raw read counts are not actually altered. \Biocpkg{DESeq2} defines a virtual reference sample by taking the median of each gene's values across samples and then computes size factors as the median of ratios of each sample to the reference sample. Generally, the ratios of the size factors should roughly match the ratios of the library sizes. Dividing each column of the count table by the corresponding size factor yields normalized count values, which can be scaled to give a counts per million interpretation. Thus, if all size factors are roughly equal to one, the libraries have been sequenced equally deeply. Count data sets typically show a (left-facing) trombone shape in average vs mean--difference plots (MA--plots), reflecting the higher variability of log ratios at lower counts. In addition, points will typically be centered around a log ratio of 0 if the normalization factors are calculated appropriately, although this is just a general guide. <>= #### estimate size factors des <- estimateSizeFactors(des) sizeFactors(des) # plot densities of counts for the different samples ### to assess their distributions multiecdf(counts(des, normalized = T)[idx.nz ,], xlab="mean counts", xlim=c(0, 1000)) multidensity( counts(des, normalized = T)[idx.nz ,], xlab="mean counts", xlim=c(0, 1000)) ### looks good! ## check pairwise MA plots #pdf("pairwiseMAs.pdf") library(EDASeq) MA.idx = t(combn(1:8, 2)) for( i in 1:15){ MDPlot(counts(des, normalized = T)[idx.nz ,], c(MA.idx[i,1],MA.idx[i,2]), main = paste( colnames(des)[MA.idx[i,1]], " vs ", colnames(des)[MA.idx[i,2]] ), ylim = c(-3,3)) } #dev.off() ## looks good, no systematic shift visible! @ \subsection{PCA and sample heatmaps} A useful first step in an RNA-Seq analysis is often to assess overall similarity between samples: Which samples are similar to each other, which are different? Does this fit to the expectation from the experiment's design? We use the \Rfunction{dist} to calculate the Euclidean distance between samples. To avoid that the distance measure is dominated by a few highly variable genes, and have a roughly equal contribution from all genes, we use it on the regularized log-transformed data. The aim of the regularized log-transform is to stabilize the variance of the data and to make its distribution roughly symmetric since many common statistical methods for exploratory analysis of multidimensional data, especially methods for clustering and ordination (e.g., principal-component analysis and the like), work best for (at least approximately) homoskedastic data; this means that the variance of an observable quantity (i.e., here, the expression strength of a gene) does not depend on the mean. In RNA-Seq data, however, variance grows with the mean. For example, if one performs PCA directly on a matrix of normalized read counts, the result typically depends only on the few most strongly expressed genes because they show the largest absolute differences between samples. A simple and often used strategy to avoid this is to take the logarithm of the normalized count values plus a small pseudocount; however, now the genes with low counts tend to dominate the results because, due to the strong Poisson noise inherent to small count values, they show the strongest relative differences between samples. Note that this effect can be diminished by adding a relatively high number of pseudocounts, e.g. 32, since this will also substantially reduce the variance of the fold changes. As a solution, \Biocpkg{DESeq2} offers the regularized-logarithm transformation, or `rlog` for short. For genes with high counts, the rlog transformation differs not much from an ordinary log2 transformation. For genes with lower counts, however, the values are shrunken towards the genes' averages across all samples. Using an empirical Bayesian prior in the form of a ridge penality, this is done such that the rlog-transformed data are approximately homoskedastic. Note that the rlog transformation is provided for applications other than differential testing. For differential testing it is always recommended to apply the DESeq function to raw counts. Note the use of the function \Rfunction{t} to transpose the data matrix. We need this because dist calculates distances between data rows and our samples constitute the columns. We visualize the distances in a heatmap, using the function \Rfunction{heatmap.2} from the \CRANpkg{gplots} package. The heatmap is saved as a \file{.pdf} file. <>= ### produce rlog-transformed data rld <- rlogTransformation(des, blind=TRUE) ## create a distance matrix between the samples pdf("HeatmapPlots.pdf") distsRL <- dist(t(assay(rld))) mat <- as.matrix(distsRL) library(RColorBrewer) hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100) library(gplots) heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(5, 5)) dev.off() @ Another way to visualize sample-to-sample distances is a principal components analysis (PCA). In this ordination method, the data points (i.e., here, the samples) are projected onto the 2D plane such that they spread out optimally. Here, we use the function \Rfunction{plotPCA} which comes with \Biocpkg{DESeq2}. The term specified as \Robject{intgroup} are the column names from our sample data; they tell the function to use them to choose colors. <>= DESeq2::plotPCA(rld, intgroup=c("condition")) @ \section{Differential expression analysis} \subsection{Dispersion estimation} It is well known that a standard Poisson model can only account for the technical noise in RNA--Seq data. In the Poisson model the variance is equal to the mean, while in real RNA-Seq data the variance is greater than the mean, which is often encountered in general count data and referred to as ``overdispersion''. A popular way to model this is to use a Negative-Binomial distribution (NB) with additional parameter dispersion parameter $\alpha$ such that \begin{align*} \text{E[NB($\mu$, $\alpha$)]} = \mu \\ \text{Var[NB($\mu$, $\alpha$)]} = \mu + \alpha \cdot \mu^2 \end{align*} Hence, the first step in the analysis of differential expression, is to obtain an estimate of the dispersion for each gene. The typical shape of the dispersion fit is an exponentially decaying curve. The asymptotic dispersion for highly expressed genes can be seen as a measurement of biological variability in the sense of a squared coefficient of variation: a dispersion value of 0.01 means that the gene's expression tends to differ by typically $\sqrt{0.01}$ = 10\% between samples of the same treatment group. <>= des <- estimateDispersions(des) plotDispEsts(des) @ \subsection{Statistical testing of Differential expression} We can perform the statistical testing for differential expression and extract its results. Calling \Rfunction{nbinomWaldTest} performs the test for differential expression, while the call to the \Rfunction{results} function extracts the results of the test and returns adjusted $p$-values according to the Benjamini-Hochberg rule to control the FDR. Wald testis a test for coefficients in a regression model, which is based on a $z$-score, i.e. a $N(0,1)$ distributed (under $H_0$) test statistic. <>= des <- nbinomWaldTest(des) des2Res <- results(des, pAdjustMethod = "BH") ### number of siginificant DE-genes table(des2Res$padj < 0.05) @ Then, we can identify the differentially expressed genes. \subsubsection{Inspection and correction of $p$-values} The null $p$-values follow a uniform distribution on the unit interval $[0,1]$ if they are computed using a continuous null distribution. Significant $p$-values thus become visible as an enrichment of $p$-values near zero in the histogram. Thus, $p$-value histogram of ``correctly'' computed $p$-values will have a rectangular shape with a peak at 0. A histogram of $p$--values should always be plotted in order to check whether they have been computed correctly. We also do this here: <>= hist(des2Res$pvalue, col = "lavender", main = "WT vs Deletion", xlab = "p-values") @ We can see that this is clearly not the case for the $p$-values returned by \Biocpkg{DESeq2} in this case. Very often, if the assumed variance of the null distribution is too high, we see hill-shaped $p$-value histogram. If the variance is too low, we get a U-shaped histogram, with peaks at both ends. Here we have a hill-shape, indicating an overestimation of the variance in the null distribution. Thus, the $N(0,1)$ null distribution of the Wald test is not appropriate here. The dispersion estimation is not condition-specific and estimates only a single dispersion estimate per gene. This is sensible, since the number of replicates is usually low. However, if we have e.g. batches or ``outlying'' samples that are consistently a bit different from others within a group, the dispersion within the experimental group can be different and a single dispersion parameter not be appropriate. For an example of the estimation of multiple dispersions, see the analysis performed in: \href{http://dx.doi.org/10.1073/pnas.1307202110}{Reyes et. al. - Drift and conservation of differential exon usage across tissues in primate species, 2013} Fortunately, there is software available to estimate the variance of the null-model from the test statistics. This is commonly referred to as ``empirical null modelling''. Here we use the \CRANpkg{fdrtool} for this using the Wald statistic as input. This packages returns the estimated null variance, as well as estimates of various other FDR--related quantities and the $p$-values computed using the estimated null model parameters. An alternative and widely used-package for this task is \CRANpkg{locfdr}. <>= ### remove filtered out genes by independent filtering, ### they have NA adj. pvals des2Res <- des2Res[ !is.na(des2Res$padj), ] ### remove genes with NA pvals (outliers) des2Res <- des2Res[ !is.na(des2Res$pvalue), ] ### remove adjsuted pvalues, since we add the fdrtool results later on ### (based on the correct p-values) des2Res <- des2Res[, -which(names(des2Res) == "padj")] ### use z-scores as input to FDRtool to re-estimate the p-value FDR.des2Res <- fdrtool(des2Res$stat, statistic= "normal", plot = T) ### null model variance FDR.des2Res$param[1, "sd"] ### add values to the results data frame, also ad new BH- adjusted p-values des2Res[,"padj"] <- p.adjust(FDR.des2Res$pval, method = "BH") @ \CRANpkg{fdrtool} estimates the variance of the null model, which is less than the theoretical sd of 1, as expected from the $p$-value histogram. We can plot the histogram of the ``correct'' $p$-values. <>= hist(FDR.des2Res$pval, col = "royalblue4", main = "WT vs Deletion, correct null model", xlab = "CORRECTED p-values") @ As we can see, the null model is now correct and the histogram has the expected shape. \subsubsection{Extracting differentially expressed genes } We can now extract the number of differentially expressed genes and produce an MA plot of the two-group comparison. Similar to its usage in the initial quality control, a MA plot provides a useful overview for an experiment with a two-group comparison: The plot represents each gene with a dot. The $x$-axis is the average expression over all samples, the y axis the log2 fold change between treatment and control. Genes with an FDR value below a threshold (here 0.1) are shown in red. This plot demonstrates that only genes with a large average normalized count contain more information to yield a significant call. Also note \Biocpkg{DESeq2}'s shrinkage estimation of log fold changes (LFCs): When count values are too low to allow an accurate estimate of the LFC, the value is ``shrunken'' towards zero to avoid that these values, which otherwise would frequently be unrealistically large, dominate the top-ranked log fold changes. <>= table(des2Res[,"padj"] < 0.05) plotMA(des2Res) @ We now identify the differentially expressed genes at an FDR of $0.1$. \section{Gene ontology enrichment analysis} We can now try characterize the identified differentially expressed genes a bit better by performing an GO enrichment analysis. Essentially the gene ontology (\url{http://www.geneontology.org/}) is hierarchically organized collection of functional gene sets. \subsection{Matching the background set} The function \Rfunction{genefinder} from the \Biocpkg{genefilter} package will be used to find background genes that are similar in expression to the differentially expressed genes. The function tries to identify 10 genes for each \textbf{DE-gene} that match its expression strength. We then check whether the background has roughly the same distribution of average expression strength as the foreground by plotting the densities. We do this in order not to select a biased background since the gene set testing is performed by a simple Fisher test on a $2\times 2$ table, which uses only the status of a gene, i.e. whether it is differentially expressed or not and not its fold-change or absolute expression. Note that the chance of a gene being identified as DE will most probably depend on its gene expression for RNA-Seq data (potentially also its gene length etc.). Thus it is important to find a matching background. Here, we explicitly model the background. <>= ## get average expressions overallBaseMean <- as.matrix(des2Res[, "baseMean", drop = F]) sigGenes <- rownames(subset(des2Res, padj < 0.05)) anno <- mcols(rowRanges(des))[,1:3] anSig <- subset(anno, ensembl_gene_id %in% sigGenes) backG <- genefinder(overallBaseMean, anSig$ensembl_gene_id, 10, method = "manhattan") ## get identified similar genes backG <- rownames(overallBaseMean)[as.vector(sapply(backG, function(x)x$indices))] ## remove DE genes from background backG <- setdiff(backG, anSig$ensembl_gene_id) ## number of genes in the background length(backG) multidensity( list( all= log2(des2Res[,"baseMean"]) , fore=log2(des2Res[anSig$ensembl_gene_id, "baseMean"]), back=log2(des2Res[backG, "baseMean"])), xlab="log2 mean counts", main = "Matching for enrichment analysis") @ We can see that the matching returned a sensible result, we can now perform the actual testing. For this purpose we use the \Biocpkg{topGO} package which implements a nice interface to Fisher testing and also has additional algorithms taking the GO structure into account, by e.g. only reporting the most specific gene set in the hierarchy. For more details, see the paper: \begin{itemize} \item \href{http://dx.doi.org/10.1093/bib/bbr002}{Alexa et. al. -- Improved scoring of functional groups from gene expression data by decorrelating GO graph structure - Bioinformatics, 2006} \end{itemize} The GO has three top ontologies, cellular component (CC), biological processes (BP), and molecular function (MF). Here we test all of them subsequently. \subsection{Running topGO} We first create a factor \Robject{alg} which indicates for every gene in our universe (union of background and DE-genes), whether it is differentially expressed or not. It has the ENSEMBL ID's of the genes in our universe as names and contains 1 if the gene is DE and 0 otherwise. << create factor of interesting genes>>= onts = c( "MF", "BP", "CC" ) geneIDs = rownames(overallBaseMean) inUniverse = geneIDs %in% c(anSig$ensembl_gene_id, backG) inSelection = geneIDs %in% anSig$ensembl_gene_id alg <- factor( as.integer( inSelection[inUniverse] ) ) names(alg) <- geneIDs[inUniverse] @ We first initialize the \Biocpkg{topGO} data set for each top ontology, using the GO annotations contained in the annotation data base for mouse \Biocannopkg{org.Mm.eg.db}. The \Robject{nodeSize} parameter specifies a minimum size of a GO category we want to use: i.e. here categories with less than 5 genes are not included in the testing. Since we have ENSEMBL IDs as our key, we have to specify it since \Biocpkg{topGO} uses ENTREZ identifiers by default. Now the tests can be run. \Biocpkg{topGO} offers a wide range of options, for details see the paper or the package vignette. Here we run two common tests: an ordinary Fisher test for every GO category, and the ``elim" algorithm, which tries to incorporate the hierarchical structure of the GO and to ``decorrelate" it. The ``elim" algorithm starts processing the nodes/GO category from the highest (bottommost) level and then iteratively moves to nodes from a lower level. If a node is scored as significant, all of its genes are marked as removed in all ancestor nodes. This way, the ``elim" algorithm aims at finding the most specific node for every gene. The tests use a 0.01 $p$-value cutoff by default. We order by the classic algorithm to make the analysis comparable to the one performed in the original paper. << run tests, results='hide' >>= tab = as.list(onts) names(tab) = onts ### test all three top level ontologies for(i in 1:3){ ## prepare data tgd <- new( "topGOdata", ontology=onts[i], allGenes = alg, nodeSize=5, annot=annFUN.org, mapping="org.Mm.eg.db", ID = "ensembl" ) ## run tests resultTopGO.elim <- runTest(tgd, algorithm = "elim", statistic = "Fisher" ) resultTopGO.classic <- runTest(tgd, algorithm = "classic", statistic = "Fisher" ) ## look at results tab[[i]] <- GenTable( tgd, Fisher.elim = resultTopGO.elim, Fisher.classic = resultTopGO.classic, orderBy = "Fisher.classic" , topNodes = 200) } @ We can now look at the results, we look at the top 200 GO categories according to the ``Fisher classic" algorithm. The function \Rfunction{GenTable} produces a table of significant GO categories. Finally, we bind all resulting tables together and write them to an \file{.csv}--file that can be view with spreadsheet programs. << process results>>= topGOResults <- rbind.fill(tab) write.csv(topGOResults, file = "topGOResults.csv") @ Gene set enrichment analysis is still a field of very extensive research in bioinformatics. For additional approaches see the \Biocpkg{topGO} vignette and the references therein and also GeneSetEnrichment workflow in Bioconductor: \begin{itemize} \item \url{http://bioconductor.org/packages/release/BiocViews.html#___GeneSetEnrichment} \end{itemize} \end{document}