# Lab 5B: RNA-seq Data Analysis ## Outline 1. RNA-Seq experimental design 2. Transcript quantification, and normalization 3. Differential expression analysis 4. Pathway anaysis 5. RNA-Seq assembly 6. Alternative splicing analysis ## Transcriptome profiling 1. Entire transcriptome can be measured by microarray or RNA-Seq. 2. Widely-sed techniques, provide insight into biological system, albeit a snapshot - highly dynamic and complex process (splicing, gene methylation, RNA stability/degradation, miRNA regulation etc.) ## Public repositories - [Gene expression Omnibus (GEO)](https://www.ncbi.nlm.nih.gov/geo/) - [Sequence Read Archive (SRA)](https://www.ncbi.nlm.nih.gov/sra/) - [ENCODE](http://encodeproject.org) - [Allen brain atlas](https://www.brain-map.org) - [Genotype-Tissue Expression Project (GTEX)](https://gtexportal.org) - [The Cancer Genome Atlas (TCGA)](https://cancergenome.nih.gov) ## Profiling Designs - Disease vs. control - Gene knockdown/knockout vs. wildtype - Effect of treatment/stimulus/drug - Clinical applications * Tumor-normal pairs * Good prognosis vs. poor prognosis * Patient subgroups responding to different treatments. * "Gene signature" to predict the treatment response. - Time course - Different tissues/stages of development ## Premise of gene expression profiling - Compare gene expression in different conditions - Differentially expressed genes may provide some biological insights. - But not magical solutions! Large amounts of descriptive data generated - what to do next? ## Limitations of gene expression data - Comprehensive but inherently limited to descriptive results, no matter how well experiment performed or data analyzed. - Produce large amount of information: subjective interpretation, can be mined in different ways, alway muth left untouched (often publically available). - Expensive and time-consuming so often published as a stand-alone experiment. - However best used as starting point for further work - following-up hypotheses from gene expression data to uncover mechanistic/causal effects can produce elegant studies. ## Choice of microarray or RNA-Seq? - Choice usually depends how detailed a characterisation of the transcriptome is required. * Gene-level changes - microarrays sufficient * Isoform structure, splicing, novel transcripts - RNA-Seq - Many low-expressed genes in a given sample type in both technologies. ## Consideration of experiment design - Number of replicates * depens on context - type of sample, size of effect, heterogeneity within condition * Cell line (n=3) -> mouse model (n=5) -> human samples (n=10) -> human clinical samples for heterogeneous disease (n=100) - Sequencing depth (number of reads per sample) * depends on experiment questions: quantifying gene expression -> splicing anlysis -> certain library prep methods (Ribo-depletion) - Single or paired-end sequencing? * Increased mapped reads with PE data - better anchoring - Good experimental design principles * Sufficient replication/depth for purpose * Avoid confounding factors when obtaining/preparing samples - gene expression data highly sensitive to many factors. * Be aware of potential effects of unrelated factors on the data, which may need to be accounted for to optimise studies. ## RNA-Seq sequencing workflow - reverse transcription from RNA to cDNA - fragmentation of cDNA - ligating adapters to fragments - melting - attaching to the flowcell - amplifying - reading sequences - from laser snapshots to sequenced reads ### mRNA sequencing protocols - Enrichment for mRNA: * oligo-dT capture * rRNA depletion (riboDep) ### non-mRNA sequencing protocols - total RNA (coding and non-coding) - targetd RNA sequencing (probes) - ribosome profiling (sequencing ribosome-protected mRNA fragments) - microRNA by size selection - nuclear RNA (pre-mRNA, chromatin-associated RNA and nucleoplasmic RNA) - CAGE (cap analysis; introducing a biotin group to the cap structure, capturing it with oligo primers) - SAGE (sequencing 3'-end; chopping off 11 bp from 3'-end) - single-cell RNA sequencing ### Quality control - biases common for any NGS data * Amplification bias: pre-sequencing amplification to get enough input yield. * bridge amplification on the flow cell. * introduction of single-nucleotide errors. * depending on the sequence content fragments anneal/melt with variable efficiency. * errors introduced by the polymerase during actual seqeuncing: partial solution, number of reads containing this error. * signal not strong enough for the CCD camera: partial solution, Phred quality score * Phred score: per-base quality of the sequencer's base calling. * Phred score $Q$ is related to the base calling error probability $P$. * Illumina sequencing quality: >99.9\% * Illumina sequencing quality drops towards the end of the read. * This happens mainly due to phasing (blocking does not work perfectly) * - biases specific to RNA-Seq ## Alignment ### Alignment strategies - Gapped alignment - Spliced junctions detection ### Alignment format - Sequence Alignment Map (SAM) format - Binary SAM (BAM) format (for the sake of storage) - SAMtools, BAMtools - [Specification](https://samtools.github.io/hts-specs/SAMv1.pdf) ### SAM file - __Header__: information about the reference. - Contains one line per alignment! - For one paired-end fragment, there will be at least two lines (one line for each end). - Multiple alignments for one read == multiple lines in a SAM file. - Multiple columns * 1 - Read name * 2 - Bitwise flag * 3 - Reference sequence name (chromosome name) * 4 - Leftmost position of the alignment * 5 - Mapping quality * 6 - CIGAR string * 7 - Ref. name for the alignment of the second end. * 8 - Leftmost position of the second end - CIGAR string * `M`: match * `X`: mismatch * `I`: insertion * `D`: deletion * `N`: big insertion (defined only for RNA aligners) ### Post-alignment quality control - number of mapped reads (>80%) - number of uniquely mapped reads (>70%) - number of non-duplicates (>70%) - number of properly-paired reads (varies) - check for contamination: e.g., number of reads mapped to chrY - number of reads remaining after all filtering steps should be > 20 million. ### Biases in alignments - mappability bias - sequencability bias - 5'-3' bias - reference bias ### Transcript quantification - Many tools available, all claiming to be the best performer - which to choose? - Review articles can be helpful. * A survey of best practices for RNA-seq data analysis. * Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data. * A benchmark for RNA-seq quantification pipelines. - Sensible pipelines should give reasonable results - make sure suitable for your data/purpose, key parameters are set correctly. - Tools can have differing input requirements and output formats - make sure using the expected format for a given tool. >> During the course of this study we discovered a number of assumptions that the programs tacitly made and that affected the interpretation of the results. Therefore, a specific recommendation that we can make to developers is to ensure that sufficiently detailed information on input requirements, potential pitfalls and the implication of specific options (ideally including usage examples) is provided. Kanitz et al. Genome Biology (2015) 16:150 ### Tools choice - __Adapter trimming__: `Trimmomatic`, `Cutadapt` - __Read mapping__: `Bowtie2`, `Tophat2`, `GSNAP`, `STAR`, `Subread`, `HISAT` - __Quantification__: `HTSeq-count`, `featureCounts`, `RSEM`, `eXpress` - __Differential__: `edgeR`, `DESeq2`, `limma/voom`, `BaySeq`, `CuffDiff2`, `Ballgown` - __De novo assembly__: `Trinity`, `SOAPdenovo-trans`, `Abyss-trans`, `Oases` ### General mapping issues - Multi-mapped reads (map to more than one location) * Gene families, pseudogenes, repeat regions, MHC region - Potential PCR duplicates (multiple reads mapping to identical position) * May exclude some genuine duplicate fragments, especially for high expressed genes ### Expression quantification methods Process of using aligned reads to quantify gene/transcript abundance. - Count-based (`HTSeq`, `featureCounts`) * Find overlap of aligned reads and known annotated features. * ouput is raw counts. - Model-based (e.g., `Cufflinks`, `RSEM`) * Find the optimal set of isoforms and their relative abundances from the observed data (aligned reads) * Short reads tend to map to multiple isoforms, hence the need for probabilistic models. * Output usually RPKM/FPKM or TPM ### `HTSeq` - Python-based suite of scripts/tools including `htseq-count` * Annotations from Ensembl/RefSeq etc. * Reads overlapping gene features counted. * Careful setting of parameters, e.g. strandedness option in particular * Usage: `htseq-count [options] ` ### `featureCounts` ### `RSEM` ### `Cufflinks` ## Differential expression analysis (DEA) - `edgeR` - empirical Bayes estimation and exact tests based on the negative binomial model. - `DEGseq` - Poisson model. - `DESeq` - negative binomial model. - `baySeq` - empirical Bayes approach. ## Experiment 1. Download the RNA-Seq data. 2. Download the reference genome and reference transcriptome. 3. FastQC on the data and preprocess the data. 4. Align the sequence data to the reference genome and gtf file. 5. Quantify the gene expression. 6. Conduct differential expression analysis (DEA) of the data. 7. Decipher the results using gene set enrichment analysis. ### Exercise Apply the following different pipelines to find the differential expressed gene list, and use venn-diagram to find the consistence and difference between these pipelines. - [Tophat2 + Cufflinks + Cuffdiff](https://wikis.utexas.edu/display/bioiteam/Tophat-Cufflinks-Cuffdiff%2C+allowing+for+novel+transcripts) - [Hisat2 + Stringtie + Ballgown](https://davetang.org/muse/2017/10/25/getting-started-hisat-stringtie-ballgown/) - [STAR + RSEM + DESeq2](https://ycl6.gitbooks.io/rna-seq-data-analysis/content/) - [Kallisto + sleuth](https://scilifelab.github.io/courses/rnaseq/labs/kallisto) ### Deafness Data on the server - `/home/data/bi390/lab5B/*.fastq.gz` - `/home/data/bi390/lab5B/metadata.csv`