Biology has undergone a revolution due to the maturation of high-throughput RNA and DNA sequencing technologies. RNA sequencing can quantify the expression levels of thousands of genes, proteins or metabolites and produce terabytes of data. This data can be combined with millions of DNA sequencing reads to identify genetic mutations that affect gene or protein levels. How can we determine the quality of large data sets? How can we make sense of such vast data to prioritize genes or genetic variants that may help us to treat human diseases? How can we protect ourselves from spurious and irreproducible results? In this course, we will learn fundamental techniques of data analysis for RNA and DNA sequencing data. The course will begin by surveying the technology behind high-throughput sequencing and will progress to alignment of RNA reads to a reference genome. We will then learn how to identify differentially expressed genes using methods that correct for potential biases and correlation structure in the data. Next, we will combine DNA sequences with gene expression data to understand how genetic variation produces differences in gene expression levels. Students interested in learning widely applicable bioinformatics techniques will benefit from this course. Students who complete this course will be able to read and assess the quality of high-throughput sequencing data, to align RNA or DNA reads to a reference genome, to quantify differences in gene expression between groups, and how to associate DNA sequence variation with gene expression variation. We will use the R programming language and Bioconductor libraries. Evaluation will be through quizzes, homework and a final project.


Biology I: Cellular Processes of Life or equivalent, and either Python I or Data Science I. 

