Workshops and Tutorials - CloudTech'17
Distributed computing for High-throughput biological datasets, computational biology, genomics and transcriptomics
Department of Computer Science Centre for Systems and Synthetic Biology, Royal Holloway, University of London.
Jamie Alnasir has completed his PhD in Computer Science and molecular Biology from Royal Holloway, University of London. He also holds a Masters in Pharmacy and Chemistry from Kingston University and St. Georges hospital medical school. He is interested in applying computational techniques to solve problems in the Sciences, in particular using distributed computing.
In this workshop we will introduce and review a selection of distributed computing technologies that are commonly applied to bioinformatics and biological datasets, and which are typically deployed as clusters, grids and clouds. We will focus on clusters and clouds by covering and comparing batch-scheduled clusters with Hadoop based MapReduce clusters (Hadoop and Spark), discussing the HDFS distributed filesystem and YARN in more detail. We will discuss high-throughput datasets such as those generated by next-generation sequencing technologies which are massively parallel and hence produce large volumes of data, i.e. Bigdata. Processing such large datasets is integral to research and applications in specialist fields such as genomics and Personalised Medicine. For instance, as of January 2017, the SRA alone (Sequence Read Archive) which stores DNA/RNA sequencing data, contains over 9 Petabases (9.377x10^15, over 9 quadrillion letters, roughly a Petabyte) of data in over 30,000 experimental studies. Hadoop MapReduce based systems are ideally suited for these tasks because their architecture and file system are structured and designed to allow for massive scalability and fault-tolerance. As Hadoop technologies are increasingly ubiquitous in industry and a number of cloud service providers offer these platforms, we will briefly review some some distributed Bioinformatics tools which have been implemented using MapReduce to handle. We conclude by providing an example of a transcriptomics analysis system implemented in Apache Spark which can be used to quantify bias in the form of sequence-specific deviations in mapped reads to a reference genome. Understanding such bias is an integral part of gene expression studies and requires distributed technologies to process large datasets.