Azure for Genomic-scale Workloads
by Lynn Langit and Kelly Kermode
When building data analysis pipelines for genomic-scale workloads, the typical approach is to use a Data Lake Architecture. In this article, we’ll detail using this architecture on Azure. The general Data Lake pattern (shown below) includes several surface areas and uses common cloud services. Of note are the three major surface areas — Blob Storage for files (data layer), Batch Compute for Virtual Machine clusters (compute layer), and Interactive Analysis for query-on-files (analysis layers).
Files are the Data
In the Data Lake pattern, data, which is usually in the form of various types of files, is stored in BLOB storage containers. The compute plane in this pattern takes the form of a job controller which spawns a burstable cluster. Also, implementation of appropriate security via authentication, encryption, and endpoint protection is a key part of any solution.
Additionally, it is becoming increasingly common to add an optimization layer over BLOB storage. This is most commonly done using a FUSE library, which results in improved usability. Another recent pattern, which is particularly relevant for bioinformatics, is the incorporation of cloud-hosted genomic reference data into cloud pipelines.