Azure for Genomic-scale Workloads

Lynn Langit
11 min readJul 13, 2020

by Lynn Langit and Kelly Kermode

When building data analysis pipelines for genomic-scale workloads, the typical approach is to use a Data Lake Architecture. In this article, we’ll detail using this architecture on Azure. The general Data Lake pattern (shown below) includes several surface areas and uses common cloud services. Of note are the three major surface areas — Blob Storage for files (data layer), Batch Compute for Virtual Machine clusters (compute layer), and Interactive Analysis for query-on-files (analysis layers).

Cloud Data Lake Pattern

Files are the Data

In the Data Lake pattern, data, which is usually in the form of various types of files, is stored in BLOB storage containers. The compute plane in this pattern takes the form of a job controller which spawns a burstable cluster. Also, implementation of appropriate security via authentication, encryption, and endpoint protection is a key part of any solution.

Additionally, it is becoming increasingly common to add an optimization layer over BLOB storage. This is most commonly done using a FUSE library, which results in improved usability. Another recent pattern, which is particularly relevant for bioinformatics, is the incorporation of cloud-hosted genomic reference data into cloud pipelines.

--

--