Member-only story

Cloud-native Bioinformatics: HPC to GCP

Lynn Langit
5 min readMay 8, 2020

Working with a team from the Imperial College in London, we started by reviewing a genomic analysis workflow which is currently running on their HPC cluster. The workflow is a complete analysis for single-cell/nuclei RNA-sequencing data. I started by reviewing the workflow process steps generated by the Nextflow script used to run it.

Next I reviewed the current method of running this workflow by connecting with the team and observing them run this analysis on their HPC cluster. Of note is that the workflow utilizes containerization via Singularity / Docker.

Also I asked the team to share run logs via visualizations using Nextflow Tower. I did so to quickly see which processes in the workflow had the highest resource needs. Shown below is one of the useful visualizations — showing Memory usage.

Nextflow Tower showing memory usage for each process in scFlow on HPC

To The Cloud

A useful starting point when running any genomics workflow on the public cloud is to begin with a subsampled synthetic dataset. Real-world on premise workflows can run for hours, days or even weeks. To optimize for cloud, we will be more productive with fast feedback — i.e. did the workflow run correctly? how long did it take?…

15 minutes or less is the goal for each test.

--

--

Lynn Langit
Lynn Langit

No responses yet