Cloud-native Hello World for Bioinformatics

PART ONE — RATIONALE

Lynn Langit
4 min readNov 28, 2019

So often the first task my customers ask me to work with them on is to ‘cloud-scale’ their computational complex tool or data pipeline. I find that this approach can be premature. I like to first ask…

“How could I try it out on the cloud?”

In this multi-part series, I’ll discuss how my team has collaborated with bioinformatics researchers world-wide to answer this question using cloud-native services, tools and patterns.

Hello Open Source Library

Because we have been working with bioinformatics researchers, we often are asked to scale open source tools and libraries which they have developed as a part of their work. However, this is not always the case — as the bioinformatics industry matures, standardized analysis tools, such as The Broad Institute’s open source GATK (Genome Analysis) toolkit are gaining more adoption. In the latter case, the significant ‘Hello World’ code is that of tool configuration values, rather than the tool source code.

Also researchers are usually working toward eventual publication, so a key goal is to create testable systems that are fully reproducible by other researchers. Those researchers are often working on different teams at different locations globally.

The first task my team performs in these situations is to attempt to use the tool or library locally for a ‘hello-world’ scale test analysis. This is often an arduous process, requiring installation of SDKs, languages, libraries, etc…

50%+ of the time we stop after 1+ hours of installation attempts

We also often discover issues with analysis data. These are not limited to, but have included the following:

  • Data quality is poor or unusable
  • Data size is too big to test — needs to be partitioned
  • Data privacy is limited to researchers (human health data) — we need mock data for testing
  • Data permissions are too restrictive — no one outside the team can access
  • Data location is local — data is on someone’s laptop (and not backed up)

We want to experience the potential pain of local installation first (and discover the ‘data situation’), so that we can be confident as we work to build a better alternative — Hello World using cloud tools and services.

We define this as an example that can be set up and run in less than 15 minutes. Note that we include setup time in the test goal time.

The ideal test time is less than 5 minutes.

In order to accomplish set up with the fewest steps, we recommend using known cloud services and patterns, such as vendor (or 3rd party) templates, rather than scripts. Also we setup suggested baseline security best practices, such as individual user accounts via IAM Roles for AWS testing, etc…

Often, we find there has been no time on the research team to consider using any of these tools or patterns — researchers are focused on conducting their research first and publishing it second. When the size of their analysis ‘no longer runs on my machine’ or ‘takes too long on the HPC cluster’, researchers start to look to the public cloud as a potential solution. Then, and only then, do they think about environment reproducibility.

This approach has had some unfortunate consequence. One study found that 70% of reviewed research was not reproduced computationally and was, therefore, not useful.

Building one or more cloud-native Hello World implementations early not only provides the advantage of enabling collaboration with other teams earlier in research, but also give the team important practice learning cloud services. Because the goal at this time is usability and not scalability, it’s a low cost situation, that allows the team to make mistakes cheaply (in terms of time to run and cloud service costs) as they learn.

This approach helps the team to ‘skill up’, so that when they want to scale up for actual research, they already have cloud basics (such as understanding security groups, IAM roles, etc…) in place. Also it allows them to share their scripts, tools and/or tool configurations outside of their group, thus facilitating more collaboration even earlier in their work.

Scripts , Templates and More

Creating reproducible cloud infrastructure is a ‘solved problem’ in that all major cloud vendors, and also several third party companies [such as Terraform] provide templates to define and create cloud infrastructure.

Going beyond templates to use vendor services to enable even faster setup, for example by hosting templates with metadata using the AWS Marketplace is an advanced example of this type of pattern.

However, my team has found that the bioinformatics research industry, being mostly relatively new to using cloud services, has a rather large learning curve to get from clicking on the AWS console to start EC2 instances to get to using CloudFormation templates which are source controlled.

A key pattern is to work with the research teams to build Hello World-scale, cloud-native implementations first.

In the next part of this series, we’ll examine a couple of examples in depth. The first of which will be CSIRO’s VariantSpark library running on the AWS Marketplace. As a preview, the architecture is shown below.

VariantSpark Hello World on the AWS Marketplace

--

--