Cloud-native Hello World for Bioinformatics

PART THREE — Working with custom cloud VM images for GCP

In the series to date, I’ve presented the general rationale for building quick start examples (or ‘Hello World’) for bioinformatics tools in part one.

In part two, I reviewed one such example for the VariantSpark library, runnable via smart templates in the AWS Marketplace.

In this part three, I’ll cover how to convert a locally runnable example to a reusable cloud example, by working with custom Google Cloud Platform custom Virtual Machine images running on Google Compute Engine.

One aspect of working in a cloud-native way is to START all technical work in the public cloud rather than on a local laptop. For example, building Hello World examples which can be quickly run on a public cloud infrastructure enables more researchers to try out tools and libraries with a minimum of effort as they won’t have to waste time locating, installing and configuring any software dependencies on their local laptop.

Many bioinformatics tools and libraries include several required dependencies. In order to use or run a hello world example on a local machine, the current assumption is that the researcher will install these dependencies on their local machine. This requirement can, at best, result in lost productivity as installations of SDKs, libraries, etc…can involve even more dependencies. Also configuration of the these dependencies, which can include setting of paths, variables and permissions takes even more time and effort. Worst case scenario, your Hello World example won’t be used, because the researcher is unwilling or unable to set up the test environment locally.

Working cloud-first can serve as an alternative.

Creating hello world examples which use custom cloud Virtual Machine images can improve the researcher experience substantially. For example, a researchers could select and use an appropriately pre-configured image file to easily start their Virtual Machine instance for testing.

After the instance is available, then they can just run the bioinformatics tool hello world example quickly and easily. When testing is complete, they simply shutdown and/or delete the VM instance. Another side benefit to this approach is that there is no disruption to functionality on their local machine — because they didn’t install anything there.

Nextflow is a bioinformatics pipelining library written in the Java-dependent Groovy programming language. Nextflow is designed to support quick set up of scalable workflows for bioinformatics analysis jobs. These workflows use chained bioinformatics tools and can be run in a number of environments, i.e. locally, HPC, public cloud, etc. The Nextflow website includes a number of Hello World (quickstart) examples. All of these examples are designed to be run locally.

One such example creates and runs a simple workflow which showcases the ability of Nextflow to parallelize tasks. This example workflow uses just one tool— blast. From the Nextflow site “the example splits a FASTA file into chunks and executes for each of them a BLAST query in a parallel manner. Then, all the sequences for the top hits are collected and merged to a single result file.”

The example is also designed to demonstrate how a docker container which includes a bioinformatics tool (in this case the blast tool) can be run in a Nextflow pipeline.

If the required dependencies (Java 8 and Docker) are installed locally, then the example can be run in two steps from the researcher’s local terminal. The steps to run the example locally are as follows:

  1. Download the Nextflow libraries by entering the command below:
$ curl -fsSL https://get.nextflow.io | bash

When the first command is run the researcher will see output in their local terminal similar to that shown below. Note that at the end of download, the message “Nextflow installation completed.” will display.

Downloading Nextflow locally

2. Launch the Nextflow pipeline execution using the command shown below:

$ ./nextflow run blast-example -with-docker

This command will automatically download the pipeline Github repository and the associated Docker images, because of this requirement, the first execution of the second command listed above can take a couple of minutes to complete.

This second command runs the blastn example using the included main.nf Nextflow script. The script first starts the a Docker container image which includes the blastn tool. The command then runs the blastn tool on the sample data and then prints the results to the terminal window. A sample output is shown below.

Running the example for the blast bioinformatics tool

The Nextflow team has done a nice job writing a quick-to-execute hello world sample. End-to-end this takes less than 5 minutes to run — so long as the development environment is set up correctly.

Now let’s perform the steps to make this sample cloud-native. The first consideration is that the required dependencies are not trivial to locate, download, install and configure in general. As mentioned, the Nextflow scripting language is an extension of the Groovy programming language. Groovy is a language for the Java platform, and thus requires a JVM (Java Virtual Machine) to run in. The Nextflow team has noted that the availability of Java 8 is a requirement for the example in it’s source `travis.yml` file as shown below.

sudo: requiredjdk: openjdk8services:- dockerinstall:- sudo apt-get -qq update- sudo apt-get -qq -y install graphviz realpath- curl -fsSL get.nextflow.io | bash- docker pull nextflow/examplesscript:- ./nextflow run main.nf -with-docker

You’ll also note that the file above uses a docker pull command to pull a docker container image named `nextflow/examples`

So, in order to use this example in the cloud, the researcher needs a compute environment that includes Java 8 and Docker. Ideally, the environment would also be quickly reproducible, so that other researchers can also try out the example scripts.

Using Google Cloud Platform, what would be the easiest way for them to accomplish this? There are several ways to meet these requirements. These include the following:

  • Use the GCP console to create a VM instance from a customized image. This image must include Java 8 and Docker installed and configured appropriately.
  • Run a `gcloud` cli script to start a customized GCE VM instance. This script should include a VM start-up script. The VM startup script installs/configures the requirements
  • Execute a GCP deployment using a GCP template, which references either the custom image or the gcloud & startup scripts.

As a starting point, I choose to work to create an example using the technique described the first option listed above, that is to create an instance of a Google Compute Engine (GCE) Virtual Machine based on an custom GCE image file. The image file must include an operating system AND both dependencies (Java and Docker).

Of course, the researcher could create a VM instance from a GCE default image and then install the dependencies — however this isn’t the most optimal solution because then they still have to locate, install and configure the correct version of both Java and Docker.

For this article, I created and tested an instance configured as described. I then created and exported a GCE VM image file and exported it to a public GCS bucket.

To use this custom image file to create a customized GCE VM instance, do the following:

  1. CREATE a bucket in your GCS project.
    NOTE: If new to GCP, I suggest signing up for the GCP Free Trial which includes $ 300 USD in service credit. After you sign up, log in to the GCP console and create a new GCP project using the GCP console.
  2. IMPORT the image file (from this link) into your bucket
  3. CREATE a GCE image from the image file in your bucket using the GCP console (GCE → Create an Image →Source Cloud Storage File).
  4. CREATE a GCE instance from new GCE image. This is your copy of the custom VM instance file. Use the GCP console (GCE → Create Instance). Screen shown below.
Creating a GCE VM instance from a custom image

After the instance is available, SSH to the GCE instance to open a terminal window. In that window, researchers can verify that required dependencies are installed using these two commands shown below. Both should return version numbers (for this image results should be `java 11` and `docker 19`).

java --version
docker --version

Researchers can then use the Nextflow blast example on this configured instance using the two commands from the Nextflow site for ‘example 3’. First to get the Nextflow tool, run the command below in the terminal window of the GCE VM instance.

curl -fsSL https://get.nextflow.io | bash

Expected output should look like screen shown below. Note the Nextflow version is 19 at the time I am writing this article.

Nextflow 19 running on GCE

Researchers can now run the Nextflow blast script example, which runs the bioinformatics blast tool inside the VM from a Docker contain image instance. The command is shown below.

./nextflow run blast-example -with-docker

The expected blast results appeared in less than a minute and an example is shown below. Voila! GCP Cloud-native Hello World for Nextflow.

Blast Nextflow pipeline example running on GCE

Researchers can then close the SSH window, and stop the VM. They can restart the VM if they would like repeat the test. They can optionally delete the VM instance if their testing is complete.

Because I am writing not only for the users of Hello World examples (researchers), but also teams who are involved in building this type of sample, in the next sections of this article, I’ll cover how I quickly built and deployed this custom GCE image.

First I needed to find the best-fit base GCE image. In this case I simply used a standard Debian 9 Linux image on GCE. I needed to install both Java 8 and also Docker on this image.

NOTE: I tried to use a GCE container-optimized image (which had Docker tools pre-installed) but found that the lack of Linux tools (such as apt-get, dpkg and others) made it too time consuming to get those tools and then to install Java 8, so I abandoned this approach quickly.

I found it to be quite difficult to find accurate instructions on how to install both Java 8 and also Docker on a GCE image. I’ll add the steps I used to get this to work properly below:

To install Java 8 or above (I used Java 11):

sudo apt-get install -yq openjdk-11-jdk

To verify that Java is installed

java --version

To install Docker:

sudo curl -sSL https://get.docker.com/ | sh

Then after the install completes, add your username to be able to run docker (as non-root)

sudo usermod -aG docker lynnlangit

Logout and then log back in to your GCE terminal to apply this update. Then run the command below to update docker

sudo apt-get update && sudo apt-get upgrade

Verify that docker is running

docker --version

I then tested the Nextflow blast hello world example on this configured instance using the two commands from the Nextflow site for ‘example 3’. First to get the Nextflow tool, I ran the command below in the terminal window of the GCE VM instance.

curl -fsSL https://get.nextflow.io | bash

Then I ran the Nextflow script example, which uses Docker to containerize the bioinformatics blast tool, as shown below.

./nextflow run blast-example -with-docker

The expected blast results appeared in less than a minute. So my custom GCE image was working properly. The next step was to prepare a copy of the image for reuse.

To prepare the instance for imaging, I first stopped the VM instance and then created a GCE image from that stopped image using the GCP console. Because GCE VM images are available to users in the GCP project that they were created in, stopping here would not meet my need for creating a publicly available base VM image. Shown below from the GCP documentation is the workflow to create such an image.

Workflow to create a public, customized GCE VM image

To make my image publicly available , I next need to export the image as a file into a GCS bucket. This bucket need to have public permissions set on it. So I created a GCS bucket and set the permissions to `allUsers/ read` — which is public access for GCS. I then used the `gcloud` tool (example shown below) to create and export a zipped file from my custom VM image. This command puts that image file in the newly-created public GCS bucket.

Note that the line break is NOT shown for line 1 in the example below.

gcloud compute images export --destination-uri gs://nextflow-quickstart/vm-images/snapshot-nextflow-helloworld.tar.gz \
--image nextflow-hello-world --project gcp-for-bioinformatics

Now this image can be used by anyone with the URL as a basis for quickly creating a pre-configured GCE VM instance so that they can easily try out Nextflow.

As mentioned, in creating this quick example, I did NOT use two other common patterns for cloud-native Hello World examples. One option would’ve been to create a startup script, which would install and configure Java 8 and Docker on a base GCE VM, then put that startup script in a public GCS bucket. Next I would create a gcloud statement which creates a VM and uses that script and then just run the gcloud script to create an instance of a customized VM.

Here I am using what I call, ‘just enough automation’.

Because I wanted to make a copy of working VM image as quickly as possible, I just clicked in the console and used gcloud to create and export my VM image to a public bucket. I do expect, as I continue to test Nextflow, I’ll create this scripted version.

Creating a GCP deployment seemed like overkill in terms of time to write and test a configuration file (written in the Jinja/YAML or Python dialects). Although I do like the ‘install everything/remove everything’ aspect of deployments, because, at this stage, I am only working with a single VM, using Deployments just wasn’t time effective for this level of configuration automation.

Interestingly, Nextflow’s documentation references a pre-configured AWS VM image for Nextflow testing (in AWS these images are called AMIs — or Amazon Machine Images). Of note is that this AMI is in AWS EU/Ireland region. The ID is ‘ami-4b7daa32'.

As I continue to test various bioinformatics pipeline (DSL) scripting language and tools, I’ll continue to share my ‘just enough automation’ stories.

Cloud Architect who codes, Angel Investor