Cloud-native Hello World for Bioinformatics


In part one of this series, I discussed rationale for building a cloud-native ‘Hello World’ as an important first step in moving bioinformatics analysis successfully to the public cloud.

In this article, I’ll dig into the details of one (of many) examples of using this approach. This example is built on the AWS Cloud and it uses a number of AWS services. The most significant of which is the AWS Marketplace.

The AWS Marketplace is a set of services that are designed to make building reproducible cloud environments easier. The Marketplace includes a WebUI (shown below), a command-line tool (cli) and an API-based programming model.

The AWS Marketplace includes multiple delivery methods.

The Marketplace is designed to allow builders to set up what I like to call ‘smarter templates’. These templates can be run with default service configuration values or customized, which allows users to quickly build up complex infrastructures. These infrastructures often include multiple cloud service instances and their associated configurations. As of this writing, there are over 7,000 of these smarter templates, which are called ‘solutions’, in the marketplace.

In addition to the large number of solutions available in the Marketplace, it’s important to note that AWS has been expanding the Marketplace capabilities to be able to host smart templates which support a number of delivery methods (including VMs, Containers and more).

What’s a Smart Template?

The basis of operation for most of the items listed in the Marketplace is one or more AWS CloudFormation YAML or JSON templates. Templates define AWS service configurations (for S3 buckets, EC instances, etc…). In addition to templates, CloudFormation includes Stacks (defined as a set of related resources managed as a single unit) and Change Sets (defined as summaries of your proposed changes to a stack). The diagrams below show how the templates, stacks and change sets are designed to work.

First…shown below (from AWS documentation) is a diagram of the process of building cloud infrastructure by creating a stack from a template. When the user clicks ‘create stack’, then AWS CloudFormation uses the associated template to allocate and configure AWS service instances such as EC2 Virtual Machines, S3 storage buckets, IAM security roles, etc…

Building AWS Infrastructure using a Cloud Formation template and stack

In addition to using stacks to create cloud infrastructure, they can be used manage and maintain it. Shown below (from AWS documentation) is the process for updating that infrastructure via a change set.

Updating AWS Infrastructure using a Cloud formation template, change set and updated stack

In addition to templates, stacks and change sets, the AWS Marketplace adds a the ability to include a number of types of metadata to stack. This metadata can include a description of the solution’s infrastructure. It can also include instructions and direction on how to configure that infrastructure. Additionally it can contain a selection of suggested configuration options (in our case EC2 instance size), and, importantly, a service-cost estimator.

Although the original purpose of the AWS Marketplace was to enable commercial vendors to sell (lease) configured versions of their software running on AWS, the CSIRO Bioinformatics team used the Marketplace a bit differently. This team decided to focus on the ease-of-use features, such as template metadata, so that they could make it simpler for bioinformatics researchers world-wide to try out their open source genomic-scale custom machine learning algorithm (VariantSpark) on AWS.

To start the team fully utilized the metadata tagging capabilities in the AWS Marketplace. When a user enters the search term ‘bioinformatics’, the CSIRO VariantSpark solution appears on the first AWS Marketplace results page, as shown below.

Results of searching on the term ‘bioinformatics’ in the AWS Marketplace

In working with a number of new-to-AWS-Marketplace customers over the years, I’ve found that they’ve often skipped this tagging step, which is unfortunate. As mentioned, with over 7,000 entries in the AWS Marketplace, focusing on discoverability is an important first step in creating a broadly usable solution.

If people can’t find it, they can’t use it

Another important aspect of this sample is that the team built a reference infrastructure that can be run using the AWS Free Tier. Although the ultimate goal of building analysis infrastructure on the public cloud is to improve speed and cost of running genome-sized workloads, taking this ‘try it fast and free first’ approach implements the cloud-native Hello World pattern elegantly.

Try it out — fast and free

Details matter too. Notice also that the team is using AWS Marketplace solution versioning, the current published version being 1.0.10.

In addition to following best practices of naming, tagging, versioning and pricing their solution clearly, the CSIRO team did a great job describing how their example works in the AWS Marketplace VariantSpark solution ‘Product Overview’ and ‘Highlights’ sections. They included both a time-to-run and cost estimate in their concluding sentence.

“VariantSpark can process 200 samples with 20M variables in 1 hour consuming $3 of AWS resources. VariantSpark compute time increases linearly with both variables and samples.”

Also, of note, is that the team took the time to record and link a short screencast of how to set the solution up (showing how to fill out the Cloud Formation template parameters) on the main AWS Marketplace page.

In the ‘Usage Information’ section, further enhancing the usability of the solution, the team included an example Jupyter notebook, sample data and also a summary diagram of the AWS infrastructure that will be built (shown below).

Diagram of VariantSpark AWS Marketplace Example

A series of large yellow buttons in the UI guide the user through the process to quickly build and test this example. First, the user must subscribe to this solution. This is one-time process and reflects the history of the marketplace, that is, many commercial vendors use the feature to lease their software to customer via the marketplace. Subscribing is one-time operation.

Next the user is presented with the yellow ‘Continue to Launch’ button shown below. Note the small number of configuration options presented at this phase of the process. This is by design — users need only to select the Region — and is done to make set up simpler. Screen shown below.

Select your Region and Launch!

In the next screen users can select between a ‘Cloud Formation’ or ‘Service Catalog’ launch. They would normally choose the first option for testing. The second would be used in Enterprise scenarios (where the ability to launch solutions from the AWS Marketplace should be controlled by launch policies associated with the AWS Service Catalog).

Clicking ‘launch’ here will open the template in the CloudFormation UI. The AWS Marketplace VariantSpark solution has an associated CloudFormation template file which is stored in an AWS S3 bucket. Users click ‘next’ to go the main configuration screen, here they enter a name for their stack, the number of CPUs (default is 32), their local IP address (for the Jupyter notebook client) and then click ‘next’ a couple more times to accept all template defaults. Note that the last configuration page shows that some IAM resources will be created and is required to be manually checked for the solution to function.

Users review stack (instance creation) event notifications in the CloudFormation console to view progress as shown by example below.

Review status of instance creation in CloudFormation stacks console

Users will need to wait ~10 minutes, then they’ll see the stack status change to the green text message which says ‘CREATE_COMPLETE’. At that point, they can navigate to the ‘Outputs’ tab on the their stack page, then click on the Jupyter URL hyperlink to open and run the example notebook, a portion of which is shown below.

A portion of the included, example Jupyter Notebook

While the notebook is running, users can view the workload overhead using the Ganglia tool and associated visualizers, again shown below. AWS includes Ganglia as part of it’s EMR (Elastic Map Reduce or managed Hadoop/Spark) service, which this solution uses, because CSIRO’s VariantSpark library is written to run on top of a Spark cluster.

Viewing AWS EMR cluster resource usage via the Ganglia tools

In addition the many advantages discussed already, yet another aspect of using the AWS Marketplace and associated CloudFormation stacks for ‘Hello World’ scenarios is that users can quickly and cleanly REMOVE all service instances when they are done testing with one action. In this case, the template created 24 service instance (called ‘Resources’) and a partial view is shown below.

24 Resources (AWS service instances) are creating using this AWS Marketplace example

In the CloudFormation UI, users can simply select the stack and then click ‘delete’ to delete all associated service instances when they are done testing. Additionally, if users want to make a copy of the stack for their own use (or further customization) they can easily do that by clicking on the ‘create stack’ button in the UI. After the termination process completes, the stack is removed from the CloudFormation console. Users can also verify that the associated EC2 instances (for example) have been terminated by viewing the EC2 console (example shown below).

Deleting the CloudFormation example stack automatically terminates associated resources.

CSIRO implemented a cloud-native ‘Hello World’ for their VariantSpark bioinformatics library using the following best practices:

  • Used parameterized templates which included ‘starter-level’ default values to allow users to build complex infrastructure — rather than scripts
  • Included metadata in the vendor marketplace to make their templates even more usable. Their smart templates create a solution which is discoverable, understandable, and compelling, i.e. ‘try it easily — fast and free’
  • Included Hello World scaled example data and an example analysis Jupyter notebook
  • Included cluster monitoring tools

Congratulations to Dr. Denis Bauer and her team on a job well done!

The CSIRO Bioinformatics team

In the next article, I’ll be examining another approach in this space — patterns are similar, but not identical, this is part due to the choice of cloud vendors. The next article will cover a cloud-native Hello World bioinformatics example running on GCP (Google Cloud Platform).

Cloud Architect who codes, Angel Investor