Cloud-native Hello World for Bioinformatics

PART TWO — AWS MARKETPLACE EXAMPLE

In part one of this series, I discussed rationale for building a cloud-native ‘Hello World’ as an important first step in moving bioinformatics analysis successfully to the public cloud.

In this article, I’ll dig into the details of one (of many) examples of using this approach. This example is built on the AWS Cloud and it uses a number of AWS services. The most significant of which is the AWS Marketplace.

What is the AWS Marketplace?

Image for post
Image for post
The AWS Marketplace includes multiple delivery methods.

The Marketplace is designed to allow builders to set up what I like to call ‘smarter templates’. These templates can be run with default service configuration values or customized, which allows users to quickly build up complex infrastructures. These infrastructures often include multiple cloud service instances and their associated configurations. As of this writing, there are over 7,000 of these smarter templates, which are called ‘solutions’, in the marketplace.

In addition to the large number of solutions available in the Marketplace, it’s important to note that AWS has been expanding the Marketplace capabilities to be able to host smart templates which support a number of delivery methods (including VMs, Containers and more).

What’s a Smart Template?

First…shown below (from AWS documentation) is a diagram of the process of building cloud infrastructure by creating a stack from a template. When the user clicks ‘create stack’, then AWS CloudFormation uses the associated template to allocate and configure AWS service instances such as EC2 Virtual Machines, S3 storage buckets, IAM security roles, etc…

Image for post
Image for post
Building AWS Infrastructure using a Cloud Formation template and stack

In addition to using stacks to create cloud infrastructure, they can be used manage and maintain it. Shown below (from AWS documentation) is the process for updating that infrastructure via a change set.

Image for post
Image for post
Updating AWS Infrastructure using a Cloud formation template, change set and updated stack

In addition to templates, stacks and change sets, the AWS Marketplace adds a the ability to include a number of types of metadata to stack. This metadata can include a description of the solution’s infrastructure. It can also include instructions and direction on how to configure that infrastructure. Additionally it can contain a selection of suggested configuration options (in our case EC2 instance size), and, importantly, a service-cost estimator.

Although the original purpose of the AWS Marketplace was to enable commercial vendors to sell (lease) configured versions of their software running on AWS, the CSIRO Bioinformatics team used the Marketplace a bit differently. This team decided to focus on the ease-of-use features, such as template metadata, so that they could make it simpler for bioinformatics researchers world-wide to try out their open source genomic-scale custom machine learning algorithm (VariantSpark) on AWS.

Finding Configurations on the AWS Marketplace

Image for post
Image for post
Results of searching on the term ‘bioinformatics’ in the AWS Marketplace

In working with a number of new-to-AWS-Marketplace customers over the years, I’ve found that they’ve often skipped this tagging step, which is unfortunate. As mentioned, with over 7,000 entries in the AWS Marketplace, focusing on discoverability is an important first step in creating a broadly usable solution.

If people can’t find it, they can’t use it

Another important aspect of this sample is that the team built a reference infrastructure that can be run using the AWS Free Tier. Although the ultimate goal of building analysis infrastructure on the public cloud is to improve speed and cost of running genome-sized workloads, taking this ‘try it fast and free first’ approach implements the cloud-native Hello World pattern elegantly.

Image for post
Image for post
Try it out — fast and free

Details matter too. Notice also that the team is using AWS Marketplace solution versioning, the current published version being 1.0.10.

Reviewing the Solution

“VariantSpark can process 200 samples with 20M variables in 1 hour consuming $3 of AWS resources. VariantSpark compute time increases linearly with both variables and samples.”

Also, of note, is that the team took the time to record and link a short screencast of how to set the solution up (showing how to fill out the Cloud Formation template parameters) on the main AWS Marketplace page.

In the ‘Usage Information’ section, further enhancing the usability of the solution, the team included an example Jupyter notebook, sample data and also a summary diagram of the AWS infrastructure that will be built (shown below).

Image for post
Image for post
Diagram of VariantSpark AWS Marketplace Example

Running the Solution

Next the user is presented with the yellow ‘Continue to Launch’ button shown below. Note the small number of configuration options presented at this phase of the process. This is by design — users need only to select the Region — and is done to make set up simpler. Screen shown below.

Image for post
Image for post
Select your Region and Launch!

In the next screen users can select between a ‘Cloud Formation’ or ‘Service Catalog’ launch. They would normally choose the first option for testing. The second would be used in Enterprise scenarios (where the ability to launch solutions from the AWS Marketplace should be controlled by launch policies associated with the AWS Service Catalog).

Clicking ‘launch’ here will open the template in the CloudFormation UI. The AWS Marketplace VariantSpark solution has an associated CloudFormation template file which is stored in an AWS S3 bucket. Users click ‘next’ to go the main configuration screen, here they enter a name for their stack, the number of CPUs (default is 32), their local IP address (for the Jupyter notebook client) and then click ‘next’ a couple more times to accept all template defaults. Note that the last configuration page shows that some IAM resources will be created and is required to be manually checked for the solution to function.

Users review stack (instance creation) event notifications in the CloudFormation console to view progress as shown by example below.

Image for post
Image for post
Review status of instance creation in CloudFormation stacks console

Users will need to wait ~10 minutes, then they’ll see the stack status change to the green text message which says ‘CREATE_COMPLETE’. At that point, they can navigate to the ‘Outputs’ tab on the their stack page, then click on the Jupyter URL hyperlink to open and run the example notebook, a portion of which is shown below.

Image for post
Image for post
A portion of the included, example Jupyter Notebook

While the notebook is running, users can view the workload overhead using the Ganglia tool and associated visualizers, again shown below. AWS includes Ganglia as part of it’s EMR (Elastic Map Reduce or managed Hadoop/Spark) service, which this solution uses, because CSIRO’s VariantSpark library is written to run on top of a Spark cluster.

Image for post
Image for post
Viewing AWS EMR cluster resource usage via the Ganglia tools

Cleaning Up

Image for post
Image for post
24 Resources (AWS service instances) are creating using this AWS Marketplace example

In the CloudFormation UI, users can simply select the stack and then click ‘delete’ to delete all associated service instances when they are done testing. Additionally, if users want to make a copy of the stack for their own use (or further customization) they can easily do that by clicking on the ‘create stack’ button in the UI. After the termination process completes, the stack is removed from the CloudFormation console. Users can also verify that the associated EC2 instances (for example) have been terminated by viewing the EC2 console (example shown below).

Image for post
Image for post
Deleting the CloudFormation example stack automatically terminates associated resources.

Learnings

  • Used parameterized templates which included ‘starter-level’ default values to allow users to build complex infrastructure — rather than scripts
  • Included metadata in the vendor marketplace to make their templates even more usable. Their smart templates create a solution which is discoverable, understandable, and compelling, i.e. ‘try it easily — fast and free’
  • Included Hello World scaled example data and an example analysis Jupyter notebook
  • Included cluster monitoring tools

Congratulations to Dr. Denis Bauer and her team on a job well done!

Image for post
Image for post
The CSIRO Bioinformatics team

In the next article, I’ll be examining another approach in this space — patterns are similar, but not identical, this is part due to the choice of cloud vendors. The next article will cover a cloud-native Hello World bioinformatics example running on GCP (Google Cloud Platform).

Cloud Architect who codes

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store