Scaling Custom Machine Learning on AWS

8 min readOct 2, 2018

Understanding the Challenge

Bioinformatics is one of the most interesting and challenging areas to work on scaling big data machine learning solutions. These challenges include not only the size and scale of genomic data (3 Billion DNA ‘letters’ per person). They also include the potential to improve feedback loops for important research in human health, such as understanding significant variants in genomic data for potential CRISPR-Cas9 research. This research can have profound impact on diseases such as cancer.

VariantSpark is a custom machine learning library for genomic data, built on Apache Spark core in Scala. VariantSpark implements a wide RandomForest ML algorithm, which includes custom splits which allows for analysis of input data which can include up to 100 Million features. The team (led by Dr. Denis Bauer) at CSIRO Bioinformatics in Sydney, Australia created this algorithm. This paper describes the origins of VariantSpark in more detail.

VariantSpark — wide RandomForest Machine Learning

When I joined this project as a remote Cloud Advisor, the team at CSIRO wanted to learn how best to utilize the public cloud to scale VariantSpark job runs. The project goals are as follows:

Speed — running one job on the current shared on-premise Hadoop/Spark cluster requires waiting for cluster availability. Also, even after the cluster is available, a single VariantSpark analysis job run can take up to 500 compute hours to complete.
Predictability — cost (for cloud services) and speed (time to run a job)
Simplicity — many bioinformatics research teams have 1 or fewer DevOps dedicated teams members, so easy setup, monitoring & maintenance is needed.
Reproducibility — using tools to capture AWS service configuration as code, so that research can be verified and reproduced is also critical for this domain

In this multi-part blog series, I’ll detail our work, moving towards scaling VariantSpark jobs using the AWS cloud. In this first post, I’ll detail all of the actions we took to prepare BEFORE we started the actual scaling work.

Goal 1: Understand current state and project goals

As with any technical project, we began by reviewing the existing implementation. The CSIRO team currently uses a shared, on-premise managed Hadoop/Spark cluster. When they want to perform a VariantSpark analysis, they submit their job request to their internal IT group and wait for availability on the shared cluster. After their job is scheduled, they they wait for results. While data input sizes do vary, for a large analysis, the actual job run time can be several hundred hours.

Understanding Input Data

Genomic sequencer output data is presented for further evaluation in a variety of data formats. VariantSpark is designed to work with the following data types: .csv, .txt, .vcf, .bgz, .bz2 and parquet. The sizes of the input data can vary widely for analysis jobs, currently “typical” input data sizes vary from 1 GB to 5 TB. VariantSpark uses two types of files — an input file (the large-sized file for the workload) and a features (or label) file. Shown below is a portion of an input file.

There are also other data characteristics for consideration. These include the partitioning scheme (for parquet data), and size of job in memory. Apache Spark works most effectively when all of the input data can fit into the in-memory executors on each worker machine in the cluster. In order to facilitate testing, the CSIRO team created a selection of sample input data (which we call ‘synthetic data’), to reflect the different types and sizes of analysis jobs.

Job Types — # rows (samples) * # features (columns)-> total data size

Tiny — demo — 2.5k samples * 17k features→ 1 GB
Small — partial GWAS — 5k samples* 2.5M features→ 5 GB
Huge 1— big GWAS1–10k samples* 50M features→ 217 GB
Huge 2— big GWAS2–5k samples* 100M features→ 221 GB

Why so many features? The human genome has 3 Billion data points.

Understanding Compute Resources

The computing power available to team (the on-premises shared Hadoop/Spark cluster) consists of the following specifications:

12 servers, each with 16 CPUs & 96 GB RAM
When testing the largest-sized VariantSpark jobs, 10 of the total available 12 servers were needed to be able to cache all job data. That cluster size is effectively 160 CPUs & ~ 1 TB RAM. Spark is configured to use ‘whole node executors’ as well.

Reviewing the OSS Library

We reviewed the state of the VariantSpark OSS library on GitHub and worked with the team to perform the following preparatory actions:

Built code and ran all tests. Used IDE code coverage tools to evaluate areas of the code base which were covered with unit tests, with emphasis on the ML algorithm areas.
Refactored code, mostly using safe renaming, to remove the need for most code comments. Deleted commented out code blocks
Added ScalaDocs to key machine learning sections of the code
Create a task list on GitHub for next work steps
Pushed a compiled VariantSpark.jar file to a public Maven repository

Our community created a Python API (wrapper) for usability too! Key code is shown below.

Community Partners DIUS coded a Python wrapper for VariantSpark

Goal 2: Select Cloud Services

The team at CSIRO has been developing a number of research tools on the public cloud, including applications such as GT-Scan-2 on AWS. They wanted to work to consider the available options for scaling VariantSpark jobs on the public cloud as well.

As with other client work, the first level of consideration was the type of compute resources we wished to test. Shown below is a list of AWS services that we considered, grouped by type.

One of the key constraints, is that at the time we began work, the team at CSIRO Bioinformatics didn’t have any dedicated DevOps professional on their team. For this reason, we decided to start NOT by designing a solution for scaling, rather first to build a solution for the main purpose of doing quick demonstrations for a wide variety of public audiences — both bioinformaticians and also technical professionals.

Goal 3: Building a Sample

As previously stated, the goal of this phase of the work was to create a sample. This sample should make it fast, easy, free and fun to try VariantSpark out on the cloud for ‘customers’. Due to privacy considerations, we incorporated realistic (rather than real) data and the team created a fake phenotype — Hipsterism.

Hipsterism traits

We also created a sample Jupyter notebook for this synthetic affliction. A Jupyter notebook, rather than a bash script, was used so that new users could more easily understand the way VariantSpark works. In bioinformatics, Jupyter notebooks are becoming increasingly pervasive, as they improve reproducible research, by containing not only executable code, but also data visualizations and markdown text. This text helps to document experiments and jobs.

Using SaaS to get Customer Feedback

For this portion of the work, I recommended using a SaaS (Spark as a service) cloud solution due to ease of setup and use. We built our sample using Databricks on AWS. Using a free community Databricks cluster and HipsterIndex Databricks notebook example allows a new user to run their first job with minimal setup steps.

The Databricks AWS community edition includes 1 hour of free managed Spark cluster compute use with a single node cluster. Learn about our work — link. Here is a presentation which includes a screencast demo showing how to run the example (demo starts at 18:00)

Additionally our work on this sample has become part of the official Databricks documentation, as an example of a genomics applications built on Apache Spark.

Goal 4: Selecting Services for Scaling

Having successfully built our sample, it was time to get to work on scaling VariantSpark jobs. We first needed to determine which AWS services to use for testing VariantSpark jobs at scale. We must consider exploring AWS compute and data services which would be best fit for testing, scaling and performance verification.

First we discarded lambda, because we are working with a stateful machine learning algorithm and lambdas are stateless. We could then consider the following:

Unmanaged Virtual Machines — EC2
Managed Virtual Machines — EMR
Unmanaged Containers — ECS or EKS
Managed Containers — Fargate or Sagemaker

Again we looked to reduce the size of this list by removing some options. We removed EC2, because of the lack of DevOps team members on most bioinformatics teams. The amount of administrative overhead to setup and maintain a Hadoop/Spark cluster on EC2 wasn’t a fit for the size of team at CSIRO.

We removed Fargate because we preferred to use Kubernetes, rather than ECS for container management. As of this writing, Fargate does not yet support EKS. We feel that Kubernetes has emerged as a standard for container management in the public cloud in general and we favor flexibility in our implementation.

That left AWS EMR, EKS and Sagemaker as the basis for the start of our scaling work.

Goal 5: Prepare for Testing at Scale

Now that we had done a good amount of preparatory work, we had just a few more steps to complete to be able to start our scaling tests. We needed to design and build test harnesses. This included working with the team to generate sample data and a couple of sample scripts. For convenience, we run these scripts from Jupyter notebooks — so that we can add documentation about our testing run conditions directly in the notebook. Also we needed to verify the AWS account, AWS region, IAM users and set up AWS service limits (billing alarms) to use for our work.

Additionally we chose to attempt to use both AWS CloudFormation and Terraform, so that we could select the best automation process for this work and for ongoing movement to using cloud services by the team.

Finally, we worked with the group in Sydney to build and run a couple of example Docker containers locally. We did so because using containers for running compute was new to them. We both created sample containers (such as this one which includes the common bioinformatics tool blastn, sample data and a example in an included Jupyter notebook) and also used the biocontainers registry. We had the team use the Docker desktop tools to run these containers locally.

Now that we had completed this preparatory work, we were ready to start work on scaling. In the next blog post I’ll detail how our work on scaling VariantSpark on AWS has been progressing.