Scaling Custom Machine Learning on AWS — Part 3 Kubernetes

  • In part one — we detailed the project itself and also preparation work on the VariantSpark library.
  • In part two — we detailed scaling VariantSpark analysis jobs on the AWS cloud using the Elastic Map Reduce or EMR service.

Goal 1: Use Machine Learning Containers

Because many people on this project had little experience with container-based workloads in general, we decided to first work with individual Docker containers for machine learning, before we moved to attempting to build an AWS container cluster for VariantSpark.

  • the common ‘blastn’ bioinformatics tool
  • the Jupyter notebook service
  • sample analysis data
  • a sample notebook
AWS SageMaker ML lifecycle phases
  1. to understand the benefits of using AWS-optimized ML containers
  2. to see how the Jobs (training) and Models (hosting) phases worked
  3. to use an algorithm that had a similarity to VariantSpark (XGBoost uses a tree-based approach — blog which compares both algorithms)
  4. to try the hyperparameter tuning feature
Spark 2.3 and Kubernetes

Goal 2: Select a Container Service

Now we were ready to select one or moreAWS container orchestration services for testing for scaling VariantSpark jobs using container clusters. There are a number of services to choose from on the AWS cloud. We considered the following for this project:

  • EKS (Elastic Kubernetes Service) — AWS service for managing containers using open source Kubernetes. We chose to test this service and will detail our work in the remainder of this blog post.
  • ECS (Elastic Container Service) — AWS service for managing containers using their own container-management system. We chose not to test this service, as we preferred to use Kubernetes.
  • Fargate w/ECS — AWS service for managing containers using their own higher-level container-management system. We chose not to test this service, same reason as above.
  • Fargate w/EKS — AWS service for managing containers using Kubernetes with a higher-level container-management system. At the time of work, Fargate w/EKS was not yet available, so we were unable to test this service.
  • SageMaker — AWS service for managed containers for machine learning workloads. We made an effort to test this service, but stopped to focus on Kubernetes, as SageMaker uses ECS. We preferred to work with Kubernetes.

We chose open source Kubernetes using EKS.

Starting our work with EKS, we wanted to first build a container that included Spark, Spark ML and Scala and run it locally. We found a couple of helpful blog posts to start this process. We were able to replicate this work quickly. We were particularly encouraged by being able to easily ‘containerize’ Spark and to interact with the local file system (rather than HDFS).

Build a Spark ML Kubernetes POC

For the next step, we wanted to get our custom Spark ML/Scala container working on EKS. To do this we created Terraform templates and also used the kops service in lieu of EKS (as the latter wasn’t yet publicly available for testing). The architecture we built from is is shown below.

Spark/Scala on EKS & S3

Goal 3: Setup VS-EKS for Testing

Finally we were ready to test our custom VariantSpark container on EKS. During this time AWS released the Kubernetes-native EKS service to GA (general availability) which removed the need for us to use the kops service.

VariantSpark on AWS EKS configuration
Setting up a VariantSpark EKS cluster using Terraform templates

Goal 4: Plan Configuration & Parameters

Now that we had a working sample environment, our next step was to start the work to testing scaling VariantSpark jobs. At this point, we wanted to be able to compare the resources needed for on-premise, EMR or EKS. We worked diligently to use the most similar configuration for our testing.

Services Types

The Kubernetes Master works with the Spark Driver on EKS

As with our EMR configurations, we tested many combinations of EKS configurations and Spark/VariantSpark job execution parameters. We were able to get a ‘like-for-like’ comparison, after we understood where to get parameter changes in each environment. EMR uses Hadoop (YARN), whereas EKS uses Kubernetes/Spark Driver settings.

  • Prepare Spark Executors — runs within milliseconds, requires minimal resources
  • Zip with Index — prepares data, runs within minutes, requires appropriate compute, IO and memory.
  • Count at Importance — loads data, runs within 10 minutes, requires enough executor memory to fit all data. IMPORTANT: if cluster memory isn’t sufficient, then data will ‘spill’ to disk and this process runs 30–50x slower.
  • Run Decision Trees — analyzes data using multiple steps for each tree, variable run time (minutes to hours), depends on (configured) number of trees, requires appropriate compute and memory. The number of Decision Trees impacts the quality of the analysis results. For our initial testing we started by running a very small number of trees (100), we later expanded our testing to include analysis which produced as many as 10,000 trees. Properly accounting for tree building needs is a key aspect of scaling VariantSpark.
VariantSpark job steps in the Spark Web UI

Goal 5: Monitor for Effective Sizing

We worked with a number of tools and interfaces to understand the resources needed for each VariantSpark analysis job. These tools allowed us to ‘see into’ the job executions, so that we could understand how to configure and scale the AWS EKS resources. The tools that we used included the following:

  • Kubernetes Web UI — pods and nodes (also the ‘kubectl’ Kubernetes client tool)
  • Terraform Templates-- output from terraform script runs told us when we exceeded AWS allocated resources (i.e. not enough of a certain size or type of EC2 instance, or out of EIP addresses, etc…)
  • Spark Web UI — live job steps, executor resource usage and logs
Viewing an undersized cluster for a VariantSpark job in Kubernetes UI
Verifying the VariantSpark EKS job step durations via the Spark UI
Verifying 100% of data is cached via the Spark UI

Next Steps: From Batch to Pipeline

As with any project, although we created repeatable processes for scaling VariantSpark analysis jobs on the AWS cloud, there is still more work to be done.

Vision for VariantSpark Pipeline
  1. CLIENT — Jupyter notebook client, which connect to multiple AWS endpoints. These endpoints are presented as public (API) gateway endpoints to AWS compute processes (Lambda, EKS, EMR, etc…)
  2. DATA — A lambda-driven job analysis-process, which kicks off by evaluating the size, type and quantity of data in one or more designated S3 buckets. It then calls an intermediate lambda which generates an analysis of needed AWS compute resources given the input data size. This lambda estimates time to prepare the compute resources, time to run them and estimated service cost. If input data size is larger than a defined threshold, then data compression, conversion and/or partitioning prior to analysis would be suggested and communicated back to the user via a message in the notebook client.
  3. DATA PREP (optional) — For ‘large’ data , the user could choose to initiate a ‘data prep’ step via the notebook interface which would then initiate downstream compute to compress, convert and/or partition input data and would output optimized data to a destination S3 bucket.
  4. ANALYSIS — the user would be presented with a set of choices in the notebook, i.e. ‘run job using AWS’ or ‘run job locally’. These choices would include estimated time to prepare environment, time to execute job and estimated compute cost (for AWS services).

    — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
  1. Can we use EKS use EC2 spot instances (requires EKS auto-scaler as well)? The answer appears to be yes — blog here.
  2. Can we use EKS auto-scaling for precision (rather than EC2 auto-scaling)? Again, the answer appear to be yes w/Kubernetes 1.12+ or above — link
  3. Can we use AWS tools to build CI/CD for EKS/custom Docker container and the VariantSpark jar file itself? We believe so, a partial potential architecture is shown below.
Potential CI/CD Pipeline for VariantSpark jar file



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store