AWS SageMaker for Bioinformatics

Lynn Langit
3 min readJan 11, 2018

written by Samantha Langit

The Bioinformatics Research Challenge

As a college student who needs to hand in bioinformatics homework, I considered trying out AWS SageMaker to run a Jupyter notebook as it allows me to share my work and make it reproducible for others.

The other reason I chose a cloud-hosted notebook is to build the skills I need for performing genomic research. Workflows that are developed by the research community need to be globally shared and the ever-growing dataset sizes require these workflows to be executable in the cloud. Cloud-hosted Jupyter notebooks are easily accessible and can use a variety of methods, including machine learning algorithms to analyze massive amounts of genomic data.

To demonstrate these benefits, I am setting up a scalable Jupyter notebook. I am using the API of GT-Scan2 , developed for genomic analysis by the team at CSIRO Bioinformatics in Sydney, Australia. This tool locates optimal CRISPR Cas-9 binding sites within a given region of the genome. For my example I want to find target sites for gene editing that are active in one organ but inactive in another. This may be of value for increasing neutrophils in heart but not in blood, as shown in this notebook.

AWS SageMaker

As mentioned, SageMaker provides scalable cloud hosting for Jupyter notebooks.

Its appeal for this use case is that SageMaker involves zero installation steps, which lets me get started within 5 minutes. It incorporates the ability to select flexible machine sizes, including those with GPUs. For other research, SageMaker also includes built-in machine learning models. In this case, I’ll be using only the Jupyter notebook capabilities. Shown below is the instance I initialized.

Cloud-hosted Jupyter notebook

Here are the steps I took:

  1. Initialize: After signing in on the AWS web console, I opened SageMaker from the list of services. From there, I created a notebook instance by clicking ‘Create notebook instance’. SageMaker prompted me with options to name it, choose a machine type — for this case I chose medium, and associate an IAM role with it. The role I selected had IAM permissions for any bucket on my account, because I wanted the flexibility to choose a bucket downstream. Finally, I clicked ‘Create’ and waited 5 minutes for the instance to load. Shown below are the fields I filled out.
  2. Test: When the notebook environment was ready, I opened and tested the instance to see if it would run as expected. To do this, I picked a sample notebook from the given list, ‘trusted’ that sample within the notebook environment, and used shift+enter to run a cell. With that successful, I could shutdown that sample notebook.
  3. Customize: I then made a new folder at the root level and named it (for the existing notebook relating to my specific use case). I uploaded the relevant Jupyter notebook into it, then ran it in the same way as I did with the sample notebook. I can, of course, change any values or add any additional code for further analysis.

Summary and Next Steps

Above is a screenshot which shows the research results from the notebook. I added a quick plot (cell 47) to summarize findings. It took me only around 20 minutes from service setup to notebook execution. Also, the sample notebook performance in the SageMaker environment was fast. Given these results, I intend to add SageMaker to my research toolkit.

If you’d like to try out this example yourself, then you can get the Jupyter notebook source code from this location — https://s3.us-east-2.amazonaws.com/csiro-graphics/notebook-casestudy.ipynb If you’d like to connect with the team at CSIRO Bioinformatics, reach out to the lead, Dr. Denis Bauer via https://bioinformatics.csiro.au/.

Samantha Langit is a bioinformatics student at Harvey Mudd College in Claremont, CA.

Reviewed by Dr. Denis Bauer (CSIRO Bioinformatics), Lynn Langit (AWS Community Hero)

--

--