The Top Five 2017 AWS Re:Invent Announcements Impacting Bioinformatics

Todd Harris
4 min readDec 5, 2017

The sixth annual Amazon Web Services (AWS) Re:Invent conference was held last week in Las Vegas. Like years past, 2017 Re:Invent was a dizzying week of new announcements, enhancements to existing products, and interesting prognostications on the future of cloud computing.

The first Re:Invent conference was held in 2012. 6000 attendees gorged themselves on keynotes, sessions, hackathons, bootcamps, cloud evangelism, and video games. Not much has changed, except for the number of attendees. 2017 boasted 43,000 attendees with a conference campus spread across multiple venues on the Las Vegas strip. It was not difficult to get your 10,000 daily steps in this year.

I’ve been fortunate to attend Re:Invent every year. In 2012, I’d already been using AWS for nearly 5 years and was thoroughly convinced of its utility and value. In those early days, there was still a lot of reluctance and skepticism of the cloud, particularly in academic settings. To some extent, these biases still exist which partially explains the slower uptake of the cloud for academic projects. By now, I think it’s overwhelmingly clear that academic compute and basic research projects should be leveraging the many benefits of building and deploying on the cloud.

Without further ado, here are my top five announcements from the 2017 AWS Re:Invent impacting bioinformatics.

1. AWS Sagemaker

AWS Sagemaker is a fully managed service for building, training, and deploying machine learning workflows in the AWS cloud. Machine learning has always played an important role in bioinformatics. Simplifying training and deployment of ML workflows will have a profound impact on bioinformatics and big data. For one, Sagemaker offers the opportunity to introduce ML approaches to a broader audience, and to a broader range of research topics. Of any of the 100s of announcements at Re:Invent, I’m most excited to put Sagemaker to use.

2. Amazon Neptune

Bioinformatics is all about highly connected data. These relationships are often not concretely known and difficult to model in relational database management systems. Graph databases are a perfect fit for fluid biological data. Amazon Neptune is the latest entry into the crowded Graph database space. Although many commercial options are currently available, they often force unsavory decisions and raise significant issues of cost and vendor lock-in. Neptune is still in a preview phase and I haven’t had any direct experience with it, so I can’t address how it will perform against these challenges. However, given the quality of the other hosted database solutions, coupled with Neptune’s deep integration with other AWS tools, I expect it to be a worthy contender. As a highly available and scalable managed database supporting graph APIs and designed for the cloud from the ground up, Neptune could be an amazing tool for bioinformatics projects, especially projects looking to empower developers and decentralize management and system administration.

3. AWS Fargate

AWS Fargate promises to bring the serverless revolution to containers. Containers already have a strong presence in bioinformatics and have greatly simplified the maintenance and deployment of applications that may be, ahem, short on documentation. Still, contianer deploys have required managing infrastructure. Fargate is a launch type for Amazon ECS that simplifies launching containers without having to manage the underlying infrastructure. You don’t have to define instance type or family or manage scaling or clusters. Just define CPU and memory, IAM, and networking, and let Fargate handle the infrastructure. While we are on containers, AWS also introduced ECS for Kubernetes (EKS). Although it doesn’t rank in my top five, it does bear mention here.

4. AWS Comprehend

Did I say that I was most excited about Sagemaker? Well, I’m also pretty psyched about the introduction of AWS Comprehend. Comprehend is a natural language processing (NLP) managed service that relies on machine learning to process test. At the end of the day, a big part of the most interesting part of bioinformatics is text. Comprehend offers a really cool way to get at that information. It can extract key phrases, known vocabularies, and custom lexica. It also does expected things like weighting occurrences and displaying them in context. Of course, it has an API and integrates with other AWS services, too.

5. AWS Glacier Select

Last but not least is AWS Glacier Select. Really, you ask? A storage enhancement made my top five list? Yes. Here’s why. Biology (and bioinformatics) is about data. Data is expensive to generate and expensive to keep around. You either pay a lot for storage, throw your data away and commit to regenerating it later, or place it in essentially inaccessible archival storage.

That’s where Glacier comes in. Glacier is an AWS archival service for data not requiring quick, realtime access. Glacier Select is a new ehancement that lets you execute SQL queries against a Glacier archive. Since it is archival storage and partially “frozen”, you also specify when you would like your results returned — standard queries take about 3–5 hours. Results can be deposited in an S3 bucket for further analysis. Since we’re talking about AWS here, of course there is an API that you can build in to existing data warehousing applications. I’m psyched about cheap archival storage. And I’m super psyched about cheap archival storage that actually remains queryable. I think this relatively humble announcement will find many applications in bioinformatics and big data analytics.

There were many other announcements that have direct application to bionformatics. I’d highly encourage you to watch the keynotes from Andy Jassy and Werner Vogels to dive in a little deeper. From there, you can watch most sessions from Re:Invent online, too.

This is my first post on Medium. Liked it? Hated it? Let me know how I can improve. And if you’re so inclined, please follow me on Twitter. And a quick thanks and shout out to Aram Rasa Taghavi for some interesting discussions on Medium and motivation.

--

--

Todd Harris

Facilitating discovery at the intersection of genetics, genomics, bioinformatics, and big data. Traumatic Brain Injury survivor making every day count.