Five Commands Build Two NCBI APIs on the Cloud via Pulumi

Sixing Huang
Star Gazers
Published in
8 min readMar 24, 2021

--

In my previous articles “Serve NCBI Taxonomy in AWS, Serverlessly” and “Build Your Own GraphQL GenBank in AWS”, I outlined my approaches to putting up NCBI’s resources as REST or GraphQL APIs in AWS. They were straightforward, but they were also the hard ways to do things — followers have to click and type a lot before the services were up and running, an error-prone and boring process.

Do it once, OK. Do it twice, no.

What if we can describe all these steps into some scripts and just let a software run through them and also build all those infrastructures for us? The benefits are many: the scripts are declarative so that they can serve as excellent documentations. They are big time savers not only because their executions are fast, but they are also reusable for now and for the future, for me and for you. Finally, it is easy to maintain or tear down the whole infrastructure by modifying the scripts or issuing new commands.

Figure 1. The Pulumi homepage.

In this age of automization, this is already a reality. It is called Infrastructure as Code (IaC). The open source Pulumi is one of the contenders in this field. Unlike Cloud Deployment Manager from Google and Cloudformation from AWS, Pulumi supports several cloud providers. It also differs from configuration file based Terraform, another big name in the arena, in that Pulumi orchestrates the deployment with programming languages such as Python or Go. Pulumi glues different cloud resources together in a program through the management APIs provided by those cloud providers. It is easy to learn for users who are proficient in one of its supporting languages. Pulumi’s function names and syntaxes are close to Terraform, so Terraform users can quickly master it.

I have reshaped my two NCBI projects with the help of Pulumi in the past week. The principle is: minimal commands. For this purpose, I packed commands for data import and source code building into scripts. Now instead of two lengthy protocols, I have two Python projects that are tested and ready for future deployment. In this article, I will go through the processes and describe the multiple gotchas that took me hours to fix. The codes are available in my Github repositories: https://github.com/dgg32/ncbi-taxonomy-pulumi and https://github.com/dgg32/graphql_genbank_pulumi.

1. GraphQL Genbank

In this article, I set up a nano GraphQL Genbank that serves the metadata and the sequence data of three GBK files. With GraphQL, users can exactly retrieve the data fields needed. The architecture consists of only two components: DynamoDB and AppSync. The DynamoDB stores the data and AppSync acts GraphQL interface.

Figure 2. The architecture for GraphQL Genbank

The Pulumi script for this project is also very straightforward. Below is the demo version of the __main__.py:

A DynamoDB can be easily defined by the first few lines. Then I set up a service role with access to the database (line 10 to 19). Later, AppSync “data source” will assume this role (line 36) and thereby connects the API with the database. The GraphQLApi block (line 22 to 26) contains the schema of the API. It effectively defines the inputs and outputs of the API, in other words, the interface. An ApiKey (line 29 to 30) is necessary for the client authentification. The rest of the codes are the resovlers. They do all the heavy lifting by implementing each function given in the schema. Finally, the script stores and shows the ApiKey and the API endpoint so that we can visit it from the internet. The table_name is needed for the data import after Pulumi.

To set up the whole infrastructure in AWS, just cd into the project folder and enter:

pulumi up -y

But this step above does not include data import. To put the three GBK files as documents into DynamoDB, just cd into “data_import_and_client” and run:

python gbk_to_dynamodb.py ./data/ $(pulumi stack output table_name)

The “gbk_to_dynamodb.py” accepts two arguments, the GBK data folder and the table_name. Since I exposed the table_name in line 65, I could just quote it with the “pulumi stack output” command.

Finally, to test the API, I could reuse my demo client from my previous article:

python query.py $(pulumi stack output genbank_graph_ql_api_endpoint) $(pulumi stack output genbank_graph_ql_api_key --show-secrets)

With the endpoint and the ApiKey, “query.py” queries the GraphQL Genbank for all the accession numbers and country data. Again, the two dollar parentheses quoted the exposed variables from Pulumi and spared me from entering the values explicitly.

To tear down the project, issue:

pulumi destroy -y

This command should destroy all the infrastructure on AWS.

2. NCBI Taxonomy as REST API

Figure 3. The architecture for NCBI Taxonomy REST API.

In another article, I created a REST API to serve the NCBI taxonomy in AWS. It relies on the crosstalk among API Gateway, Lambda and Aurora. The architecture looks deceivingly easy, but I have spent four days until I could transform everything into a Pulumi project. The reason was the fact that many complexities were hidden in the AWS web console. And they all needed to be configured correctly in Pulumi.

Furthermore, in my original design, Aurora was serverless. I imported the data via Cloud9 because serverless Aurora does not allow public access. Now with Pulumi, I wanted to import the data locally via a script to achieve full automation. For this reason, I used provisioned Aurora this time. As mentioned in my original article, serverless Aurora has a delay in response after waking up from inactivity. But the provisioned version seems to be immune from this bug.

Unlike the two-stages setup in the GraphQL project, the NCBI Taxonomy API project needed three steps: two Pulumi steps sandwiched a pure Python one. The first step was to set up the Aurora database in Pulumi. In the second step, I imported the taxonomy data to the database, wrote the Aurora endpoint into the Lambda source code to guide it to the database, and zipped the source code. The final step was to set up Lambda with the zip file and configure the Api Gateway in Pulumi. The second step is necessary because to my understanding, the Aurora endpoint value cannot be converted to a string during the execution of my first step Pulumi. Only after the first step exported the endpoint, then I could finalize my Lambda source code in step 2.

2.1 Provision Aurora

Here is the demo version of my first step Pulumi:

In this script, I defined two security groups. The first one “pulumi-ncbi-lambda” was to let Lambda communicate with Aurora. The second one “pulumi-ncbi” was to let Aurora accept accesses from my IP address and Lambda. The rest of the code was to set up an Aurora cluster with just one cluster instance. Finally, the instance endpoint was exported for step 2. To run this, cd into the “database” folder and issue:

pulumi up -y

2.2 Data import and Lambda source code

Stay in the same folder and issue the command for step 2:

python zip_and_import.py $(pulumi stack output instance-endpoint)

This step will populate the Aurora database, finalize and zip the Lambda source code.

3.3 Lambda and Api Gateway

Here is the final Pulumi demo script that configures both Lambda and Api Gateway:

The first part of the code (line 6–30) established the Lambda deployment. It started with the upload of the Lambda source code to S3 and then Lambda took it from S3. Lambda needed an IAM role with some policies and the security group defined in step 1. Additionally, it also needed a list of subnets for its deployment. Line 33–36 set up a Lambda permission that granted access to the Api Gateway.

Line 33–67 defined the APIs. Line 41 declared six functions, “getdictpathbytaxid”, “getnamebytaxid”, “gettaxidbyname”, “getrankbytaxid”, “getparentbytaxid” and “getsonsbytaxid”. Each of them required an identical setup and contained five components: Resource, Method, Integration, MethodResponse and IntegrationResponse. I also put on the “depends_on” ResourceOption so that the component could get created in the right order to avoid the “REST API Doesn’t Contain Any Methods” error. Finally, I deployed the API in line 69 and export its endpoint in line 71.

Finally, cd into the “lambda_api_gateway” folder and issue the command:

pulumi up -y

After a short while, the command should finish setting up the Lamdba and Api Gateway. The endpoint of the Api deployment will be shown in the output.

Figure 4. Example output of my Pulumi execution.

To test the API, open a browser window and enter the address with your deployment endpoint, function and query string (such as “taxid=” or “name=”). For example:

https://[endpoint]/prod/getnamebytaxid?taxid=1224

The Api call above should return “Proteobacteria”:

Figure 5. Example output from my API call.

You can test the other functions and make sure the Api returns the desired outputs. In case you want to tear down the whole infrastructure, issue:

pulumi destroy -y

first in the “lambda_api_gateway” folder. It will destroy the Lambda and Api Gateway. In order to destroy Aurora. cd into the “database” folder and issue another “pulumi destroy -y”. It will finally clean up all the traces on AWS.

Conclusion

There are they, two NCBI services as APIs in a minimal set of Python commands. With the help of Pulumi, I am able to code the infrastructure for the majority of my Cloud projects. They are portable, easy to maintain and easy to share. Great!

But, as always, the devil is in the details. Because Pulumi just wraps the cloud API in programming languages, users still need a good understanding about the specifications of each cloud provider before they can call the correct Pulumi methods and enter the correct parameters. Also, the web console of AWS hides many complexities such as permission settings and definition orders. In Pulumi, all these complexities come out of woodwork and they all demand correct configurations. So writting a Pulumi script is not only a way of automation, but it also deepens my understanding over cloud resource management.

So, now it is time for you to automate your own cloud deployment with Pulumi.

--

--

Sixing Huang
Star Gazers

A Neo4j Ninja, German bioinformatician in Gemini Data. I like to try things: Cloud, ML, satellite imagery, Japanese, plants, and travel the world.