Microsoft Azure: Multi-Instance infrastructure with AppService as the center of the Universe.
TL;DR
Multi-Instance is a public cloud deployment of an application where each customer is using their own copy of the software and has their data stored in their own specific RDMBS.
~ Brian Sommer
Not so long ago I was asked by one PM to help their project team with setting up Azure infrastructure and hosting their application there. The idea was already there, but one minor thing was left to cover: Implementation of that idea…
One of the main requirements was: all our customers (environments) must be isolated from each other and don’t share anything. Here’s where multi-instance came from. Another requirement was to minimize number of servers (VMs) as much as possible, that's why we are going to talk about Azure AppServices.
General idea
First, I’d like to present a little drawing, so that you have an idea what we are talking about:
System RG is a resource group, containing of system resources/utilities, virtual network, secret vault, database to store information about customers, DNS zone.
We are “mycompany.com”, planning to sell cloud based software (SaaS). This setup implies having plenty of customers and therefore resource groups.
QA/DEV RG: resource groups with QA and Development environments.
Customers RG is a resource group for a specific customer, let’s say named “Customer”.
Faced issues
If you are planning to give access to resources in your Azure subscription, that is connected to a corporate Azure AD it's a little bit problematic, I just didn't have access to view or create users in this AD, so I've done the following:
- Created a separate Azure AD;
- Created all service accounts and users for all third party vendors there;
- Sent them invites to my subscription (no need to confirm invitation);
- Assigned required role.
As for service accounts, you might want to set password of an individual account to "never expire", and you can do it only with the help of PowerShell, so be ready. You can always change password policy through Office365 admin center, but it will be applied to all users in this Azure AD.
Since we care about CI/CD, bulk updates of customer environment, automated deployments, and so on, we’ve had to setup management server and other system resources. Exposing this server to the internet was not the way we wanted to go, so how to access it? There are a couple of options, considered all variations and narrowed down to Jump host and VPN to Azure. Later on we decided to go for VPN server, so I started to dig into available VPN options.
VPN, but not for you!
As I wrote above, we didn’t want to expose anything to the internet, except customer front end(s). Point-to-site VPN was good for both, customer and development team. Doesn’t sound like a problem, right? There is an Azure native service VPN Gateway.
I Did a quick research and found Azure VPN Gateway, MS says I can connect to my infrastructure from “ANYWHERE”:
Well, in Microsoft’s ideal world everybody uses Windows 10 with latest updates on PC/Laptops, Windows mobile on every smartphone and washing machines running Windows Server 2016R2. They simply don’t have a client for non Windows machines that is able to connect to SSTP VPN connection (Azure VPN gateway). I managed to google a couple of options for MacOS, but not those, I would use for production environment.
I'm not trying to advertise but should say a couple of words about the chosen VPN solution, that is OpenVPN Access Server. In my opinion it’s one of the best VPN solutions for production and I’ve convinced a couple of customers to use it. I guess now I deserve a “local distributor reward”, if there is such :)
Application insights
Application consist of Java BackEnd and Angular FrontEnd. They are served by Azure AppService, configured for Java web applications hosting, data is stored in MS SQL DB (Azure SQL).
As I already mentioned at the beginning of this article each customer environment is isolated from others, created in different resource groups and has it’s own SQL server. SQL server firewall is configured to allow connections only from a specific app service (AppService outbound IP addresses whitelisted in SQL server firewall)
E-Mail notifications / SMTP service
The application should have been able to send mail so we had to figure out which SMTP server we could use. Unlike AWS, Azure doesn’t have a native SMTP service, however there is a way: SendGrid.
A detailed how-to is well described here.
Azure customers can unlock 25,000 free emails each month. These 25,000 free monthly emails will give you access to advanced reporting and analytics and all APIs (Web, SMTP, Event, Parse and more). For information about additional services provided by SendGrid, visit the SendGrid Solutions page.
Deployment
One of the main challenges was auto deployment, including: Creation of an environment for specific customer, application deployment, DB schema deployment. To accomplish the goal I used a tool that I’m most comfortable with: Jenkins and configured job for each of the following actions there:
Create Environment
This job creates an environment, based on pre-cooked JSON templates, in Azure to store environment for a specific customer, customer’s name is passed to the job as a parameter. I’m not going to share the code of whole deployment scripts due to security reasons, but I’ll explain what is being done by each deployment script (bash):
- Replace placeholders in parameters.json
- Generate DB password and write it to Azure KeyVault
- Create Resource Group as a logical unit to store all resources
- Create CNAME in DNS Zone mycompany.com, basically <customername>.mycompany.com
- By referring to template.json, create appservice, appservice usage plan, sqlserver with empty database, storage account with a container for backups, which we are going to place there later.
- Assign additional host name to appservice (the one we created at step 5.)
- Upload wildcard TLS certificate (*.mycompany.com) to newly created appservice
- Configure binding, so appservice is accepting requests on HTTPS and my custom domain name
- Create “Allow” rules in SQL server firewall to allow access only from Appservice outbound IP addresses (reading outbound IP addresses by querying appservice with az cli ).
If you wonder what we do in case we need to deploy a DB schema from jenkins to SQL server: we add management (Jenkins) server IP addresses dynamically and temporarily, before the job is executed and remove it at the end of any DB-related job. - Force HTTPS on appservice by uploading this web.config file to FTP endpoint:
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<system.webServer>
<rewrite>
<rules>
<!-- BEGIN rule TAG FOR HTTPS REDIRECT -->
<rule name="Force HTTPS" enabled="true">
<match url="(.*)" ignoreCase="false" />
<conditions>
<add input="{HTTPS}" pattern="off" />
</conditions>
<action type="Redirect" url="https://{HTTP_HOST}/{R:1}" appendQueryString="true" redirectType="Permanent" />
</rule>
<!-- END rule TAG FOR HTTPS REDIRECT -->
</rules>
</rewrite>
</system.webServer>
</configuration>
Deploy DB Schema
Deploying DB schema, creating required tables and so on. Here we have a bunch of SQL scripts and Flyway Migrate (gradle plugin) to execute them and take control over the whole process. It helps us to get rid of unnecessary DB changes, if DB schema version is already the latest available, it just skips everything and reports:
Schema [database_name] is up to date. No migration necessary.
- Read DB server address, db name, db user, db password parameters from azure using CLI;
- Check if current management server is allowed to connect to SQL. Update SQL server firewall if needed;
- Replace placeholders in flywayMigrate profile with real values, pulled from azure;
- Execute flywayMigrate with gradle
Build Deploy back end / front end
These two jobs are designed to build and deploy back end and front end.
To build Java Back End we pull a bunch of parameters from azure, include those parameters in application properties. Basically we query azure for things like db connection strings configured at appservice, URL of azure blob storage where application stores some files (e-mail templates, profile pictures and so on) and then just build it with gradle.
I’ve never mentioned anything about deployment. There aren’t many ways to deploy to azure app-service, the only one for us was ordinary FTP deployment, and to make it a bit more secure I’ve used FTPS endpoint instead.
So in case of back end application we have a single WAR file when build is done, we take this file and upload to Tomcat webapp folder via FTPS, the rest is done by Tomcat.
However in case of frontend we have tons of HTML files, scripts, style-sheets, etc. Uploading all those to FTPS was a little problematic so we came to following idea, that might even sound weird, still worked fine for us:
Once npm build is done we compress everything with zip and rename archive to “frontend.war” then upload the single file via FTPS to Tomcat and let it do its job.
Health check: a thing you need!
One day it came to mind that it would be nice to know how much time it takes to deploy front end or back end and also we couldn't live without knowing whether deployment of a new version has been successful or not, we kinda needed some kind of a feedback from our little application. So here’s what I’m doing to make sure our version is deployed.
Front end:
When the application is built I’m injecting the file with current git HEAD to a package that's uploaded to web server later. After deployment I’m CURL-ing URL with this little file every 5 seconds and checking if git HEAD in file matches the one we built right before deployment. Once matched it returns exit code “0” and exit Jenkins job with “SUCCESS”. If git HEAD is different for 10 minutes, the build fails and I get an e-mail from Jenkins that something broke there. Here’s what I’m talking about:
gitHeadId=$(git rev-parse --short HEAD)#Verification if deployment was successful
retry_count=0
until [ $retry_count -gt 600 ] || [ $returned_head = $gitHeadId ] ; do
sleep 5
echo $(( retry_count+=5 ))s since new version FTP upload.
returned_head=$(curl -s --max-time 1 https://${customername}.mycompany.com/githead.txt)
doneif [ $returned_head = $gitHeadId ]
then
echo "SUCCESS: New version is deployed, git head $gitHeadId "
else
echo "ERROR: Timeout of $retry_count seconds reached"
exit 1
fi
Back end
This one was tricky, since I cannot just inject file in java back end, well maybe I can but why should I learn how to do it when there is a team of skilled Java developers :). So here I asked the developers team to add “HEALTHCHECK” resource to API, that does the following:
- If application is running, can query database and get expected results — it returns JSON with “success”
- Returns JSON with current git head
We do basically the same, but a slightly different health check as we did for front end:
gitHeadId=$(git rev-parse --short HEAD)#Verification if deployment was successful
retry_count=0
until [ $retry_count -gt 600 ] || [ $returned_version = ${customername}_$gitHeadId ] ; do
sleep 5
echo $(( retry_count+=5 ))s since new version upload.
returned_version=$(curl -s --max-time 5 https://${customername}.mycompany.com/BE/healthcheck?includeVersions=true | jq -r ".version")
doneif [ $returned_version = ${customername}_$gitHeadId ]
then
echo "SUCCESS: New version is deployed, version: $returned_version"
else
echo "ERROR: Timeout of $retry_count seconds reached."
exit 1
fi
As you might have noticed I’m using JQ here. At the beginning I tried to have as fewer as possible third party tools, but later on I gave up, since that’s a lightweight tool and it’s the best for parsing JSON data.
Tomcat parallel deployment and clean-up
We’ve made use of parallel deployment in Tomcat to minimize downtime. Still I have to remove old versions after the new one is deployed:
#Remove old wars
echo "Removing old versions"
wars=($(curl -k -l $URL --user "$FTPS_USER":"${FTPS_PASS}" | grep -v ${warFilename} | grep ^backend.*\.war)) n=1
for i in "${wars[@]}"
do
echo "Removing old version: $i" curl -k -v -u "$FTPS_USER":"${FTPS_PASS}" $URL -Q "DELE site/wwwroot/webapps/$i"
n=$(($n+1))
done
Delete environment
This job name is pretty self-explanatory. It removes Resource Group for a specific customer, by doing that all resources in that RG are getting deleted. Additionally it removes DNS record (CNAME) from mycompany.com zone and database admin password from azure key vault.
Re-Key DB
We all know about the importance of changing our passwords from time to time due to security reasons. Same here, it would be nice to get SQL admin password changed from time to time, since in this case we are not talking about change in one single place, it makes sense to write a script that would do all updates for us:
- Generate new password, apply changes to SQL server setting
- Write new value to Azure KeyVault
- Update application with new settings
- Restart application
- Another variation of HealthCheck loop, described above
Backup/Restore Azure appservice and its database
Microsoft offers automated backup of Azure SQL, that is a pretty good option, however we want to run backups on demand, use case: taking a backup every time before a new version deployment, so in case new version has some problems or, god forbid, hidden bugs we can roll-back to a version that was running right before we started an update. And here we’ve tried a couple of different options, let me start from the failed one:
0. FlywayMigrate, SqlCmd (Failed)
We are already familiar with this tool, so I wanted to use it to backup SQL. Not going to be a long story here, Azure SQL does not allow “BACKUP DATABASE” SQL statements. Obviously, situation is the same with another command line tool SqlCmd. Done with this option.
1. Copy to another database
Full copy of DB content to another database running on the same SQL server. Azure CLI helps here:
az sql db copy ...
This option was the easiest to implement (at that stage). But having another DB running in the same SQL is not optimal in terms of resources cost, therefore we’ve come to another solution:
2. Export to Azure Blob storage
This method implies taking a DB backup as a BACPAC and storing it on Azure blob storage, using our favorite Azure CLI:
az sql db export ...
We have been using this method for some time, until one day I noticed that backup, that I started early morning was still running at around 5PM. And it turned out there was a known issue, last reviewed Jun 16, 2014… In short:
This problem occurs when many customers make an import or export request at the same time in the same region.
The Azure SQL Database Import/Export Service provides a limited number of Compute virtual machines (VMs) per region to process the import and export operations. The Compute VM is hosted per region to make sure that the import or export avoids cross-region bandwidth delays and charges. If too many requests are made at the same time in the same region, significant delays occur in processing the operations. The time that is required to complete requests can vary from a few seconds to many hours.
~ Microsoft
We can’t afford such delays, so another option popped up.
3. Backup to server’s HDD
In the article about the known issue (link above), they mentioned DACFx API, offering us to use it in our code as a workaround, not the case, we can have only scripts (bash/PowerShell). There is a way to backup (take a snapshot) database to BACPAC file: sqlpackage.exe. This command line tool gets installed automatically when we install MSSQL engine on a Linux server. I don’t really want to have it on my Linux machine, it looks a little bit weird to me. I have a Windows server (UI testing Jenkins slave), so I could use it and had to recall my PowerShell skills, then write some scripts for backup/restore. But first install sqlpackage.exe and Microsoft SQL 2014 Feature Pack. If we skip all error handling, variables definition, comments, etc, the line that does the job looks like this:
.\SqlPackage.exe /Action:Export /SourceDatabaseName:database_name /SourceServerName:"tcp:$db_server,1433" /tf:D:\backups\$customername\databasename.bacpac /su:$dbuser /sp:$dbpassword /p:Storage=File
This method worked much faster than the previous, however I wasn’t happy with storing backups on a local HDD, of course we configured backups of this VHD at Azure, but anyway… So let’s go to the next and final solution.
4. AppService + DB backup to Azure Blob
We can configure backups on Azure Portal in AppService blade. Same can be done with Azure CLI. Using this method we get a backup of appservice files and settings, optionally we can include database to a backup and in our case this option is extremely useful. And few words about the script: first we need to define variables with azure storage account name and access key, generate SAS key for access to blob storage.
#Populate variables with Azure storage details storageAccName="${customername}storage" storageAccKey=$(az storage account keys list --account-name $storageaccname --resource-group ${customername}RG --query [0].[value] --out tsv)#Generate SAS valid for 1 hour
SASkey=$(az storage account generate-sas --services bfqt --resource-types sco --permissions rwdlacup --ip "0.0.0.0-254.254.254.254" --expiry $(date -u -d '59 minutes' +%Y-%m-%dT%H:%MZ) --account-name $storageAccName --account-key $storageAccKey
Generating SAS key (Shared Access Signature) was new to me and I found it a bit tricky when you deal with it for the first time, so I'd like so shed some light on this. First thing to understand it is a temporary access key to access your storage, parameters explanation:
--services bfqt
: Means we are generating SAS for Blob, File, Queue, Table services;
--resource-types sco
: Specifies that the resource types for which the SAS is valid are Service, Container, and Object. This means that the specified permissions are granted for all appropriate operations for the specified services.
--permissions rwdlacup
: Permissings are read, write, delete, list, copy, update, process.
--ip "0.0.0.0-254.254.254.254"
: Allow access from all IP addresses. Don't miss this parameter, otherwise you will be gerring "Bad Request" error. There might be a better way to specify this parameter instead of full range, like digging opendns.com or another resource to find out external IP of the server we are working on and then allowing only one address.
--expiry $($date -u -d '59 minutes'+%Y-%m-%dT%H:%MZ)
: Set SAS to expire in 1 hour from now, pay attention to date format.
Storage account name and key are already defined above, so we are pretty much done here. Once we have SAS it's time to tell Azure to start backup, assuming we have blob container named “backups”, our commands would look like this:
#Create backup
az webapp config backup create --webapp-name ${appServiceName} --resource-group ${customername}RG --container-url "https://${customername}storage.blob.core.windows.net/backups/?$(echo $SASkey | tr -d '"')&sr=b" --backup-name Backup1 --db-name databasename --db-type SqlAzure --db-connection-string "$(az webapp config connection-string list --resource-group ${customername}RG --name ${appServiceName} | jq -r '.[]."value"."value"')"
--&sr=b
: It's important to append this little key to SAS URL string, it tells Azure that the resource is “Blob”.--db-type, --db--name, --db-connection string
are specified in order to include database in backup, we are querying appservice to get value of --db-connection string
, I should mention that we are populating this setting upon environment creation.
Taking into account pros and cons of all methods, we ended up using the last backup option. Now let's get back to our Jenkins jobs.
Auto testing (UI) and Code quality control
For UI testing using TestNg we have Selenium server and chrome installed on Windows 2012R2 Jenkins slave. SonarQube is used as code quality control system, also installed on the same Windows machine. Configuration of both is pretty straight forward, and remember you are working in a team, your QA engineers are always ready to help setting things up. The only moment I would pay attention here is UI testing. We need to create a separate user, configure auto-login for this user in Windows, you might want to restrict access to registry keys, since login and password are stored there in clear text. Once the user is created, make a little batch file that starts Jenkins slave and add task to Task Scheduler to run this script on user's logon. It's important not to logoff this user from the server, otherwise all UI tests will fail.
To be honest, I used to be a good friend to Windows servers, so there were no big issues, only some minor difficulties, for instance hanging GIT CLI.
Remarks about Jenkins configuration
Not all these jobs are needed for developers team, therefore I installed yet another plugin for role based access in Jenkins and published only 2 deployment jobs to the team:
Create_env_Deploy_DB_BE_FE, sequence of jobs:
Create Environment -> Deploy DB -> Deploy WAR (BE) -> Deploy FE
Update_application, sequence:
Backup DB -> Deploy DB -> Deploy WAR (BE) -> Deploy FE
I’m not going to get too deep into all details, however I'll share some key features:
- Jenkins is configured to use TLS certificate and DNS name in our domain "mycompany.com".
How-to make Jenkins use your TLS certificate:
1) Convert PFX cert to JKSkeytool -importkeystore -srckeystore certname.pfx -srcstoretype pkcs12 -destkeystore certname.jks -destkeystoretype JKS
Enter source and destination passwords, password in destination JKS must be the same as PFX password, otherwise you'll spend some more hours googling.
2) Edit 2 lines in/etc/sysconfig/jenkins
JENKINS_HTTPS_KEYSTORE=/path/to/certname.jks
JENKINS_HTTPS_KEYSTORE_PASSWORD=cert_password
3) Restart Jenkins service and we are all done. - Installed some plugins for role based access, for fetching multiple repositories in single freestyle job, email extension plugin, and some others;
- Configured all jobs to accept parameters, such as customer name, git branch and others.
- Some more plugins, for TestNG results aggregation and integration with SonarQube
- OH! Nearly forgot my favorite plugin in Jenkins "Green balls"
In this setup Jenkins is the center of the Universe, therefore we added a couple of slaves with 2–3 executors on each and configured backup of all nodes. Later on we are planning to develop a nice concise admin interface-visualizer that will contain all admin tasks such us bulk upgrades, backups, etc of all environments, and interact with API of our Jenkins.
Hope it's been informative for you, I'd like to thank you all, who managed to read this all the way till the end. New stories are coming soon, take care and remember to use Green Balls ;)
P.S. If you wonder what's my opinion about Azure as a cloud: it's cheap, good, but not the best.
Why? I have faced a few major cloud-wide issues at Azure, such as certificate problem, some fancy errors (screenshot below), and very often generic slowness.