Durable Functions Blue Green Deployment Strategies
Durable Functions is very cool strategy. However, if you just deploy it to the production using slot swapping, it will be destroy your current running job. I come up with deployment strategy for Durable Functions. This is NOT proved strategies, however I want to start discussing with professionals using this blog diagrams. If I confirm the firm strategy, I’ll share it in the near future.
What is the problem?
Durable Functions has three parts. An orchestrator manages the Activities which work as a worker. An orchestrator client Start/Stop/Status checking of the Orchestrator. To enable “Durable” task, the Orchestrator save the state into a storage table and read / replay it using Event Sourcing pattern. They use storage queues when they want to talk each others. For more detail, you can refer this page.
For example, Orchestrator code is like this.
If you change
- Method signature of the Activity
- Orchestrator logic
It will be break the current job execution. If you change these, Storage Table and Queue data will be not consistent with the new code. That is why, the current job execution will be cause error. The official documentation mention three strategies to conquer this issue.
- Do nothing
- Stop all in-flight instances
- Side-by-side deployments
You can refer the detail in this page.
Do nothing is just make it throw error. It depend on the app, however, it could be one solution. Stop all in-flight instances means remove all queues. It is good for under development. However if you use Durable function in production with ordering system? We don’t want to stop the current jobs. In this case, we can use Side-by-side deployments.
Based on this strategy, I come up with a Durable Functions CD pipeline strategy.
Side-by-side deployments strategy
I’d like to share the fundamental idea with you.
Deployment without no breaking change
Update 26/03/2018 : This strategy doesn’t work. For example, I create a long running activity, then run the Durable Functions then deploy the new version.The Activity will reply it again, Activity should be designed as idempotence.
Let’s consider if there is no breaking change. I don’t recommend to share the storage account among the production/staging deployment slots. If you share these, two durable functions share the same queue and storage table. Which cause very weird behavior because of the confusion. It should be avoided.
In this case, I want to keep the current task which works on the Storage Account A but the current App (version) should change into the new version (ver 2.0) and keep on processing the current task.
In this case, you can have two storage account for each slots. Once the swap slot occurs, you can swap the app, however, keep the storage account settings.
You can do it using Azure Functions feature. In the app settings, you can control if you swap the App Settings elements. If you check the Slot Setting, then, it doesn’t swap the value. In this example, Slot Swap feature will swap the Whoami value but not Whoami Keep value.
Since I’ve not done the experiment, I’m not sure, however, this strategy might not work. If an activity processing some task, then deployment occurs, it might remove the current task execution and the orchestrator might wait the task end until the timeout occurs. I’ll investigate this.
Deployment with breaking change
Consider the breaking change. In case of the breaking change, if we swap the app, it will break the current tasks. I want the current app (ver 1.0) should keep on processing the current task until the end even after the swap executed. New App should be use the new storage account to avoid the break the current task.
In this case, just swap the configuration between the Production and Staging slots.
This strategy keeps on working the old apps with old storage account.
Why not TaskHubs strategy?
We can do the Similar strategy if we use TaskHub settings. If you don’t know about TaskHubs, please refer this page. In this case, the both slots point the same storage account. It includes Task Hubs settings on the host.json. If you have TaskHub 1.o, the Durable Function use the Task Hub 1.0 which include queues and storage table. It is like a namespaces.
If you swap the apps, the both apps point the different Task Hubs. It looks nice.
It is good for keeping the queue and tables forever! However, on the deployment pipeline, you might increment the version of the TaskHubs for every deployments. Please consider if you deploy the third app (ver 3.0) it will break the current task, and if you frequently deploy apps, the Task Hubs increase a lot. Again, it is trade off. Also, my friend tried this and said that this doesn’t work. I’m not sure if they missed something or not. I’ll try it by myself.
Mixing the both strategy (if possible)
You might mix the strategies. In case of no breaking change, use the first strategy and once have a breaking change, use the second strategy. In this case you can identify the change during the pipeline. Then choose the strategy automatically. It might work. Wait! What happens if I use strategies one just after the strategy two. If you deploy new app on the new app on the staging slot, you will break the current task which is running on the old app. We need to wait until the task is completed.
No problem still the same strategy works.
- Non breaking change -> Swap the storage account settings
- Breaking change -> Keep the storage account settings
Let’s consider the cases.
Breaking change -> Non-breaking change
Let’s use two slots. It cause a complexity, however, you can keep the old app (version 1) and keep the old app which has some current jobs. App(ver1.0) and App (v2.0) has a breaking change. However, App(v2.0) and App(v3.0) doesn’t have a breaking change.
Keep the App(ver 1.0), you can deploy new version.
Breaking change -> Breaking change
it also works. It keeps App(ver 1.0), App (ver 2.0), and App (ver 3.0) without losing the current job.
You can also use Three Slots strategies for the Breaking change with theTaskHubs Strategy.
What we discussed
We can use Storage Account settings to swap/non-swap
- Non breaking change -> Swap the storage account settings (Possibly cause a problem need investigate)
- Breaking change -> Keep the storage account settings
Breaking change with TaskHubs strategy seems good. However, it will increase the number of task hubs if you frequently deploy it. Also the third deployment might cause stop the old apps task.
If you deploy an app with the breaking change, then deploy the new app on the staging just after that, it might destroy the current task execution. for avoiding this you can use multiple slots strategy. However, it introduce complexity. It’s trade-off.
Update 26/03/2018 : We can only use Breaking Change Strategy.
Deployment Pipeline for Breaking Change strategy
Let’s consider the breaking change strategy pipeline.
- Build / Test / Zip the app
- Check if the Staging slot Durable Functions tasks are all completed (Option)
- Remove/clear the Storage Queue data of the Staging Slot
- Deploy new app to the Staging Slot
- Swap an App with Storage Account settings of App Settings
If you want to use Breaking Change with TaskHubs strategy
- Build / Test the app
- Increment the version of the TaskHubs and Zip with a new App
- Check if the Staging slot Durable Functions task are all completed (Option)
- Deploy new app to the Staging Slot
- Clean up the old TaskHubs(Storage Queues and Tables) (Option)
This might be depend on how frequently deploy your apps. Also you need to consider, it’s OK if the third app is destroy the first app tasks. If you don’t allow this, you can consider multiple slot strategies.
If we want to have mix strategies, we need to use “IF’ clause developing an app which identify if it has breaking change.
Conclusion
I consider four strategies. Breaking Change Strategy (Storage swap or TaskHubs) might be the best one. I’ll implement and try these strategy from tomorrow.
I’d like to share the idea with others and start discussing using this diagram with experts. That is why I wrote this idea. If you have any comment, please let me know on the following issue. I’m the owner of the issue.
https://github.com/Azure/azure-functions-durable-extension/issues/184