Chaos Engineering for Amazon SNS (Simple Notification Service) with Gremlin
Gremlin is a simple, safe, and secure service for performing Chaos Engineering experiments through a SaaS-based platform. Amazon Simple Notification Service (SNS) provides fully managed pub/sub messaging, SMS, email, and mobile push notifications. Datadog is a monitoring service for cloud-scale applications, providing monitoring of servers, databases, tools, and services, through a SaaS-based data analytics platform.
This tutorial shows:
- How to create a Gremlin Scenario for Amazon SNS
- How to measure the impact/results of your Gremlin Scenario
Chaos Engineering Hypothesis
For the purposes of this tutorial, we will run a Gremlin Scenario on Amazon SNS. We will focus on network-related Chaos Engineering attacks.
When Amazon Simple Notification Service (SNS) is unreachable from our API servers, we are not able to dispatch events to our event pipeline. As a result, some Tier2/Tier3 services will behave slowly or not at all (e.g. Slack Integration, Mixpanel, Salesforce, Datadog)
When a single zone is targeted, we can expect some failures of the mentioned services, but not 100% failure.
Failure Scenario: SNS unreachable from API Server
Gremlin Attack Scope (Blast Radius)
Target: Tag service: api, [zone:us-west-1a], Random 100%
Gremlin: Network > Blackhole
Attack Parameters: Length 60s (1m) → 300s (5m), Hostname sns.us-west-1.amazonaws.com
Would you like to run this scenario yourself? Here’s the json:
{
"name": "SNS unreachable from API Server",
"description": "When Amazon Simple Notification Service (SNS) is unreachable from our API servers, we are not able to dispatch events to our event pipeline. As a result, some Tier2/Tier3 services will behave slowly or not at all (e.g. Slack Integration, Mixpanel, Salesforce, Datadog). When a single zone is targeted, we can expect some failures of the mentioned services, but not 100% failure.",
"recommended_scenario_id": "",
"graph": {
"nodes": {
"0": {
"target_definition": {
"target_type": "Container",
"strategy_type": "Random",
"strategy": {
"attrs": {
"multiSelectLabels": {
"annotation.kubernetes.io/config.source": [
"api"
]
}
},
"type": "RandomPercent",
"percentage": 100
}
},
"impact_definition": {
"infra_command_type": "blackhole",
"infra_command_args": {
"cli_args": [
"blackhole",
"-l",
"60",
"-h",
"^api.gremlin.com",
"-p",
"^53"
],
"providers": [],
"type": "blackhole"
}
},
"type": "InfraAttack",
"id": "0"
}
},
"start_id": "0"
}
}
Results of this Gremlin Scenario
Observations
- Passed the first experiment (60s).
- Didn’t pass the second (300s)
Follow up Actions
Improve SNS monitoring by adding additional monitors https://app.datadoghq.com/monitors/
Expected: N
Detected: N
Handled: N
Ready to Automate: N