Chaos Engineering for Amazon SNS (Simple Notification Service) with Gremlin

Tammy Butow
Chaos Engineering
Published in
2 min readSep 17, 2020

Gremlin is a simple, safe, and secure service for performing Chaos Engineering experiments through a SaaS-based platform. Amazon Simple Notification Service (SNS) provides fully managed pub/sub messaging, SMS, email, and mobile push notifications. Datadog is a monitoring service for cloud-scale applications, providing monitoring of servers, databases, tools, and services, through a SaaS-based data analytics platform.

This tutorial shows:

  • How to create a Gremlin Scenario for Amazon SNS
  • How to measure the impact/results of your Gremlin Scenario

Chaos Engineering Hypothesis

For the purposes of this tutorial, we will run a Gremlin Scenario on Amazon SNS. We will focus on network-related Chaos Engineering attacks.

When Amazon Simple Notification Service (SNS) is unreachable from our API servers, we are not able to dispatch events to our event pipeline. As a result, some Tier2/Tier3 services will behave slowly or not at all (e.g. Slack Integration, Mixpanel, Salesforce, Datadog)

When a single zone is targeted, we can expect some failures of the mentioned services, but not 100% failure.

Failure Scenario: SNS unreachable from API Server

Gremlin Attack Scope (Blast Radius)

Target: Tag service: api, [zone:us-west-1a], Random 100%

Gremlin: Network > Blackhole

Attack Parameters: Length 60s (1m) → 300s (5m), Hostname sns.us-west-1.amazonaws.com

Would you like to run this scenario yourself? Here’s the json:

{
"name": "SNS unreachable from API Server",
"description": "When Amazon Simple Notification Service (SNS) is unreachable from our API servers, we are not able to dispatch events to our event pipeline. As a result, some Tier2/Tier3 services will behave slowly or not at all (e.g. Slack Integration, Mixpanel, Salesforce, Datadog). When a single zone is targeted, we can expect some failures of the mentioned services, but not 100% failure.",
"recommended_scenario_id": "",
"graph": {
"nodes": {
"0": {
"target_definition": {
"target_type": "Container",
"strategy_type": "Random",
"strategy": {
"attrs": {
"multiSelectLabels": {
"annotation.kubernetes.io/config.source": [
"api"
]
}
},
"type": "RandomPercent",
"percentage": 100
}
},
"impact_definition": {
"infra_command_type": "blackhole",
"infra_command_args": {
"cli_args": [
"blackhole",
"-l",
"60",
"-h",
"^api.gremlin.com",
"-p",
"^53"
],
"providers": [],
"type": "blackhole"
}
},
"type": "InfraAttack",
"id": "0"
}
},
"start_id": "0"
}
}

Results of this Gremlin Scenario

Observations

  • Passed the first experiment (60s).
  • Didn’t pass the second (300s)

Follow up Actions

Improve SNS monitoring by adding additional monitors https://app.datadoghq.com/monitors/

Expected: N

Detected: N

Handled: N

Ready to Automate: N

--

--