Using Neo4j Graph Database to map your AWS Infrastructure

Nick Doyle
Aug 2, 2018 · 8 min read
Here you can see just one principal (the main account) having full control privs on all buckets, which are in four regions
Here you can see just one principal (the main account) having full control privs on all buckets, which are in four regions
S3 bucket access vizualization - Here you can see only one principal (the main account) having full control (yellow labeled “FULL_CONTROL”) on all buckets (purple), which are in four regions (green)

Update May 2020 now based on neo4j 4.0

TLDR — run live local webserver with:

docker run \
-d \
--name aws_map_myaccount \
--env NEO4J_AUTH=$NEO4J_AUTH \
--env AWS_TO_NEO4J_LIMIT_REGION=ap-southeast-2 \
--env AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
--env AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
--env AWS_SESSION_TOKEN=$AWS_SESSION_TOKEN \
-p 80:7474 \
-p 7687:7687 \
rdkls/aws_infra_map_neo4j

Where NEO4J_AUTH is of format neo4j/mypassword

For a while I’ve been kicking around the idea of combining 2 things:

  • Graph Databases

In a certain way, applying things that graph databases are good at to the cloud.
On researching I was surprised to see that nobody else seems to be doing this.

This post is a writeup of a quick hack I made in this area, and thoughts about where it might lead.
I think there’s scope to do some very interesting (and potentially lucrative) work in this area, for those with the time and interest.
Basically what I’ll be covering in the post is this:

Image for post
Image for post

Graph & Cloud

Talking to people about graph databases is interesting; many people have no idea what they are.
Or beyond that, why they’re actually beyond-novelty useful.
Often people trying to explain graph feel frustrated, and that others “just don’t get it” (personally I think that yes, some lazy brains just don’t feel like making the effort to think — but that we can always improve our communication)

I’ll try to summarize:

  • Despite the name, “relational” databases are NOT good at relationships
    Graph databases are good at relationships

If you’ve ever worked as a DBA, particularly in BI/analysis, to the point where you have fucking had enough of JOINs, I’m sure you see the appeal.

Just to digress on that first point, relationships being (as, or more) important than the data themselves always brings to mind Systems Thinking — but that’s a whole other Thing.

Now the Cloud.

Common challenges I see are:

  • Nontrivial architectures comprising many interrelated components
    (lambdas, ASGs, caches, dbs, containers etc)

My primary goal was, coming into a new AWS account, if I could run a scraper that jammed all AWS account info into a graph database, it’d be a great way to explore and understand the account.

I should also point out, that I wanted something better than static analysis on e.g. cloudformation or terraform templates; I wanted to know what the real deal as-deployed, warts and all, ideally with scope for realtime updates.

But there are other potential benefits, such as exploration & analysis on:

  • Data lineage and governance (hello GDPR)

The Beginning — Awless and Google Badwolf

█████╗  ██╗    ██╗ ██╗     ██████  ██████╗ ██████╗
██╔══██╗ ██║ ██║ ██║ ██╔══╝ ██╔═══╝ ██╔═══╝
███████║ ██║ █╗ ██║ ██║ ████╗ ██████ ██████
██╔══██║ ██║███╗██║ ██║ ██╔═╝ ██╗ ██╗
██║ ██║ ╚███╔███╔╝ ██████╗ ██████╗ ██████║ ██████║
╚═╝ ╚═╝ ╚══╝╚══╝ ╚═════╝ ╚═════╝ ╚═════╝ ╚═════╝

After this idea kicking around up there for a while, what really prompted me to have a crack was the excellent tool awless- a supercharged, more-standard and less-painful AWS CLI.
If you do any engineering at all with AWS I highly recommend trying it.

Once I started using it, I noticed that it locally stored details of your AWS environment in a graph database — namely using Google BADWOLF.

Image for post
Image for post

Right so my infrastructure data is already there, that’s great no need to scrape it all with boto.

It’s in a standard format right? Wrong. Badwolf gives the user flexibility with the scema, being only “loosely modeled” on w3c standard RDF. As far as my limited graph knowledge goes, some reasons for this might be the temporal aspect, as well as meta-modeling; relating relationships, and also “just because” … UFO built by insane space aliens …

So I had a look at the data files, and despite not being totally 100% RDF-compliant, they weren’t far off. Unfortunately they did differ for each resource type/file, so standardising has to be done for each type of resource, but it’s not too bad.
An example is in the “infrastructure.nt”:

<i-0a74e88bee42b4bb5> <cloud:launched> "2018-06-27T23:39:31Z"^^<xsd:dateTime> .
  1. Problem: The resources just don’t have schemas
    Soution: Just prepend resource://

Changing such to:

<resource://i-0a74e88bee42b4bb5> <cloud:launched> "2018-06-27T23:39:31Z" .

Made it valid RDF and able to be loaded in the next step.

(the code I put together for this is in “awless_to_neo.py” in the source repo for this post)

Loading to Neo4j

Image for post
Image for post

When I first realized I had RDF-like data, I looked round for ways to easily load that.

And surprise surprise, this guy Jesús Barassa has already written Neo4j modules to make this possible.

(I say surprise surprise because only a couple months back I became quite familiar with his thoughts on ontologies in neo4j, when I had to do some data lineage and fault finding for a consulting gig — but that’s another story)

They worked pretty well.

I found I did need to hack around some fields / labels after import to make the display nicer (this code also in “awless_to_neo.py” at end of post).

All good.

I hope that right now you are, just as I was, itching to take awless and the rdf import modules right out of the picture, and insert directly into Neo4j with one beautiful script. Yes. That would be a lovely improvement for future work on this.

Packaging — Docker

Having hacked the basics together I wanted a way to hand off to someone to try out.

I could tell them to set up neo4j, install the script etc etc etc

Or I could just chuck it all in a docker container, so I did that.

Source is on GitHub with built Docker image on dockerhub here.

docker run \
-d \
--name aws_map_myaccount \
--env NEO4J_AUTH=$NEO4J_AUTH \
--env AWS_TO_NEO4J_LIMIT_REGION=ap-southeast-2 \
--env AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
--env AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
--env AWS_SESSION_TOKEN=$AWS_SESSION_TOKEN \
-p 80:7474 \
-p 7687:7687 \
rdkls/aws_infra_map_neo4j

(variables substituted per your environment) and you’ll get your own account mapped and available on port 80.

Some Results

Image for post
Image for post
Here you can see regions in dark green. Most resources concentrated in a primary, then a secondary, with some (probably unintentional/orphaned) resources in 2 others.
Image for post
Image for post
Which VPCs (green) have Subnets (blue) in which AZs (red), in which Regions (purple)

Integrate with other frontends

Another cool thing you can do once you have the data is point other tools such as Linkurious, or the new (but basic) (but free!) GraphXR at it.

I think GraphXR is looking pretty cool. A bit clunky, but looks sweet, and loving the 3d effect. IMO this sort of interface, “just like in the movies” where we’re exploring data in 3d, will have big benefits in future. Here’s an example of pointing GraphXR at my localhost. Basically you go there, create a new project, point it at your local neo4j instance, “configure search index” and do some sort of search, to get things showing up.

Image for post
Image for post
Actual AWS Account Infra data loaded into GraphXR. The large aggregation is the VPC, with subnets etc coming out of it. On the left is another VPC, with its own Lambdas (green) and Security Groups (red)

Environment Variables

Code

Code for the work in this post can be found on GitHub
A built docker image, including neo4j community edition, awless and related utils is available on Dockerhub.

Possible future work

Better / custom graph layouts needed
Possibly using Sigma.js
In my trials, Dagre (with default NetworkSimplex ranker) gave decent layout, but to make the best I think I’d need to stub in there a static hierarchy i.e. be able to specify Account > VPC > Subnet etc

Image for post
Image for post

Dagre with default NetworkSimplexRanker
Getting close, but not as good as I’d like

Elasticsearch + Kibana plugin to index + explore AWS accounts

Relate to other log / data indexed in ES for drilldown

Opportunity to commercialize this — particularly when tied into existing ELK data — machine logs, VPC flow firewall etc

Other graph engines

Janusgraph/gremlin server backed by dynamodb — still requires ec2/docker long-running gremlin/janus api

AWS Neptune — requires minimum db.r4.large @ $0.348/hr

IBM Cloud — run on free K8s cluster with janus backend

Source

References & Further Reading

JESÚS BARRASA — Importing RDF data into Neo4j

Andy Robbins writes about Blood Hound, using graph to automate attack — and defense — paths in an Active Directory Environment

Dagre layout

Dagre bindings for sigma.js

Awless: A Mighty CLI for AWS

Chromeo — ‘Bonafied Lovin’ Chromeo

That’s It!

If you’ve read this far thanks, hope you found it interesting.
If you have any suggestions, insights, or karaaaaazy konspiracy theories feel free to let me know in the comments, and I will feel free to keep on insulting you. Until next time.
—— — — — — — — —— brrrzzzzzttt —— — — — — — — — -

Nick Doyle

Written by

Computer Scientist. Agile Enthusiast. Past lives include Perl Hacker, Web Developer, DBA, Tech Lead, Motorcycle Instructor, Forensic Data Analyst, & Cloud Guy

Nick Doyle

Written by

Computer Scientist. Agile Enthusiast. Past lives include Perl Hacker, Web Developer, DBA, Tech Lead, Motorcycle Instructor, Forensic Data Analyst, & Cloud Guy

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store