Hacking on the cloud: An effective and cost optimization approach

Tran Quoc Viet
Altitude
Published in
12 min readFeb 7, 2022

A few words before we start: this article isn’t about Hacking so you don’t have to worry about not having security-relate skills to continue reading. This article only take Recon and Scanning phases of Hacking, which usually very time consuming, as a model for the main subject: Design an effectively Cloud Solution while keeping the cost optimized.

Also Hacking without permission is illegal. The only purpose of hacking is to secure networks, and think like a hacker to be able to secure networks.

In case you didn’t know, Altitude is a fully cloud integrated platform to simplify the process of building a smarter accommodation, and as we deal with both our partners and customers data, security is one of our fore most concerns. Here at Altitude we always remind ourselves this:

A breached system is a dead system

to keep remind ourselves of how much important our customer data is, as they put their trust into us, we must response in kind.

Who am i?

My name is Tran Quoc Viet, currently working as Technical Advisor at Altitude. Since Altitude most concern is the safety of its customer data, we have built a security team from the early day and i had the honor of being one of the team first members.

During the time with the team, i have developed a strong interest in Security, and since then have been working as ethical security hacker for Altitude and her partners during my spare time. My task is to look for vulnerabilities in our networks, and report it.

With all the security skillsets and tools built over the years , and as Altitude cloud solution architect, one day i decided to move all those tools into the cloud as my machine started to take weeks to finish its given task. My choice is of course AWS cloud services as Altitude also using this.

AWS cloud services

Problems with cloud base hacking tools

My naive old self back then thought that the solution was simply to get some AWS EC2 instances, move my tools over (dockerized and ready to run anywhere) and call it a day. But reality isn’t that simple, my tools needs a lots of CPU and RAM, either one or both, so after finished my math, a few high tier EC2 instances were required, and even with reserved instances it costed me around 250$/month. And that was just the cost for EC2 instances only, so as poor as i were, i gave up on that solution.

Cloud solution cost a lots if you don’t pay attention to what you use

There was something else i noticed: the inconsistent usage. Below are the usage charts of Altitude and my tools:

My Tools CPU utilization chart
Altitude CPU utilization chart

The problem is pretty obvious, Altitude is a huge system used by a lots of users across the world so it must have high availability and the usage always remain high. My tools, on other hand, mostly in idle state, except for when i run it again a target.

Noticed such different, i gave up on trying to imitate Altitude Cloud Structure, and instead design a completely different approach.

Before a cloud solution architect can design a solution, they must first understand the problem and the goal of the target system. That is what we will try to do in this section.

The 5 steps of ethical hacking

5 steps of ethical hacking (credit: itperfection.com)

These are a lots of models for ethical hacking but the 5 steps model is probably one of the most popular. Since hacking isn’t our article main subject, let quickly go over all five.

Reconnaissance

In this phase hackers will try to gather information about the target, as many as they can. This is the most important phase out of all five.

If you know the enemy and know yourself, you need not fear the result of a hundred battles.
-Sun Tzu-

The more information hacker can gather, the better change of success they have. There is no “too much information” for this phase, they will try to gather everything, no matter how trivial. I can list a few examples:

  • IP ranges, information relate to IP such as whois
  • Domains, subdomains, domain owner information
  • Domain records such as A, TXT, CNAME, DKIM,…
  • Email, phone number, name, address, key words, …
  • TCP and UDP services
  • and a lots more

Scanning

With the information gather during Reconnaissance phase, the hacker will scan the network for specific information. For example with the collected IPs, they can scan for open ports, vulnerabilities,… associate with said IPs.

From my experience, most of the scanning phase is all about running the scanning tools with the input from Reconnaissance phase. Of course to have good result, you must know to run which tools on which data. A good hacker also have their own or personal-tuned tools, which set them apart from the rest.

One important thing to note: Scanning phase usually take VERY LONG TIME, even after narrowing down the target surface.

Scanning might take DAYS to finish

Gaining Access

This phase is when the attack actually start. With all the information from Recon and Scanning phase, hacker will be able to draft an attack plan and excuse it.

Maintaining Access

Once successfully gained the access into network, hacker will want to keep that access for future exploitation and attacks. These can be so many ways, like creating a backdoor, or take over an Administration account inside the network, …

Clearning Tracks

Remove anything that might indicate that the network have been breached, or anything might lead back to the hacker. It might be logs, sent emails, files generated, bash history,…

The goal: an automation hacking toolbox

It was pretty fun at first, then i realize i pretty much repeated doing the same stuffs over and over with each new target, especially during the Recon and Scanning phases. So i decided to automate the process.

My aim is to build a cloud hacking toolbox which, after accept the target information, and my attacking scenario config as input, can automatically perform the Reconnaissance and Scanning phase using all my tools. The time consuming steps will be automatically ran, while i can go live my life until the notification come.

And that is the Goal we want to archive.

The problems

The inconsistent usage: i already mentioned this above so i wont go over it again. But the objective is cut the cloud cost when the toolbox is in idle state.

Long running process: some tools take hours, or even days to finish.

Highly customization: No hacker run all his tools again any target. He must pick which one to run depend on the target characteristic. For example it make no sense running WPScan again a non Wordpress target. The toolbox must be flexible to fit on any attacking plan.
And as a hacker continue with his work, he will employ more tools, so the process of adding new tool should also be easy.

Design a solution

Now with the Goal and Problem identified, time to get our hand dirty. The first task is to design the automation flow for the Recon and Scanning phases.

Automation recon + scanning flow

As two of our main problems are long running process and customization, a queue-base architecture is highly favor. Event-driven architect is also a very good option, but AWS Kafka price is too expensive for our project.

First let go over all the actors we see in the image first:

  • Command center: accept/log command from user, and generate the action stages for that command.
  • Action stages: each stage represent a hacking action. For example, the recon phase consist of the following stages: scan for subdomains, probe sub domains,…. Each stage have its corresponding tool.
  • Action queue: You can think of it as a message box, where you put in a message, and later someone will take out the message, read it and do what it said. ActiveMQ, RabbitMQ,… are a few popular examples.
  • Artifact: a ZIP/RAR file contains all the files, documents, reports generated by a tool. Since some tools generate A LOTS of files, it is better to have a center storage for them, and i call it Artifact Storage.
  • Recon and Scan services: the recon and scan phases are actually consisted of many services, but to keep thing simple let assume they are just a single service.

With the actors in place, our automation flow will be as follows:

  • The hacker will trigger the flow by sending a command to command center.
  • The command center will generate the action stages , and then send the first one to action queue.
  • After a message are put into the queue, the recon and scan service will pick up the message, parse the message to get the stage information and use corresponding tools to process the stage action. After the tools finished its job, an artifact will be generated and upload to Artifact Storage.
  • Once the tool finish its task, the stage is consider done, and the service will inform the command center. The next action stage will be generated and put to queue, and be processed by recon and scan service. This will continue until the command center decided that there is no more action stage left.
  • Once all the action stages are finished, user will be notified via email/sms. Since the recon and scan usually take lots of time, a notification is required (you don’t want to check the status every few hours, do you?)

This queue-base architecture help us resolves two of our problems:

  • Long running process: by using queue-base solution, no service have to wait for a response from another. For example command center don’t have to open a processing thread to call Nmap service, then keep that thread open for 2 hours (or more) until Nmap finish its task. Instead command center simply drop a message into the queue, and call it a day until Nmap contact it later with the scan result. No more waste resource!
  • Highly customization: each action stage accept a Queue message as input, and product an Artifact as output , so there is almost no dependency between stages (except if the artifact will be used by another stage, but this is just a matter of order). This way you can easily add new action stage in the future without worrying it might break the flow.

Final problem: The inconsistent in usage

If you ask 100 system architects how to deal with the inconsistent in usage, all of them will give you the same answer: scalability .

Sample of scalability

In short, it is the measure of a system’s ability to increase or decrease in performance and cost in response to changes in application and system processing.

Take Altitude as an example, during normal period, we have ten instances to host all our services. But during holiday period, the number of requests spikes, either 2 or 3 times more than normal. Thank to Altitude scalability ability, our system will automatically increase the number of instances to match the high demand without us have to do anything.

As for our toolbox, the different between low demand and high demand period are even more distinctive. Because when you don’t perform any hacking task, the system does almost nothing.

So how can we archive Scalability for our solution?

Two categories of services

The hard part of the Scalability is to select which target for the scaling. Thanks to the queue-base architecture we employed, it wont be too hard.

Our hacking toolbox consisted of many services, but in general, you can divide them into 2 categories:

  • High processing services: tools which consume a lots of CPU and RAM, and took very long time to finish. For example: Nmap service which scan for working ports. Except for when these are tasks for them in Action Queue, they do absolutely nothing.
  • API-relate services: Command center, artifact storage and notification service belong to this category. They don’t require much resource, but should always in ready state, waiting for request from user or other services. These services should have short response time.

Our scalability strategy

With our services perfectly divided like that, it is quite easy to decide a scalability strategy.

High processing services will be put into scalable instances, which will be boot up whenever there is a new task in the Job Queue, and terminate when there is no more task. Amazon have a perfect solution for this: AWS Auto Scaling service.

Aws Auto Scaling

For API relate services, there are 2 options:

  • Option 1: put them into EC2 instances which always in ready state. This approach allow the api to response quickly without delay, since there is no boot-up time required. But you will have to pay monthly fee for these instance. If the api are called frequently, this is the prefer option. But in our case it isn’t.
  • Option 2: Convert the API services into serverless services with AWS Serverless. The advantage of serverless services is that you only have to pay for what you use, in short, the cost depend on the number of requests you make that month. In our case the number of requests are quite low, so this is the approach that we will select. A side note: serverless services have a boot-up delay(change from terminated state to ready state) so it is a bit slower than the first approach, but in our case it isnt a problem.
AWS Serverless Computing (Credit: Xenonstack)

And that is it folks, we have our solution to build the hacking toolbox. All that left is to design the Infrastructure , and that is what we will do next.

With the solution figured out, design the infrastructure is quite straight forward:

The toolbox infrastructure
  • A VPC with one private subnet, one public subnet span across 2 Availability Zones. With this design, even if by some reasons, a zone is inaccessible, our services can still be deployed in another one.
  • An internet gateway attaches to the VPC so our services can access the internet.
  • An Amazon MQ deployed in public subnet to act as our Action Queue. NOTE: the Amazon MQ brokers belong to same security group, which only allow access from our lambdas and private services.
  • A Cloudwatch Event which monitor the Amazon MQ status, will inform our AWS Auto Scaling whenever new message arrived.
  • AWS Auto Scaling will do the following:
    - If a topic dont have any new message for a period of time (let say 2 hours), terminate the auto scaling group associate with that topic. For example, if the topic for Scanning Service dont have any message for 2 hours, turn off all the Scanning tool services.
    - If a new message arrived at a topic, boot up the services inside the auto scaling group associate with that topic.
  • The lambda functions, each corresponding with one of our API endpoints.
  • The API gateway: Since we cannot access the lambda functions directly, all the https call must go through the API gateway. For example the following request
    GET https://gatewaydomain.com/command/getCommandStatus
    will be sent to API gateway, then the gateway will route the request to the CommandCenter.getCommandStatus lambda function.

With this infrastructure, whenever your toolbox isn’t in use (idle for more than 2 hours), all the high processing (and expensive) tools are turned off. Also all the lambda functions are only charged when you call them, so you are pretty much don’t have to pay any money during the idle period (not counting the storage fee).

Conclusion

This infrastructure is exactly what i am using right now, and it help me save a lots of money and time while get the job done.

I hope this article show you the important of designing a suitable solution base on the system and its usage. Please look forward for our next article.

--

--