Introducing TinyApprover: Extend traditional alerting with approvals

Published in

Open Door Security

9 min readMar 3, 2024

Disclaimer: The viewpoints expressed on this blog, including all articles, comments, and other content, are solely my own and do not in any manner reflect the views, policies, or positions of my employer. All content is created and published by me in a personal capacity, and is not endorsed, sponsored, or influenced by my current or any past employers.

Background

A risk factor haunting every organization is ability to make calculated, fast business decisions. These can be anything from an engineering team choosing a platform for a new critical product feature to executives weighing the value of a strategic acquisition with competing offers.

Among every one of these decisions, the common denominator is time sensitivity. Without it, the decision is no longer strategic and can often lose most of its value. For startups, this can make or break success. What good is a feature if it takes 12 months to ship and by then customers have moved on to a competitor?

In addition, decision making ability sometimes degrades as an organization scales in size. Organizational structures change when you have more people and this must be carefully managed from the top to enable small teams with a high level of autonomy. Some examples of this in practice appear in Amazon’s leadership principles as a “Bias for Action”, and Zuckerberg pioneered the infamous phrase in the community of “Move fast and break things”.

The surprising thing, though is that most decisions are actually quite small yet make up the largest collective impact. Harvard Business Review calls these “microdecisions”:

…microdecisions — small decisions made many times by many workers at the customer interface — can have a major impact on the business. How they are made can be the difference between sloppy and effective execution, and between profit and loss.

An employee makes many microdecisions during a typical workday. (Image: Forbes)

Enough of these MBA terms. What does this mean in real business applications?

Enter: Operations teams

Operations teams are a part of almost all companies in the world. They are the powerhouse behind the scenes that keep the lights on, triaging and following up on issues as they come in. Their work is distinct from product teams that can easier adhere to an Agile (re: planned) style of work.

Some of the most common operations roles in tech are:

Site Reliability Engineer
SOC Analyst
Security Engineer
On-call Software Engineer
IT Associate (less common title nowadays)

Let’s focus on security & IT teams to narrow down our scope, specifically teams that process and review identity and access management (IAM) access requests. IAM requests could come in any of the following ways:

Initial new hire Okta/Active Directory access setup
Provisioning new Okta apps
Kubernetes role permission changes
AWS policy permission changes
AWS Service Control Policy (SCP) changes

These types of work are precisely the “microdecisions” we defined previously. While they may not seem to be significant in the context of the job descriptions, managing IAM is highly complex with many moving parts:

“Small” changes are seldom made in isolation; impacting an overall security strategy.
Missed SLOs (service level objectives) can delay new hires from ramping up or teams from deploying critical product features
UX is typically lacking. It can be frustrating to submit several tickets to helpdesk for access to different software and services.

Due to these factors, it’s relevant to address ROI.

ROI

Some forecasts place the global market in IAM growing from $15.7B last year to $32.6B by 2028, more than doubling in 5 years. Internal access provisioning is almost always a cost center, with each of the Big 4 offering entire professional services divisions for IAM.

We discussed in the previous article how difficult it is to quantify ROI in security. However, given 82% of breaches involve the human element, it’s critical to make sure access management is resourced well in an organization.

A robust access management program may in fact be the best security investment an organization can make. It also tends to be the hardest.

Bringing it back

In context of the IAM use case, many lean teams operate in a “hybrid-ops” fashion, rotating responsibility for on-call (operations) work. Many on-calls will also work on project work during downtime until the moment a page or message comes.

This pattern, coined the “context switch”, is studied to result in more speed and stress during work hours. UC Irvine researchers proved it takes more than 23 minutes to refocus after a distraction. How’s that for productivity?

What if there was a way to reduce the cognitive load involved with the context switch? Or maybe even present all of the relevant information at once, reducing the depth of the context switch?

Imagine how sweet it would be to have a fleet of carrier pigeons bringing you bits and pieces of information to complete your work…

Solution

Enter a term called Human-in-the-Loop (HITL).

HITL is a fundamental part of realistically addressing a Human-Computer Interaction (HCI) design problem.

In simple terms, all HITL means is an explicit layer of human approval for any automatic actions — such as sending an email or scheduling a meeting.

Traditionally reserved for production machine learning models, we can take this framework and apply it to nearly any business decision making process. In turn, this reduces most of the risk involved.

Wait, but how?

Now that you understand the business value, let’s dive into the technicals. This section is meant to provide a quick overview — stay tuned for next week’s article with a heavy technical background.

Let’s get some components ready:

An LLM App connected to data sources via retrieval-augmented generation (RAG)
An AWS Account (to host the app)
A PagerDuty subscription

LLM App

This is the “bread and butter” of the application. Basically, this connects to your internal data sources (AWS, Okta, Workday, Slack) with a read-only role.

Upon receiving an IAM request to validate, it queries the data sources for relevant IAM context and your custom prompt. We then instruct it to return a value depending on whether the IAM request should be approved.

Sound experimental? It is, and LLMs are not deterministic by nature. However, the gold here is the context retrieved-we’ll make a request no matter what the value returned is. This can help your team be better informed when the time comes to handle an interrupt request.

Plus, the technology is getting better and better every day with many organizations already adopting it in production. Try it out and you’ll likely be surprised by the results.

AWS Account

We deploy a CloudFormation stack here. More info to come in the technical article.

PagerDuty Account

This is where requests are processed — we’ll alert one of your team members that they need to approve or deny a request. Their response is recorded and sent to the app.

The Framework

Step 1. Configure your IAM Requests to be sent to the LLM App

Got a Jira-based setup where requesters submit tickets for access?

Just set up incoming requests to fire at the LLM app with a webhook. This one is straightforward but dependent on your use case.

Step 2. Configure your privileged actions

What happens when requests are approved?

Configure code that provisions the proper access when a request is approved. This may be changing an AWS policy or setting an Okta role.

Step 3. Set up your data sources

What data sources does your team use to make decisions when a request comes in?

Do they look at Slack or other internal documentation? Do they look at existing Okta groups or AWS permissions? What about looking at the requester and their organization unit in Workday?

The theme here is to provide access to the same data sources a human would use to approve/deny a request. Is it the first time a user has logged in from a new location that they’re making an access request?

Automating these processes not only enhances efficiency but also raises security considerations that could be implemented that would be tedious in a manual process.

Step 4. Set up PagerDuty

Your team will be paged when a new access request comes in, complete with all of the relevant context in a PagerDuty incident. To approve or deny a request, they’ll add comments to the incident.

From here, the request will be sent via a webhook to your AWS account, where the access request will trigger or fail. Voila!

Framework Summary

This framework uses new technology to propose a repeatable process for handling interrupt IAM access requests. It adheres to security concerns by still requiring explicit approval from one of your team’s humans before action is taken-hence “human in the loop”.

As covered above, even if the LLM gets the call wrong we’ll deliver the alert with all of the relevant context to your incident responders. Over time and with more and more use, your team can leverage something called reinforcement learning/human feedback (RLHF) — which means that the model can be fine tuned or adjusted to provide prompt outputs in line with expected patterns in your organization.

This is getting long now — so we’ll wrap this up-answering some of the most pertinent questions.

FAQ

Why couple with incident response?

Sense of urgency

Incident response platforms evoke a sense of urgency more so than simple ticketing/project management tools like Jira, ServiceNow Ticketing, etc.

On-call schedules, uptime SLOs, and escalation logic all ensure time-sensitive acknowledgement of incidents.

Low barrier to implementation

Companies that care about uptime have to use incident response platforms to ensure engineers get alerted when there’s problems, whether it’s in-house, PagerDuty, OpsGenie, or ServiceNow.

Using existing licenses to save money is a prime example of “Doing more with less” by not increasing fixed expenses.

Existing webhook support

Most platforms, including PagerDuty, support outgoing webhooks in response to actions in the account. Why build your own platform if you can use an existing vendor?

Other applications

Use cases of this framework isn’t limited to IAM or even security. You could probably take this framework and apply it to anything involving operations work (comment some use cases!).

Regardless, here’s some more security use cases:

Slackbot responses for questions in a public channel
Approving breakglass activities in the event of a major incident (root production account access, role assumption by a user with a Separation of Duties conflict).
Improving Detection & Response (ITDR) with logical actions based on outputs

Try it out

Stay posted for a complete technical design post and Github project link coming next week! Following this, I’ll cover a use case from beginning to end applying the framework.

If you’re interested in applying this framework to your organization, I’m happy to chat to talk pain points and where you’re at today in your journey. Just send me a DM!