How foundation models reshape cybersecurity tooling

The progress is great, but the opportunities are greater

Harpinder Singh

Published in

Innovation Endeavors

10 min readJun 27, 2023

By Harpi Singh and Dhruv Iyer

Thanks to the blazing pace of innovation in the space over the past several months, foundation models (FMs) have been embedded into tooling for nearly every domain, including design, FP&A, HR, sales, marketing, and software engineering. It should come as no surprise, therefore, that the technology has made its way into cybersecurity products as well.

With the hype of RSAC 2023 now firmly behind us, we thought it would be helpful to give an overview of the current “Foundation Models x Cybersecurity’’ landscape, discuss the implications for startups, and add to a growing list of predictions about where this technology will be most impactful for the security industry.

But first, a quick nod to infrastructure (Microsoft and Google)

Two platforms are providing foundational dev infrastructure and bringing the power of foundation models to security tooling: Microsoft Azure’s OpenAI Service and Google Cloud’s Security Workbench.

Microsoft was the first Cloud Service Provider (CSP) to go to market with a developer platform specifically for generative AI. The Azure OpenAI Service (the result of a partnership between Microsoft and OpenAI) made LLMs commercially available to developers, including pre-trained models like GPT, which can be fine-tuned on proprietary data. Google responded by integrating generative capabilities with their own AI development and MLOps platform called Vertex AI, leveraging the Pathways Language Model (PaLM) architecture. Amazon eventually released its own development platform called Bedrock that supports Anthropic’s Claude model, among others.

In late March/April 2023, Microsoft and Google began expanding their generative capabilities into the security space. Microsoft announced Security Copilot, which presents users with a single prompt bar that can launch incident investigations, reverse engineer scripts, explain vulnerabilities, and generate security reports. Google released Cloud Security Workbench, which brought generative capabilities to VirusTotal (explaining vulnerabilities) and the Chronicle SOC (threat investigations). Along with this, Google also debuted their new Sec-PaLM model, a version of their PaLM model fine-tuned on security data.

Microsoft and Google’s approach to generative AI and cybersecurity creates an interesting “frenemy” dynamic for startups. On the one hand, they provide developer platforms for building on top of leading foundation models: GPT-4 for the Azure stack and PaLM/Sec-PaLM for GCP. On the other hand, these companies have shown a desire to vertically integrate these models into their own security products, creating competitive pressure for startups. Unlike players like OpenAI and Anthropic, which prefer to stay at the infrastructure level and rarely verticalize into the application space, CSPs seem to be equally comfortable operating at both the infrastructure and application layers.

Search (Endor Labs)

A few cybersecurity vendors use large language models to provide semantic search in their applications. This is especially relevant for companies where a search interface over a proprietary data lake is a core value proposition.

DroidGPT by Endor Labs, for example, uses ChatGPT to offer users a natural language search over their OSS vulnerability database. This allows users to ask questions like “what packages are vulnerable to log4j?” or “what vulnerabilities does Apache Struts have?” Where this tool becomes especially interesting is when users can leverage the reasoning capabilities of LLMs to make security decisions based on this data. For example, users can ask, “what packages have similar functionality to log4j, but with better security?” With this copilot-like search interface, developers can potentially start fixing security issues in a self-serve way.

Adding a natural language search interface to one’s product has clear upsides: it offers users more flexibility in expressing their queries and resilience to issues like typos. But, if a keyword-based search experience has been working for your users, it’s highly unlikely that NLP search will dramatically improve product-market fit. There may still be value in pursuing it, but the opportunity cost should be well understood.

Code writing (Veracode, Semgrep, and Snyk)

SAST scanners have traditionally been limited to identifying issues without the ability to offer remediations. LLMs are poised to change that, and companies like Veracode, Snyk, and Semgrep are showing how. For example, when users run a Veracode scan from the command line, Veracode Fix will gather the results and suggest several options for fixes.

Similarly, the Semgrep bot can provide insight into whether the violation of a Semgrep rule is likely to be a true or false positive and, in the case of true positives, can suggest a fix to the developer.

A screenshot from the semgrep GPT-4 blog post showing the semgrep assistant making recommendations — Image: Semgrep

While all of these tools seek to embed themselves in the developer workflow, they have made different decisions as to how. Veracode and Synk integrate into the IDE and make direct changes, and Semgrep integrates into the source code manager and makes PRs.

It’s interesting to consider whether a new company can be created that leverages the code writing capabilities of FMs. A new competitor in the SAST space is unlikely, given that they would face a tremendous distribution and data disadvantage. However, we think there are more greenfield cases where code generation could be interesting, and we will discuss one example later on.

Vulnerability explanation (VirusTotal)

VirusTotal, a Google acquisition, shows how LLMs can be used to audit the findings of a vulnerability scan. It leverages Google’s Sec-PaLM model to provide additional context and insight into (1) whether a vulnerability is a true or false positive, (2) how the vulnerability works, and (3) why the user should or should not disregard the findings of the scanner.

In the below examples, VirusTotal has identified a malicious code snippet in a file that was otherwise marked safe by scanners and is able to reason why an alert is actually a false positive.

A screenshot of VirusTotal flagging a false negative

A screenshot of VirusTotal flagging a false positive — Images: VirusTotal

VirusTotal is likely using some form of chain-of-thought prompting to get Sec-PaLM to think and explain its conclusions logically. By providing a detailed explanation of how the attack works and what the impact is/why a user should care, the product helps streamline communication between developer and security teams.

The ability of LLMs to reason about why a piece of code presents a security risk hints at a much more latent understanding of security and threat modeling concepts. Although the vulnerability explanation capability itself is going to be a useful feature of a much larger product (i.e. VirusTotal), it would be interesting to see what new companies can be built on top of this newly discovered “security sense”.

Incident response and threat intelligence (SentinelOne, Recorded Future, Orca)

The last major category we’ll cover in this post is incident response and threat intelligence. These workflows are typically owned by a SOC analyst or security engineer and involve (1) triaging an alert, (2) reviewing the security log data, (3) piecing together the timeline of events and kill chain, (4) assessing the breach impact, and (5) implementing security fixes.

Within this broad category are three distinct subcategories, represented by Recorded Future, SentinelOne, and Orca Security.

Recorded Future’s AI can answer questions about global security incidents and events to provide SOC analysts context about various active incidents around the world. For example, a security analyst can search for finance related cyberattacks from the last three days, and Recorded Future will give a natural language summary of relevant financial attacks over that time period.

SentinelOne’s GPT-enabled PurpleAI product is targeted at helping security analysts investigate a potential incident through a single chat interface. The tool also suggests actions and further questions to help drive investigation and remediation.

A screenshot of SentinelOne’s PurpleAI product providing a chat experience for incident response and threat investigation — Image: SentinelOne

Lastly, Orca Security provides GPT-4 enabled remediation steps to help asset owners resolve alerts. For example, GPT can instruct the user on how to implement a least-privilege IAM role.

Screenshot of Orca Security’s GPT integration providing a recommendation on how to remediate a finding — Image: Orca Security

The key insight for all of these platforms is that they can use AI to help surface data and insights to users instead of having them hunt for the relevant data between different views. In that vein, it’s interesting to think about how AI can short circuit these highly manual “button-clicking” workflows and create value quickly.

Where we go from here

The trend of incumbents adding foundation models to their tools illustrates two things. First, foundation models have astounding capabilities in the security domain and will massively reshape how we interact with security tools. Second, the opportunities where new companies can be created are more subtle than simply reinventing legacy tools with AI.

This second point is worth some deeper discussion. Because incumbents will almost certainly be adding generative capabilities to their products in the next few months, we believe their distribution advantages (i.e. scale and brand awareness) will pose serious impediments to any challenger startup. Thus, we believe that the next massive cybersecurity company will not be one that reinvents existing tooling with AI, but instead, will be one that leverages AI to solve a net-new problem. There are three particular domains that we believe fit this profile: penetration testing, security reviews, and security-as-code generation.

Penetration testing

Pentesting has always been more of an art than a science. Knowing where to look for potential vulnerabilities, how to probe the system for unexpected behavior, and whether to dig in and explore a potential attack vector or move on to a different target requires a great deal of depth and reasoning. The return on this effort is that pentests produce very strong signals — every successful pentest finding represents a real, exploitable attack. Although pentesting should be done consistently, because of the high degree of manual effort required, many companies choose to do so only a few times per year or before a major release.

Given that LLMs have demonstrated an ability to perform chain-of-thought reasoning, it’s interesting to consider how they could be applied to bring true automation to pentesting. The models could be given API access to the same open-source tool a human would have and could be taught how to make the appropriate exploration versus exploitation tradeoff to identify meaningful vulnerabilities. Interestingly, a recent open-source project called PentestGPT has attempted to do just that.

Although PentestGPT is certainly impressive, it’s unclear if LLMs will be able to perfectly match a human’s ability to explore obscure corner cases and thus fully replace human pentesters. However, we believe that even if AI cannot replace pentesting, it’s possible that a tool that has the capabilities of 80–90% of human pentesters can still be commercially successful by enabling inexpensive, continuous testing and helping manual pentesters use their time more efficiently. This type of automation could go head-to-head with companies like Synack and Cobalt, who have built massive businesses around bringing the gig-economy model to pentesting.

Security reviews

Similar to pentesting, security reviews (and threat modeling) have traditionally been manual processes that only specialized, highly skilled individuals could perform. In a security review, a security engineer will review the design document for a service, understand the infrastructure, roles, and dependent systems, and make recommendations on how to improve the security of the overall system. Oftentimes, a security review is used as a release blocker for a new product or happens on an annual cadence, but it should really be a continuous process. It would be interesting to see how the natural language understanding and code comprehension capabilities of foundation models can be combined to make security reviews more frequent and light-touch.

Security-as-code generation

Security-as-code is an umbrella term to describe any form of codified security, including policy-as-code (PAC), detection-as-code (DAC), and (secure) infrastructure-as-code (IAC). This trend of “codifying everything” is exciting because it creates a standardized interface to describe and affect the behavior of cloud systems in predictable, replicable, and auditable ways.

Given that foundation models are able to write code, translate between natural and programming languages, and reason about security through chain-of-thought, we think there is an opportunity for a product that creates alignment between all of your existing security tooling and infrastructure through the mediums of PAC, DAC, and IAC, acting as something of a universal coordination layer. As an illustrative example, suppose that your organization has a natural-language rule that says “all resources must be tagged with ‘PII/No PII’”. An LLM could interpret this statement and automatically generate an OPA policy around requiring tagging, an IAC template to make sure new resources are created with tags and a detection-as-code alert in case any resources are discovered that are untagged. LLMs may also be able to do things like infer detection rules from policy and infrastructure, infer secure infrastructure from detection rules and policy, and infer policy from (existing) detection rules and infrastructure.

Ultimately, we’re bullish on the possibility of foundation models to change the way we interact with cybersecurity tooling. Over the next few years, we will likely see a combination of existing tools getting totally reinvented in an AI-first way and new products emerging to solve net-new problems in the domain. This technology will almost certainly make security faster, cheaper, and easier to integrate, and this is a world we’re excited to live in.

If you’re helping to build the future of generative AI in cybersecurity, we’d love to hear from you. Feel free to reach out to diyer [at] innovationendeavors [dot] com.

Huge thanks to Clint Gibler (tl;dr sec), Harshit Chitalia, and Prahathess Rengasamy for their feedback and comments.