AI Is Changing AppSec Faster Than We Expected — And That’s a Good Thing

February 2026 was a month that made the application security world pay attention. Anthropic launched Claude Code Security — a system that had already found over 500 zero-day vulnerabilities in production open-source codebases before it shipped. Days earlier, the open-source Raptor framework showed that a properly orchestrated LLM could autonomously run Semgrep scans, execute CodeQL queries, validate whether findings are exploitable, generate proof-of-concept exploits, and produce patches. All in a single workflow.

These are not incremental improvements to existing tools. They represent something fundamentally different: AI that reasons about code rather than matching patterns against it.

For those of us who do application security professionally — and who have watched decades of SAST tools produce mountains of false positives while missing the vulnerabilities that actually matter — this is the most interesting development in a long time. And, perhaps surprisingly, it is good news for human pentesters.

What Anthropic Actually Built

On February 20, 2026, Anthropic released Claude Code Security as a limited research preview for Enterprise and Team customers, with expedited free access for open-source maintainers. The announcement came fifteen days after the company published research showing that Claude Opus 4.6 had discovered more than 500 previously unknown vulnerabilities in production open-source codebases — bugs that had survived years of expert review, automated scanning, and fuzzing campaigns.

The technical approach is what makes this significant. Claude Code Security does not scan for known patterns or match against vulnerability signatures. Instead, it reads code the way a security researcher would: it traces data flows across components, understands how different parts of an application interact, and reasons about edge cases that emerge from that interaction. Every finding goes through a multi-stage verification process where the model attempts to prove or disprove its own results before they reach a human analyst.

The example that best illustrates the difference is a heap buffer overflow that Claude found in CGIF, a library for creating GIF images. The vulnerability existed in how the library handled LZW compression. CGIF assumed that compressed output would always be smaller than uncompressed input — which is almost always true. But Claude recognised that if the LZW dictionary filled up and triggered resets, the compressed output could exceed the uncompressed buffer size. The model generated a proof-of-concept by deliberately maxing out the LZW symbol table to force the insertion of clear tokens, causing the overflow.

This is the kind of vulnerability that traditional tools cannot find. Coverage-guided fuzzers struggle with it because triggering the bug requires a specific sequence of operations that makes the algorithm behave counterintuitively. Even with 100% line and branch coverage, a fuzzer could miss it. Finding it requires understanding what LZW compression does conceptually and reasoning about when its assumptions break down. That is not pattern matching. That is analysis.

Raptor: When the Pentester’s Toolkit Becomes Autonomous

While Anthropic was building a commercial product, a team of respected security researchers — Gadi Evron, Daniel Cuthbert, Thomas Dullien (better known as Halvar Flake), and Michael Bargury — released Raptor as an open-source framework. Their premise was simple: take Claude Code, which was designed for software development, and configure it for adversarial thinking.

The result is an autonomous security research pipeline. When you run Raptor’s /agentic command against a codebase, it executes a multi-phase workflow:

  1. Static analysis — Raptor runs Semgrep rules and CodeQL queries against the target, using the same tools that security teams already rely on.
  2. Exploitability validation — Instead of dumping raw SAST results into a spreadsheet, the LLM evaluates each finding in context. Can this actually be triggered? What conditions are required? Is there a realistic attack path?
  3. LLM-powered deep analysis — Claude reasons about the findings, identifying complex vulnerabilities that rule-based tools flag incorrectly or miss entirely.
  4. Exploit generation — For confirmed vulnerabilities, Raptor can generate proof-of-concept exploits to demonstrate real-world impact.
  5. Patch generation — The same model that found and exploited the vulnerability proposes a fix.

Halvar Flake demonstrated this practically by using Raptor’s patching capability to address recently disclosed vulnerabilities in FFmpeg. The framework examined the vulnerabilities, pinpointed their locations, and generated fixes. The patches needed some refinement before being finalised, but the bulk of the analytical and implementation work was done autonomously.

What makes Raptor particularly relevant is that it is not replacing the pentester’s tools — it is orchestrating them. Semgrep and CodeQL are still doing what they do well: fast, deterministic scanning against known patterns. The LLM layer adds what those tools have always lacked: contextual reasoning about whether a finding matters, how it can be exploited, and how it should be fixed.

Why This Is Good News for Pentesters

When AI security tools make headlines, the first question from practitioners is always the same: does this replace us? The honest answer, based on what we have seen in February 2026, is no — but it changes what the job looks like.

Consider the current reality of a web application penetration test. Whether you are assessing a web application, an API, or a mobile backend, a significant portion of the engagement goes to activities that are necessary but not intellectually demanding: running scanners, triaging false positives, documenting known vulnerability classes, writing reproduction steps for issues that follow well-understood patterns. These are hours spent on process, not on the deep analysis that actually finds the vulnerabilities clients care about most.

Agentic tools like Claude Code Security and Raptor compress that process layer dramatically. If an LLM can run your static analysis, validate which findings are real, generate proof-of-concept exploits, and draft patches — all before a human touches the results — then the pentester’s time shifts entirely toward the work that requires human judgment: understanding business logic, identifying trust boundary violations that only make sense in context, testing authentication and authorisation flows that depend on how the organisation actually operates, and communicating findings in ways that drive remediation.

This is the difference between a pentester who spends three days running tools and two days analysing results, versus one who gets pre-validated, contextualised findings on day one and spends the full engagement on analysis. The depth and quality of the assessment goes up. The value the client receives goes up. The pentester’s work becomes more interesting and more impactful.

There is one condition, though, and anyone who has done enough black-box assessments will appreciate the irony: these tools work best when they have access to source code.

An LLM can reason about a heap buffer overflow in CGIF because it can read the code that implements LZW compression. Raptor can validate whether a Semgrep finding is exploitable because it can trace the data flow through the actual source. The agentic security revolution is, fundamentally, a white-box revolution. For pentesters, the implication is clear: the engagements where you get source code access will produce dramatically better results than those where you do not. If you needed another argument for why clients should provide source code during assessments — and we have been making this argument for years — AI tooling just made it ten times stronger.

What This Means for Product Security Teams

The implications extend beyond penetration testing. For organisations building software, agentic AI tools change the economics of their security programmes.

Continuous scanning becomes genuinely useful. Traditional SAST tools produce so many false positives that development teams learn to ignore them. When an LLM validates findings before they reach a developer’s queue, the signal-to-noise ratio improves enough that continuous scanning becomes actionable rather than annoying.

The remediation bottleneck shrinks. The 2025 Checkmarx Future of AppSec report found that 81% of organisations knowingly shipped vulnerable code. The gap between vulnerabilities discovered and vulnerabilities fixed continues to widen. AI-generated patches — even imperfect ones that need human review — dramatically reduce the friction between finding a bug and fixing it.

Security scales with development velocity. Modern engineering teams ship code continuously. Security review has always struggled to keep pace. Agentic tools that can review code changes in near-real-time, flag genuine issues, and suggest fixes bring security closer to the speed of development without requiring a proportional increase in security headcount.

This does not mean product security teams become redundant. It means their role shifts from manually reviewing everything to curating, validating, and prioritising AI-assisted findings — and focusing human attention on the architectural and design-level security decisions that AI cannot yet make well.

Where This Is Heading

The trajectory is clear. Within the next 12 to 18 months, we expect to see agentic security tools become a standard part of the CI/CD pipeline for mature engineering organisations. The combination of traditional SAST engines (Semgrep, CodeQL, SonarQube) with LLM-powered validation and remediation will become the default approach — not because the AI is perfect, but because it is substantially better than either approach alone.

For penetration testing firms, including BSG, this means adapting our methodology to leverage these tools where they add value while focusing our human expertise where it matters most: complex application logic, multi-step attack chains, business context that no model understands as well as a researcher who has spoken with the client’s engineering team.

For development teams — especially those investing in security training for their engineers — it means the barrier to meaningful security testing is dropping. You do not need a six-figure security budget to get competent automated vulnerability detection and remediation suggestions. Open-source tools like Raptor demonstrate that powerful agentic security pipelines can be assembled from existing components.

And for the AppSec profession as a whole, it means the job is about to get more interesting. The tedious parts are being automated. The parts that require genuine expertise — understanding systems, reasoning about risk, communicating with humans — are becoming more valuable, not less.

The future of AppSec is not AI versus humans. It is AI making humans better at the parts of the job that actually matter.

FAQ

Can AI tools like Claude Code Security replace human penetration testers?

Not in their current form. Agentic AI tools excel at finding known vulnerability patterns, validating findings, and generating patches — tasks that consume significant time in a traditional assessment. However, they cannot assess business logic in organisational context, understand trust relationships between systems, or make judgment calls about risk prioritisation that depend on business factors. The most effective approach combines AI-assisted scanning with human expertise for complex analysis.

How does Claude Code Security differ from traditional SAST tools like Semgrep or CodeQL?

Traditional SAST tools match code against predefined patterns and rules. Claude Code Security reads and reasons about code — understanding data flows, component interactions, and algorithmic edge cases. The CGIF heap buffer overflow example illustrates this: the vulnerability required understanding how the LZW compression algorithm behaves under specific conditions, something no pattern-matching tool could detect. However, traditional SAST tools remain faster and more deterministic for known vulnerability classes.

What is Raptor and how does it use AI for security testing?

Raptor is an open-source framework that turns Claude Code into an autonomous security research agent. It orchestrates traditional tools (Semgrep, CodeQL, AFL++) with LLM-powered analysis to scan code, validate whether findings are exploitable, generate proof-of-concept exploits, and produce patches — all in a single automated pipeline. It was created by security researchers Gadi Evron, Daniel Cuthbert, Thomas Dullien (Halvar Flake), and Michael Bargury.

Do these AI security tools work on black-box assessments?

Their greatest strength is in white-box (source code) analysis. An LLM can reason about vulnerabilities most effectively when it can read the source code, trace data flows, and understand implementation details. Black-box testing — where no source code is available — benefits less from current agentic approaches, though AI can still assist with fuzzing, response analysis, and pattern recognition. This reinforces the value of providing source code access during security assessments.

Should development teams adopt AI security tools now or wait?

Teams should begin integrating these tools now, particularly for code review and vulnerability triage. Claude Code Security is available for Enterprise and Team customers, and Raptor is open-source. Start with AI-assisted validation of existing SAST findings to reduce false positives, then expand to automated patch suggestions. The tools are imperfect but already provide meaningful value, and waiting for perfection means missing months of improved security coverage.