LLM Penetration Testing: 2026 Methodology Guide

Penetration Testing May 2, 2026

LLM penetration testing is not a normal web-app pentest with a chatbot bolted on. The attack surface includes the prompt layer, model behaviour, retrieval (RAG), tool and agent invocation, and output handling — and the most damaging failures usually live in the seams between those layers.

A useful engagement is therefore architecture-led. You start from the application’s components, integrations, and business-critical actions, then derive a test plan that targets the real trust boundaries — rather than running a generic checklist of “interesting prompts”.

Why LLM Apps Need a Different Pentest

A web application pentest already covers endpoints, roles, authn/authz, injection, access control, session management, and business logic. That work remains necessary for an LLM app. It is not sufficient.

It helps to think about LLM applications as a five-layer attack surface:

Prompt layer — system prompts, templates, conversation memory, and policy guardrails. This is a new control plane that influences downstream decisions and tool invocation.
Model behaviour — how the model responds to adversarial input: instruction-following conflicts, refusal patterns, hallucination under pressure, over-trust of retrieved content. The pentest question is rarely “can we jailbreak it?” and almost always “can we make it do the wrong thing inside this app’s trust boundaries?”
Retrieval / RAG — ingestion, chunking, embeddings, vector search, re-ranking, context assembly, and permissions. RAG is an identity and data-isolation problem disguised as a search feature.
Tools and agents — function calls, plugins, connectors, and autonomous workflows that take real actions: email, ticket creation, code changes, financial actions, data exports. This is a new perimeter.
Output handling — wherever model output is rendered, stored, indexed, executed, or fed into other systems. Classic injection concerns reappear, but the payload generator is now the application itself.

A standard web-app pentest typically under-tests these areas because it treats the LLM as just another API dependency, focuses on endpoints and parameters rather than instruction hierarchies and tool constraints, and rarely validates the cross-boundary flows that matter most: untrusted content lands in retrieval → enters context → changes tool invocation → triggers a privileged downstream action.

For the catalogue of risk categories, see BSG’s reference post on the OWASP LLM Top 10. This guide focuses on how a pentest is actually executed and what a buyer should expect from one.

The Realistic 2026 Threat Model

Enterprise incidents around LLM features cluster around three practical risks: indirect prompt injection, data leakage through retrieval, and tool and agent abuse. These are less headline-friendly than jailbreak demos, but they map directly to business impact: unauthorised actions, data exposure, policy violations, and workflow compromise.

Indirect prompt injection is the underrated risk

Direct prompt injection — a user typing “ignore your instructions” into a chatbot — is the easy case. Mature applications already build defences: input filtering, refusal patterns, tool gating, and monitoring.

Indirect prompt injection is more dangerous because it exploits the application’s own trust pipeline. The malicious instructions are not typed by a user; they arrive embedded in content the application ingests:

Uploaded documents (PDFs, DOCX, slides) with hidden instructions in metadata, white-on-white text, or footnotes.
Web pages scraped for RAG sources.
Shared knowledge bases, support tickets, or wiki articles used as retrieval corpora.
Emails or chat logs summarised by an assistant.

When the model retrieves this content into its context, it can treat attacker-authored text as instructions rather than data. The result is misleading answers, disclosure of sensitive context, or tool calls the user never requested. The relevant question for a buyer is not “can you jailbreak the model?” but “can untrusted content cross our trust boundaries and influence privileged actions?”

RAG leakage is an access-control problem

Most retrieval failures we see are access-control failures expressed through search:

Cross-tenant leakage — tenant A receives chunks from tenant B because of indexing mistakes, metadata mishandling, or a shared vector store without strict partitioning.
Cross-permission leakage — within one tenant, a user retrieves content from documents they should not be able to see, because retrieval scores by semantic relevance without enforcing authorisation at query time.
Over-broad context assembly — the application pulls “similar” chunks from sensitive sources into context even when not needed, increasing exposure through model output, transcripts, and logs.

A competent provider tests retrieval pipelines as if they were a data access layer: identity, authorisation, and isolation first; relevance second.

Tool and agent abuse is the new perimeter

The moment an LLM can call tools, it can move from “wrong answer” to “wrong action”. Most buyer-critical risks live here: financial workflows, customer support actions, account changes, data exports, code changes, and integrations.

Agentic systems amplify the risk because they chain tools, accumulate side effects, operate over longer horizons with memory, and may act on partially verified information before triggering irreversible actions. For a closer look at how attackers manipulate tool invocation, see our analysis of malicious skills and tool-abuse threats.

When testing tool abuse, focus on three things:

Authority — what credentials does the tool use (user-scoped, app-scoped, service account)?
Constraints — which parameters are allowed, which actions need human approval, and what rate limits exist?
Observability — are tool calls logged with enough context to attribute and investigate?

Jailbreak demonstrations rarely answer any of these. A provider whose strongest deliverable is a clever prompt has not tested your application’s actual trust boundaries.

A Practical LLM Pentesting Methodology

A competent LLM pentest is architecture-led and evidence-driven. The tester should treat the system as a distributed application with extra control inputs and trust-boundary crossings — not as a chat interface to break.

Phase 1 — Threat modelling

The engagement starts with a threat-modelling workshop. The objective is to identify where untrusted input can influence privileged outcomes. Expect a competent provider to ask for:

Architecture diagram covering the app, LLM provider, RAG store, tool layer, and authn/authz.
Data classification and tenancy model.
Tool inventory with permissions, marking which actions are irreversible.
Prompt composition: system prompt, templates, memory, policies, and how context is assembled.
Retrieval design: indexing, metadata, authorisation enforcement, query-time filters, logging.
Deployment context: staging vs production, feature flags, rate limits, monitoring.

The output is a model of inputs (chat, files, URLs, connectors), trust boundaries, business-critical actions (refunds, approvals, data exports, password resets, code merges), and prioritised attack hypotheses mapped to OWASP LLM Top 10 and MITRE ATLAS categories. Frameworks tell you the classes of issues; the threat model decides which classes matter for your system.

Phase 2 — Reconnaissance

Recon clarifies what the application actually does. Typical activities:

Integration mapping — identify the tool endpoints, internal services, and third-party APIs the assistant can reach.
Model and policy fingerprinting — determine which model families and safety layers are used, and whether different models handle classification, generation, and tool selection.
System prompt and policy exposure — assess whether system prompts, tool schemas, or routing logic can be extracted directly or via indirect injection. The objective is not “steal the prompt” as a trophy; it is to see whether hidden controls can be surfaced and abused.

By the end of recon, the tester should already be producing a test plan tied to the threat model: specific abuse paths, data targets, and tool actions to validate.

Phase 3 — Active testing

Active testing combines adversarial prompting with classic application security technique. Reproducibility and boundary validation are what make this a pentest rather than a demo.

The scenarios that matter most:

Indirect prompt injection via realistic corpora — uploaded documents, scraped pages, KB articles, support tickets. Chat-only payloads are insufficient.
RAG isolation testing — prove or disprove strict separation by tenant and by permission. Cover metadata handling, query-time filters, connector permissions, and “semantic overshare” behaviours.
Tool abuse — parameter manipulation, privilege escalation through tool credentials, bypass of user confirmation, and chaining of innocuous tools into high-impact actions.
LLM-specific supply chain checks — how prompts, retrieval corpora, embeddings, and tool schemas are managed and updated; integrity controls on agent “skills” and tool definitions.

Each scenario should map to one or more OWASP LLM Top 10 categories and to the relevant MITRE ATLAS techniques, and each should have an explicit success criterion: unauthorised data disclosure, cross-tenant retrieval, tool execution without proper authorisation, bypass of approval workflows, downstream injection, or measurable cost and availability impact.

Ask the provider how evidence will be produced. A good answer includes captured requests and responses (redacted as needed), tool-call traces, and minimal repro scripts — not screenshots of a chat transcript.

Phase 4 — Output handling and downstream impact

Output handling is where LLM apps rejoin classic AppSec, but with new payload sources:

Rendering — if the UI renders Markdown or HTML, can the model be induced to output content that triggers XSS or unsafe links via downstream renderers?
Workflow injection — when output is copied into tickets, emails, CRM notes, or code, can the model produce content that exploits those systems (HTML injection in email clients, formula injection in spreadsheets, unsafe Markdown in internal portals)?
Tool-call SSRF and injection — when tools accept URLs or file paths, can a prompt injection steer the agent to fetch internal resources or hit metadata services?

Many LLM features expose new endpoints — chat completions, retrieval search, tool proxies — that need standard API hardening. Use API security testing as the baseline; LLM testing extends it, never replaces it.

Phase 5 — Reporting

Reporting is what justifies the budget. Buyers should expect:

Clear scope and assumptions (what was tested, what was not, what access was provided).
A documented threat model capturing trust boundaries and critical actions.
Reproducible findings: steps, inputs, expected vs actual outcomes, and evidence.
Severity ratings tied to business impact (data classification, tenancy, regulated data, fraud exposure, availability).
Implementable remediation guidance: design recommendations such as authorisation at retrieval time, control suggestions such as approval gates for high-impact tools, and concrete engineering tasks.

A “list of prompts that broke the model” is not a penetration test report.

What a Buyer Should Expect

Scope and access level

Scoping should cover assets (chat UI, APIs, tool proxy, retrieval pipeline, ingestion connectors, admin consoles), data (tenant boundaries, sensitive datasets, transcript storage), identities (end users, admins, service accounts, tool credentials), and environments (staging vs production). On access level:

Black-box validates external exposure but misses retrieval and tool authorisation failures without observability.
Grey-box — limited architecture and log access — is usually the best value for LLM pentests.
White-box — code and configuration review combined with testing — is appropriate for high-risk assistants and agentic systems with privileged tools.

A good provider will explain what additional access (tool-call logs, retrieval logs, permission matrices) improves coverage and what can be safely tested without it.

Deliverables

Ask explicitly for: a threat model document, findings with severity tied to business impact, mapping to OWASP LLM Top 10 and MITRE ATLAS, and prioritised remediation. If you also need broader assurance, combine with traditional penetration testing and application security work.

Timeline and re-test

Plan for a phased engagement: artefact collection and threat modelling, retrieval and tool validation across roles, reporting with reproducible scenarios, and a bounded re-test window for confirmed findings. The provider should define what qualifies for re-test (specific fixes) versus what becomes a new scope item (new tools, new corpora, major architecture changes).

Anti-patterns to walk away from

No threat modelling, no architecture review, no questions about tools or retrieval.
Deliverable is a list of jailbreak prompts or a transcript of the model “saying something it shouldn’t”.
Findings without reproduction steps, evidence, or impact analysis.
No coverage of RAG isolation, connector permissions, or tool credentials.
Framework mapping used as a substitute for application-specific analysis.

Common Findings We See

Patterns repeat because teams assemble similar building blocks: RAG, tool calls, connectors, shared vector stores. The examples below are anonymised composites; they illustrate how findings should be framed and evidenced, not any specific client environment.

Cross-tenant RAG context leak. A shared vector store indexed multiple tenants with inconsistent metadata filters. Under certain queries, retrieval returned semantically similar chunks from a different tenant, which then surfaced in the model’s answer.

Indirect prompt injection via uploaded document. An assistant ingested a user-uploaded PDF containing hidden instructions. When the document was later retrieved, those instructions caused the assistant to disclose internal operational notes — including parts of hidden prompt logic — and to attempt an unauthorised tool call. Root cause: no provenance labelling, and no separation between “data” and “instructions” in the prompt template.

Tool abuse to bypass an approval gate. An agent had access to both a “search user” tool and an “update account” tool. Through intermediate steps, an attacker steered the agent to select an account outside the requester’s authorisation scope, then triggered an action using service-account credentials. The UI enforced approval; the tool proxy did not bind calls to the requesting user’s permissions.

How LLM Pentesting Fits the Wider AppSec Programme

LLM security testing should not be a one-off. The right cadence depends on how quickly your prompts, corpora, tools, and integrations change.

Annual baseline — validate the overall architecture, trust boundaries, and highest-risk tool paths.
Per-release or per-major-change — re-test when you add tools, connectors, new corpora, new tenant models, or new deployment patterns.
Continuous controls — automated regression tests for known prompt injection patterns, RAG permission checks, and tool-call policy enforcement. Not a substitute for a pentest, but they reduce drift between engagements.

Pre-deployment testing is usually safer and allows more aggressive scenarios. Production testing is sometimes needed for real connector behaviour, true tenant isolation validation, or observability verification — but it should be scoped tightly: clear test windows, safe targets, rate limits, and rollback plans.

A pentest complements rather than replaces secure SDLC controls, API hardening, and data governance. Where penetration testing finds and fixes specific vulnerabilities, red teaming is more appropriate when you need to measure detection and response: how the organisation reacts to a sustained prompt injection campaign, how employees use internal assistants, what data they paste, what approvals are bypassed in practice. Mature programmes use both.

Frameworks and Standards Worth Knowing

Frameworks give you a shared vocabulary; threat modelling tells you what matters for your system.

OWASP LLM Top 10 — the standard taxonomy for risk categories.
OWASP GenAI Red Teaming Guide — structures assessment across model evaluation, implementation testing, infrastructure, and runtime behaviour.
MITRE ATLAS — adversary tactics and techniques for AI systems; useful for mapping realistic attack paths and aligning with threat intelligence.
NIST AI RMF — programme-level governance (Govern / Map / Measure / Manage). The GenAI profile (NIST AI 600-1) positions red teaming as a core “Measure” activity.
OWASP AI Exchange — extensive practical guidance across AI security and privacy, useful as a reference library.

FAQ

What is LLM penetration testing?

LLM penetration testing is a security assessment focused on the unique attack surface of LLM-powered applications: prompt composition, model behaviour under adversarial input, retrieval pipelines, tool and agent invocation, and output handling. The goal is to identify and reproduce real abuse paths — such as data leakage via retrieval or unauthorised actions via tools — and to deliver remediation guidance tied to business impact.

How is LLM pentesting different from a regular pentest?

A regular pentest covers endpoints, parameters, and classic web and API issues. LLM pentesting must also test instruction hierarchies, indirect prompt injection via retrieved content, retrieval authorisation and tenancy isolation, tool-call constraints, and downstream impact of model output. The riskiest failures usually occur between components — retrieval to context to tool calls — which standard methodologies do not cover deeply.

Should we pentest the model itself or just the application around it?

For most buyers, the priority is the application around the model: prompts, retrieval, tools, identities, permissions, and output handling. If you fine-tune or self-host models, model-level testing also matters — particularly data poisoning risks and unsafe behaviour evaluation. Even when the model is a managed service, you still need to validate how your application constrains it.

How often should we pentest our LLM application?

At minimum, before launch and annually thereafter. Re-test whenever you add tools, connectors, corpora, or tenant models, or when you change how context is assembled and authorised. If your assistant changes frequently, invest in continuous regression controls and use periodic pentests to validate the full end-to-end threat model.

What deliverables should we expect from an LLM pentest?

A documented threat model, reproducible attack scenarios with evidence, severity ratings tied to business impact, and prioritised remediation guidance. Findings should map to OWASP LLM Top 10 and MITRE ATLAS for governance and tracking. A list of jailbreak prompts or chat transcripts is not sufficient for engineering remediation or risk acceptance.

Conclusion

LLM-powered applications introduce new control surfaces and new trust-boundary crossings. The risks that matter to enterprise buyers are indirect prompt injection, retrieval-driven data leakage, and tool and agent abuse — not the jailbreak demos that get the headlines.

When buying LLM security testing, prioritise providers who start from threat modelling, validate your specific integrations and access boundaries, and deliver reproducible scenarios with remediation guidance mapped to recognised frameworks.

Building or shipping LLM features? BSG’s application security team runs LLM penetration tests against the OWASP LLM Top 10, MITRE ATLAS, and OWASP GenAI Red Teaming Guide — covering prompt injection resilience, RAG data isolation, tool abuse, and output handling. Talk to our team about scoping an engagement.