Cyber Defense Exercises: 40 Teams, 15 Countries
Most organizations test their defenses with tabletop exercises — facilitated discussions where someone reads a scenario and teams talk through their response. Tabletops test process. They don’t test capability.
BSG has spent over a decade on the offensive side — penetration testing, red teaming, and running CTF competitions at security conferences across Europe. A few years ago, we won SANS Grid NetWars, a defensive investigation tournament, and something clicked: the same skills that make a good attacker make a good exercise designer. We know how real adversaries operate because that’s what we do every day. Building realistic exercises for blue teams was a natural next step.
Since then, we’ve been building and running live adversary simulation exercises for blue teams — from 5-person SOC squads to multinational events with 40+ teams competing simultaneously. It’s a newer part of our practice, but it builds directly on everything we’ve learned from years of offensive work. The exercises use real infrastructure, real attack tools, and real defensive tooling. Participants detect, investigate, and respond to multi-stage attacks while our red team operates against them in real time.
This post shares what we’ve learned so far — the infrastructure decisions, scenario design trade-offs, and the blue team gaps that keep showing up across industries and geographies.
Why Tabletops Fall Short
There’s nothing wrong with tabletop exercises. They’re valuable for testing communication plans, escalation procedures, and executive decision-making. But they have a fundamental limitation: participants know it’s a simulation.
In a tabletop, nobody has to parse a SIEM alert at 2 AM. Nobody struggles with a tool they’ve never configured before. Nobody discovers that the runbook they wrote six months ago references a server that was decommissioned last quarter.
Live exercises reveal the gaps that tabletops can’t surface. We’ve watched experienced teams freeze when their SIEM returned unexpected results. We’ve seen well-documented playbooks fail because the analyst executing them had never actually run the commands. These aren’t failures of knowledge — they’re failures of practice.
The regulatory landscape is catching up to this reality. DORA now requires threat-led penetration testing for financial entities. NIS2 mandates incident response exercises for essential entities across the EU by October 2026. Both regulations explicitly distinguish between discussion-based exercises and operational exercises that test real technical capabilities.
Building the Exercise Infrastructure
Our first design decision was the most consequential: build everything in the cloud, deploy with Infrastructure-as-Code, and require nothing from participants except a laptop.
We deploy complete exercise environments on AWS — Active Directory domains, Linux servers, SIEM platforms, EDR agents, and network monitoring tools. Each team gets their own isolated environment with identical configurations. The infrastructure is defined in code, which means we can tear it down, rebuild it, and reproduce it exactly for future events.
This approach solves three problems simultaneously:
Repeatability. When you run the same exercise for a new group, the environment is identical. No drift, no forgotten configurations, no “it worked last time.” Every team faces exactly the same starting conditions.
Scale. For the 40-team multinational event, we deployed 40 parallel environments. Manually configuring that many instances would take weeks. With IaC, it takes hours.
Isolation. Each team operates in their own environment. A mistake in one team’s instance — accidentally shutting down a domain controller, for example — doesn’t affect anyone else. This is critical for learning. Teams need freedom to make mistakes without consequences beyond their own sandbox.
The trade-off is cost. Cloud infrastructure for 40 parallel environments isn’t cheap. For smaller engagements (one or two teams), it’s straightforward. At scale, infrastructure management becomes its own workstream, requiring a dedicated engineer throughout the event.
Designing Attack Scenarios That Actually Teach
Scenario design is where most exercises succeed or fail. The temptation is to build the most sophisticated, multi-stage APT campaign possible. We learned quickly that complexity without calibration frustrates rather than teaches.
Calibrating Difficulty
We design scenarios with progressive difficulty tiers. Early stages test fundamental skills: Did the team notice a suspicious logon event? Can they identify a known malware hash? Later stages introduce lateral movement, privilege escalation, and data exfiltration that require correlation across multiple data sources.
The key insight: if teams can’t complete the early stages, adding advanced stages doesn’t teach them — it demoralizes them. We build checkpoints into the exercise where the exercise director can assess team progress and adjust the red team’s pace accordingly.
Mapping to MITRE ATT&CK
Every attack scenario maps to specific MITRE ATT&CK techniques. This isn’t just for documentation — it structures the after-action review. Instead of vague feedback like “you need to improve detection,” we can say: “Your team detected Initial Access (T1566) but missed Lateral Movement via Pass-the-Hash (T1550.002) in 85% of cases.”
After the exercise, each team receives a MITRE ATT&CK heat map showing which techniques they detected, which they missed, and where their tooling has blind spots. This artifact alone has driven more security investment decisions than any risk assessment report we’ve produced.
Balancing Realism and Achievability
We typically build two scenario tracks: an opportunistic attacker (commodity malware, credential stuffing, lateral movement via known exploits) and a persistent threat actor (custom tooling, living-off-the-land techniques, slow and deliberate lateral movement).
Mixed-maturity groups get both. Experienced teams that clear the commodity track quickly move to the APT track. Less experienced teams spend their time on realistic threats they’ll actually encounter. Nobody sits idle, and nobody drowns.
Running a 40-Team Event: What We Didn’t Expect
Scaling from single-team exercises to a multinational event with 40 teams across 15+ countries introduced challenges we hadn’t anticipated.
Scoring at Scale
We use CTFd for scoring — teams earn points for correctly identifying indicators of compromise, answering investigation questions, and completing containment actions. Progressive difficulty means early flags are worth fewer points, and late-stage flags require correlating evidence across multiple systems.
What we didn’t expect: teams approach the same attack chain in fundamentally different ways depending on their regional training background. Teams from some countries prioritized containment speed. Others focused on thorough forensic documentation before taking any action. Both approaches have merit, and our scoring needed to accommodate that.
Communication Across Time Zones
A multinational exercise means participants in time zones spanning 10+ hours. We structured the event so core activities happened in a shared window, with asynchronous challenges available outside that window. The lesson: never assume synchronous participation is possible at scale.
The Observer Problem
With 40 teams, you can’t embed a BSG operator with each one. We assigned one observer per 6-8 teams, supplemented by automated telemetry. Observers monitored team chat channels, flagged teams that were stuck, and injected hints when teams hit dead ends.
The data from observers proved more valuable than the scoring results. Scoring tells you what a team found. Observers tell you how they found it — or why they didn’t.
Common Blue Team Gaps (Anonymized)
After running exercises for dozens of teams, patterns emerge. These gaps appeared consistently across industries, team sizes, and geographies:
Log correlation remains the biggest weakness. Most teams could identify individual suspicious events. Far fewer could connect a phishing email to a compromised credential to lateral movement to data staging. The skills exist in isolation; the analytical workflow connecting them is where teams break down.
Detection coverage drops sharply after initial access. Teams detected Initial Access techniques in about 70% of cases. For Lateral Movement, that number dropped below 30%. For Collection and Exfiltration, below 15%. The tools are often configured to detect; teams aren’t practiced at investigating what the tools surface.
Handoffs between team members fail silently. Shift changes, analyst rotations, and cross-team communication are where investigations stall. An analyst discovers a suspicious process, documents it in a ticket, and hands off. The next analyst reads the ticket but interprets it differently. The original context — the analyst’s intuition about why the process was suspicious — doesn’t transfer.
Playbooks are written for the happy path. Runbooks assume the tool works, the server responds, and the attacker behaves predictably. In exercises, at least one of those assumptions fails. Teams with rigid playbooks froze. Teams with strong fundamentals and flexible thinking adapted.
What We’d Change Next Time
No exercise is perfect. NIST SP 800-84 covers exercise planning methodology well, but the hard lessons come from running them. Here’s what we’d do differently:
Start the AAR sooner. We originally ran the full exercise, then conducted after-action reviews the following day. By then, the adrenaline had faded and teams struggled to recall specific decisions. Now we run mini-debriefs after each exercise phase, while the experience is fresh.
Invest more in pre-exercise baselines. Understanding each team’s starting capability lets us calibrate difficulty more precisely and measure improvement more accurately. A brief skills assessment survey before the event pays for itself in exercise quality.
Build in more “quiet” observation time. Some of the most valuable insights came from watching teams during low-activity periods — not during the intense attack phases. How teams triage a backlog of alerts, how they prioritize when nothing is screaming at them, reveals their operational maturity.
Is a Live Exercise Right for Your Team?
If your team has never run a live adversary simulation exercise, start with a focused one-day incident response drill. It’s less resource-intensive, delivers immediate value, and gives you a baseline to build from.
If your organization falls under DORA or NIS2, live exercises aren’t optional — they’re a compliance requirement. Starting early gives your team time to practice and improve before audit deadlines.
For teams that have completed foundational exercises and want to test against sophisticated, multi-stage attacks, a full adversary simulation with a live red team is the next step. It’s the closest thing to a real incident without the consequences.
We’ve built cyber defense training programs that scale from single-team drills to multinational events. Whether you’re preparing for a regulatory deadline or building a culture of continuous improvement, the exercise format adapts to your team’s maturity and goals.
BSG has spent 12+ years on the offensive side — penetration testing, red teaming, and winning CTF competitions. Adversary simulation exercises are a newer extension of that work, born from the realization that the best defense training comes from people who think like real attackers. Because that’s what we are.