The Three Layers You Need Before Giving Your Team Claude Code

A senior backend engineer I work with ran a “quick cleanup” on a Thursday afternoon. Claude Code, a database connection to staging, ten minutes of vibes-driven refactoring. He asked the agent to “remove obsolete test reservations.” The agent wrote a DELETE FROM reservs WHERE created_at < '2024-01-01'. No WHERE on is_test = 1. No dry-run. No second pair of eyes.

It ran in less than a second. It nuked 340,000 real reservations.

The rollback took 11 hours.

The engineer was not junior. The tool was not broken. Claude Code did exactly what he asked — fast, confident, and catastrophic. The missing piece was not talent or caution. It was scaffolding.

Most companies today are sitting on a version of this story waiting to happen. They rolled out Cursor, Claude Code, Copilot, maybe a custom internal agent. Productivity went up on paper. Outages went up in Slack. Nobody connected the dots, because the causal chain takes months to surface.

This article is about the three layers that separate teams that compound with AI from teams that blow up with it. I’ve seen them work. I’ve also seen what happens when you skip any of them.

Why “Just Give Them The Tool” Doesn’t Work

In the old world, a developer’s blast radius was limited by friction. To do something dangerous in production, you had to:

Know the AWS CLI well enough to run the command.
Know the database schema well enough to write the query.
Know where the logs were to debug it after.
Manually type it out, usually in front of a colleague.

Every step was a speed bump. Most incidents were caught by Step 2 or Step 3, because the dev realized mid-typing that something felt off.

AI removed every one of those speed bumps.

A developer with Claude Code who barely knows AWS can now execute a perfectly-formed aws rds modify-db-instance in twelve seconds. A developer who never learned SQL can execute a seventeen-line JOIN against production in one prompt. They don’t pause to check. They trust the output because the output looks right.

The old system relied on friction as an unofficial safety mechanism. AI deleted the friction. Nobody replaced the safety mechanism.

That’s the gap the AI Enablement Engineer fills. And the way they fill it is by building three layers — each solving a different class of failure.

If you read the article on the AI Enablement Engineer role, think of this as the actual job description.

The Three Layers

┌──────────────────────────────────────────────────────┐
│ LAYER 3: CONTROLLED SPECIFICATION                    │
│ What gets built in the first place                   │
│ → Spec-Driven Development, approval gates, compliance │
├──────────────────────────────────────────────────────┤
│ LAYER 2: CONTROLLED WORKFLOWS                        │
│ How risky operations get done                         │
│ → Skills, guided procedures, institutional context    │
├──────────────────────────────────────────────────────┤
│ LAYER 1: CONTROLLED EXECUTION                        │
│ What dangerous actions can actually run              │
│ → Internal MCP, whitelists, AI gates, audit trail     │
└──────────────────────────────────────────────────────┘

Skipping any layer leaves a specific gap. Skipping all three is the reservation-deletion story from the opening paragraph.

Layer 1: Controlled Execution — The MCP That Owns Your Dangerous Operations

The foundation is simple to describe, hard to accept: developers should not have raw access to production infrastructure anymore.

Not Claude Code with AWS credentials. Not Cursor with a production DB connection. Not a terminal with kubectl pointed at the prod cluster.

Instead, you build an internal MCP server — let’s call the real one I’ve seen in the wild “Otto” — that wraps every dangerous operation behind a layer of policy, audit, and AI-driven sanity checking.

What that looks like concretely:

Database writes. db_update doesn’t just execute the SQL. It runs a four-gate safety pipeline: pattern validation → row count check via EXPLAIN → an AI risk analysis using Claude on the query itself → only then execute with COMMIT. A DELETE is dry-run by default; it requires confirm=true to actually run. No WHERE clause? Rejected. JOIN in a delete? Rejected. Touching a blocked table? Rejected.

Large queries. db_query refuses anything that EXPLAIN estimates will scan more than 500,000 rows without explicit force=true. Results over 1,000 rows auto-export to S3 with a presigned URL instead of dumping into the agent’s context.

Shell commands on instances. ssm_run_command accepts only whitelisted commands. No arbitrary shell. ssm_hotpatch for file deployment is dry-run by default, and can target specific environments, never all at once by accident.

Every call audited. Every invocation — who, what, when, input, output — indexed in OpenSearch. Searchable, queryable, attributable.

Auth that matches the org. Google SSO with domain restrictions. Role-based permissions. Per-user API tokens you can revoke in one call.

The philosophy is not “prevent developers from working fast.” It’s “prevent developers from working fast in the three specific ways that destroy production.” Everything else stays unconstrained.

Two things happen once you build this:

The blast radius of any AI agent is bounded by what the MCP permits. A hallucinated DROP TABLE from a prompt injection attack is rejected at the gate, not at the database.
You have an audit trail. When something does go wrong, you can trace exactly which user, which agent, which prompt, produced which call. Root cause analysis becomes a query, not an archaeological dig.

Layer 1 alone is not enough. It controls what can run, not what developers decide to run. That’s Layer 2.

Layer 2: Controlled Workflows — Skills That Inject Context And Guardrails

A developer asks Claude Code: “Fix the Bugsnag error UndefinedMethodError in ReservationController.”

Without a controlled workflow, the agent will read the stack trace, pattern-match against the code, propose a fix, and potentially ship it. The fix might be wrong because:

The error might already be documented in docs/recommendations/bugs/ with the real root cause.
The affected area might have a known security issue in docs/recommendations/security/ — and the “fix” might widen the hole.
The team’s convention for this type of error might be “add null check + log to New Relic”, not “wrap in try-catch”.
The right tool to investigate might be otto bugsnag_get_error not a direct API call that won’t be audited.

The agent doesn’t know any of this. Nobody told it. And the developer, under pressure, won’t remember to inject all this context by hand every time.

Enter skills.

A skill is a guided procedure. In Claude Code, it’s a markdown file with a YAML header and a playbook. Invoking /bug-flow or /check-security loads that playbook into the agent’s context and forces it through a specific sequence:

Before investigating, read docs/recommendations/ for known issues in this area.
Before fixing, confirm the error exists in Bugsnag via otto bugsnag_get_error, not a manual lookup.
Before deploying, run /check-security and cross-reference with the 218 findings in docs/recommendations/security/.
Before marking done, run /linter, /testing, and /check-architecture.

Every skill defers to Layer 1 where possible (“Prefer otto for SSM commands”). Every skill points to living documentation. Every skill is a short, opinionated checklist that keeps the agent — and the developer — from skipping the steps that usually get skipped.

A real deployment in a codebase I know has 28 of these: bug-flow, check-security, check-architecture, check-performance, deep-review, hotpatch, migration, incident-research, local-debug, requirement-validate, requirement-design, requirement-tasks, and so on. Each one is maybe 150 lines. Together they form a paved road for the whole team.

The key design principle: skills are not a permission system. They’re a context system. A dev with a skill gets faster and safer because the agent has the right institutional knowledge loaded. A dev without the skill can still do the work — it’s just lonelier and more error-prone.

The organizational effect is profound: tribal knowledge stops being tribal. The senior engineer’s instinct to “always check docs/recommendations/ before touching the payment flow” becomes one line in a skill file that applies to every new hire from day one.

Layer 2 controls how risky operations actually get done, with the full institutional context injected. But it doesn’t control whether the operation should happen in the first place. That’s Layer 3.

Layer 3: Controlled Specification — The Spec-Driven Framework That Fixes The Real Bottleneck

Here’s the uncomfortable truth that breaks most AI rollouts:

The bottleneck was never typing code. The real bottleneck is clarity.

Claude Code can write a 400-line feature in fifteen minutes. If the feature spec was ambiguous, you now have a 400-line feature that solves the wrong problem, with confident naming and decent tests, merged into main before anyone noticed.

This is the most dangerous failure mode in the whole stack, because it’s invisible. Layer 1 protects you from DDL disasters. Layer 2 protects you from workflow shortcuts. Layer 3 protects you from building entirely the wrong thing at ten times the usual speed.

The remedy is Spec-Driven Development: a framework that forces clarity before code.

The shape of it:

Two explicit processes, separated.

Product Discovery answers “Should we build this?” — opportunity brief, research, prototype, Go/No-Go.
Development Workflow answers “How do we build this?” — Epic → Validation → Slicing → Feature → Design → Tasks → Implement → Review → Deploy.

Discovery is optional for bug fixes and hotfixes. Development is not.

The Hierarchy of Technical Impact.

Level	What	Visibility	Cost of Error
3	Aesthetics — colors, copy, spacing	High	Low
2	Behavior — flows, states, edge cases	Medium	Medium
1	Foundation — entities, business logic, integration contracts	Low	Critical

The pattern: what’s easiest to see is cheapest to fix. What’s invisible is expensive to change. Your PRD must define Level 1 completely before any code is written. Level 2 gets sketched. Level 3 stays flexible.

The reason this matters specifically for AI is what I call the AI Cascade Effect: errors at Level 1 create a domino effect in AI-assisted work. When the foundation is wrong, each subsequent prompt to fix errors consumes more context, explanations get more convoluted, and the model starts hallucinating solutions on top of patches. What starts as a small data model mistake becomes thousands of tokens spent explaining why the fix for the fix needs another fix. Get Level 1 right, or pay the token tax forever.

Full breakdown of the AI Cascade Effect →

Seven approval gates, explicit RACI.

Each gate has a named owner:

Gate	Approver	What it Gates
1	PM + TL	Epic validation — is the requirement complete?
2	PM + TL	Slicing — are features independently valuable and deployable?
3	TL	Design — does the architecture align with system standards?
4	DEV	Tasks — are they complete, testable, ordered correctly?
5	TL	Code review — quality, security, correctness
6	DEV	Quality checks — all automated checks pass
7	PM + TL	DoD — all acceptance criteria met

AI generates artifacts at every step — the Epic validation checklist, the feature design, the task breakdown. Humans approve at each gate. AI accelerates execution; humans own decisions.

Compliance by default.

GDPR, EU AI Act, and sector regulations belong in the Epic validation gate, not as a post-development checkbox. Security vulnerabilities can be patched; compliance failures invalidate architectures. A data model that doesn’t support “right to be forgotten” doesn’t need a fix — it needs a rebuild.

Rethinking slicing.

A less obvious consequence of AI: slicing overhead stays constant (branch management, PR coordination, integration testing) while development time collapses. If AI-assisted work completes an epic in 16 hours but slicing it into 4 features adds 18 hours of coordination overhead, you’ve slowed yourself down. The framework pushes toward single-owner epics when cohesive, slicing only for genuine business or risk reasons.

Why slicing in the AI era breaks traditional Agile →

The deliverables of Layer 3 are not documents for the sake of documents. They’re the substrate the other two layers operate on. Skills reference Epic docs. The MCP audit trail links back to feature requirements. Compliance constraints baked into the Epic shape what ssm_hotpatch or db_update are allowed to do.

Why You Can’t Skip Any Layer

The three layers are not independent. They fail in predictable ways when isolated:

Layer 1 alone. Devs still make decisions in a vacuum. They call db_query against the right table for the wrong reason. The tool is safe; the intent was wrong. You prevent DROP TABLE disasters but ship features that solve non-problems.

Layer 2 alone. Skills inject context, but the agent still runs raw SQL through a plain connection. A single bad prompt and you’re back to the 340k-deleted-reservations story. The workflow is good; the execution is unguarded.

Layer 3 alone. Great specs, clean approvals, excellent Epic docs. Then a dev opens Cursor and freehand-implements it against production. You spent three weeks defining the Foundation, and the implementation blew it up on day one.

All three together. The spec defines what to build. The skill guides how to build it, with full context. The MCP enforces what can actually execute. A bad prompt has to fail three different gates before it reaches anything real.

The investment isn’t theoretical. The team I’m drawing from shipped each layer in about a quarter of engineering focus for a single engineer — one person, three months each, roughly — and it now serves an engineering org of ~50 people across two companies.

One engineer. 50x leverage. Which is the whole point of the AI Enablement Engineer role.

The Uncomfortable Honest Part

Most companies reading this will not build these layers. Here’s what they will do instead:

Buy Cursor/Claude Code seats for the whole team.
Run a 60-minute internal training.
Wait for velocity to increase.
Notice outages increase.
Blame the tool.

The Ferrari problem doesn’t go away because you don’t name it. And the longer you wait to build the layers, the more production incidents you’ll debug under the flag of “unrelated.”

The teams that build this scaffolding in 2026 will look back in 2028 and wonder how anyone ever worked without it. The teams that don’t will be explaining to their CEO why a developer’s Thursday-afternoon cleanup cost 11 hours of rollback.

Your choice.

Building the three-layer stack is a book-length argument I’ve compressed into one article. If you want the full version — including the AI Enablement Engineer as a role, the founder-mode management approach that makes it possible, and the end of the Product Owner as we knew it — The Broken Telephone is where all three threads converge.